feat: add follower controller source debug step

This commit is contained in:
Codex
2026-07-03 19:52:55 +00:00
parent ba09dfa8b2
commit 20a61b47e1
4 changed files with 88 additions and 4 deletions
@@ -12,6 +12,7 @@ bun scripts/cli.ts cicd branch-follower status --live
bun scripts/cli.ts cicd branch-follower run-once --all --dry-run
bun scripts/cli.ts cicd branch-follower run-once --follower <id> --confirm --wait
bun scripts/cli.ts cicd branch-follower debug-step --follower <id> --step state-read
bun scripts/cli.ts cicd branch-follower debug-step --follower <id> --step controller-source
bun scripts/cli.ts cicd branch-follower debug-step --follower <id> --step status-read
bun scripts/cli.ts cicd branch-follower debug-step --follower <id> --step decide
bun scripts/cli.ts cicd branch-follower debug-step --follower <id> --step state-write --confirm
@@ -24,6 +25,7 @@ bun scripts/cli.ts cicd branch-follower logs --follower <id>
`debug-step` is the required single-step troubleshooting entry before changing branch-follower code for repeated CI/CD convergence issues. It runs in a bounded target-side Job when called from the operator host, and uses the same controller modules as the real flow:
- `state-read`: read only the compact ConfigMap state, value bytes, resourceVersion and `_updatedAt`.
- `controller-source`: read only the current target-side one-shot checkout identity: HEAD, branch, registry sha and key file markers. Use this before attributing a failed/slow self-upgrade run to new controller code.
- `status-read`: read native source/Tekton/Argo/runtime status without triggering adapters.
- `decide`: run the decision function in dry-run mode without triggering adapters or writing state.
- `state-write --confirm`: patch the stored follower state back through the normal ConfigMap write helper and report before/after resourceVersion; this is for isolating state write failures, not for normal rollout.
@@ -122,6 +124,8 @@ State writes must preserve same-source total timing at the target side. When a l
Controller self-upgrade has a one-loop source boundary: the controller Deployment uses the stable tools image, syncs UniDesk source into the k8s git-mirror cache, then clones `/work/unidesk` each reconcile. A UniDesk source commit that changes branch-follower controller logic can still be triggered by the previous checkout if the loop observes that commit before cloning it for execution. Do not use that self-upgrade source change to validate new controller-state semantics, and do not backfill its missing total timing. First confirm the target Pod checkout contains the fix, then validate future timing/state behavior with a later source change or an explicit target-side `run-once` that starts from a stored state written by the fixed controller.
When self-upgrade timing is unclear, use `debug-step --step controller-source` before pushing another source change. If the checkout identity is not visible, add that single-step visibility first; do not infer controller code version from a slow automatic rollout alone.
If a deterministic Kubernetes Job or PipelineRun is reused and there is no already-stored `timings.startedAt`, the reused object's current wait/check duration is only a stage observation; it must not be promoted to `timings.totalSeconds`.
When `run-once --confirm --wait` resumes a source change that is already `ClosingOut`, the CLI may wait for native closeout and report a `closeout` stage duration. That closeout-only wait is not the end-to-end total unless the stored state already contains a valid `timings.startedAt`.