pikasTech-unidesk/.agents/skills/unidesk-cicd/references/branch-follower.md

# CI/CD Branch Follower

SPEC: PJ2026-01060703 CI/CD branch follower draft-2026-07-03-p0-branch-follower

## Entrypoints

```bash
bun scripts/cli.ts cicd branch-follower plan
bun scripts/cli.ts cicd branch-follower apply --confirm --wait
bun scripts/cli.ts cicd branch-follower status
bun scripts/cli.ts cicd branch-follower status --live
bun scripts/cli.ts cicd branch-follower run-once --all --dry-run
bun scripts/cli.ts cicd branch-follower run-once --follower <id> --confirm --wait
bun scripts/cli.ts cicd branch-follower debug-step --follower <id> --step state-read
bun scripts/cli.ts cicd branch-follower debug-step --follower <id> --step controller-source
bun scripts/cli.ts cicd branch-follower debug-step --follower <id> --step status-read
bun scripts/cli.ts cicd branch-follower debug-step --follower <id> --step decide
bun scripts/cli.ts cicd branch-follower debug-step --follower <id> --step state-write --confirm
bun scripts/cli.ts cicd branch-follower events --follower <id>
bun scripts/cli.ts cicd branch-follower logs --follower <id>
bun scripts/cli.ts cicd branch-follower gate --follower hwlab-jd01-v03 --gate control-plane-refresh --source-commit <sha> --confirm --json
```

`apply --confirm --wait` is the one-command deploy/update entry for the K8s controller. `status` is the default intermediate-state query. `status --live` and local `run-once` submit a bounded K8s reconcile Job; the Job performs all source, Tekton, Argo and runtime reads inside the cluster and may write only the compact state summary. `events` and `logs` are read-only drill-downs for the same Kubernetes-native state. `run-once --confirm --wait` is the manual one-command trigger and closeout path.

`debug-step` is the required single-step troubleshooting entry before changing branch-follower code for repeated CI/CD convergence issues. It runs in a bounded target-side Job when called from the operator host, and uses the same controller modules as the real flow:

- `state-read`: read only the compact ConfigMap state, value bytes, resourceVersion and `_updatedAt`.
- `controller-source`: read only the current target-side one-shot checkout identity: HEAD, branch, registry sha and key file markers. Use this before attributing a failed/slow self-upgrade run to new controller code.
- `status-read`: read native source/Tekton/Argo/runtime status without triggering adapters.
- `decide`: run the decision function in dry-run mode without triggering adapters or writing state.
- `state-write --confirm`: patch the stored follower state back through the normal ConfigMap write helper and report before/after resourceVersion; this is for isolating state write failures, not for normal rollout.

Do not debug the same state/read/write problem by repeatedly pushing empty or tiny source commits to drive the full automatic follower loop.

When a branch-follower issue remains ambiguous after a debug step or drill-down, split the CLI into a smaller single-step probe before any new end-to-end run. Add or use a focused `debug-step`, follower-scoped drill-down, or bounded target-side diagnostic for the exact missing edge, such as PipelineRun -> Pipeline spec, controller refresh apply object, state write, closeout re-read, or log/timing extraction. Do not use another source PR, merge, or full automatic follower loop as the next diagnostic action until the narrower step can show the needed evidence.

For HWLAB native `control-plane-refresh`, the bounded evidence chain must preserve both the rendered Pipeline summary and the applied cluster object summary for the same source commit: rendered Pipeline name, bounded `runtime-ready` task/when summary, source commit/stage ref, applied Pipeline name, resourceVersion, and a short annotation/label subset proving which object was patched. If the Job TTL has already removed the original Job, status/events/logs must show `-` or a bounded missing reason from stored state instead of inferring the missing edge.

CI/CD validation must be decomposable into ordered single-step gates before a full rollout observation is accepted: first validate the reuse plan, then CI parallelism/TaskRun plan, then CD rollout plan, then post-deploy monitoring/health evidence. "Single-step" means an independently triggerable and independently executable target-side CLI/debug-step/drill-down entry, not a passive observation extracted from one end-to-end follower run. Each gate must be runnable against a selected follower/source snapshot, must emit bounded evidence, and must be retryable/fixable without creating a new source PR or replaying the full follower loop. The owner should not stop at explaining historical timing/status after a gate is clear enough to exercise; within the assigned boundary, it should autonomously trigger the relevant target-side single-step gate, inspect the short native result, tune the smallest failing edge, and rerun that same gate until it passes or a real permission/external/architecture blocker is proven. For HWLAB Pipeline render changes, `gate --gate control-plane-refresh --source-commit <sha> --confirm` is the independently triggerable native refresh gate. Do not use issue comments, repeated PR merges, or end-to-end follower loops as substitutes for a missing single-step validator; add the missing bounded CLI step first.

PRs that change branch-follower convergence, reuse, Tekton/Argo closeout, runtime readiness or gate visibility must be submitted only after the author has run the affected independently triggerable single-step gates on the target NODE/k8s and captured bounded pass evidence. If a required gate cannot be triggered independently or does not pass, do not open the PR as a validation vehicle; leave a short issue comment with the missing gate, target object names and next minimal fix scope, then fix the gate first.

An independently triggered single-step failure is an actionable defect inside that gate until proven otherwise. The owner must narrow the failing gate, patch the relevant CLI/helper/controller logic, rerun the same target-side step, and iterate until it passes before submitting a PR. Issue-only blockers are reserved for permission gaps, external service outages, architecture decisions, or work that is explicitly outside the assigned runtime/repo boundary; ordinary failing step evidence is not a blocker.

When a repeated runtime pitfall or visibility defect is found during branch-follower work, update this reference or the skill entry first, then continue with the narrow debug step. Do not proceed to `run-once`, controller loop observation, automatic follower validation, or source-commit-driven integration until the relevant `state-read`, `status-read`, `decide`, and `state-write` debug steps pass for the affected follower.

Stage and end-to-end timing budgets are observability and guidance signals, not hard failure gates. When a stage or total wall-clock exceeds its YAML budget, the CLI/controller should record `overBudget`, emit a warning/hint, keep exposing state and continue toward native completion when the underlying Tekton/Argo/runtime operation is still making progress. Do not fail, kill, or permanently block a follower solely because the timing budget elapsed; otherwise the timeout checker itself can become the source of hung or failed delivery. Real failures must come from native objects such as Job/TaskRun/PipelineRun/Argo/runtime conditions, explicit command failures, missing required source/config, or operator cancellation.

`debug-step` wrappers must be failure-visible and non-crashing. If the target-side Job fails, returns an older schema, or omits optional summary fields, the operator-facing CLI must render `-`/null plus the target error and Job identity; it must not throw a local TypeError before showing the target evidence.

`debug-step` output must stay bounded in both text and JSON modes. The default machine payload should include step result, compact state/status/decision/write summaries, target Job identity and short error/timing fields only. Full target Job logs, full target JSON and long stdout/stderr tails belong behind explicit drill-down, not in the default `--json` payload.

Bounded JSON means the operator-facing `--json` payload must remain below the YAML-configured stdout limit in normal successful debug cases. Do not duplicate the same evidence as full top-level objects, compact `targetResult`, full `stateAfter` and target stdout tail at the same time; choose one compact representation by default and put full payload/log drill-down behind explicit commands.

Target-side state summaries used by `status`, `events`, `logs` and `debug-step state-read` must also remain below the transport stdout limit. When exposing stored native payloads, return gate summaries only: git-mirror, Tekton, Argo, runtime and short errors. Do not include full source objects, TaskRun item arrays, plan-artifact arrays, report payloads or full command payloads in the default state summary; a truncated state summary is a visibility defect because the operator can no longer parse the follower state.

Follower-scoped commands such as `status --follower`, `events --follower`, `logs --follower` and `debug-step --follower` must ask the target summary helper for only that follower's state. Do not fetch every follower and filter locally at the operator side; multi-follower summaries have different size budgets and should use lower per-follower stage limits.

Multi-follower status summaries should omit per-follower `command.payload`/native drill-down payloads entirely; those belong to follower-scoped `events`/`logs`/`debug-step` queries. Default all-follower status must remain parseable below the transport stdout limit.

`scripts/src/cicd.ts` must stay a thin top-level CI/CD route entry. Branch-follower implementation belongs in `scripts/src/cicd-branch-follower.ts` and responsibility-specific modules; rendering, debug steps, controller manifests, native K8s helpers, adapter-specific trigger/status logic and large data compactors must be split before any implementation file approaches the 3000-line hard split point.

`status-read`, `events`, `logs` and debug summaries must expose compact closeout gate details when a follower is not aligned: git-mirror readiness, Tekton PipelineRun condition, Argo sync/health, runtime target sha/readiness and short errors. Repeating only phase/observed/target/message is a visibility defect and must be fixed before further rollout tuning.

Argo closeout visibility must include the bounded reason for non-ready health, not only `Synced/Progressing`: health message, operation phase/message, short Application conditions and a small list of non-healthy resources when available.

Tekton failure visibility must include bounded TaskRun detail, not only PipelineRun `Failed`: failed TaskRuns, active TaskRuns and slow TaskRuns with task name, reason and duration. Without this, performance/failure work cannot move past the PipelineRun gate.

Default stage timing tables must prioritize failed, active and slow TaskRun rows before ordinary succeeded TaskRuns when the row budget is tight. Do not truncate TaskRuns purely by Kubernetes start time if that hides the first failing or slow task.

When Argo exposes operation start/finish timestamps, stage timing rows should report the Argo operation duration directly. Missing timestamps still render `-`; do not infer Argo duration from total elapsed time or from unrelated runtime polling.

The automatic controller loop is non-blocking, so closeout acceleration cannot live only in the user-facing `--wait` path. Once a triggered PipelineRun has succeeded and required runtime/GitOps gates are not aligned, the in-cluster controller path should perform the same bounded target-side Argo refresh used by wait closeout; otherwise convergence depends on Argo's background poll interval and can exceed the 120s budget even when Tekton finished quickly.

The same rule applies to git-mirror post-flush. If native status shows runtime/Argo are aligned but GitOps mirror is still pending flush, the automatic controller loop must run the bounded target-side git-mirror flush instead of leaving a follower in `ClosingOut` until a manual wait/closeout path is used.

After an automatic closeout accelerator runs, the same reconcile must do a bounded native status re-read/poll and write the resulting state when it is already aligned. Do not defer the final `Noop` write to the next controller loop; loop interval plus another status-read can add enough idle time to exceed the 120s end-to-end budget even when PipelineRun, Argo and runtime are already ready. The re-read timeout must come from YAML follower budgets, and the short poll interval must come from YAML controller budgets. A single immediate re-read is insufficient when Argo accepts refresh first and updates operation/runtime state a few seconds later.

Stage timing rows must not label optional gates as `not-ready` when they are not part of that follower's closeout contract. For sentinel-like followers without a GitOps branch flush gate, git-mirror source snapshot readiness should render as source-ready/ready, while missing GitOps `githubInSync` remains `-`/not-applicable instead of a failure-looking state.

## Source Authority

- Follower decisions must not read host source worktrees, target dev directories, `.worktree/*`, local git state, or direct GitHub branch refs.
- Controller pods use EmptyDir plus the YAML-declared k8s git-mirror cache PVC, sync GitHub refs from inside Kubernetes, clone UniDesk controller source from `/cache`, then run the CLI with the mounted registry.
- All GitHub/Git egress used by branch-follower source sync, adapter git-mirror sync/flush, PR/merge closeout helpers and controller bootstrap must resolve proxy settings from YAML/sourceRef. Controller GitHub SSH uses `config/cicd-branch-followers.yaml#controller.source.githubSsh`; runtime adapters use their owning lane/control-plane YAML host proxy refs such as `config/hwlab-node-control-plane.yaml#nodes.<NODE>.egressProxy`. Do not rely on undeclared pod env, host shell proxy variables, direct GitHub transport, or trans-side proxy defaults.
- Runtime source commits, build contexts, publish inputs and closeout status remain owned by each adapter's k8s git-mirror snapshot and runtime objects.
- Trigger adapters communicate through the Kubernetes API with the controller service account. Formal triggering, observation and closeout must not depend on downstream CLI stdout parsing, host worktrees, or operator shell state.
- Dirty, stale, or missing-dependency host worktrees are non-authoritative and must not change observed sha, trigger sha, PipelineRun, GitOps, or status output.
- `trans` or SSH may be used only by the operator CLI as a transport to create/read Kubernetes objects on the target cluster. It must not be part of branch-follower source sync, GitHub communication, status collection, decision making or closeout.

## YAML Ownership

`config/cicd-branch-followers.yaml` owns controller settings and the follower registry: id, adapter, source/target configRefs, command argv, native status object refs, closeout check labels and budgets.

It must not copy runtime/GitOps/Secret details from owning configs:

- HWLAB node lanes: `config/hwlab-node-lanes.yaml`
- AgentRun lanes: `config/agentrun.yaml`
- Web sentinel profiles/scenarios/reports/secrets: `config/hwlab-web-probe-sentinel/*.yaml`

Use configRef summaries in plan/status; do not create a `full.md` or super Markdown index.

Timeout, TTL, retry/backoff, reconcile interval and end-to-end budget values must be declared in YAML/source-of-truth fields. Do not introduce hidden numeric defaults in TypeScript, shell, native helper scripts, or controller manifests; helper code should read the configured values and fail structurally when required timing policy is missing.

## First Followers

- `hwlab-jd01-v03`: follows `pikasTech/HWLAB@v0.3`, adapter `hwlab-node-runtime`, native trigger `Tekton PipelineRun -> Argo Application closeout -> runtime Deployment sourceCommit readiness`.
- `agentrun-jd01-v02`: follows `pikasTech/agentrun@v0.2`, adapter `agentrun-yaml-lane`, native trigger `build image Job -> GitOps publish Job -> git-mirror flush Job -> Tekton PipelineRun -> Argo Application closeout -> runtime Deployment sourceCommit readiness`. The same source commit must use deterministic Job names so a later controller loop can resume or reuse already completed stages.
- `web-probe-sentinel-master`: follows `pikasTech/unidesk@master`, adapter `web-probe-sentinel-cicd`, native trigger `Tekton PipelineRun -> Argo Application closeout -> runtime Deployment sourceCommit readiness`.

These three followers are the initial production set. HWLAB and AgentRun both run on JD01; there is no D601 target in the automatic follower set unless YAML is explicitly changed.

## Reuse And Mirror Contract

The controller must preserve the runtime reuse capabilities that already exist in the runtime lanes:

- runtime reuse: if both code identity and env identity are unchanged for a microservice, skip rebuild and rollout for that service;
- env reuse: if code changed but env identity is unchanged, reuse the previous environment image and publish only the changed service artifact;
- git mirror: source sync, immutable source snapshot creation and GitOps flush are generic branch-follower stages, not adapter-local afterthoughts.

Runtime/env reuse configuration for branch-followed source repositories must live in the followed repository at `./gitops/reuse.ymal`. The branch-follower reads that file from the k8s git-mirror source snapshot, parses it through the shared reuse-config parser, and passes only the bounded redacted summary to adapter status/trigger payloads. Do not keep separate adapter-local reuse config as the authoritative source for branch-follower runs.

The reuse-plan gate must emit an adapter-consumable per-service decision, not only a parsed config summary. At minimum the bounded plan must show the source/env identity comparison outcome, runtime reuse hit/miss, env reuse hit/miss, whether CI should build an image, the existing image/ref to reuse when known, and the short reason. When a service's Dockerfile/env identity is unchanged, the plan decision is `skipImageBuild` for that service: CI must consume that decision and must not rebuild the environment image by independently re-inferring changes from source files, pipeline defaults or TaskRun logs. If only code changed, rollout should move the new source through git-mirror/runtime source sync while reusing the prior env image.

Adapters should expose reuse evidence through compact native state. HWLAB uses the `plan-artifacts` task event summary (`affectedServices`, `buildServices`, `reusedServices`, `artifactProvenanceAudit`). AgentRun publishes deterministic image/GitOps/git-mirror stage names and source-commit labels so a later loop can resume closeout without rebuilding completed stages. Sentinel keeps the same source/CI/Argo/runtime contract but has no GitOps branch flush gate.

The normal convergence budget is 120 seconds per source change. A follower may report `ClosingOut` while waiting for Argo/runtime readiness, but it must not report `Noop` when the source sha matches and required native gates such as git-mirror flush are still incomplete.

## Status Contract

Default `status` output must show follower id, phase, adapter, source branch + observed sha, target sha, last triggered sha, last succeeded sha, in-flight job/PipelineRun, budget source, timing summary and next drill-down commands.

Stage timing must be queryable through normal CLI output, not only raw JSON. `status` and `run-once` print a bounded `STAGE TIMINGS` table with `total`, `status-read`, git-mirror, Kubernetes Job, PipelineRun, TaskRun, Argo, runtime and closeout rows when available. `followers[].timings` remains available in `--raw`/JSON for machine consumers.

`run-once` also prints a bounded `STATE WRITES` table whenever it writes follower state. The table must include follower id, write status, before/after ConfigMap resourceVersion, whether timing was preserved, exit code and a short message. Missing write evidence is a visibility defect; use `debug-step --step state-write` before any further full-loop validation.

`timings.totalSeconds` is the authoritative end-to-end wall-clock measurement for a triggered run: measure from `timings.startedAt` until `timings.finishedAt`, or until query time while closeout is still running. Do not compute total by summing stage rows, because stage rows can overlap, omit external waiting, or be reported by different native objects.

Do not backfill, infer, or migrate old branch-follower state when historical timing, stage timing, or other observability fields are missing or known to be unreliable. Compatibility starts with future state written by the current controller; old missing data must render as `-`/unknown in CLI output instead of being recovered from unrelated native objects.

State writes must preserve same-source total timing at the target side. When a later native observation for the same follower and same observed source sha lacks `timings.totalSeconds` or `timings.startedAt`, the ConfigMap patch helper must read the existing follower state on the target node, keep the already-recorded total timing, and only replace stage rows/current gate details. This merge must happen in the target-side patch operation, not by host-side parsing or by a prior local read that can be overwritten by the next controller loop.

Controller self-upgrade has a one-loop source boundary: the controller Deployment uses the stable tools image, syncs UniDesk source into the k8s git-mirror cache, then clones `/work/unidesk` each reconcile. A UniDesk source commit that changes branch-follower controller logic can still be triggered by the previous checkout if the loop observes that commit before cloning it for execution. Do not use that self-upgrade source change to validate new controller-state semantics, and do not backfill its missing total timing. First confirm the target Pod checkout contains the fix, then validate future timing/state behavior with a later source change or an explicit target-side `run-once` that starts from a stored state written by the fixed controller.

When self-upgrade timing is unclear, use `debug-step --step controller-source` before pushing another source change. If the checkout identity is not visible, add that single-step visibility first; do not infer controller code version from a slow automatic rollout alone.

If a deterministic Kubernetes Job or PipelineRun is reused and there is no already-stored `timings.startedAt`, the reused object's current wait/check duration is only a stage observation; it must not be promoted to `timings.totalSeconds`.

When `run-once --confirm --wait` resumes a source change that is already `ClosingOut`, the CLI may wait for native closeout and report a `closeout` stage duration. That closeout-only wait is not the end-to-end total unless the stored state already contains a valid `timings.startedAt`.

State machine phases are `Observed`, `Noop`, `PendingTrigger`, `Triggering`, `ClosingOut`, `Succeeded`, `Failed`, `Superseded`, `Blocked`, and `Skipped`.

Status and decision inputs are Kubernetes-native:

- source: k8s git-mirror cache ref and immutable snapshot ref;
- CI: Tekton `PipelineRun.status.conditions`;
- CI drill-down: compact TaskRun timings and plan-artifact reuse summary when available;
- git mirror: source snapshot readiness plus GitOps `pendingFlush`/`githubInSync` when the follower owns a GitOps branch;
- deployment: Argo `Application.status.sync` and `Application.status.health`;
- runtime: selected Deployment/StatefulSet readiness plus source commit labels, annotations or env.

The branch follower must not parse downstream CLI stdout/stderr, `kubectl` human tables, `argo` text, `tkn` text, or curl output to infer observed sha, target sha, readiness or closeout. `kubectl -o json` may be used inside the controller/Job as a structured Kubernetes API transport only.

In-cluster controller and native helper scripts must not require a `kubectl` binary in the image. Native helpers that read or write ConfigMaps, Jobs, PipelineRuns, Argo Applications, Pods or logs must use the serviceaccount token and Kubernetes HTTPS API directly, or a shared native helper that does the same. A missing `kubectl` binary is a product defect in the helper, not a node problem. Operator-side `kubectl` through the controlled CLI/trans boundary remains acceptable only as a transport/debug wrapper.

Native helper scripts that are reused in both execution planes must make the plane explicit. Inside a Pod/Job they use serviceaccount HTTPS API; from the operator/trans boundary they may use the controlled `kubectl` transport. A helper must not assume serviceaccount files exist on the target node, and must not assume `kubectl` exists inside the controller image.

The controller automatic loop submits trigger work without a blocking wait; later loops close out via the native state objects above. Failed state must not dedupe a source commit forever: retries may reuse deterministic native objects for the same source commit, and a new compact observation should be able to move the follower back into triggering or closeout.

State ConfigMaps must stay bounded and human-queryable. Store compact summaries, stage refs, conditions, short messages, and drill-down object names; do not store full API payloads or long log dumps. Cleanup is an explicit operator operation for stale/broken state and must not be required for normal convergence.

When retesting the same source sha after fixing controller/render inputs, `cleanup-state` only deletes the stored follower state. It does not delete deterministic native objects such as an existing PipelineRun, and decision logic may still treat that sha as already triggered. Do not loop on cleanup plus `run-once`; use an independently triggerable gate such as control-plane refresh or an explicit rerun/cleanup of the native object, then re-read status.

Status readers must compute near the data. When the operator CLI reaches a target node or k8s route through `trans`, the target NODE/k8s side must parse ConfigMap values, Kubernetes objects and log/event lists locally, then return only the bounded follower summary, timing rows, object names, counts and short tails needed by the CLI. Do not transmit complete ConfigMap entries, full API objects or long logs back to the host just so host-side TypeScript can parse and trim them.

Operator transport timing warnings such as `UNIDESK_SSH_TIMING` measure CLI/trans latency, not branch-follower CI/CD stage time or end-to-end convergence time. Do not mix those warnings into `timings.totalSeconds`, stage rows, or performance closeout evidence; when transport cost becomes noisy, reduce round trips by adding a target-side debug/status summary instead of pulling more raw output to the host.

Validation, test and performance evidence for branch-follower changes must also run on the target NODE/k8s runtime, not on the local/master host. For CI/CD changes, use the target node's Tekton/Argo/runtime objects, controlled CLI jobs, and target-side summary scripts as the evidence source; local tests may not be cited as convergence or performance proof.

Operator-facing commands must use intuitive target-side verbs instead of internal execution flags. From a local/master host, use `status --live`, `run-once ...`, `events`, or `logs`; these commands create a bounded target-side Job when live state is needed. The internal `--in-cluster` flag is reserved for the Kubernetes Job/Pod command line after the registry, serviceaccount, in-cluster API endpoint and EmptyDir source checkout are mounted. It must not appear in user-facing examples.

`--in-cluster` only selects the execution environment. It must not imply `--wait`, closeout blocking, longer budgets, or sequential waiting across followers. Only an explicit user-facing `--wait` may perform blocking closeout waits; the automatic controller loop must submit/observe one bounded step and let later loops advance state from native Kubernetes objects.

Legacy `--controller` is accepted only as a compatibility spelling: inside Kubernetes it maps to `--in-cluster`, while outside Kubernetes it behaves like the ordinary public target-side path rather than running in-cluster logic locally. If an internal flag, hidden mode, or operator shortcut is misused and can write partial state or misleading evidence, stop feature work and simplify the public command semantics plus this reference before continuing.

`run-once --dry-run` is read-only for deployment: it may refresh the state ConfigMap with current native observations, but it must not trigger adapters.