Files
pikasTech-unidesk/.agents/skills/unidesk-cicd/references/branch-follower.md
T

17 KiB

CI/CD Branch Follower

SPEC: PJ2026-01060703 CI/CD branch follower draft-2026-07-03-p0-branch-follower

Entrypoints

bun scripts/cli.ts cicd branch-follower plan
bun scripts/cli.ts cicd branch-follower apply --confirm --wait
bun scripts/cli.ts cicd branch-follower status
bun scripts/cli.ts cicd branch-follower status --live
bun scripts/cli.ts cicd branch-follower run-once --all --dry-run
bun scripts/cli.ts cicd branch-follower run-once --follower <id> --confirm --wait
bun scripts/cli.ts cicd branch-follower debug-step --follower <id> --step state-read
bun scripts/cli.ts cicd branch-follower debug-step --follower <id> --step status-read
bun scripts/cli.ts cicd branch-follower debug-step --follower <id> --step decide
bun scripts/cli.ts cicd branch-follower debug-step --follower <id> --step state-write --confirm
bun scripts/cli.ts cicd branch-follower events --follower <id>
bun scripts/cli.ts cicd branch-follower logs --follower <id>

apply --confirm --wait is the one-command deploy/update entry for the K8s controller. status is the default intermediate-state query. status --live and local run-once submit a bounded K8s reconcile Job; the Job performs all source, Tekton, Argo and runtime reads inside the cluster and may write only the compact state summary. events and logs are read-only drill-downs for the same Kubernetes-native state. run-once --confirm --wait is the manual one-command trigger and closeout path.

debug-step is the required single-step troubleshooting entry before changing branch-follower code for repeated CI/CD convergence issues. It runs in a bounded target-side Job when called from the operator host, and uses the same controller modules as the real flow:

  • state-read: read only the compact ConfigMap state, value bytes, resourceVersion and _updatedAt.
  • status-read: read native source/Tekton/Argo/runtime status without triggering adapters.
  • decide: run the decision function in dry-run mode without triggering adapters or writing state.
  • state-write --confirm: patch the stored follower state back through the normal ConfigMap write helper and report before/after resourceVersion; this is for isolating state write failures, not for normal rollout.

Do not debug the same state/read/write problem by repeatedly pushing empty or tiny source commits to drive the full automatic follower loop.

When a repeated runtime pitfall or visibility defect is found during branch-follower work, update this reference or the skill entry first, then continue with the narrow debug step. Do not proceed to run-once, controller loop observation, automatic follower validation, or source-commit-driven integration until the relevant state-read, status-read, decide, and state-write debug steps pass for the affected follower.

debug-step wrappers must be failure-visible and non-crashing. If the target-side Job fails, returns an older schema, or omits optional summary fields, the operator-facing CLI must render -/null plus the target error and Job identity; it must not throw a local TypeError before showing the target evidence.

debug-step output must stay bounded in both text and JSON modes. The default machine payload should include step result, compact state/status/decision/write summaries, target Job identity and short error/timing fields only. Full target Job logs, full target JSON and long stdout/stderr tails belong behind explicit drill-down, not in the default --json payload.

status-read, events, logs and debug summaries must expose compact closeout gate details when a follower is not aligned: git-mirror readiness, Tekton PipelineRun condition, Argo sync/health, runtime target sha/readiness and short errors. Repeating only phase/observed/target/message is a visibility defect and must be fixed before further rollout tuning.

Stage timing rows must not label optional gates as not-ready when they are not part of that follower's closeout contract. For sentinel-like followers without a GitOps branch flush gate, git-mirror source snapshot readiness should render as source-ready/ready, while missing GitOps githubInSync remains -/not-applicable instead of a failure-looking state.

Source Authority

  • Follower decisions must not read host source worktrees, target dev directories, .worktree/*, local git state, or direct GitHub branch refs.
  • Controller pods use EmptyDir plus the YAML-declared k8s git-mirror cache PVC, sync GitHub refs from inside Kubernetes, clone UniDesk controller source from /cache, then run the CLI with the mounted registry.
  • All GitHub/Git egress used by branch-follower source sync, adapter git-mirror sync/flush, PR/merge closeout helpers and controller bootstrap must resolve proxy settings from YAML/sourceRef. Controller GitHub SSH uses config/cicd-branch-followers.yaml#controller.source.githubSsh; runtime adapters use their owning lane/control-plane YAML host proxy refs such as config/hwlab-node-control-plane.yaml#nodes.<NODE>.egressProxy. Do not rely on undeclared pod env, host shell proxy variables, direct GitHub transport, or trans-side proxy defaults.
  • Runtime source commits, build contexts, publish inputs and closeout status remain owned by each adapter's k8s git-mirror snapshot and runtime objects.
  • Trigger adapters communicate through the Kubernetes API with the controller service account. Formal triggering, observation and closeout must not depend on downstream CLI stdout parsing, host worktrees, or operator shell state.
  • Dirty, stale, or missing-dependency host worktrees are non-authoritative and must not change observed sha, trigger sha, PipelineRun, GitOps, or status output.
  • trans or SSH may be used only by the operator CLI as a transport to create/read Kubernetes objects on the target cluster. It must not be part of branch-follower source sync, GitHub communication, status collection, decision making or closeout.

YAML Ownership

config/cicd-branch-followers.yaml owns controller settings and the follower registry: id, adapter, source/target configRefs, command argv, native status object refs, closeout check labels and budgets.

It must not copy runtime/GitOps/Secret details from owning configs:

  • HWLAB node lanes: config/hwlab-node-lanes.yaml
  • AgentRun lanes: config/agentrun.yaml
  • Web sentinel profiles/scenarios/reports/secrets: config/hwlab-web-probe-sentinel/*.yaml

Use configRef summaries in plan/status; do not create a full.md or super Markdown index.

Timeout, TTL, retry/backoff, reconcile interval and end-to-end budget values must be declared in YAML/source-of-truth fields. Do not introduce hidden numeric defaults in TypeScript, shell, native helper scripts, or controller manifests; helper code should read the configured values and fail structurally when required timing policy is missing.

First Followers

  • hwlab-jd01-v03: follows pikasTech/HWLAB@v0.3, adapter hwlab-node-runtime, native trigger Tekton PipelineRun -> Argo Application closeout -> runtime Deployment sourceCommit readiness.
  • agentrun-jd01-v02: follows pikasTech/agentrun@v0.2, adapter agentrun-yaml-lane, native trigger build image Job -> GitOps publish Job -> git-mirror flush Job -> Tekton PipelineRun -> Argo Application closeout -> runtime Deployment sourceCommit readiness. The same source commit must use deterministic Job names so a later controller loop can resume or reuse already completed stages.
  • web-probe-sentinel-master: follows pikasTech/unidesk@master, adapter web-probe-sentinel-cicd, native trigger Tekton PipelineRun -> Argo Application closeout -> runtime Deployment sourceCommit readiness.

These three followers are the initial production set. HWLAB and AgentRun both run on JD01; there is no D601 target in the automatic follower set unless YAML is explicitly changed.

Reuse And Mirror Contract

The controller must preserve the runtime reuse capabilities that already exist in the runtime lanes:

  • runtime reuse: if both code identity and env identity are unchanged for a microservice, skip rebuild and rollout for that service;
  • env reuse: if code changed but env identity is unchanged, reuse the previous environment image and publish only the changed service artifact;
  • git mirror: source sync, immutable source snapshot creation and GitOps flush are generic branch-follower stages, not adapter-local afterthoughts.

Adapters should expose reuse evidence through compact native state. HWLAB uses the plan-artifacts task event summary (affectedServices, buildServices, reusedServices, artifactProvenanceAudit). AgentRun publishes deterministic image/GitOps/git-mirror stage names and source-commit labels so a later loop can resume closeout without rebuilding completed stages. Sentinel keeps the same source/CI/Argo/runtime contract but has no GitOps branch flush gate.

The normal convergence budget is 120 seconds per source change. A follower may report ClosingOut while waiting for Argo/runtime readiness, but it must not report Noop when the source sha matches and required native gates such as git-mirror flush are still incomplete.

Status Contract

Default status output must show follower id, phase, adapter, source branch + observed sha, target sha, last triggered sha, last succeeded sha, in-flight job/PipelineRun, budget source, timing summary and next drill-down commands.

Stage timing must be queryable through normal CLI output, not only raw JSON. status and run-once print a bounded STAGE TIMINGS table with total, status-read, git-mirror, Kubernetes Job, PipelineRun, TaskRun, Argo, runtime and closeout rows when available. followers[].timings remains available in --raw/JSON for machine consumers.

run-once also prints a bounded STATE WRITES table whenever it writes follower state. The table must include follower id, write status, before/after ConfigMap resourceVersion, whether timing was preserved, exit code and a short message. Missing write evidence is a visibility defect; use debug-step --step state-write before any further full-loop validation.

timings.totalSeconds is the authoritative end-to-end wall-clock measurement for a triggered run: measure from timings.startedAt until timings.finishedAt, or until query time while closeout is still running. Do not compute total by summing stage rows, because stage rows can overlap, omit external waiting, or be reported by different native objects.

Do not backfill, infer, or migrate old branch-follower state when historical timing, stage timing, or other observability fields are missing or known to be unreliable. Compatibility starts with future state written by the current controller; old missing data must render as -/unknown in CLI output instead of being recovered from unrelated native objects.

State writes must preserve same-source total timing at the target side. When a later native observation for the same follower and same observed source sha lacks timings.totalSeconds or timings.startedAt, the ConfigMap patch helper must read the existing follower state on the target node, keep the already-recorded total timing, and only replace stage rows/current gate details. This merge must happen in the target-side patch operation, not by host-side parsing or by a prior local read that can be overwritten by the next controller loop.

Controller self-upgrade has a one-loop source boundary: the controller Deployment uses the stable tools image, syncs UniDesk source into the k8s git-mirror cache, then clones /work/unidesk each reconcile. A UniDesk source commit that changes branch-follower controller logic can still be triggered by the previous checkout if the loop observes that commit before cloning it for execution. Do not use that self-upgrade source change to validate new controller-state semantics, and do not backfill its missing total timing. First confirm the target Pod checkout contains the fix, then validate future timing/state behavior with a later source change or an explicit target-side run-once that starts from a stored state written by the fixed controller.

If a deterministic Kubernetes Job or PipelineRun is reused and there is no already-stored timings.startedAt, the reused object's current wait/check duration is only a stage observation; it must not be promoted to timings.totalSeconds.

When run-once --confirm --wait resumes a source change that is already ClosingOut, the CLI may wait for native closeout and report a closeout stage duration. That closeout-only wait is not the end-to-end total unless the stored state already contains a valid timings.startedAt.

State machine phases are Observed, Noop, PendingTrigger, Triggering, ClosingOut, Succeeded, Failed, Superseded, Blocked, and Skipped.

Status and decision inputs are Kubernetes-native:

  • source: k8s git-mirror cache ref and immutable snapshot ref;
  • CI: Tekton PipelineRun.status.conditions;
  • CI drill-down: compact TaskRun timings and plan-artifact reuse summary when available;
  • git mirror: source snapshot readiness plus GitOps pendingFlush/githubInSync when the follower owns a GitOps branch;
  • deployment: Argo Application.status.sync and Application.status.health;
  • runtime: selected Deployment/StatefulSet readiness plus source commit labels, annotations or env.

The branch follower must not parse downstream CLI stdout/stderr, kubectl human tables, argo text, tkn text, or curl output to infer observed sha, target sha, readiness or closeout. kubectl -o json may be used inside the controller/Job as a structured Kubernetes API transport only.

In-cluster controller and native helper scripts must not require a kubectl binary in the image. Native helpers that read or write ConfigMaps, Jobs, PipelineRuns, Argo Applications, Pods or logs must use the serviceaccount token and Kubernetes HTTPS API directly, or a shared native helper that does the same. A missing kubectl binary is a product defect in the helper, not a node problem. Operator-side kubectl through the controlled CLI/trans boundary remains acceptable only as a transport/debug wrapper.

Native helper scripts that are reused in both execution planes must make the plane explicit. Inside a Pod/Job they use serviceaccount HTTPS API; from the operator/trans boundary they may use the controlled kubectl transport. A helper must not assume serviceaccount files exist on the target node, and must not assume kubectl exists inside the controller image.

The controller automatic loop submits trigger work without a blocking wait; later loops close out via the native state objects above. Failed state must not dedupe a source commit forever: retries may reuse deterministic native objects for the same source commit, and a new compact observation should be able to move the follower back into triggering or closeout.

State ConfigMaps must stay bounded and human-queryable. Store compact summaries, stage refs, conditions, short messages, and drill-down object names; do not store full API payloads or long log dumps. Cleanup is an explicit operator operation for stale/broken state and must not be required for normal convergence.

Status readers must compute near the data. When the operator CLI reaches a target node or k8s route through trans, the target NODE/k8s side must parse ConfigMap values, Kubernetes objects and log/event lists locally, then return only the bounded follower summary, timing rows, object names, counts and short tails needed by the CLI. Do not transmit complete ConfigMap entries, full API objects or long logs back to the host just so host-side TypeScript can parse and trim them.

Validation, test and performance evidence for branch-follower changes must also run on the target NODE/k8s runtime, not on the local/master host. For CI/CD changes, use the target node's Tekton/Argo/runtime objects, controlled CLI jobs, and target-side summary scripts as the evidence source; local tests may not be cited as convergence or performance proof.

Operator-facing commands must use intuitive target-side verbs instead of internal execution flags. From a local/master host, use status --live, run-once ..., events, or logs; these commands create a bounded target-side Job when live state is needed. The internal --in-cluster flag is reserved for the Kubernetes Job/Pod command line after the registry, serviceaccount, in-cluster API endpoint and EmptyDir source checkout are mounted. It must not appear in user-facing examples.

--in-cluster only selects the execution environment. It must not imply --wait, closeout blocking, longer budgets, or sequential waiting across followers. Only an explicit user-facing --wait may perform blocking closeout waits; the automatic controller loop must submit/observe one bounded step and let later loops advance state from native Kubernetes objects.

Legacy --controller is accepted only as a compatibility spelling: inside Kubernetes it maps to --in-cluster, while outside Kubernetes it behaves like the ordinary public target-side path rather than running in-cluster logic locally. If an internal flag, hidden mode, or operator shortcut is misused and can write partial state or misleading evidence, stop feature work and simplify the public command semantics plus this reference before continuing.

run-once --dry-run is read-only for deployment: it may refresh the state ConfigMap with current native observations, but it must not trigger adapters.