Adds read-only D601 recovery diagnostics, fixture coverage, CLI wiring, and recovery hotfix runbook updates. Validated with recovery contract, check --files, scripts tsc, artifact matrix direct contract, and read-only live diagnostic.
10 KiB
DevOps Hygiene
This document is the authoritative source for UniDesk deployment hygiene: Git-backed deployment truth, dirty-environment boundaries, bounded manual operations and CI source-auth rules. Release-line and CI/CD runtime-version governance is owned by docs/reference/release-governance.md and GitHub issue #6. If the same hygiene rule would need edits in docs/reference/dev-environment.md, docs/reference/deploy.md, docs/reference/ci.md, docs/reference/dev-ci-runner.md, docs/reference/deployment.md, AGENTS.md or TEST.md, keep the detailed rule here and leave only a cross-reference elsewhere.
Source Of Truth
UniDesk deployment state is healthy only when all three layers point back to pushed Git commits:
- Desired state:
origin/master:deploy.json,config.jsonand committed service manifests. - Runtime state: live service metadata, image tags, Deployment annotations/env stamps or Compose labels that identify the deployed commit.
- Verification state:
deploy plan, health checks,server status, service proxy checks and CI/e2e results that agree with the desired commit.
Local worktrees, D601 runtime files, copied scripts, copied images, ad-hoc Kubernetes objects and one-off curl results are never deployment truth.
When stable release lanes such as release/v1 are enabled, the desired-state ref must be explicit in the command, job log and deploy output. Until that support exists, commands that are documented to read origin/master:deploy.json must keep doing so and must not silently switch to another branch or a dirty manifest.
Prohibited Deployment Truth
The following practices are not acceptable as the long-term or hidden source of a working environment:
- Hand editing D601 runtime files, k3s manifests, ConfigMaps, Deployments or container env values and treating the live result as source of truth.
- Rebuilding backend-core, frontend, k3sctl-adapter or other managed services from a dirty worktree on the master server, D601 or an operator machine.
- Copying large local shell scripts, generated manifests, Docker images or application source to D601 as the main deployment mechanism.
- Fixing dev or production reachability by adding direct D601 public ports, NodePorts or backend-core hardcoded service entries instead of updating the proper catalog/control bridge.
- Treating
server rebuild backend-coreas a Rust backend-core iteration path; Rust build/check belongs to D601 CI and CD must consume the published artifact. - Using local-manifest production deploy for services that already have artifact consumers.
backend-core,frontend,baidu-netdiskanddecision-centerproduction deploys must enter throughdeploy apply --env prodso CD consumes a commit-pinned registry artifact instead of silently falling back to target-side source build. - Treating upstream images as UniDesk source-build services.
filebrowserandfilebrowser-d601are upstream-image consumers; they require digest pin or digest-verified mirror governance and must not be added to Dockerfile CI artifacts. - Considering manual curl, kubectl or Docker checks sufficient when live commit metadata, deploy plan, health checks and CI/e2e disagree.
Bounded Manual Operations
Manual operations are allowed only when they are narrow, visible and followed by commit-based verification:
server rebuild dev-frontend-proxymay update the thin main-server nginx proxy for the public dev UI port; it must not build backend/frontend application code.deploy apply --service k3sctl-adaptermay update the k3s control bridge catalog through the documented local manifest exception indocs/reference/deploy.md.- Host SSH/provider-gateway dispatch may start the
ci run-dev-e2eshort launcher described indocs/reference/dev-ci-runner.md; it must not carry large shell bodies or become a general deployment path. - Read-only smoke checks such as
curl http://74.48.78.17:18083/health,server status,microservice proxy .../healthandkubectl getmay validate state, but they do not replace desired-state and live-commit verification. bun scripts/cli.ts check recovery-guardrailsand the equivalentcheck --recovery-guardrailsgate are commander-safe read-only recovery diagnostic surfaces for D601 reboot incidents. They may read/proc/mounts, inspect local path metadata, parse committed k3s manifests, and run bounded read-onlykubectl get pods -A -o json/crictl pods -o jsonprobes when the tools are present. They must not restart k3s, delete pods or CRI sandboxes, apply manifests, rollout workloads, mutate hostPath directories, prune Docker state, or repair symlinks automatically.- File Browser recovery may use the existing provider-local image-only/docker-run path only as a bounded repair path. Standardization requires first resolving
docker.io/filebrowser/filebrowser:v2.63.3to an upstream manifest digest or a digest-verified local mirror, then validating the running container through the UniDesk private proxy.
Manual Secret/env/rollout repair is allowed only as a bounded runtime recovery path. It must have explicit authorization for the target environment and service, a narrow object scope, redacted evidence, an issue review trail and a durable source fix. Acceptable evidence includes object names, revision changes, health status, rollout status and redacted key presence; it must not include secret values, tokens, full env dumps or copy-pastable sensitive mutation commands.
Any manual repair that changes live credentials, env wiring, DNS/egress assumptions, ConfigMaps, Deployments or rollout state must be followed by a source-of-truth update in the owning repository, secret management path, manifest, deployment catalog or a tracked remediation issue. The live environment is recovery evidence, not deployment truth.
If a manual repair is needed to unblock the platform, the durable fix must be committed and pushed, then redeployed or revalidated through the normal path. Do not preserve the repair only as hidden runtime state.
D601 Recovery Hotfix Exception
D601 reboot recovery has a narrow hotfix exception because k3s, Code Queue and hostPath readiness can fail before normal UniDesk proxy/CD surfaces are healthy. The exception authorizes diagnosis and carefully scoped host repair only; it does not make live host edits a new deployment path.
Allowed read-only recovery checks:
bun scripts/cli.ts check recovery-guardrailsorbun scripts/cli.ts check --recovery-guardrailson the host or runner environment./proc/mountsinspection for malformed Docker Desktop/Docker/host9p rows that may break kubelet mount-table validation.kubectl get/describe/logs/eventsandcrictl pods -o jsonas bounded observation.- Path metadata checks for
/home/ubuntu/unidesk-code-queue-deploy,/home/ubuntu/cq-deploy, Code Queue hostPath directories,.codexfiles,.ssh,.agents/skills, and MDTODO workspace/log paths.
Manual host hotfix may be considered only after the read-only output identifies a concrete redline and the operator has reviewed whether the target is source checkout, credential material, log/cache state, or user data. Examples include restoring a missing Git worktree from pushed remote state, recreating an intended compatibility symlink after confirming its target, restoring a missing runtime Secret source, or fixing a Docker Desktop/WSL mount-table condition. The repair must be recorded in the relevant issue or commander brief and followed by a source or runbook update when the root cause is durable.
Forbidden automatic recovery actions:
systemctl restart k3s,service k3s restart,kubectl delete pod,crictl rmp,crictl rm,docker system prune,docker volume prune, recursive chmod/chown/rm under/home/ubuntu, andgit reset --hardof a live worktree.- Deleting CRI sandboxes or Kubernetes Pods because a diagnostic counted stale sandboxes.
- Creating, deleting or replacing MDTODO workspace content from Code Queue or the generic CLI. MDTODO hostPaths contain user-authored Markdown data.
- Treating
DirectoryOrCreateas permission to mass-create parent trees or credential directories. Kubelet may create the final directory only when mount validation and parent permissions are already healthy; humans still decide user-data boundaries.
ClaudeQQ or direct user approval is required before any high-risk host action that restarts k3s/kubelet/Docker Desktop, deletes pods/sandboxes, changes credential or SSH paths, touches MDTODO workspace data, force-resets a worktree, changes production rollout state, or could interrupt active Code Queue tasks. If ClaudeQQ is unavailable, the operator must stop at the written plan and record the approval gap instead of silently executing the action.
CI And Private Source Auth
Private repository access is part of the CI contract. ci run must not rely on unauthenticated HTTPS clone, an operator's local dirty worktree or an ad-hoc secret copied by hand into one PipelineRun.
Acceptable source access implementations are:
- a first-class CI Git credential installed by
ci installand referenced by the Tekton Pipeline; - an in-cluster Git mirror managed by UniDesk; or
- the same commit-pinned host-fetch boundary used by
ci run-dev-e2e, where D601 fetches the manifest commit and passes only verified inputs into Tekton.
For backend-core artifact publication, the required implementation is the host-fetch boundary: D601 uses the existing GitHub SSH deploy identity plus the node-local provider-gateway WS egress proxy, exports the requested commit to /home/ubuntu/.unidesk/ci/backend-core-artifacts/<commit>, and passes only that verified source directory into Tekton. This path must not be replaced by an in-cluster Git mirror, a third-party source mirror, an operator local checkout, or Tekton-mounted GitHub credentials.
If a CI repo-check task fails at git clone because credentials are unavailable, classify it as a CI infrastructure/auth gap, not as an application test failure.
Verification Priority
When checks disagree, use this priority order:
- The pushed desired commit and
deploy.json/manifest contract. - Live runtime metadata proving which commit is deployed.
- Controlled health checks and service proxy checks for the same runtime object.
- CI/e2e results tied to the same commit.
- Manual curl/kubectl output as supporting evidence only.
A passing lower-priority manual check cannot override a higher-priority mismatch. Fix the desired state, deploy path, runtime metadata or CI infrastructure until all layers agree.