Files
pikasTech-unidesk/docs/reference/devops-hygiene.md
T

20 KiB

DevOps Hygiene

This document is the authoritative source for UniDesk deployment hygiene: Git-backed deployment truth, dirty-environment boundaries, bounded manual operations and CI source-auth rules. Release-line and CI/CD runtime-version governance is owned by docs/reference/release-governance.md and GitHub issue #6. If the same hygiene rule would need edits in docs/reference/dev-environment.md, docs/reference/deploy.md, docs/reference/ci.md, docs/reference/dev-ci-runner.md, docs/reference/deployment.md, AGENTS.md or TEST.md, keep the detailed rule here and leave only a cross-reference elsewhere.

Source Of Truth

UniDesk deployment state is healthy only when all three layers point back to pushed Git commits:

  • Desired state: origin/master:deploy.json, config.json and committed service manifests.
  • Runtime state: live service metadata, image tags, Deployment annotations/env stamps or Compose labels that identify the deployed commit.
  • Verification state: deploy plan, health checks, server status, service proxy checks and CI/e2e results that agree with the desired commit.

Local worktrees, D601 runtime files, copied scripts, copied images, ad-hoc Kubernetes objects and one-off curl results are never deployment truth.

When stable release lanes such as release/v1 are enabled, the desired-state ref must be explicit in the command, job log and deploy output. Until that support exists, commands that are documented to read origin/master:deploy.json must keep doing so and must not silently switch to another branch or a dirty manifest.

Prohibited Deployment Truth

The following practices are not acceptable as the long-term or hidden source of a working environment:

  • Hand editing D601 runtime files, k3s manifests, ConfigMaps, Deployments or container env values and treating the live result as source of truth.
  • Rebuilding backend-core, frontend, k3sctl-adapter or other managed services from a dirty worktree on the master server, D601 or an operator machine.
  • Copying large local shell scripts, generated manifests, Docker images or application source to D601 as the main deployment mechanism.
  • Fixing dev or production reachability by adding direct D601 public ports, NodePorts or backend-core hardcoded service entries instead of updating the proper catalog/control bridge.
  • Treating server rebuild backend-core as a Rust backend-core iteration path; Rust build/check belongs to D601 CI and CD must consume the published artifact.
  • Using local-manifest production deploy for services that already have artifact consumers. backend-core, frontend, baidu-netdisk and decision-center production deploys must enter through deploy apply --env prod so CD consumes a commit-pinned registry artifact instead of silently falling back to target-side source build.
  • Treating upstream images as UniDesk source-build services. filebrowser and filebrowser-d601 are upstream-image consumers; they require digest pin or digest-verified mirror governance and must not be added to Dockerfile CI artifacts.
  • Considering manual curl, kubectl or Docker checks sufficient when live commit metadata, deploy plan, health checks and CI/e2e disagree.

Bounded Manual Operations

Manual operations are allowed only when they are narrow, visible and followed by commit-based verification:

  • server rebuild dev-frontend-proxy may update the thin main-server nginx proxy for the public dev UI port; it must not build backend/frontend application code.
  • deploy apply --service k3sctl-adapter may update the k3s control bridge catalog through the documented local manifest exception in docs/reference/deploy.md.
  • Host SSH/provider-gateway dispatch may start the ci run-dev-e2e short launcher described in docs/reference/dev-ci-runner.md; it must not carry large shell bodies or become a general deployment path.
  • Read-only smoke checks such as curl http://74.48.78.17:18083/health, server status, microservice proxy .../health and kubectl get may validate state, but they do not replace desired-state and live-commit verification.
  • bun scripts/cli.ts check recovery-guardrails and the equivalent check --recovery-guardrails gate are commander-safe read-only recovery diagnostic surfaces for D601 reboot incidents. They may read /proc/mounts, inspect local path metadata, parse committed k3s manifests, and run bounded read-only kubectl get pods -A -o json / crictl pods -o json probes when the tools are present. They must not restart k3s, delete pods or CRI sandboxes, apply manifests, rollout workloads, mutate hostPath directories, prune Docker state, or repair symlinks automatically.
  • File Browser recovery may use the existing provider-local image-only/docker-run path only as a bounded repair path. Standardization requires first resolving docker.io/filebrowser/filebrowser:v2.63.3 to an upstream manifest digest or a digest-verified local mirror, then validating the running container through the UniDesk private proxy.

Manual Secret/env/rollout repair is allowed only as a bounded runtime recovery path. It must have explicit authorization for the target environment and service, a narrow object scope, redacted evidence, an issue review trail and a durable source fix. Acceptable evidence includes object names, revision changes, health status, rollout status and redacted key presence; it must not include secret values, tokens, full env dumps or copy-pastable sensitive mutation commands.

Any manual repair that changes live credentials, env wiring, DNS/egress assumptions, ConfigMaps, Deployments or rollout state must be followed by a source-of-truth update in the owning repository, secret management path, manifest, deployment catalog or a tracked remediation issue. The live environment is recovery evidence, not deployment truth.

If a manual repair is needed to unblock the platform, the durable fix must be committed and pushed, then redeployed or revalidated through the normal path. Do not preserve the repair only as hidden runtime state.

分布式敏捷流程

“分布式敏捷”是 UniDesk 对 distributed agile field repair 的固定流程名。后续 issue、PR、指挥记录或用户反馈提到“分布式敏捷”时,默认指下面这套流程:先在真实分布式运行面快速探测和实验补丁,形成可复现的证据与复盘 issue,再把有效修复收敛为 Git/PR/CI/CD 的持久化交付,最后从原始用户入口复测。它允许快速现场学习,但不允许运行面改动变成隐藏部署真相。

Before classifying a failure as an external blocker, the operator must complete the field anti-misclassification check. This is P0 for model providers, API providers, hardware links, cross-platform bridges, CLI/tran paths and frequently used tooling:

  1. Confirm the exact runtime configuration used by the failing path: committed source ref, deployed image or script revision, redacted Secret names and key presence, env/proxy/NO_PROXY shape, endpoint identity and command args. Do not infer these values from memory or from a different workspace.
  2. Reproduce the symptom from the actual target provider, pod, host bridge or service port through UniDesk passthrough or the service entry that failed. A commander-machine-only check is supporting evidence, not classification evidence.
  3. Compare with the mature local implementation when one exists. For Codex/model-provider work, inspect the current UniDesk/HWLAB stdio, forwarder, proxy, env-stripping and config-loading paths before concluding the provider itself is broken.
  4. Run narrow one-variable experiments in the live target environment. Typical variables are explicit versus config-derived model, endpoint, proxy or NO_PROXY, env inheritance, secret mount shape, CLI version, protocol start parameters and request payload. Record the success case and the failure case with trace ids, run ids, job names, rollout objects or bounded logs.
  5. Only call the condition an external blocker after the current runtime config has been verified, the minimal real-path probe still fails, a mature reference path or equivalent cross-check also fails, and the evidence rules out local adapter/config mistakes.

If user feedback or fresh evidence contradicts an initial blocker claim, the operator must stop repeating the blocker narrative and switch to field repair mode immediately. The expected sequence is passthrough probing, single-variable live experiments, a bounded hotfix experiment when needed, a source PR, CI/CD rollout and re-test from the original entry point. The hotfix proves direction or restores a live path; it does not complete the task.

The standard flow is:

  1. Probe the real runtime surface first. Use structured UniDesk passthrough, service health endpoints, trace/result polling, bounded logs, object metadata and user-entry requests to reproduce the symptom on the actual target environment. Prefer short single-step commands that return promptly and can be repeated.
  2. Apply an experimental runtime patch only when it is needed to prove a fix direction. The patch must be narrow, named, reversible and scoped to the affected deployment, pod, ConfigMap, env key, script mount or file. It must not include secrets, broad filesystem rewrites, unmanaged image builds, destructive resets or unrelated cleanup.
  3. Validate the runtime patch from the user or service entry that exposed the problem. Supporting internal checks are useful, but the decisive evidence should include the external URL, API route, trace id, operation id, rollout object, health metadata or other runtime identity that proves the real path changed.
  4. Write a recap issue before treating the fix as complete. The issue must include reproduction steps, runtime evidence, root cause, exact experimental patch shape, rollback or cleanup notes, durable source changes needed, and post-CD validation criteria. Sensitive values stay redacted; copy-pastable secret or credential mutation commands do not belong in the recap.
  5. Convert the working runtime patch into the owning repository. The formal fix must be committed, pushed, reviewed through PR when applicable, and validated by the smallest appropriate CI gate on an approved execution surface. Master server local checks remain prohibited for heavy gates.
  6. Roll the fix through the standard CD path. CD must consume commit-pinned artifacts or desired-state manifests rather than the live hotfix. If publish and desired-state commits differ, the rollout target is the already published source commit, not a later documentation-only or desired-state commit.
  7. Re-test after CD rollout from the original user or service entry. The final evidence must show the deployed commit/image/runtime metadata, the relevant health or trace result, and that the temporary runtime patch is gone or no longer active.

This flow deliberately separates agility from persistence: runtime probing and experimental patches are allowed to shorten diagnosis, while Git, PR, CI/CD and post-rollout validation remain the only durable completion path. If a recurring field step is painful because of quoting, target routing, kubeconfig selection, output volume or missing helpers, improve the UniDesk passthrough tool and document the new helper instead of preserving another one-off command recipe.

Full CI/CD, GitOps rollout, image build, hardware run, long trace replay and model-provider compatibility work must not be used as the inner loop. First prove the smallest real loop inside the target provider, Pod, host bridge or service port, then promote the proven change into source and the normal release path. For model-provider work, prefer official docs and mature protocol bridges over hand-written protocol translation; a local compatibility shim should stay thin and preserve provider cache/tool semantics.

Distributed Command Passthrough

Distributed runtime work should prefer structured CLI passthrough over ad-hoc nested shell strings. The standard escalation order is:

  1. Use a purpose-built UniDesk route plus operation or helper such as ssh D601:k3s kubectl ..., ssh D601:k3s script, ssh D601:k3s:<namespace>:<workload> logs, ssh D601:k3s:<namespace>:<workload> script, ssh D601:k3s:<namespace>:<workload>/<workspace> apply-patch, ssh <providerId>:/absolute/workspace apply-patch, ssh <providerId> py, ssh <providerId> find, ssh <providerId> glob or ssh <providerId> skills. Use legacy apply-patch-v1 only when the old remote helper is explicitly required.
  2. If no helper exists, use ssh <providerId> argv <command> [args...] so the CLI quotes each argv token once.
  3. If shell features such as pipes, redirects, loops or variable expansion are required, use a single quoted heredoc with ssh <providerId> script or ssh D601:k3s:<namespace>:<workload> script so the script body travels over stdin instead of through shell command-string arguments.
  4. Treat free-form ssh-like command strings as an interactive compatibility path, not as the default automation surface.

For D601 Kubernetes work, route syntax is preferred over positional shell recipes, but the route must stay a pure locator. D601:k3s means the native k3s control plane, and D601:k3s:<namespace>:<workload>[:container] means a namespaced workload or pod. Operations come after the route: kubectl runs on the control plane, logs reads bounded workload logs, script streams a local heredoc/stdin script into the host or target pod, and apply-patch is the default remote text patch operation for host or pod workspaces. The route-operation split keeps distributed location and execution behavior independently extensible, fixes KUBECONFIG=/etc/rancher/k3s/k3s.yaml, refuses long-follow logs, and assembles common kubectl exec / kubectl logs / stdin script / pod patch target arguments without adding a provider-gateway protocol change. This prevents the common failure mode where a command crosses local shell, UniDesk SSH broker, remote shell command strings, kubectl exec, and container shell quoting layers before reaching the process that should run it.

Longer scripts should move across stdin (ssh py, ssh script or k3s script operation), and remote text patches should default to apply-patch with a host or pod workspace route. Legacy apply-patch-v1 remains available as the explicit fallback and uses the injected sh helper path instead of assuming target containers have python3, node or repository-local tools. Avoid heredocs nested inside remote command strings, python - <<EOF inside SSH strings, or JSON/Markdown bodies passed through shell arguments. These patterns often bind stdin to the wrong process, strip quotes, or leave a half-open provider SSH session that looks like a platform outage.

When structured passthrough is missing for a recurring workflow, fix the CLI first and then document the durable helper. Do not preserve a growing collection of one-off shell recipes as the long-term runbook.

tran and non-interactive ssh are short-operation tools. Their outer runtime limit is intentionally bounded, so long builds, downloads, Tekton/Argo observations, device operations and Code Agent trace waits must use short-start plus poll semantics: start or observe a named job, write logs/state on the target, then return; follow-up commands read bounded status, log tails and terminal evidence. A UNIDESK_TRAN_TIMEOUT_HINT means the caller held the transport too long, not that the target task necessarily failed.

D601 Recovery Hotfix Exception

D601 reboot recovery has a narrow hotfix exception because k3s, Code Queue and hostPath readiness can fail before normal UniDesk proxy/CD surfaces are healthy. The exception authorizes diagnosis and carefully scoped host repair only; it does not make live host edits a new deployment path.

Allowed read-only recovery checks:

  • bun scripts/cli.ts check recovery-guardrails or bun scripts/cli.ts check --recovery-guardrails on the host or runner environment.
  • /proc/mounts inspection for malformed Docker Desktop /Docker/host 9p rows that may break kubelet mount-table validation.
  • kubectl get/describe/logs/events and crictl pods -o json as bounded observation.
  • Path metadata checks for /home/ubuntu/unidesk-code-queue-deploy, /home/ubuntu/cq-deploy, Code Queue hostPath directories, .codex files, .ssh, .agents/skills, and MDTODO workspace/log paths.

Manual host hotfix may be considered only after the read-only output identifies a concrete redline and the operator has reviewed whether the target is source checkout, credential material, log/cache state, or user data. Examples include restoring a missing Git worktree from pushed remote state, recreating an intended compatibility symlink after confirming its target, restoring a missing runtime Secret source, or fixing a Docker Desktop/WSL mount-table condition. The repair must be recorded in the relevant issue or commander brief and followed by a source or runbook update when the root cause is durable.

Forbidden automatic recovery actions:

  • systemctl restart k3s, service k3s restart, kubectl delete pod, crictl rmp, crictl rm, docker system prune, docker volume prune, recursive chmod/chown/rm under /home/ubuntu, and git reset --hard of a live worktree.
  • Deleting CRI sandboxes or Kubernetes Pods because a diagnostic counted stale sandboxes.
  • Creating, deleting or replacing MDTODO workspace content from Code Queue or the generic CLI. MDTODO hostPaths contain user-authored Markdown data.
  • Treating DirectoryOrCreate as permission to mass-create parent trees or credential directories. Kubelet may create the final directory only when mount validation and parent permissions are already healthy; humans still decide user-data boundaries.

ClaudeQQ or direct user approval is required before any high-risk host action that restarts k3s/kubelet/Docker Desktop, deletes pods/sandboxes, changes credential or SSH paths, touches MDTODO workspace data, force-resets a worktree, changes production rollout state, or could interrupt active Code Queue tasks. If ClaudeQQ is unavailable, the operator must stop at the written plan and record the approval gap instead of silently executing the action.

CI And Private Source Auth

Private repository access is part of the CI contract. ci run must not rely on unauthenticated HTTPS clone, an operator's local dirty worktree or an ad-hoc secret copied by hand into one PipelineRun.

Acceptable source access implementations are:

  • a first-class CI Git credential installed by ci install and referenced by the Tekton Pipeline;
  • an in-cluster Git mirror managed by UniDesk; or
  • the same commit-pinned host-fetch boundary used by ci run-dev-e2e, where D601 fetches the manifest commit and passes only verified inputs into Tekton.

For backend-core artifact publication, the required implementation is the host-fetch boundary: D601 uses the existing GitHub SSH deploy identity plus the node-local provider-gateway WS egress proxy, exports the requested commit to /home/ubuntu/.unidesk/ci/backend-core-artifacts/<commit>, and passes only that verified source directory into Tekton. This path must not be replaced by an in-cluster Git mirror, a third-party source mirror, an operator local checkout, or Tekton-mounted GitHub credentials.

If a CI repo-check task fails at git clone because credentials are unavailable, classify it as a CI infrastructure/auth gap, not as an application test failure.

Verification Priority

When checks disagree, use this priority order:

  1. The pushed desired commit and deploy.json/manifest contract.
  2. Live runtime metadata proving which commit is deployed.
  3. Controlled health checks and service proxy checks for the same runtime object.
  4. CI/e2e results tied to the same commit.
  5. Manual curl/kubectl output as supporting evidence only.

A passing lower-priority manual check cannot override a higher-priority mismatch. Fix the desired state, deploy path, runtime metadata or CI infrastructure until all layers agree.