# DevOps Hygiene This document is the authoritative source for UniDesk deployment hygiene: Git-backed deployment truth, dirty-environment boundaries, bounded manual operations and CI source-auth rules. Release-line and CI/CD runtime-version governance is owned by `docs/reference/release-governance.md` and [GitHub issue #6](https://github.com/pikasTech/unidesk/issues/6). If the same hygiene rule would need edits in `docs/reference/dev-environment.md`, `docs/reference/deploy.md`, `docs/reference/ci.md`, `docs/reference/dev-ci-runner.md`, `docs/reference/deployment.md`, `AGENTS.md` or `TEST.md`, keep the detailed rule here and leave only a cross-reference elsewhere. ## Source Of Truth UniDesk deployment state is healthy only when all three layers point back to pushed Git commits: - Desired state: `origin/master:deploy.json`, `config.json` and committed service manifests. - Runtime state: live service metadata, image tags, Deployment annotations/env stamps or Compose labels that identify the deployed commit. - Verification state: `deploy plan`, health checks, `server status`, service proxy checks and CI/e2e results that agree with the desired commit. Local worktrees, D601 runtime files, copied scripts, copied images, ad-hoc Kubernetes objects and one-off curl results are never deployment truth. When stable release lanes such as `release/v1` are enabled, the desired-state ref must be explicit in the command, job log and deploy output. Until that support exists, commands that are documented to read `origin/master:deploy.json` must keep doing so and must not silently switch to another branch or a dirty manifest. ## Prohibited Deployment Truth The following practices are not acceptable as the long-term or hidden source of a working environment: - Hand editing D601 runtime files, k3s manifests, ConfigMaps, Deployments or container env values and treating the live result as source of truth. - Rebuilding backend-core, frontend, k3sctl-adapter or other managed services from a dirty worktree on the master server, D601 or an operator machine. - Copying large local shell scripts, generated manifests, Docker images or application source to D601 as the main deployment mechanism. - Fixing dev or production reachability by adding direct D601 public ports, NodePorts or backend-core hardcoded service entries instead of updating the proper catalog/control bridge. - Treating `server rebuild backend-core` as a Rust backend-core iteration path; Rust build/check belongs to D601 CI and CD must consume the published artifact. - Using local-manifest production deploy for services that already have artifact consumers. `backend-core`, `frontend`, `baidu-netdisk` and `decision-center` production deploys must enter through `deploy apply --env prod` so CD consumes a commit-pinned registry artifact instead of silently falling back to target-side source build. - Treating upstream images as UniDesk source-build services. `filebrowser` and `filebrowser-d601` are upstream-image consumers; they require digest pin or digest-verified mirror governance and must not be added to Dockerfile CI artifacts. - Considering manual curl, kubectl or Docker checks sufficient when live commit metadata, deploy plan, health checks and CI/e2e disagree. ## Bounded Manual Operations Manual operations are allowed only when they are narrow, visible and followed by commit-based verification: - `server rebuild dev-frontend-proxy` may update the thin main-server nginx proxy for the public dev UI port; it must not build backend/frontend application code. - `deploy apply --service k3sctl-adapter` may update the k3s control bridge catalog through the documented local manifest exception in `docs/reference/deploy.md`. - Host SSH/provider-gateway dispatch may start the `ci run-dev-e2e` short launcher described in `docs/reference/dev-ci-runner.md`; it must not carry large shell bodies or become a general deployment path. - Read-only smoke checks such as `curl http://74.48.78.17:18083/health`, `server status`, `microservice proxy .../health` and `kubectl get` may validate state, but they do not replace desired-state and live-commit verification. - `bun scripts/cli.ts check recovery-guardrails` and the equivalent `check --recovery-guardrails` gate are commander-safe read-only recovery diagnostic surfaces for D601 reboot incidents. They may read `/proc/mounts`, inspect local path metadata, parse committed k3s manifests, and run bounded read-only `kubectl get pods -A -o json` / `crictl pods -o json` probes when the tools are present. They must not restart k3s, delete pods or CRI sandboxes, apply manifests, rollout workloads, mutate hostPath directories, prune Docker state, or repair symlinks automatically. - File Browser recovery may use the existing provider-local image-only/docker-run path only as a bounded repair path. Standardization requires first resolving `docker.io/filebrowser/filebrowser:v2.63.3` to an upstream manifest digest or a digest-verified local mirror, then validating the running container through the UniDesk private proxy. Manual Secret/env/rollout repair is allowed only as a bounded runtime recovery path. It must have explicit authorization for the target environment and service, a narrow object scope, redacted evidence, an issue review trail and a durable source fix. Acceptable evidence includes object names, revision changes, health status, rollout status and redacted key presence; it must not include secret values, tokens, full env dumps or copy-pastable sensitive mutation commands. Any manual repair that changes live credentials, env wiring, DNS/egress assumptions, ConfigMaps, Deployments or rollout state must be followed by a source-of-truth update in the owning repository, secret management path, manifest, deployment catalog or a tracked remediation issue. The live environment is recovery evidence, not deployment truth. If a manual repair is needed to unblock the platform, the durable fix must be committed and pushed, then redeployed or revalidated through the normal path. Do not preserve the repair only as hidden runtime state. ## 分布式敏捷流程 “分布式敏捷”是 UniDesk 对 distributed agile field repair 的固定流程名。后续 issue、PR、指挥记录或用户反馈提到“分布式敏捷”时,默认指下面这套流程:先在真实分布式运行面快速探测和实验补丁,形成可复现的证据与复盘 issue,再把有效修复收敛为 Git/PR/CI/CD 的持久化交付,最后从原始用户入口复测。它允许快速现场学习,但不允许运行面改动变成隐藏部署真相。 Before classifying a failure as an external blocker, the operator must complete the field anti-misclassification check. This is P0 for model providers, API providers, hardware links, cross-platform bridges, CLI/trans/tran paths and frequently used tooling: 1. Confirm the exact runtime configuration used by the failing path: committed source ref, deployed image or script revision, redacted Secret names and key presence, env/proxy/NO_PROXY shape, endpoint identity and command args. Do not infer these values from memory or from a different workspace. 2. Reproduce the symptom from the actual target provider, pod, host bridge or service port through UniDesk passthrough or the service entry that failed. A commander-machine-only check is supporting evidence, not classification evidence. 3. Compare with the mature local implementation when one exists. For Codex/model-provider work, inspect the current UniDesk/HWLAB stdio, forwarder, proxy, env-stripping and config-loading paths before concluding the provider itself is broken. 4. Run narrow one-variable experiments in the live target environment. Typical variables are explicit versus config-derived model, endpoint, proxy or NO_PROXY, env inheritance, secret mount shape, CLI version, protocol start parameters and request payload. Record the success case and the failure case with trace ids, run ids, job names, rollout objects or bounded logs. 5. Only call the condition an external blocker after the current runtime config has been verified, the minimal real-path probe still fails, a mature reference path or equivalent cross-check also fails, and the evidence rules out local adapter/config mistakes. If user feedback or fresh evidence contradicts an initial blocker claim, the operator must stop repeating the blocker narrative and switch to field repair mode immediately. The expected sequence is passthrough probing, single-variable live experiments, a bounded hotfix experiment when needed, a source PR, CI/CD rollout and re-test from the original entry point. The hotfix proves direction or restores a live path; it does not complete the task. The standard flow is: 1. Probe the real runtime surface first. Use structured UniDesk passthrough, service health endpoints, trace/result polling, bounded logs, object metadata and user-entry requests to reproduce the symptom on the actual target environment. Prefer short single-step commands that return promptly and can be repeated. 2. Apply an experimental runtime patch only when it is needed to prove a fix direction. The patch must be narrow, named, reversible and scoped to the affected deployment, pod, ConfigMap, env key, script mount or file. It must not include secrets, broad filesystem rewrites, unmanaged image builds, destructive resets or unrelated cleanup. 3. Validate the runtime patch from the user or service entry that exposed the problem. Supporting internal checks are useful, but the decisive evidence should include the external URL, API route, trace id, operation id, rollout object, health metadata or other runtime identity that proves the real path changed. 4. Write a recap issue before treating the fix as complete. The issue must include reproduction steps, runtime evidence, root cause, exact experimental patch shape, rollback or cleanup notes, durable source changes needed, and post-CD validation criteria. Sensitive values stay redacted; copy-pastable secret or credential mutation commands do not belong in the recap. 5. Convert the working runtime patch into the owning repository. The formal fix must be committed, pushed, reviewed through PR when applicable, and validated by the smallest appropriate CI gate on an approved execution surface. Master server local checks remain prohibited for heavy gates. 6. Roll the fix through the standard CD path. CD must consume commit-pinned artifacts or desired-state manifests rather than the live hotfix. If publish and desired-state commits differ, the rollout target is the already published source commit, not a later documentation-only or desired-state commit. 7. Re-test after CD rollout from the original user or service entry. The final evidence must show the deployed commit/image/runtime metadata, the relevant health or trace result, and that the temporary runtime patch is gone or no longer active. This flow deliberately separates agility from persistence: runtime probing and experimental patches are allowed to shorten diagnosis, while Git, PR, CI/CD and post-rollout validation remain the only durable completion path. If a recurring field step is painful because of quoting, target routing, kubeconfig selection, output volume or missing helpers, improve the UniDesk passthrough tool and document the new helper instead of preserving another one-off command recipe. Full CI/CD, GitOps rollout, image build, hardware run, long trace replay and model-provider compatibility work must not be used as the inner loop. First prove the smallest real loop inside the target provider, Pod, host bridge or service port, then promote the proven change into source and the normal release path. For model-provider work, prefer official docs and mature protocol bridges over hand-written protocol translation; a local compatibility shim should stay thin and preserve provider cache/tool semantics. ## Distributed Command Passthrough Distributed runtime work should prefer structured CLI passthrough over ad-hoc nested shell strings. The standard escalation order is: 1. Use a purpose-built UniDesk route plus operation or helper such as `trans D601:k3s kubectl ...`, `trans D601:k3s script`, `trans D601:k3s:: logs`, `trans D601:k3s:: script`, `trans D601:k3s::/ apply-patch`, `trans :/absolute/workspace apply-patch`, `trans py`, `trans find`, `trans glob` or `trans skills`. Use legacy `apply-patch-v1` only when the old remote helper is explicitly required. 2. If no helper exists, use `trans argv [args...]` so the CLI quotes each argv token once. 3. If shell features such as pipes, redirects, loops or variable expansion are required, use a single quoted heredoc with `trans script` or `trans D601:k3s:: script` so the script body travels over stdin instead of through shell command-string arguments. 4. Treat free-form ssh-like command strings as an interactive compatibility path, not as the default automation surface. For D601 Kubernetes work, route syntax is preferred over positional shell recipes, but the route must stay a pure locator. `D601:k3s` means the native k3s control plane, and `D601:k3s::[:container]` means a namespaced workload or pod. Operations come after the route: `kubectl` runs on the control plane, `logs` reads bounded workload logs, `script` streams a local heredoc/stdin script into the host or target pod, and `apply-patch` is the default remote text patch operation for host or pod workspaces. The route-operation split keeps distributed location and execution behavior independently extensible, fixes `KUBECONFIG=/etc/rancher/k3s/k3s.yaml`, refuses long-follow logs, and assembles common `kubectl exec` / `kubectl logs` / stdin script / pod patch target arguments without adding a provider-gateway protocol change. This prevents the common failure mode where a command crosses local shell, UniDesk SSH broker, remote shell command strings, `kubectl exec`, and container shell quoting layers before reaching the process that should run it. Longer scripts should move across stdin (`trans py`, `trans script` or k3s `script` operation), and remote text patches should default to `apply-patch` with a host or pod workspace route. Legacy `apply-patch-v1` remains available as the explicit fallback and uses the injected `sh` helper path instead of assuming target containers have `python3`, `node` or repository-local tools. Avoid heredocs nested inside remote command strings, `python - <`, and passes only that verified source directory into Tekton. This path must not be replaced by an in-cluster Git mirror, a third-party source mirror, an operator local checkout, or Tekton-mounted GitHub credentials. If a CI repo-check task fails at `git clone` because credentials are unavailable, classify it as a CI infrastructure/auth gap, not as an application test failure. ## Verification Priority When checks disagree, use this priority order: 1. The pushed desired commit and `deploy.json`/manifest contract. 2. Live runtime metadata proving which commit is deployed. 3. Controlled health checks and service proxy checks for the same runtime object. 4. CI/e2e results tied to the same commit. 5. Manual curl/kubectl output as supporting evidence only. A passing lower-priority manual check cannot override a higher-priority mismatch. Fix the desired state, deploy path, runtime metadata or CI infrastructure until all layers agree.