pikasTech-unidesk/docs/reference/devops-hygiene.md

# DevOps Hygiene

This document is the authoritative source for UniDesk deployment hygiene: Git-backed deployment truth, dirty-environment boundaries, bounded manual operations and CI source-auth rules. Release-line and CI/CD runtime-version governance is owned by `docs/reference/release-governance.md` and [GitHub issue #6](https://github.com/pikasTech/unidesk/issues/6). If the same hygiene rule would need edits in `docs/reference/dev-environment.md`, `docs/reference/deploy.md`, `docs/reference/ci.md`, `docs/reference/dev-ci-runner.md`, `docs/reference/deployment.md`, `AGENTS.md` or `TEST.md`, keep the detailed rule here and leave only a cross-reference elsewhere.

## Source Of Truth

UniDesk deployment state is healthy only when all three layers point back to pushed Git commits:

- Desired state: `origin/master:deploy.json`, `config.json` and committed service manifests.
- Runtime state: live service metadata, image tags, Deployment annotations/env stamps or Compose labels that identify the deployed commit.
- Verification state: `deploy plan`, health checks, `server status`, service proxy checks and CI/e2e results that agree with the desired commit.

Local worktrees, D601 runtime files, copied scripts, copied images, ad-hoc Kubernetes objects and one-off curl results are never deployment truth.

When stable release lanes such as `release/v1` are enabled, the desired-state ref must be explicit in the command, job log and deploy output. Until that support exists, commands that are documented to read `origin/master:deploy.json` must keep doing so and must not silently switch to another branch or a dirty manifest.

## Prohibited Deployment Truth

The following practices are not acceptable as the long-term or hidden source of a working environment:

- Hand editing D601 runtime files, k3s manifests, ConfigMaps, Deployments or container env values and treating the live result as source of truth.
- Rebuilding backend-core, frontend, k3sctl-adapter or other managed services from a dirty worktree on the master server, D601 or an operator machine.
- Copying large local shell scripts, generated manifests, Docker images or application source to D601 as the main deployment mechanism.
- Fixing dev or production reachability by adding direct D601 public ports, NodePorts or backend-core hardcoded service entries instead of updating the proper catalog/control bridge.
- Treating `server rebuild backend-core` as a Rust backend-core iteration path; Rust build/check belongs to D601 CI and CD must consume the published artifact.
- Using local-manifest production deploy for services that already have artifact consumers. `backend-core`, `frontend`, `baidu-netdisk` and `decision-center` production deploys must enter through `deploy apply --env prod` so CD consumes a commit-pinned registry artifact instead of silently falling back to target-side source build.
- Treating upstream images as UniDesk source-build services. `filebrowser` and `filebrowser-d601` are upstream-image consumers; they require digest pin or digest-verified mirror governance and must not be added to Dockerfile CI artifacts.
- Considering manual curl, kubectl or Docker checks sufficient when live commit metadata, deploy plan, health checks and CI/e2e disagree.

## Bounded Manual Operations

Manual operations are allowed only when they are narrow, visible and followed by commit-based verification:

- `server rebuild dev-frontend-proxy` may update the thin main-server nginx proxy for the public dev UI port; it must not build backend/frontend application code.
- `deploy apply --service k3sctl-adapter` may update the k3s control bridge catalog through the documented local manifest exception in `docs/reference/deploy.md`.
- Host SSH/provider-gateway dispatch may start the `ci run-dev-e2e` short launcher described in `docs/reference/dev-ci-runner.md`; it must not carry large shell bodies or become a general deployment path.
- Read-only smoke checks such as `curl http://74.48.78.17:18083/health`, `server status`, `microservice proxy .../health` and `kubectl get` may validate state, but they do not replace desired-state and live-commit verification.
- `bun scripts/cli.ts check recovery-guardrails` and the equivalent `check --recovery-guardrails` gate are commander-safe read-only recovery diagnostic surfaces for D601 reboot incidents. They may read `/proc/mounts`, inspect local path metadata, parse committed k3s manifests, and run bounded read-only `kubectl get pods -A -o json` / `crictl pods -o json` probes when the tools are present. They must not restart k3s, delete pods or CRI sandboxes, apply manifests, rollout workloads, mutate hostPath directories, prune Docker state, or repair symlinks automatically.
- File Browser recovery may use the existing provider-local image-only/docker-run path only as a bounded repair path. Standardization requires first resolving `docker.io/filebrowser/filebrowser:v2.63.3` to an upstream manifest digest or a digest-verified local mirror, then validating the running container through the UniDesk private proxy.

Manual Secret/env/rollout repair is allowed only as a bounded runtime recovery path. It must have explicit authorization for the target environment and service, a narrow object scope, redacted evidence, an issue review trail and a durable source fix. Acceptable evidence includes object names, revision changes, health status, rollout status and redacted key presence; it must not include secret values, tokens, full env dumps or copy-pastable sensitive mutation commands.

Any manual repair that changes live credentials, env wiring, DNS/egress assumptions, ConfigMaps, Deployments or rollout state must be followed by a source-of-truth update in the owning repository, secret management path, manifest, deployment catalog or a tracked remediation issue. The live environment is recovery evidence, not deployment truth.

If a manual repair is needed to unblock the platform, the durable fix must be committed and pushed, then redeployed or revalidated through the normal path. Do not preserve the repair only as hidden runtime state.

## 分布式敏捷流程

“分布式敏捷”是 UniDesk 对 distributed agile field repair 的固定流程名；通用 P1/P2/P3/P4 阶段、禁止行为和证据边界由 `$dad-dev` skill 维护，本参考不再重复展开。UniDesk 项目内只保留下面的特有约束：必须使用结构化 `trans`/UniDesk CLI 进入真实 provider、pod、host bridge 或 service port；运行面热补只能证明方向或临时恢复，不能成为隐藏部署真相；持久化完成必须回到 Git/PR/CI/CD 后原入口复测。

固定主 repo 是 source truth anchor，不是源码/运行面 scratch 区。会产生源码、配置、issue closeout、部署脚本、验收产物或高风险 dad-dev / post-task 交付的工作，执行前必须先从目标 fixed repo 的最新 remote/base 创建任务专属 `.worktree/<task>`，后续编辑、验证、提交、push 和受控 CLI 写操作都在该 worktree 内完成。fixed repo 只用于 `git fetch`、`git status`、快进同步、读取规则和 `git worktree add`；其中已有的并行未提交修改默认保持不动，不纳入当前任务，也不要用 reset、checkout 或删除来“清理”。

开始使用任何固定主/目标 worktree 时，如果发现它落后 remote/base，必须先完成 source-truth 同步再继续分析或创建任务 worktree：若工作区有脏改，先用 `git stash push -u` 保存当前脏改（包括 untracked），再执行 `git pull --ff-only <remote> <branch>` 快进，随后 `git stash apply` 并按快进后的语义合并冲突或重复内容；若工作区 clean，则直接 `git pull --ff-only`。这个 stash 是保护并行修改以便固定主 repo 回到最新 source truth，不是清理、丢弃或隐式接管并行任务；apply 后只提交当前任务明确相关的文件，其他并行修改继续保留原状。不得在落后的固定 worktree 上继续源码分析、创建任务分支、修改代码或给 issue 写基于旧源码的结论。

所有源码、配置、部署脚本和运行面修复都必须在从最新 remote/base 创建的独立 `.worktree/<task>` 中完成；固定主 worktree 不直接承载代码修改。PR 合并或等价集成进入 remote base 后，应及时回到对应固定主 worktree 执行 `git fetch` 与 `git pull --ff-only`，让下一轮 source-truth 预检、web-probe 复核、issue closeout 和新任务 worktree 都从已合并的最新源码开始。若固定主 worktree 因并行脏改不能安全快进，保留脏改并改用干净的最新 remote/base worktree 继续，不能 reset、checkout 或删除他人修改。

任务 worktree 清理前必须做语义合并核查。最低要求是：worktree clean；相关提交已经是 `origin/master` 祖先，或 `git log --left-right --cherry-pick <worktree-head>...origin/master` 没有 left-only 未吸收 patch；必要时再核对关键文件 diff、PR merge commit、issue closeout 和运行面验证是否对应最新 `master`。只有确认当前任务语义已经进入 `master` 或被更新实现等价替代后，才允许 `git worktree remove <path>`；不得只因为分支落后、PR 已关闭、文件看起来相似或本地空间紧张就删除。若发现未提交文件、未推送提交、left-only patch 或语义不确定，先把应保留内容提交/合并/推送到 `master`，或记录阻塞并保留 worktree。

文档治理是固定主 repo 保护规则的轻量例外。单纯文档、`AGENTS.md`、`docs/reference/*.md`、skill 规则、runbook、过程文档蒸馏和其他长期参考收敛不需要创建新 `.worktree` 或短生命周期 PR；应在当前主 worktree 按上面的 stash-if-dirty + `git pull --ff-only` 对齐最新 remote 后再直接修改、做最小语法/diff 检查、提交并 push。该例外只覆盖文档/规则本身，不得夹带源码、配置、部署、运行面或 issue lifecycle 写操作；若主 worktree 已有并行文档修改，只提交本次明确相关文件，不能 reset、drop stash 或顺手合并他人修改。

允许不创建新 `.worktree` 的场景包括 P1 只读探测、运行面临时热补、上述文档/skill/长期参考轻量修改，或目标项目长期参考明确声明的直接修改例外。例外必须能解释为什么不会污染 fixed repo source truth，并且不得触碰无关并行修改；一旦需要写源码、配置、issue closeout、部署脚本、验收产物或其他高风险交付记录，立即切回独立 `.worktree`。

在模型 provider、API provider、硬件链路、跨平台 bridge、CLI/trans/tran 或高频工具链问题上，判定外部 blocker 前仍需完成 UniDesk 的防误判核查：确认当前 runtime config / Secret key presence / env / proxy / NO_PROXY / endpoint / args，使用实际目标运行面复现，并尽量与 UniDesk/HWLAB 成熟实现对照。用户反馈或新证据推翻 blocker 判断时，立即切回 `$dad-dev` 的现场修复闭环。

如果某个现场步骤因为 quoting、route 定位、kubeconfig、输出体积或缺少 helper 而反复痛苦，优先改进 UniDesk passthrough / CLI 并在本文件的 `Distributed Command Passthrough` 或 `docs/reference/cli.md` 中记录稳定入口，不要沉淀一批一次性 shell 菜谱。

## Distributed Command Passthrough

Distributed runtime work should prefer structured CLI passthrough over ad-hoc nested shell strings. The standard escalation order is:

1. Use a purpose-built UniDesk route plus operation or helper such as `trans D601:k3s kubectl ...`, `trans D601:k3s sh`, `trans D601:k3s:<namespace>:<workload> logs`, `trans D601:k3s:<namespace>:<workload> sh`, `trans D601:k3s:<namespace>:<workload>[:<container>] apply-patch --cwd /workspace`, `trans <providerId>:/absolute/workspace apply-patch`, `trans <providerId> py`, `trans <providerId> find`, `trans <providerId> glob` or `trans <providerId> skills`. Use legacy `apply-patch-v1` only when the old remote helper is explicitly required.
2. If no helper exists, use `trans <providerId> argv <command> [args...]` so the CLI quotes each argv token once.
3. If shell features such as pipes, redirects, loops or variable expansion are required, use a single quoted heredoc with explicit `trans <providerId> sh|bash` or `trans D601:k3s:<namespace>:<workload> sh|bash` so the script body travels over stdin instead of through shell command-string arguments.
4. Treat free-form ssh-like command strings as an interactive compatibility path, not as the default automation surface.

For D601 Kubernetes work, route syntax is preferred over positional shell recipes, but the route must stay a pure locator. `D601:k3s` means the native k3s control plane, and `D601:k3s:<namespace>:<workload>[:container]` means a namespaced workload or pod/container. `:` is the distributed route separator; `/` is only an in-container filesystem cwd, so container selection must use `:<container>` or `--container <container>`, not `pod/<pod>/<container>`. Operations come after the route: `kubectl` runs on the control plane, `logs` reads bounded workload logs, `sh`/`bash` stream a local heredoc/stdin script into the host or target pod with an explicit shell dialect, and `apply-patch --cwd /workspace` is the default remote text patch operation for pod workspaces. The route-operation split keeps distributed location and execution behavior independently extensible, fixes `KUBECONFIG=/etc/rancher/k3s/k3s.yaml`, refuses long-follow logs, and assembles common `kubectl exec` / `kubectl logs` / stdin shell / pod patch target arguments without adding a provider-gateway protocol change. This prevents the common failure mode where a command crosses local shell, UniDesk SSH broker, remote shell command strings, `kubectl exec`, and container shell quoting layers before reaching the process that should run it.

Longer scripts should move across stdin (`trans py`, explicit `trans sh|bash`, or k3s `sh|bash` operation), and remote text patches should default to `apply-patch` with a host or pod workspace route. Legacy `apply-patch-v1` remains available as the explicit fallback and uses the injected `sh` helper path instead of assuming target containers have `python3`, `node` or repository-local tools. Avoid heredocs nested inside remote command strings, `python - <<EOF` inside SSH strings, or JSON/Markdown bodies passed through shell arguments. These patterns often bind stdin to the wrong process, strip quotes, or leave a half-open provider SSH session that looks like a platform outage.

When structured passthrough is missing for a recurring workflow, fix the CLI first and then document the durable helper. Do not preserve a growing collection of one-off shell recipes as the long-term runbook.

`trans`/`tran` and non-interactive `ssh` are short-operation tools. Their outer runtime limit is intentionally bounded, so long builds, downloads, Tekton/Argo observations, device operations and Code Agent trace waits must use short-start plus poll semantics: start or observe a named job, write logs/state on the target, then return; follow-up commands read bounded status, log tails and terminal evidence. A `UNIDESK_TRAN_TIMEOUT_HINT` means the caller held the transport too long, not that the target task necessarily failed.

## D601 Recovery Hotfix Exception

D601 reboot recovery has a narrow hotfix exception because k3s, Code Queue and hostPath readiness can fail before normal UniDesk proxy/CD surfaces are healthy. The exception authorizes diagnosis and carefully scoped host repair only; it does not make live host edits a new deployment path.

Allowed read-only recovery checks:

- `bun scripts/cli.ts check recovery-guardrails` or `bun scripts/cli.ts check --recovery-guardrails` on the host or runner environment.
- `/proc/mounts` inspection for malformed Docker Desktop `/Docker/host` 9p rows that may break kubelet mount-table validation.
- `kubectl get/describe/logs/events` and `crictl pods -o json` as bounded observation.
- Path metadata checks for `/home/ubuntu/unidesk-code-queue-deploy`, `/home/ubuntu/cq-deploy`, Code Queue hostPath directories, `.codex` files, `.ssh`, `.agents/skills`, and MDTODO workspace/log paths.

Manual host hotfix may be considered only after the read-only output identifies a concrete redline and the operator has reviewed whether the target is source checkout, credential material, log/cache state, or user data. Examples include restoring a missing Git worktree from pushed remote state, recreating an intended compatibility symlink after confirming its target, restoring a missing runtime Secret source, or fixing a Docker Desktop/WSL mount-table condition. The repair must be recorded in the relevant issue or commander brief and followed by a source or runbook update when the root cause is durable.

Forbidden automatic recovery actions:

- `systemctl restart k3s`, `service k3s restart`, `kubectl delete pod`, `crictl rmp`, `crictl rm`, `docker system prune`, `docker volume prune`, recursive chmod/chown/rm under `/home/ubuntu`, and `git reset --hard` of a live worktree.
- Deleting CRI sandboxes or Kubernetes Pods because a diagnostic counted stale sandboxes.
- Creating, deleting or replacing MDTODO workspace content from Code Queue or the generic CLI. MDTODO hostPaths contain user-authored Markdown data.
- Treating `DirectoryOrCreate` as permission to mass-create parent trees or credential directories. Kubelet may create the final directory only when mount validation and parent permissions are already healthy; humans still decide user-data boundaries.

ClaudeQQ or direct user approval is required before any high-risk host action that restarts k3s/kubelet/Docker Desktop, deletes pods/sandboxes, changes credential or SSH paths, touches MDTODO workspace data, force-resets a worktree, changes production rollout state, or could interrupt active Code Queue tasks. If ClaudeQQ is unavailable, the operator must stop at the written plan and record the approval gap instead of silently executing the action.

## CI And Private Source Auth

Private repository access is part of the CI contract. `ci run` must not rely on unauthenticated HTTPS clone, an operator's local dirty worktree or an ad-hoc secret copied by hand into one PipelineRun.

Acceptable source access implementations are:

- a first-class CI Git credential installed by `ci install` and referenced by the Tekton Pipeline;
- an in-cluster Git mirror managed by UniDesk; or
- the same commit-pinned host-fetch boundary used by `ci run-dev-e2e`, where D601 fetches the manifest commit and passes only verified inputs into Tekton.

For backend-core artifact publication, the required implementation is the host-fetch boundary: D601 uses the existing GitHub SSH deploy identity plus the node-local provider-gateway WS egress proxy, exports the requested commit to `/home/ubuntu/.unidesk/ci/backend-core-artifacts/<commit>`, and passes only that verified source directory into Tekton. This path must not be replaced by an in-cluster Git mirror, a third-party source mirror, an operator local checkout, or Tekton-mounted GitHub credentials.

If a CI repo-check task fails at `git clone` because credentials are unavailable, classify it as a CI infrastructure/auth gap, not as an application test failure.

## Verification Priority

When checks disagree, use this priority order:

1. The pushed desired commit and `deploy.json`/manifest contract.
2. Live runtime metadata proving which commit is deployed.
3. Controlled health checks and service proxy checks for the same runtime object.
4. CI/e2e results tied to the same commit.
5. Manual curl/kubectl output as supporting evidence only.

A passing lower-priority manual check cannot override a higher-priority mismatch. Fix the desired state, deploy path, runtime metadata or CI infrastructure until all layers agree.