Files
pikasTech-unidesk/docs/reference/devops-hygiene.md
T
2026-06-13 11:07:51 +00:00

17 KiB

DevOps Hygiene

This document is the authoritative source for UniDesk deployment hygiene: Git-backed deployment truth, dirty-environment boundaries, bounded manual operations and CI source-auth rules. Release-line and CI/CD runtime-version governance is owned by docs/reference/release-governance.md and GitHub issue #6. If the same hygiene rule would need edits in docs/reference/dev-environment.md, docs/reference/deploy.md, docs/reference/ci.md, docs/reference/dev-ci-runner.md, docs/reference/deployment.md, AGENTS.md or TEST.md, keep the detailed rule here and leave only a cross-reference elsewhere.

Source Of Truth

UniDesk deployment state is healthy only when all three layers point back to pushed Git commits:

  • Desired state: origin/master:deploy.json, config.json and committed service manifests.
  • Runtime state: live service metadata, image tags, Deployment annotations/env stamps or Compose labels that identify the deployed commit.
  • Verification state: deploy plan, health checks, server status, service proxy checks and CI/e2e results that agree with the desired commit.

Local worktrees, D601 runtime files, copied scripts, copied images, ad-hoc Kubernetes objects and one-off curl results are never deployment truth.

When stable release lanes such as release/v1 are enabled, the desired-state ref must be explicit in the command, job log and deploy output. Until that support exists, commands that are documented to read origin/master:deploy.json must keep doing so and must not silently switch to another branch or a dirty manifest.

Prohibited Deployment Truth

The following practices are not acceptable as the long-term or hidden source of a working environment:

  • Hand editing D601 runtime files, k3s manifests, ConfigMaps, Deployments or container env values and treating the live result as source of truth.
  • Rebuilding backend-core, frontend, k3sctl-adapter or other managed services from a dirty worktree on the master server, D601 or an operator machine.
  • Copying large local shell scripts, generated manifests, Docker images or application source to D601 as the main deployment mechanism.
  • Fixing dev or production reachability by adding direct D601 public ports, NodePorts or backend-core hardcoded service entries instead of updating the proper catalog/control bridge.
  • Treating server rebuild backend-core as a Rust backend-core iteration path; Rust build/check belongs to D601 CI and CD must consume the published artifact.
  • Using local-manifest production deploy for services that already have artifact consumers. backend-core, frontend, baidu-netdisk and decision-center production deploys must enter through deploy apply --env prod so CD consumes a commit-pinned registry artifact instead of silently falling back to target-side source build.
  • Treating upstream images as UniDesk source-build services. filebrowser and filebrowser-d601 are upstream-image consumers; they require digest pin or digest-verified mirror governance and must not be added to Dockerfile CI artifacts.
  • Considering manual curl, kubectl or Docker checks sufficient when live commit metadata, deploy plan, health checks and CI/e2e disagree.

Bounded Manual Operations

Manual operations are allowed only when they are narrow, visible and followed by commit-based verification:

  • server rebuild dev-frontend-proxy may update the thin main-server nginx proxy for the public dev UI port; it must not build backend/frontend application code.
  • deploy apply --service k3sctl-adapter may update the k3s control bridge catalog through the documented local manifest exception in docs/reference/deploy.md.
  • Host SSH/provider-gateway dispatch may start the ci run-dev-e2e short launcher described in docs/reference/dev-ci-runner.md; it must not carry large shell bodies or become a general deployment path.
  • Read-only smoke checks such as curl http://74.48.78.17:18083/health, server status, microservice proxy .../health and kubectl get may validate state, but they do not replace desired-state and live-commit verification.
  • bun scripts/cli.ts check recovery-guardrails and the equivalent check --recovery-guardrails gate are commander-safe read-only recovery diagnostic surfaces for D601 reboot incidents. They may read /proc/mounts, inspect local path metadata, parse committed k3s manifests, and run bounded read-only kubectl get pods -A -o json / crictl pods -o json probes when the tools are present. They must not restart k3s, delete pods or CRI sandboxes, apply manifests, rollout workloads, mutate hostPath directories, prune Docker state, or repair symlinks automatically.
  • File Browser recovery may use the existing provider-local image-only/docker-run path only as a bounded repair path. Standardization requires first resolving docker.io/filebrowser/filebrowser:v2.63.3 to an upstream manifest digest or a digest-verified local mirror, then validating the running container through the UniDesk private proxy.

Manual Secret/env/rollout repair is allowed only as a bounded runtime recovery path. It must have explicit authorization for the target environment and service, a narrow object scope, redacted evidence, an issue review trail and a durable source fix. Acceptable evidence includes object names, revision changes, health status, rollout status and redacted key presence; it must not include secret values, tokens, full env dumps or copy-pastable sensitive mutation commands.

Any manual repair that changes live credentials, env wiring, DNS/egress assumptions, ConfigMaps, Deployments or rollout state must be followed by a source-of-truth update in the owning repository, secret management path, manifest, deployment catalog or a tracked remediation issue. The live environment is recovery evidence, not deployment truth.

If a manual repair is needed to unblock the platform, the durable fix must be committed and pushed, then redeployed or revalidated through the normal path. Do not preserve the repair only as hidden runtime state.

分布式敏捷流程

“分布式敏捷”是 UniDesk 对 distributed agile field repair 的固定流程名;通用 P1/P2/P3/P4 阶段、禁止行为和证据边界由 $dad-dev skill 维护,本参考不再重复展开。UniDesk 项目内只保留下面的特有约束:必须使用结构化 trans/UniDesk CLI 进入真实 provider、pod、host bridge 或 service port;运行面热补只能证明方向或临时恢复,不能成为隐藏部署真相;持久化完成必须回到 Git/PR/CI/CD 后原入口复测。

固定主 repo 是 source truth anchor,不是源码/运行面 scratch 区。会产生源码、配置、issue closeout、部署脚本、验收产物或高风险 dad-dev / post-task 交付的工作,执行前必须先从目标 fixed repo 的最新 remote/base 创建任务专属 .worktree/<task>,后续编辑、验证、提交、push 和受控 CLI 写操作都在该 worktree 内完成。fixed repo 只用于 git fetchgit status、读取规则和 git worktree add;其中已有的并行未提交修改默认保持不动,不纳入当前任务,也不要用 stash、reset、checkout 或删除来“清理”。

文档治理是固定主 repo 保护规则的轻量例外。单纯文档、AGENTS.mddocs/reference/*.md、skill 规则、runbook、过程文档蒸馏和其他长期参考收敛不需要创建新 .worktree 或短生命周期 PR;应在当前主 worktree 先 git pull --ff-only 对齐最新 remote,再直接修改、做最小语法/diff 检查、提交并 push。该例外只覆盖文档/规则本身,不得夹带源码、配置、部署、运行面或 issue lifecycle 写操作;若主 worktree 已有并行文档修改,只提交本次明确相关文件,不能 stash、reset 或顺手合并他人修改。

允许不创建新 .worktree 的场景包括 P1 只读探测、运行面临时热补、上述文档/skill/长期参考轻量修改,或目标项目长期参考明确声明的直接修改例外。例外必须能解释为什么不会污染 fixed repo source truth,并且不得触碰无关并行修改;一旦需要写源码、配置、issue closeout、部署脚本、验收产物或其他高风险交付记录,立即切回独立 .worktree

在模型 provider、API provider、硬件链路、跨平台 bridge、CLI/trans/tran 或高频工具链问题上,判定外部 blocker 前仍需完成 UniDesk 的防误判核查:确认当前 runtime config / Secret key presence / env / proxy / NO_PROXY / endpoint / args,使用实际目标运行面复现,并尽量与 UniDesk/HWLAB 成熟实现对照。用户反馈或新证据推翻 blocker 判断时,立即切回 $dad-dev 的现场修复闭环。

如果某个现场步骤因为 quoting、route 定位、kubeconfig、输出体积或缺少 helper 而反复痛苦,优先改进 UniDesk passthrough / CLI 并在本文件的 Distributed Command Passthroughdocs/reference/cli.md 中记录稳定入口,不要沉淀一批一次性 shell 菜谱。

Distributed Command Passthrough

Distributed runtime work should prefer structured CLI passthrough over ad-hoc nested shell strings. The standard escalation order is:

  1. Use a purpose-built UniDesk route plus operation or helper such as trans D601:k3s kubectl ..., trans D601:k3s script, trans D601:k3s:<namespace>:<workload> logs, trans D601:k3s:<namespace>:<workload> script, trans D601:k3s:<namespace>:<workload>/<workspace> apply-patch, trans <providerId>:/absolute/workspace apply-patch, trans <providerId> py, trans <providerId> find, trans <providerId> glob or trans <providerId> skills. Use legacy apply-patch-v1 only when the old remote helper is explicitly required.
  2. If no helper exists, use trans <providerId> argv <command> [args...] so the CLI quotes each argv token once.
  3. If shell features such as pipes, redirects, loops or variable expansion are required, use a single quoted heredoc with trans <providerId> script or trans D601:k3s:<namespace>:<workload> script so the script body travels over stdin instead of through shell command-string arguments.
  4. Treat free-form ssh-like command strings as an interactive compatibility path, not as the default automation surface.

For D601 Kubernetes work, route syntax is preferred over positional shell recipes, but the route must stay a pure locator. D601:k3s means the native k3s control plane, and D601:k3s:<namespace>:<workload>[:container] means a namespaced workload or pod. Operations come after the route: kubectl runs on the control plane, logs reads bounded workload logs, script streams a local heredoc/stdin script into the host or target pod, and apply-patch is the default remote text patch operation for host or pod workspaces. The route-operation split keeps distributed location and execution behavior independently extensible, fixes KUBECONFIG=/etc/rancher/k3s/k3s.yaml, refuses long-follow logs, and assembles common kubectl exec / kubectl logs / stdin script / pod patch target arguments without adding a provider-gateway protocol change. This prevents the common failure mode where a command crosses local shell, UniDesk SSH broker, remote shell command strings, kubectl exec, and container shell quoting layers before reaching the process that should run it.

Longer scripts should move across stdin (trans py, trans script or k3s script operation), and remote text patches should default to apply-patch with a host or pod workspace route. Legacy apply-patch-v1 remains available as the explicit fallback and uses the injected sh helper path instead of assuming target containers have python3, node or repository-local tools. Avoid heredocs nested inside remote command strings, python - <<EOF inside SSH strings, or JSON/Markdown bodies passed through shell arguments. These patterns often bind stdin to the wrong process, strip quotes, or leave a half-open provider SSH session that looks like a platform outage.

When structured passthrough is missing for a recurring workflow, fix the CLI first and then document the durable helper. Do not preserve a growing collection of one-off shell recipes as the long-term runbook.

trans/tran and non-interactive ssh are short-operation tools. Their outer runtime limit is intentionally bounded, so long builds, downloads, Tekton/Argo observations, device operations and Code Agent trace waits must use short-start plus poll semantics: start or observe a named job, write logs/state on the target, then return; follow-up commands read bounded status, log tails and terminal evidence. A UNIDESK_TRAN_TIMEOUT_HINT means the caller held the transport too long, not that the target task necessarily failed.

D601 Recovery Hotfix Exception

D601 reboot recovery has a narrow hotfix exception because k3s, Code Queue and hostPath readiness can fail before normal UniDesk proxy/CD surfaces are healthy. The exception authorizes diagnosis and carefully scoped host repair only; it does not make live host edits a new deployment path.

Allowed read-only recovery checks:

  • bun scripts/cli.ts check recovery-guardrails or bun scripts/cli.ts check --recovery-guardrails on the host or runner environment.
  • /proc/mounts inspection for malformed Docker Desktop /Docker/host 9p rows that may break kubelet mount-table validation.
  • kubectl get/describe/logs/events and crictl pods -o json as bounded observation.
  • Path metadata checks for /home/ubuntu/unidesk-code-queue-deploy, /home/ubuntu/cq-deploy, Code Queue hostPath directories, .codex files, .ssh, .agents/skills, and MDTODO workspace/log paths.

Manual host hotfix may be considered only after the read-only output identifies a concrete redline and the operator has reviewed whether the target is source checkout, credential material, log/cache state, or user data. Examples include restoring a missing Git worktree from pushed remote state, recreating an intended compatibility symlink after confirming its target, restoring a missing runtime Secret source, or fixing a Docker Desktop/WSL mount-table condition. The repair must be recorded in the relevant issue or commander brief and followed by a source or runbook update when the root cause is durable.

Forbidden automatic recovery actions:

  • systemctl restart k3s, service k3s restart, kubectl delete pod, crictl rmp, crictl rm, docker system prune, docker volume prune, recursive chmod/chown/rm under /home/ubuntu, and git reset --hard of a live worktree.
  • Deleting CRI sandboxes or Kubernetes Pods because a diagnostic counted stale sandboxes.
  • Creating, deleting or replacing MDTODO workspace content from Code Queue or the generic CLI. MDTODO hostPaths contain user-authored Markdown data.
  • Treating DirectoryOrCreate as permission to mass-create parent trees or credential directories. Kubelet may create the final directory only when mount validation and parent permissions are already healthy; humans still decide user-data boundaries.

ClaudeQQ or direct user approval is required before any high-risk host action that restarts k3s/kubelet/Docker Desktop, deletes pods/sandboxes, changes credential or SSH paths, touches MDTODO workspace data, force-resets a worktree, changes production rollout state, or could interrupt active Code Queue tasks. If ClaudeQQ is unavailable, the operator must stop at the written plan and record the approval gap instead of silently executing the action.

CI And Private Source Auth

Private repository access is part of the CI contract. ci run must not rely on unauthenticated HTTPS clone, an operator's local dirty worktree or an ad-hoc secret copied by hand into one PipelineRun.

Acceptable source access implementations are:

  • a first-class CI Git credential installed by ci install and referenced by the Tekton Pipeline;
  • an in-cluster Git mirror managed by UniDesk; or
  • the same commit-pinned host-fetch boundary used by ci run-dev-e2e, where D601 fetches the manifest commit and passes only verified inputs into Tekton.

For backend-core artifact publication, the required implementation is the host-fetch boundary: D601 uses the existing GitHub SSH deploy identity plus the node-local provider-gateway WS egress proxy, exports the requested commit to /home/ubuntu/.unidesk/ci/backend-core-artifacts/<commit>, and passes only that verified source directory into Tekton. This path must not be replaced by an in-cluster Git mirror, a third-party source mirror, an operator local checkout, or Tekton-mounted GitHub credentials.

If a CI repo-check task fails at git clone because credentials are unavailable, classify it as a CI infrastructure/auth gap, not as an application test failure.

Verification Priority

When checks disagree, use this priority order:

  1. The pushed desired commit and deploy.json/manifest contract.
  2. Live runtime metadata proving which commit is deployed.
  3. Controlled health checks and service proxy checks for the same runtime object.
  4. CI/e2e results tied to the same commit.
  5. Manual curl/kubectl output as supporting evidence only.

A passing lower-priority manual check cannot override a higher-priority mismatch. Fix the desired state, deploy path, runtime metadata or CI infrastructure until all layers agree.