feat: add d601 recovery guardrails

Adds read-only D601 recovery diagnostics, fixture coverage, CLI wiring, and recovery hotfix runbook updates. Validated with recovery contract, check --files, scripts tsc, artifact matrix direct contract, and read-only live diagnostic.
This commit is contained in:
Lyon
2026-05-23 21:18:44 +08:00
committed by GitHub
parent 6c44f66289
commit e2646763c0
9 changed files with 1406 additions and 5 deletions
+23
View File
@@ -35,6 +35,7 @@ Manual operations are allowed only when they are narrow, visible and followed by
- `deploy apply --service k3sctl-adapter` may update the k3s control bridge catalog through the documented local manifest exception in `docs/reference/deploy.md`.
- Host SSH/provider-gateway dispatch may start the `ci run-dev-e2e` short launcher described in `docs/reference/dev-ci-runner.md`; it must not carry large shell bodies or become a general deployment path.
- Read-only smoke checks such as `curl http://74.48.78.17:18083/health`, `server status`, `microservice proxy .../health` and `kubectl get` may validate state, but they do not replace desired-state and live-commit verification.
- `bun scripts/cli.ts check recovery-guardrails` and the equivalent `check --recovery-guardrails` gate are commander-safe read-only recovery diagnostic surfaces for D601 reboot incidents. They may read `/proc/mounts`, inspect local path metadata, parse committed k3s manifests, and run bounded read-only `kubectl get pods -A -o json` / `crictl pods -o json` probes when the tools are present. They must not restart k3s, delete pods or CRI sandboxes, apply manifests, rollout workloads, mutate hostPath directories, prune Docker state, or repair symlinks automatically.
- File Browser recovery may use the existing provider-local image-only/docker-run path only as a bounded repair path. Standardization requires first resolving `docker.io/filebrowser/filebrowser:v2.63.3` to an upstream manifest digest or a digest-verified local mirror, then validating the running container through the UniDesk private proxy.
Manual Secret/env/rollout repair is allowed only as a bounded runtime recovery path. It must have explicit authorization for the target environment and service, a narrow object scope, redacted evidence, an issue review trail and a durable source fix. Acceptable evidence includes object names, revision changes, health status, rollout status and redacted key presence; it must not include secret values, tokens, full env dumps or copy-pastable sensitive mutation commands.
@@ -43,6 +44,28 @@ Any manual repair that changes live credentials, env wiring, DNS/egress assumpti
If a manual repair is needed to unblock the platform, the durable fix must be committed and pushed, then redeployed or revalidated through the normal path. Do not preserve the repair only as hidden runtime state.
## D601 Recovery Hotfix Exception
D601 reboot recovery has a narrow hotfix exception because k3s, Code Queue and hostPath readiness can fail before normal UniDesk proxy/CD surfaces are healthy. The exception authorizes diagnosis and carefully scoped host repair only; it does not make live host edits a new deployment path.
Allowed read-only recovery checks:
- `bun scripts/cli.ts check recovery-guardrails` or `bun scripts/cli.ts check --recovery-guardrails` on the host or runner environment.
- `/proc/mounts` inspection for malformed Docker Desktop `/Docker/host` 9p rows that may break kubelet mount-table validation.
- `kubectl get/describe/logs/events` and `crictl pods -o json` as bounded observation.
- Path metadata checks for `/home/ubuntu/unidesk-code-queue-deploy`, `/home/ubuntu/cq-deploy`, Code Queue hostPath directories, `.codex` files, `.ssh`, `.agents/skills`, and MDTODO workspace/log paths.
Manual host hotfix may be considered only after the read-only output identifies a concrete redline and the operator has reviewed whether the target is source checkout, credential material, log/cache state, or user data. Examples include restoring a missing Git worktree from pushed remote state, recreating an intended compatibility symlink after confirming its target, restoring a missing runtime Secret source, or fixing a Docker Desktop/WSL mount-table condition. The repair must be recorded in the relevant issue or commander brief and followed by a source or runbook update when the root cause is durable.
Forbidden automatic recovery actions:
- `systemctl restart k3s`, `service k3s restart`, `kubectl delete pod`, `crictl rmp`, `crictl rm`, `docker system prune`, `docker volume prune`, recursive chmod/chown/rm under `/home/ubuntu`, and `git reset --hard` of a live worktree.
- Deleting CRI sandboxes or Kubernetes Pods because a diagnostic counted stale sandboxes.
- Creating, deleting or replacing MDTODO workspace content from Code Queue or the generic CLI. MDTODO hostPaths contain user-authored Markdown data.
- Treating `DirectoryOrCreate` as permission to mass-create parent trees or credential directories. Kubelet may create the final directory only when mount validation and parent permissions are already healthy; humans still decide user-data boundaries.
ClaudeQQ or direct user approval is required before any high-risk host action that restarts k3s/kubelet/Docker Desktop, deletes pods/sandboxes, changes credential or SSH paths, touches MDTODO workspace data, force-resets a worktree, changes production rollout state, or could interrupt active Code Queue tasks. If ClaudeQQ is unavailable, the operator must stop at the written plan and record the approval gap instead of silently executing the action.
## CI And Private Source Auth
Private repository access is part of the CI contract. `ci run` must not rely on unauthenticated HTTPS clone, an operator's local dirty worktree or an ad-hoc secret copied by hand into one PipelineRun.