diff --git a/docs/reference/devops-hygiene.md b/docs/reference/devops-hygiene.md index 29ef8651..495f0dd2 100644 --- a/docs/reference/devops-hygiene.md +++ b/docs/reference/devops-hygiene.md @@ -44,6 +44,20 @@ Any manual repair that changes live credentials, env wiring, DNS/egress assumpti If a manual repair is needed to unblock the platform, the durable fix must be committed and pushed, then redeployed or revalidated through the normal path. Do not preserve the repair only as hidden runtime state. +## Distributed Field Repair Flow + +Distributed development should support fast field learning without letting runtime edits become hidden deployment truth. The standard flow is: + +1. Probe the real runtime surface first. Use structured UniDesk passthrough, service health endpoints, trace/result polling, bounded logs, object metadata and user-entry requests to reproduce the symptom on the actual target environment. Prefer short single-step commands that return promptly and can be repeated. +2. Apply an experimental runtime patch only when it is needed to prove a fix direction. The patch must be narrow, named, reversible and scoped to the affected deployment, pod, ConfigMap, env key, script mount or file. It must not include secrets, broad filesystem rewrites, unmanaged image builds, destructive resets or unrelated cleanup. +3. Validate the runtime patch from the user or service entry that exposed the problem. Supporting internal checks are useful, but the decisive evidence should include the external URL, API route, trace id, operation id, rollout object, health metadata or other runtime identity that proves the real path changed. +4. Write a recap issue before treating the fix as complete. The issue must include reproduction steps, runtime evidence, root cause, exact experimental patch shape, rollback or cleanup notes, durable source changes needed, and post-CD validation criteria. Sensitive values stay redacted; copy-pastable secret or credential mutation commands do not belong in the recap. +5. Convert the working runtime patch into the owning repository. The formal fix must be committed, pushed, reviewed through PR when applicable, and validated by the smallest appropriate CI gate on an approved execution surface. Master server local checks remain prohibited for heavy gates. +6. Roll the fix through the standard CD path. CD must consume commit-pinned artifacts or desired-state manifests rather than the live hotfix. If publish and desired-state commits differ, the rollout target is the already published source commit, not a later documentation-only or desired-state commit. +7. Re-test after CD rollout from the original user or service entry. The final evidence must show the deployed commit/image/runtime metadata, the relevant health or trace result, and that the temporary runtime patch is gone or no longer active. + +This flow deliberately separates agility from persistence: runtime probing and experimental patches are allowed to shorten diagnosis, while Git, PR, CI/CD and post-rollout validation remain the only durable completion path. If a recurring field step is painful because of quoting, target routing, kubeconfig selection, output volume or missing helpers, improve the UniDesk passthrough tool and document the new helper instead of preserving another one-off command recipe. + ## Distributed Command Passthrough Distributed runtime work should prefer structured CLI passthrough over ad-hoc nested shell strings. The standard escalation order is: