diff --git a/AGENTS.md b/AGENTS.md index e161eeb4..d0fc1c65 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -68,6 +68,7 @@ UniDesk 是一个以主 server 为统一入口的分布式工作平台;本文 - `docs/reference/arch.md`:UniDesk 分布式工作平台的长期架构约束。 - `docs/reference/repo-tree.md`:仓库结构目标与组件边界。 - `docs/reference/strategy-governance.md`:UniDesk 外部收益约束、短长期收益划分和需求审视准则;战略分析记录见 GitHub issue #7。 +- `docs/reference/code-queue-supervision.md`:Code Queue 居中调度、并发队列拆分、运行中监控、基础设施缺陷分流和验收收口规则。 - `docs/reference/observability.md`:服务日志、任务活性、通用性能指标 API 和性能面板的可观测性规则。 - `docs/reference/microservices.md`:用户服务(兼容命名 `microservice`)的配置、代理、安全边界、unidesk-direct/k3sctl-managed 部署模式、Todo Note/Baidu Netdisk on main-server、k3s Control/Code Queue/MDTODO/Decision Center/FindJob/Pipeline/MET Nonlinear on D601 和验证规则。 - `docs/reference/windows-passthrough.md`:WSL provider 通过 SSH 透传调用 Windows cmd/PowerShell、Keil、COM 串口和 Windows 侧 skill 的长期规则。 diff --git a/docs/reference/code-queue-supervision.md b/docs/reference/code-queue-supervision.md new file mode 100644 index 00000000..34934248 --- /dev/null +++ b/docs/reference/code-queue-supervision.md @@ -0,0 +1,121 @@ +# Code Queue Supervision Policy + +This document defines the long-term operating model for using Code Queue as a parallel delivery infrastructure under a human or lead-agent supervisor. It is a coordination policy, not a replacement for `docs/reference/microservices.md`, `docs/reference/observability.md`, `docs/reference/user-service-delivery.md`, or the Code Queue runtime contracts. + +## Scope + +Use this policy when a delivery goal is too large for one Code Queue task and must be split across multiple queues, services, or infrastructure lanes. + +This policy applies to: + +- user-service CI/CD rollouts; +- multi-service fixes that need several isolated worktrees; +- infrastructure defects found while user-service work is running; +- follow-up validation, retry, and acceptance coordination; +- manual supervisor work that keeps Code Queue tasks moving without taking over their implementation. + +It does not authorize bypassing the normal deployment, Git, or production safety rules. + +## Operating Principle + +The supervisor owns the outcome. Code Queue tasks own bounded execution. + +The supervisor should keep a live map of the delivery goal, active queues, blockers, evidence gaps, and next recovery action. Code Queue workers should receive self-contained prompts with enough context to execute without relying on GitHub issue visibility or chat history. + +The goal of central supervision is to increase delivery throughput and availability without turning every defect into a manual one-off fix. + +## Task Design + +Every Code Queue task must have a narrow ownership boundary. + +- Assign one service, module, or infrastructure defect per task when possible. +- Give each task its own detached worktree under the shared workspace. +- State the write scope, validation scope, commit/push requirement, and forbidden actions directly in the prompt. +- Include the relevant background in the prompt itself; issue links are supporting references, not required context. +- Prefer existing queues and create new queues only when the existing lanes cannot express the ownership boundary. +- Keep queue concurrency bounded by real execution capacity. A target around five concurrent lanes is useful only when the active tasks have distinct write scopes and the execution plane is healthy. + +Prompts for production-adjacent work must explicitly forbid heavy local checks on the master server when those checks are known to risk OOM, and must tell the worker which validation belongs in D601 CI, dev env, or a target service container. + +## Monitoring + +The supervisor must monitor Code Queue with task-level and queue-level evidence, not with a single status field. + +Use: + +- `bun scripts/cli.ts codex queues` for queue counts, active task ids, unread terminal tasks, and control-plane diagnostics. +- `bun scripts/cli.ts codex task ` for attempt, last assistant message, last error, cancel flag, and current status. +- `bun scripts/cli.ts codex task --trace --limit N` or `codex output` only when the summary is insufficient. +- The liveness rules in `docs/reference/observability.md` when master control-plane state and D601 scheduler state appear split. + +Long-running tasks with fresh trace or heartbeat evidence should normally be left alone. Polling every few minutes is preferred over repeated interrupt/retry cycles. + +## Intervention Rules + +Intervene only when there is a clear reason. + +- If a task is running and trace or scheduler heartbeat is fresh, guide rather than interrupt. +- If a task reaches terminal state but lacks required acceptance evidence, retry the same task with a focused continuation prompt. +- If a task is blocked by a reusable infrastructure defect, assign that defect to an appropriate empty or low-risk queue and keep the original business task waiting or retried after the fix. +- If an infrastructure defect affects Code Queue control-plane availability, the supervisor may apply the smallest controlled deploy needed to restore the queue, then verify the original task can continue. +- If retry, cancel, move, or scheduler behavior is wrong, do not patch PostgreSQL manually as the final fix. Fix the code path, deploy the fix if needed, then recover the affected task through the normal API. + +Manual intervention must preserve the original task identity whenever that helps continuity. Creating duplicate replacement tasks is a fallback, not the default. + +## Completion Criteria + +A Code Queue task is not complete merely because it pushed code. + +For CI/CD delivery tasks, acceptance must include the evidence required by the target delivery policy. For user-service artifact delivery this means: + +- the CI artifact producer ran from a pushed commit; +- the artifact reference and digest are recorded; +- the dev environment consumed the same artifact; +- production CD consumed the artifact without source rebuild; +- live health and live commit or image label evidence match the requested commit. + +For infrastructure tasks, acceptance must prove that the original blocked workflow can proceed, or must state the remaining deployment step needed for the live system to consume the fix. + +Completed but unread tasks are still supervisor work. They must be read, classified, and either accepted, retried, or turned into a new bounded follow-up task. + +## Infrastructure Defect Handling + +Infrastructure defects discovered during a delivery program should be split from user-service work when the split improves throughput or reduces confusion. + +Examples of infrastructure defects include: + +- a retry API that leaves stale cancellation state; +- a healthcheck that no longer matches the runtime image; +- CLI observability that cannot show running, recently completed, or unread terminal tasks; +- a proxy path that differs between WebUI and CLI; +- a deploy job that reports failure even though the service API is healthy. + +These defects should be assigned to infrastructure queues with prompts that include the concrete observed failure, the expected long-term contract, and the recovery action required for the original delivery task. + +## Supervisor Boundaries + +The supervisor may: + +- read task, queue, health, job, and service status; +- submit, retry, interrupt, or cancel through the normal Code Queue and microservice proxy APIs; +- create self-contained follow-up tasks; +- run controlled production deploys for infrastructure recovery when the user has allowed production repair and the deploy path is already established; +- use a clean detached worktree for documentation or controlled deployment actions when the main worktree has unrelated parallel changes. + +The supervisor must not: + +- redo a worker's assigned implementation locally unless the user explicitly asks for manual takeover; +- run full check, full e2e, or Playwright on the master server when those checks are known to risk OOM; +- revert unrelated dirty worktree changes; +- treat local deployment state as source of truth when Git remote is the required truth; +- mark a delivery complete without acceptance evidence. + +## Documentation Feedback Loop + +Every repeated or delivery-blocking failure should feed back into one of: + +- a Code Queue task that fixes the defect; +- a GitHub issue or issue comment that records the blocking condition and recovery dependency; +- a long-term reference document when the lesson is durable. + +Reference docs should capture the reusable rule, not the full incident timeline. Process knowledge should reduce future supervision cost rather than become another one-off log. diff --git a/docs/reference/repo-tree.md b/docs/reference/repo-tree.md index ddddd993..368747e7 100644 --- a/docs/reference/repo-tree.md +++ b/docs/reference/repo-tree.md @@ -28,6 +28,7 @@ - repo-tree.md (This repository structure reference) - cli.md (CLI command model and async job contract) - config.md (Config and runtime rules) + - code-queue-supervision.md (Supervisor policy for parallel Code Queue delivery programs) - deployment.md (Docker stack deployment and health criteria) - frontend.md (Frontend layout and design rules) - provider-gateway.md (Provider connection and host SSH maintenance bridge) diff --git a/docs/reference/user-service-delivery.md b/docs/reference/user-service-delivery.md index 47362240..d288d2eb 100644 --- a/docs/reference/user-service-delivery.md +++ b/docs/reference/user-service-delivery.md @@ -36,6 +36,7 @@ The default release flow for a user-service change is: - The standard CI artifact producer is `bun scripts/cli.ts ci publish-user-service --service --commit `. It accepts only a pushed Git commit and a registered service id, and reports `serviceId`, `sourceCommit`, `sourceRepo`, `dockerfile`, `imageRef`, `tag`, `digest` and `digestRef`. - The CI artifact producer is not a deploy executor. It must not mutate the production namespace, restart production services, or update `deploy.json`. - Every production release must finish with a manual acceptance step after the automated checks pass. +- Multi-service delivery programs may use Code Queue parallelization, but the supervisor must follow `docs/reference/code-queue-supervision.md`: tasks need self-contained prompts, isolated worktrees, bounded queue concurrency, explicit acceptance evidence, and infrastructure defects split into separate follow-up tasks when they block several lanes. ## Frontend Pairing