Files

T

Codex 8f316db317 docs: clarify code queue supervisor responsibilities

2026-05-20 02:17:55 +00:00

11 KiB

Raw Blame History

Code Queue Supervision Policy

This document defines the long-term operating model for using Code Queue as a parallel delivery infrastructure under a human or lead-agent supervisor. It is a coordination policy, not a replacement for docs/reference/microservices.md, docs/reference/observability.md, docs/reference/user-service-delivery.md, or the Code Queue runtime contracts.

Scope

Use this policy when a delivery goal is too large for one Code Queue task and must be split across multiple queues, services, or infrastructure lanes.

This policy applies to:

user-service CI/CD rollouts;
multi-service fixes that need several isolated worktrees;
infrastructure defects found while user-service work is running;
follow-up validation, retry, and acceptance coordination;
manual supervisor work that keeps Code Queue tasks moving without taking over their implementation.

It does not authorize bypassing the normal deployment, Git, or production safety rules.

Operating Principle

The supervisor owns the outcome. Code Queue tasks own bounded execution.

The supervisor should keep a live map of the delivery goal, active queues, blockers, evidence gaps, and next recovery action. Code Queue workers should receive self-contained prompts with enough context to execute without relying on GitHub issue visibility or chat history.

The goal of central supervision is to increase delivery throughput and availability without turning every defect into a manual one-off fix.

The supervisor's end goal is to keep the broader development wave moving: track queue progress, correct course early, review completed work for quality, and schedule the next round of tasks so the delivery program keeps compounding instead of stalling after one batch finishes.

Task Design

Every Code Queue task must have a narrow ownership boundary.

Assign one service, module, or infrastructure defect per task when possible.
Give each task its own detached worktree under the shared workspace.
State the write scope, validation scope, commit/push requirement, and forbidden actions directly in the prompt.
Include the relevant background in the prompt itself; issue links are supporting references, not required context.
Prefer existing queues and create new queues only when the existing lanes cannot express the ownership boundary.
Keep queue concurrency bounded by real execution capacity. A target around five concurrent lanes is the normal operating point; the supervisor should push toward ten concurrent lanes when the active tasks have distinct write scopes, heartbeat/trace evidence is healthy, and the observed success rate stays acceptable. If success quality starts to slip, back off before expanding further.

Prompts for production-adjacent work must explicitly forbid heavy local checks on the master server when those checks are known to risk OOM, and must tell the worker which validation belongs in D601 CI, dev env, or a target service container.

When one supervisor machine is creating many Code Queue tasks in a burst, submit calls should default to serial or near-serial behavior. A short local lock or delay is acceptable if it prevents the control plane from being flooded faster than it can acknowledge tasks, especially on low-memory hosts. The goal is to keep task creation observable and stable, not to maximize raw enqueue throughput.

Monitoring

The supervisor must monitor Code Queue with task-level and queue-level evidence, not with a single status field.

Use:

bun scripts/cli.ts codex queues for queue counts, active task ids, unread terminal tasks, and control-plane diagnostics.
bun scripts/cli.ts codex task <taskId> for attempt, last assistant message, last error, cancel flag, and current status.
bun scripts/cli.ts codex task <taskId> --trace --limit N or codex output only when the summary is insufficient.
The liveness rules in docs/reference/observability.md when master control-plane state and D601 scheduler state appear split.

split-brain in queue diagnostics is a control-plane/execution-plane divergence signal, not automatic evidence that the work is dead. If the task heartbeats are fresh and the trace is still advancing, treat the task as live and keep supervising it rather than interrupting or replacing it. The queue summary should expose this as effectiveLiveness=live, splitBrainLive=true, and recommendedAction=continue-supervision; expired, missing, or stale-recovery heartbeat evidence should instead surface effectiveLiveness=at-risk.

Long-running tasks with fresh trace or heartbeat evidence should normally be left alone. Polling every few minutes is preferred over repeated interrupt/retry cycles.

For broad CI/CD migration waves, use a fixed supervision cadence unless an incident demands faster action. A five-minute poll loop is the default: read codex queues, read terminal or suspect task summaries, then either accept, retry, split a blocker, or leave healthy tasks alone. The loop should keep the supervisor doing useful non-overlapping work, such as documentation or issue triage, but that side work must not take over a worker's assigned implementation.

When a task leaves running or judging, treat the result as unread work until the final response and judge record have been inspected. Only then should the supervisor decide whether to refill the concurrency window.

Supervisor Workflow

For each active task, evaluate four things in order:

completion quality: did it actually satisfy the task's acceptance boundary;
completion state: is it terminal, retryable, or still making progress;
self-blocking risk: is the task stuck on a problem it cannot solve alone;
next action: accept, continue, replace with a narrower task, or raise an infrastructure issue.

If the blocker is a reusable infrastructure problem, do not keep re-running the business task blindly. First record the infrastructure defect in an issue, then fix the infrastructure manually if Code Queue cannot move past it, and only then resume the delivery wave.

The supervisor should prefer read-only analysis and new narrowly scoped tasks over local implementation takeover. Manual work is reserved for infrastructure blockers, live recovery, and other cases where the queue cannot safely unblock itself.

Intervention Rules

Intervene only when there is a clear reason.

If a task is running and trace or scheduler heartbeat is fresh, guide rather than interrupt.
If a task reaches terminal state but lacks required acceptance evidence, retry the same task with a focused continuation prompt.
If a task is blocked by a reusable infrastructure defect, assign that defect to an appropriate empty or low-risk queue and keep the original business task waiting or retried after the fix.
If an infrastructure defect affects Code Queue control-plane availability, the supervisor may apply the smallest controlled deploy needed to restore the queue, then verify the original task can continue.
If retry, cancel, move, or scheduler behavior is wrong, do not patch PostgreSQL manually as the final fix. Fix the code path, deploy the fix if needed, then recover the affected task through the normal API.

Manual intervention must preserve the original task identity whenever that helps continuity. Creating duplicate replacement tasks is a fallback, not the default.

Completion Criteria

A Code Queue task is not complete merely because it pushed code.

For CI/CD delivery tasks, acceptance must include the evidence required by the target delivery policy. For user-service artifact delivery this means:

the CI artifact producer ran from a pushed commit;
the artifact reference and digest are recorded;
the dev environment consumed the same artifact;
production CD consumed the artifact without source rebuild;
live health and live commit or image label evidence match the requested commit.

For infrastructure tasks, acceptance must prove that the original blocked workflow can proceed, or must state the remaining deployment step needed for the live system to consume the fix.

Completed but unread tasks are still supervisor work. They must be read, classified, and either accepted, retried, or turned into a new bounded follow-up task.

Infrastructure Defect Handling

Infrastructure defects discovered during a delivery program should be split from user-service work when the split improves throughput or reduces confusion.

Examples of infrastructure defects include:

a retry API that leaves stale cancellation state;
a healthcheck that no longer matches the runtime image;
CLI observability that cannot show running, recently completed, or unread terminal tasks;
a proxy path that differs between WebUI and CLI;
a deploy job that reports failure even though the service API is healthy.
a supervisor-side submit burst that can saturate the Code Queue manager or low-memory host before the queue has a chance to acknowledge tasks.
a Code Queue container missing a basic operator tool or credential path needed for supervision, such as gh, hub, or a GitHub token injection path.

These defects should be assigned to infrastructure queues with prompts that include the concrete observed failure, the expected long-term contract, and the recovery action required for the original delivery task.

When the defect is only in the Code Queue execution environment and the service can be safely patched live in dev without touching prod, prefer the smallest temporary live remedy first. Then persist the fix in the relevant Dockerfile, container image, or credential propagation path, and verify the persistent fix in dev before considering the issue closed.

If a business task discovers this kind of missing tool or missing credential path, the supervisor should split it into a dedicated infra task rather than leaving it buried in the business task prompt. The business task should continue with the bridge in place when possible.

Supervisor Boundaries

The supervisor may:

read task, queue, health, job, and service status;
submit, retry, interrupt, or cancel through the normal Code Queue and microservice proxy APIs;
create self-contained follow-up tasks;
run controlled production deploys for infrastructure recovery when the user has allowed production repair and the deploy path is already established;
use a clean detached worktree for documentation or controlled deployment actions when the main worktree has unrelated parallel changes.

The supervisor must not:

redo a worker's assigned implementation locally unless the user explicitly asks for manual takeover;
run full check, full e2e, or Playwright on the master server when those checks are known to risk OOM;
revert unrelated dirty worktree changes;
treat local deployment state as source of truth when Git remote is the required truth;
mark a delivery complete without acceptance evidence.

Documentation Feedback Loop

Every repeated or delivery-blocking failure should feed back into one of:

a Code Queue task that fixes the defect;
a GitHub issue or issue comment that records the blocking condition and recovery dependency;
a long-term reference document when the lesson is durable.

Reference docs should capture the reusable rule, not the full incident timeline. Process knowledge should reduce future supervision cost rather than become another one-off log.

11 KiB Raw Blame History