Files

T

Codex fced0520fe fix: serialize codex submit bursts

2026-05-19 23:55:31 +00:00

7.8 KiB

Raw Blame History

Code Queue Supervision Policy

This document defines the long-term operating model for using Code Queue as a parallel delivery infrastructure under a human or lead-agent supervisor. It is a coordination policy, not a replacement for docs/reference/microservices.md, docs/reference/observability.md, docs/reference/user-service-delivery.md, or the Code Queue runtime contracts.

Scope

Use this policy when a delivery goal is too large for one Code Queue task and must be split across multiple queues, services, or infrastructure lanes.

This policy applies to:

user-service CI/CD rollouts;
multi-service fixes that need several isolated worktrees;
infrastructure defects found while user-service work is running;
follow-up validation, retry, and acceptance coordination;
manual supervisor work that keeps Code Queue tasks moving without taking over their implementation.

It does not authorize bypassing the normal deployment, Git, or production safety rules.

Operating Principle

The supervisor owns the outcome. Code Queue tasks own bounded execution.

The supervisor should keep a live map of the delivery goal, active queues, blockers, evidence gaps, and next recovery action. Code Queue workers should receive self-contained prompts with enough context to execute without relying on GitHub issue visibility or chat history.

The goal of central supervision is to increase delivery throughput and availability without turning every defect into a manual one-off fix.

Task Design

Every Code Queue task must have a narrow ownership boundary.

Assign one service, module, or infrastructure defect per task when possible.
Give each task its own detached worktree under the shared workspace.
State the write scope, validation scope, commit/push requirement, and forbidden actions directly in the prompt.
Include the relevant background in the prompt itself; issue links are supporting references, not required context.
Prefer existing queues and create new queues only when the existing lanes cannot express the ownership boundary.
Keep queue concurrency bounded by real execution capacity. A target around five concurrent lanes is useful only when the active tasks have distinct write scopes and the execution plane is healthy.

Prompts for production-adjacent work must explicitly forbid heavy local checks on the master server when those checks are known to risk OOM, and must tell the worker which validation belongs in D601 CI, dev env, or a target service container.

When one supervisor machine is creating many Code Queue tasks in a burst, submit calls should default to serial or near-serial behavior. A short local lock or delay is acceptable if it prevents the control plane from being flooded faster than it can acknowledge tasks, especially on low-memory hosts. The goal is to keep task creation observable and stable, not to maximize raw enqueue throughput.

Monitoring

The supervisor must monitor Code Queue with task-level and queue-level evidence, not with a single status field.

Use:

bun scripts/cli.ts codex queues for queue counts, active task ids, unread terminal tasks, and control-plane diagnostics.
bun scripts/cli.ts codex task <taskId> for attempt, last assistant message, last error, cancel flag, and current status.
bun scripts/cli.ts codex task <taskId> --trace --limit N or codex output only when the summary is insufficient.
The liveness rules in docs/reference/observability.md when master control-plane state and D601 scheduler state appear split.

Long-running tasks with fresh trace or heartbeat evidence should normally be left alone. Polling every few minutes is preferred over repeated interrupt/retry cycles.

Intervention Rules

Intervene only when there is a clear reason.

If a task is running and trace or scheduler heartbeat is fresh, guide rather than interrupt.
If a task reaches terminal state but lacks required acceptance evidence, retry the same task with a focused continuation prompt.
If a task is blocked by a reusable infrastructure defect, assign that defect to an appropriate empty or low-risk queue and keep the original business task waiting or retried after the fix.
If an infrastructure defect affects Code Queue control-plane availability, the supervisor may apply the smallest controlled deploy needed to restore the queue, then verify the original task can continue.
If retry, cancel, move, or scheduler behavior is wrong, do not patch PostgreSQL manually as the final fix. Fix the code path, deploy the fix if needed, then recover the affected task through the normal API.

Manual intervention must preserve the original task identity whenever that helps continuity. Creating duplicate replacement tasks is a fallback, not the default.

Completion Criteria

A Code Queue task is not complete merely because it pushed code.

For CI/CD delivery tasks, acceptance must include the evidence required by the target delivery policy. For user-service artifact delivery this means:

the CI artifact producer ran from a pushed commit;
the artifact reference and digest are recorded;
the dev environment consumed the same artifact;
production CD consumed the artifact without source rebuild;
live health and live commit or image label evidence match the requested commit.

For infrastructure tasks, acceptance must prove that the original blocked workflow can proceed, or must state the remaining deployment step needed for the live system to consume the fix.

Completed but unread tasks are still supervisor work. They must be read, classified, and either accepted, retried, or turned into a new bounded follow-up task.

Infrastructure Defect Handling

Infrastructure defects discovered during a delivery program should be split from user-service work when the split improves throughput or reduces confusion.

Examples of infrastructure defects include:

a retry API that leaves stale cancellation state;
a healthcheck that no longer matches the runtime image;
CLI observability that cannot show running, recently completed, or unread terminal tasks;
a proxy path that differs between WebUI and CLI;
a deploy job that reports failure even though the service API is healthy.
a supervisor-side submit burst that can saturate the Code Queue manager or low-memory host before the queue has a chance to acknowledge tasks.

These defects should be assigned to infrastructure queues with prompts that include the concrete observed failure, the expected long-term contract, and the recovery action required for the original delivery task.

Supervisor Boundaries

The supervisor may:

read task, queue, health, job, and service status;
submit, retry, interrupt, or cancel through the normal Code Queue and microservice proxy APIs;
create self-contained follow-up tasks;
run controlled production deploys for infrastructure recovery when the user has allowed production repair and the deploy path is already established;
use a clean detached worktree for documentation or controlled deployment actions when the main worktree has unrelated parallel changes.

The supervisor must not:

redo a worker's assigned implementation locally unless the user explicitly asks for manual takeover;
run full check, full e2e, or Playwright on the master server when those checks are known to risk OOM;
revert unrelated dirty worktree changes;
treat local deployment state as source of truth when Git remote is the required truth;
mark a delivery complete without acceptance evidence.

Documentation Feedback Loop

Every repeated or delivery-blocking failure should feed back into one of:

a Code Queue task that fixes the defect;
a GitHub issue or issue comment that records the blocking condition and recovery dependency;
a long-term reference document when the lesson is durable.

Reference docs should capture the reusable rule, not the full incident timeline. Process knowledge should reduce future supervision cost rather than become another one-off log.

7.8 KiB Raw Blame History