fix: name commander active runner count

This commit is contained in:
Codex
2026-05-23 01:01:53 +00:00
parent 031c3fda39
commit 982f21ec62
6 changed files with 81 additions and 3 deletions
+2 -2
View File
@@ -47,7 +47,7 @@ CLI 可以从 `master` 快速演进,但必须兼容 `deploy.json` 固定的 CI
- `codex submit [prompt] [--prompt-file path|--prompt-stdin] [--queue queueId] [--provider-id id] [--cwd path] [--model model] [--reasoning-effort effort] [--execution-mode mode] [--max-attempts N] [--reference-task-id id] [--dry-run]` 通过 backend-core 私有代理向稳定 `code-queue` 用户服务路径提交任务;prompt 必须且只能来自位置参数、文件或 stdin 之一,`--dry-run` 只返回结构化请求且不实际入队。长 prompt、多行 prompt、含引号/反引号/Markdown 表格/JSON/反斜杠的 prompt 必须优先用 `--prompt-stdin``--prompt-file`,不要拼进 shell 单个参数;位置参数只适合短单行 smoke prompt。stdin 推荐用 quoted heredoc`cat <<'PROMPT' | bun scripts/cli.ts codex submit --prompt-stdin --queue <id> --dry-run`,文件路径推荐 `bun scripts/cli.ts codex submit --prompt-file /tmp/code-queue-prompt.md --queue <id> --dry-run`,确认 dry-run 后移除 `--dry-run` 提交同一 payload。dry-run 会额外输出 `routingRecommendation`,包含推荐 route、runner、model、风险信号、prompt 自包含/issue 非唯一来源/prod-secret-DB 禁止/运行态或 release 禁止/证据要求/中等复杂度候选等 guard 状态;同时输出 `policyContract`,固定暴露 GPT-5.5、DeepSeek、MiniMax 的风险分层、并发上限和外部 provider 429 退避处置。该建议只用于指挥官 preflight,不会改写 payload,不改变 runtime admission,也不假设生产 MiniMax 或 DeepSeek 可用。`--dry-run` 必须返回完整 prompt、字符数和 `truncated=false` 用于人工验收;真实提交是写入操作,默认只返回 `accepted=true`、task id、队列、写入保护摘要和后续查看命令,必须标记 `promptOmitted=true` 且不得回显 prompt 或 promptPreview。真实提交会经过本机本地串行化保护和短节流,避免同一指挥端并发 submit 把低内存主机或 `code-queue-mgr` 控制面打抖;返回值会附带低噪声 `submitConcurrencyGuard` 说明本次提交的锁与等待信息。backend-core 默认把提交、队列 CRUD、已读状态、历史摘要和轻量 Trace 读取分流到主 server `code-queue-mgr`,由它写入主 PostgreSQLD601 scheduler 只轮询并执行已入库任务。
- `codex pr-preflight [--remote] [--push-dry-run --push-dry-run-ref refs/heads/probe/<name>] [--pr-create-dry-run --pr-create-dry-run-head <head>] [--issue N] [--full]` 通过稳定 `code-queue` proxy 请求 D601 scheduler `/api/runtime-preflight`,用于 PR 型派单 admission。输出会压缩展示 scheduler/runner 的 token 覆盖、Auth Broker source/capability/nextAction、工具、agent port、Git worktree、GitHub egress、repo/issue/PR 只读探测、可选 push dry-run,以及可选 PR body/create dry-run guard;只报告 `GH_TOKEN`/`GITHUB_TOKEN` 是否存在和来源 key,不打印值。当 auth-broker 配置存在时,`tokenCoverage.source="auth-broker"``credentialSource="broker-issued-token"` 且 runner env token 不是成功前提;当仅 env token 存在时,`credentialSource="env-token"``preflight.authBroker.nextAction="use-env-token-until-auth-broker-live"`;两者都缺失时顶层 `ok=false``runnerDisposition=infra-blocked``degradedReason=auth-broker-needed``tokenCoverage.missing` 同时列出 `GH_TOKEN``GITHUB_TOKEN`,并输出 `preflight.authBroker.source="broker/auth-broker-needed"``capability.source="missing-token"`。该 `auth-missing` 的 scope 是 `scheduler-runner-env`,不能简化成“当前 active runner/dev container 不能创建 PR”;输出必须带 `scopeBoundary``activeRunnerDevContainer`,要求调用方另跑 `bun scripts/cli.ts gh auth status --repo pikasTech/unidesk` 与 PR dry-run 来确认当前 dev container 能力。`preflight.prCapabilityContract` 是 runner-facing 合同摘要,必须包含目标分支、token/auth 来源、`systemGhBinaryRequiredForWrites=false`、UniDesk REST `bun scripts/cli.ts gh` 可用性、push dry-run/PR create dry-run 的 `writesRemote=false`、expected PR handoff、真实 PR 创建需要 commander 授权和 `gh pr merge``unsupported-command` 边界;系统 `gh` binary 缺失只进入 `tools.systemGhBinary`,不得误判为 UniDesk REST `gh` CLI 不可用。`--remote` 在 runner-like 环境里不再依赖本地 `unidesk-backend-core``unidesk-database``baidu-netdisk-backend` 容器存在;这些缺失只作为本地观测证据。若远程控制面可达,则继续走远程控制面结果;若远程控制面不可达,则结构化返回 `failureKind=control-plane-missing` / `degradedReason=remote-control-plane-unreachable`,而不是把本地 `backend-core-container-missing` 当作最终阻塞。`--pr-create-dry-run` 不 POST GitHub,只证明 runner 内 PR body 生成、`scripts/cli.ts gh pr create --dry-run` 和 branch 参数形态可用;服务端创建权限仍以 token/auth broker、repo/issue/PR read、push dry-run 和最终授权后的真实 PR 创建结果为准。
- `codex task <taskId>` 通过 Code Queue 私有代理按任务 ID 查询结构化审阅摘要;默认只返回任务身份、执行 Provider、工作目录、attempt 计数、原始 prompt、最终 response、最后错误和渐进披露命令,适合指挥官审阅完成未读任务且避免上下文爆炸。`--detail` 仍是有界详细摘要:默认只返回少量 attempt/tool 行、短 prompt/response/stderr/feedback 预览和 omitted/truncated 元数据;需要完整 prompt/response 文本或更多 tool/attempt 细节时再显式加 `--full``--tool-limit N``--trace``codex output`。该摘要读取默认由主 server `code-queue-mgr` 从 PostgreSQL 返回,不依赖 D601 `code-queue-read` Service 可用。
- `codex tasks [--view supervisor|full] [--queue id] [--status succeeded|running|queued|failed|canceled|judging|retry_wait[,..]] [--unread|--unread-only] [--limit N] [--before-id id]` 通过同一私有代理输出渐进式披露视图。默认 `supervisor` 是低噪声指挥官视图,只返回 `running``completedUnread``recentCompleted``queued``executionDiagnostics` 的紧凑行;prompt/body 只给短预览和原始字符数,`running`/`completedUnread`/`queued` 默认只返回一个很小的有界页并通过 section `commands.next` 继续分页,`recentCompleted` 默认限量且不重复 `completedUnread` 未读终态,不嵌入完整 Trace、final response 或全量 overview。`--limit` 在 supervisor 中主要是扫描/分页预算,不是返回几十条肥行的开关;需要更详细当前页任务行时显式使用 `--view full``--full`,仍受 `--limit``--before-id` 分页约束。
- `codex tasks [--view supervisor|full] [--queue id] [--status succeeded|running|queued|failed|canceled|judging|retry_wait[,..]] [--unread|--unread-only] [--limit N] [--before-id id]` 通过同一私有代理输出渐进式披露视图。默认 `supervisor` 是低噪声指挥官视图,只返回 `running``completedUnread``recentCompleted``queued``activity``commanderConcurrency``executionDiagnostics` 的紧凑行;prompt/body 只给短预览和原始字符数,`running`/`completedUnread`/`queued` 默认只返回一个有界页并通过 section `commands.next` 继续分页,`recentCompleted` 默认限量且不重复 `completedUnread` 未读终态,不嵌入完整 Trace、final response 或全量 overview。`commanderConcurrency.activeRunnerCount` 是并发策略应使用的 active/running 计数,等于 `activity.effectiveActiveTaskCount`15 并发策略按 `15 - activeRunnerCount` 计算剩余窗口。`commanderConcurrency.splitBrainDisposition=live-count-as-active` 表示 split-brain 有 fresh heartbeat 证据,应继续监督并计入 active;`interventionRequired=true` 才提示介入。每个条目只保留 task id、队列、状态、issue、分类和短摘要,`show/detail/trace/output/full/read` 放在 section template 中避免重复噪声,并带 `kind` 标记直接推进、部署修复、验证/报告噪声等类别。`--limit` 在 supervisor 中主要是扫描/分页预算,不是返回几十条肥行的开关;`--unread``--unread-only` 的别名,必须只保留未读终态;`--status` 必须真实过滤支持的状态,未知参数或未知状态必须结构化失败,不能静默忽略。需要更详细当前页任务行时显式使用 `--view full``--full`,仍受 `--limit``--before-id` 分页约束。
- `codex task <taskId> --trace --tail|--from-start|--after-seq N|--before-seq N --limit N` 按页拉取 Code Queue 的逻辑 trace;响应会返回 `nextAfterSeq``previousBeforeSeq``hasMore``hasBefore` 和下一页/上一页命令,默认 `--trace` 取最新一页,且仍以分页 trace 为主;需要完整 prompt/最终 response 时加 `--full`,需要详细 task 摘要时加 `--detail`
- `codex output <taskId> --tail|--from-start|--after-seq N|--before-seq N --limit N [--full-text]` 按原始 output seq 分页读取底层记录;当 trace 行提示 `commandOmittedLines``bodyOmittedLines``rawSeqs` 时,用该命令按 seq 补取信息。默认是低噪声 raw-output 摘要:即使传入很大的 `--limit`,非 `--full-text` 也会限制返回行数和单条文本预览,并在 `disclosure.limitCapped``requestedLimit``effectiveLimit``commands.fullText` 中说明如何继续展开;显式 `--full-text` 才返回该页全文。
- `codex read <taskId>` 在人工审阅后标记单个终态任务已读;列表、overview 和 supervisor 视图只返回这个命令字段,不得自动执行,也不得批量清空未读状态。
@@ -56,7 +56,7 @@ CLI 可以从 `master` 快速演进,但必须兼容 `deploy.json` 固定的 CI
- `codex steer <taskId> [prompt|--prompt-file path|--prompt-stdin] [--dry-run] [--no-retry|--retry-attempts N]` 通过 Code Queue 私有代理向正在运行的 task 注入纠偏提示,正式替代底层 `microservice proxy code-queue /api/tasks/<taskId>/steer` 调用。prompt 必须且只能来自位置参数、文件或 stdin 之一;`--dry-run` 只输出 `method``path``stableProxyPath`、retry policy、prompt 字符数、截断预览和 raw proxy 等价命令,不触碰运行中 session,也不得泄露超长 prompt 全文。真实执行是写入操作,成功只返回 `accepted=true`、task id、prompt 字符数、`promptOmitted=true`、有界 task/queue 确认、attempt summary 和后续查看命令,不回显 prompt 或完整 task state;路径固定为 `/api/microservices/code-queue/proxy/api/tasks/<taskId>/steer`,只能作用于 D601 scheduler 上存在 active steerable turn 的 running task。默认对 `stable-proxy-failed``backend-core-unreachable` 这类 retryable control-plane failures 做一次有界重试;`--retry-attempts N` 最大为 3`--retry-delay-ms N` 最大为 5000`--no-retry` 用于复现单次失败。
- `codex steer` 非 dry-run 失败仍输出 JSON 且退出非零;`.data.diagnostics.reason` 用于 runner 分流,当前包括 `backend-core-unreachable``code-queue-microservice-unregistered``proxy-unauthorized``proxy-404``steer-endpoint-404``upstream-runtime-rejected``stable-proxy-failed``invalid-proxy-response``scope` 区分 `backend-core``stable-proxy``code-queue-runtime``unknown`,并带 `status``exitCode``retryable`、有界 `upstreamBodyPreview``attempts``retryPolicy` 和推荐交叉验证命令;若任务不在 running/active-turn 状态,通常归类为 `upstream-runtime-rejected`,不得静默成功。`502 provider HTTP tunnel failed``provider-gateway-http-fetch``The operation was aborted` 或约 30 秒 tunnel wait abort 会归类为 `stable-proxy-failed`CLI 会先按 retry policy 重试;如果仍失败,`.data.diagnostics.operatorGuidance.rawProxyEquivalentIsFallback=false` 表示 raw proxy 等价命令走同一条 tunnel,只能用于对照诊断,不应被当作更低噪声 fallback。此时 `.data.steer.deliveryUnconfirmed=true`,指挥官应先看 `codex tasks --view supervisor``codex task <taskId>``microservice health code-queue`,再从主 server CLI 或显式 SSH transport 重试同一个 `codex steer`
- `codex interrupt|cancel <taskId>` 通过 Code Queue 私有代理请求中断;running/judging 任务会请求 D601 当前 agent run 停止,queued/retry_wait 任务的取消也必须保持与 WebUI 相同代理路径,返回有界 task 摘要和后续查询命令。任何需要接触 active run 的动作仍属于 D601 执行面。
- Code Queue 多队列 lane 由 `codex` 命令命名空间管理:`queues [--full|--all] [--limit N] [--page N|--offset N]` 列表、`queue create <queueId>` 创建、`queue merge <sourceQueueId> --into <targetQueueId>` 合并、`move <taskId> --queue <queueId>` 迁移;这些队列管理入口默认由主 server `code-queue-mgr` 直管 PostgreSQL,仍通过稳定 `code-queue` 用户服务代理路径访问。`codex queues` 默认只返回 active/nonempty/unread/runnable queue 摘要、activity、全局 counts 和 execution diagnostics`--full``--all` 只切换为完整队列行视图的一页,仍受 `--limit`/`--page`/`--offset` 分页约束,不再默认携带 deprecated full array。summary 和 full 的稳定机读路径都是 `.data.queues.items[]`,全局元数据固定在 `.data.queues.activity``.data.queues.counts``.data.queues.executionDiagnostics``.data.queues.activeTaskIds``.data.queues.queuedTaskIds`;需要完整 upstream 时使用输出中的 raw command。`activity.effectiveActiveTaskCount` 是指挥官并发判断的有效活跃数,`schedulerLocalActiveQueueCount`/`activeQueueIds` 只描述本地 scheduler active-run slots,不能覆盖数据库 running 计数或 heartbeat-fresh runner 计数。旧 full 顶层数组语义已作为 deprecated 兼容信息记录,不再作为 `.data.queues` 主形态。同一个 queue 内部串行执行,不同 queue 之间并行执行。迁移只允许尚未被 scheduler claim 的 `queued`/`retry_wait` 任务,必须满足 `startedAt=null``currentAttempt=0` 且没有 active thread/turn;已进入 `running`/`judging` 或已有 claim 标记的任务返回 409,不得被 move/merge 回写成 queued。合并会移动可迁移任务归属并自动删除源 queue 记录,只保留合并后的目标 queue;若 source 或 target queue 存在 active/claimed 任务,合并整体返回 409。合并后的目标 queue 按任务原 `queueEnteredAt`/`createdAt` 时间顺序串行,成功迁移 queued/retry_wait 任务后由 D601 scheduler 轮询推进。
- Code Queue 多队列 lane 由 `codex` 命令命名空间管理:`queues [--full|--all] [--limit N] [--page N|--offset N]` 列表、`queue create <queueId>` 创建、`queue merge <sourceQueueId> --into <targetQueueId>` 合并、`move <taskId> --queue <queueId>` 迁移;这些队列管理入口默认由主 server `code-queue-mgr` 直管 PostgreSQL,仍通过稳定 `code-queue` 用户服务代理路径访问。`codex queues` 默认只返回 active/nonempty/unread/runnable queue 摘要、activity、commanderConcurrency、全局 counts 和 execution diagnostics`--full``--all` 只切换为完整队列行视图的一页,仍受 `--limit`/`--page`/`--offset` 分页约束,不再默认携带 deprecated full array。summary 和 full 的稳定机读路径都是 `.data.queues.items[]`,全局元数据固定在 `.data.queues.commanderConcurrency``.data.queues.activity``.data.queues.counts``.data.queues.executionDiagnostics``.data.queues.activeTaskIds``.data.queues.queuedTaskIds`;需要完整 upstream 时使用输出中的 raw command。`commanderConcurrency.activeRunnerCount` / `activity.effectiveActiveTaskCount` 是指挥官并发判断的有效活跃数,`schedulerLocalActiveQueueCount`/`activeQueueIds` 只描述本地 scheduler active-run slots,不能覆盖数据库 running 计数或 heartbeat-fresh runner 计数。旧 full 顶层数组语义已作为 deprecated 兼容信息记录,不再作为 `.data.queues` 主形态。同一个 queue 内部串行执行,不同 queue 之间并行执行。迁移只允许尚未被 scheduler claim 的 `queued`/`retry_wait` 任务,必须满足 `startedAt=null``currentAttempt=0` 且没有 active thread/turn;已进入 `running`/`judging` 或已有 claim 标记的任务返回 409,不得被 move/merge 回写成 queued。合并会移动可迁移任务归属并自动删除源 queue 记录,只保留合并后的目标 queue;若 source 或 target queue 存在 active/claimed 任务,合并整体返回 409。合并后的目标 queue 按任务原 `queueEnteredAt`/`createdAt` 时间顺序串行,成功迁移 queued/retry_wait 任务后由 D601 scheduler 轮询推进。
- 所有 `codex` 查询和管理命令必须走与 WebUI 相同的 backend-core 私有代理路径 `/api/microservices/code-queue/proxy/...`;CLI 不得为了提交、移动、中断、取消或队列管理直接调用 D601 内部 Service、数据库、pod curl 或 k3sctl scheduler 子服务。若该路径失败,应先修复 CLI/backend/provider tunnel 链路,而不是绕过控制面。
- `job list [--limit N] [--include-command]``job status <jobId|latest> [--tail-bytes N]` 查询 `.state/jobs/` 文件系统状态,是异步命令的可观测入口。`job list` 默认只返回最新 50 条摘要;`job status` 默认只返回 stdout/stderr 末尾 12000 字节,并带 `tailPolicy` 与完整日志路径。
- `debug health``debug dispatch``debug task` 走真实内部 core、WebSocket、数据库、provider、系统指标、Docker 状态和 Host SSH 维护桥流程,只用于开发调试,不写入 `TEST.md` 的正式验收步骤。
+1 -1
View File
@@ -238,7 +238,7 @@ bun scripts/cli.ts codex pr-preflight --remote --issue <issue-number>
队列诊断中的 `split-brain` 表示控制面/执行面观测分裂,不自动证明任务已经死亡。只要任务 heartbeat 还在刷新、trace 仍在推进,就不能把它判成服务中断或要求立刻 stop;应把它视为 `splitBrainLive=true` 的 live 任务,继续监督并推进 #20 里的已排任务,而不是 interrupt、替换或把 backend 当成已经挂掉。队列摘要应显示 `effectiveLiveness=live``splitBrainLive=true``recommendedAction=continue-supervision`compact 输出还应在 `executionDiagnostics.liveness` 中重复这些低噪声字段,并突出 `activeHeartbeatCount`、有界 `heartbeatFreshTaskIds``databaseActiveTaskCount``schedulerActiveRunSlotCount`。当 master/control-plane 的 `schedulerActiveRunSlotCount=0``heartbeatFreshTaskIds` 非空时,active 数应优先按 scheduler heartbeat 摘要解释为 live,而不是按 master 本地 slot 0 解释为执行停摆。只有 heartbeat expired/missing 或满足 stale-recovery 条件时,才应显示 `effectiveLiveness=at-risk` 并进入恢复判断。
`codex queues` 和默认 supervisor 视图的 `activity` 是指挥官并发治理的主读数`effectiveActiveTaskCount` 表示用于调度判断的有效活跃任务数;`databaseRunningTaskCount` 来自 PostgreSQL 中 `running` 状态计数;`databaseActiveTaskCount` 覆盖 running/judging 等数据库活跃任务;`heartbeatFreshActiveTaskCount` 表示 heartbeat-fresh 的有效 runner 数;`schedulerLocalActiveQueueCount``schedulerLocalActiveRunSlotCount` 只表示当前控制面本地可见 active run slots。`activeQueueIds``activeQueueCount` 是 scheduler-local 字段,可能在 `counts.running>0` 且 heartbeat 新鲜时为 0;看到这种组合时应按 `activity.effectiveActiveTaskCount``activity.heartbeatFreshActiveTaskCount``splitBrainLive` 决策,不得把空 `activeQueueIds` 当作零并发或停摆证据。
`codex queues` 和默认 supervisor 视图的 `activity` / `commanderConcurrency` 是指挥官并发治理的主读数。并发决策固定使用 `commanderConcurrency.activeRunnerCount`,它等于 `activity.effectiveActiveTaskCount`15 并发策略的可补窗口按 `15 - activeRunnerCount` 计算,不能用 `activeQueueIds.length` 或 scheduler-local slot 数替代`effectiveActiveTaskCount` 表示用于调度判断的有效活跃任务数;`databaseRunningTaskCount` 来自 PostgreSQL 中 `running` 状态计数;`databaseActiveTaskCount` 覆盖 running/judging 等数据库活跃任务;`heartbeatFreshActiveTaskCount` 表示 heartbeat-fresh 的有效 runner 数;`schedulerLocalActiveQueueCount``schedulerLocalActiveRunSlotCount` 只表示当前控制面本地可见 active run slots。`activeQueueIds``activeQueueCount` 是 scheduler-local 字段,可能在 `counts.running>0` 且 heartbeat 新鲜时为 0;看到这种组合时应按 `activity.effectiveActiveTaskCount``activity.heartbeatFreshActiveTaskCount``splitBrainLive` 决策,不得把空 `activeQueueIds` 当作零并发或停摆证据。`commanderConcurrency.splitBrainDisposition=live-count-as-active` 表示 split-brain 仍是 live 且应计入 active runner`interventionRequired=true`、heartbeat risk、stale recovery candidates,或非 `continue-supervision` 的 recommended action 才进入人工介入/恢复判断。
单次 `provider is not online`、SSH 超时、proxy 超时或 registry 请求失败只能证明“当前观察路径失败”,不能单独升级为 D601 全局离线、CI/CD 全局阻塞或业务任务不可推进。指挥官和 runner 必须用多信号裁决运行面状态,至少区分以下观察面:
@@ -169,6 +169,9 @@ function assertQueuesShape(label: string, result: unknown, expectedView: string)
assertCondition(activity.databaseRunningTaskCount === 1, `${label} activity should distinguish database running tasks`, activity);
assertCondition(activity.heartbeatFreshActiveTaskCount === 1, `${label} activity should distinguish heartbeat-fresh active runners`, activity);
assertCondition(activity.schedulerLocalActiveQueueCount === 1, `${label} activity should distinguish scheduler-local active queues`, activity);
const commanderConcurrency = asRecord(activity.commanderConcurrency);
assertCondition(commanderConcurrency.activeRunnerCount === 1, `${label} activity should expose commander-facing active runner count`, commanderConcurrency);
assertCondition(commanderConcurrency.activeRunnerCountField === "activity.effectiveActiveTaskCount", `${label} activity should name the active runner count field`, commanderConcurrency);
assertCondition(activity.activeQueueIdsScope === "scheduler-local-active-run-slots", `${label} activity should label activeQueueIds scope`, activity);
assertCondition(Array.isArray(queues.activeTaskIds), `${label} activeTaskIds should be present`, queues);
assertCondition(Array.isArray(queues.queuedTaskIds), `${label} queuedTaskIds should be present`, queues);
@@ -183,6 +186,7 @@ function assertSplitBrainLiveActivity(label: string, result: unknown): void {
assertCondition(totals.databaseRunningTaskCount === 8, `${label} should foreground DB running task count`, totals);
assertCondition(totals.databaseActiveTaskCount === 8, `${label} should foreground DB active task count`, totals);
assertCondition(totals.heartbeatFreshActiveTaskCount === 8, `${label} should foreground heartbeat-effective active runners`, totals);
assertCondition(totals.commanderActiveRunnerCount === 8, `${label} should mirror commander active count in totals`, totals);
assertCondition(totals.effectiveActiveTaskCount === 8, `${label} should foreground effective active task count`, totals);
assertCondition(asArray(queues.activeQueueIds).length === 0, `${label} activeQueueIds should remain the scheduler-local list`, queues);
assertCondition(queues.activeQueueIdsScope === "scheduler-local-active-run-slots", `${label} activeQueueIds should be scoped`, queues);
@@ -196,6 +200,13 @@ function assertSplitBrainLiveActivity(label: string, result: unknown): void {
assertCondition(activity.schedulerLocalActiveRunSlotCount === 0, `${label} activity should expose scheduler-local slot count`, activity);
assertCondition(activity.runnableQueueCount === 0, `${label} activity should expose runnable queue count`, activity);
assertCondition(activity.splitBrainLive === true, `${label} activity should preserve split-brain live`, activity);
assertCondition(activity.splitBrainDisposition === "live-count-as-active", `${label} activity should count live split-brain as active`, activity);
const commanderConcurrency = asRecord(queues.commanderConcurrency);
assertCondition(commanderConcurrency.activeRunnerCount === 8, `${label} should expose commander-facing active runner count`, commanderConcurrency);
assertCondition(commanderConcurrency.activeRunnerCountField === "activity.effectiveActiveTaskCount", `${label} should name the active runner count field`, commanderConcurrency);
assertCondition(commanderConcurrency.splitBrainDisposition === "live-count-as-active", `${label} should classify live split-brain capacity`, commanderConcurrency);
assertCondition(commanderConcurrency.interventionRequired === false, `${label} should not require intervention for fresh split-brain`, commanderConcurrency);
assertCondition(String(commanderConcurrency.decisionRule ?? "").includes("15 - activeRunnerCount"), `${label} should give 15-concurrency arithmetic`, commanderConcurrency);
assertCondition(String(activity.activeQueueIdsNote ?? "").includes("zero local queue ids does not mean zero active runners"), `${label} activity note should prevent zero-active misread`, activity);
assertCondition(String(activity.interpretation ?? "").includes("continue supervision"), `${label} activity interpretation should keep supervision action`, activity);
}
@@ -256,6 +267,7 @@ export function runCodeQueueQueuesShapeContract(): JsonRecord {
"full explicit limit remains bounded and paged",
"offset pagination",
"split-brain live activity counts distinguish scheduler-local queues, DB running tasks, and heartbeat-fresh runners",
"commander concurrency block names the active runner count and 15-concurrency rule",
],
};
}
@@ -255,6 +255,7 @@ export function runCodeQueueSupervisorDisclosureContract(): JsonRecord {
const omittedCounts = asRecord(listBudget.omittedCounts);
const splitBrainLiveView = asRecord(asRecord(splitBrainLive).supervisor);
const splitBrainLiveActivity = asRecord(splitBrainLiveView.activity);
const splitBrainLiveConcurrency = asRecord(splitBrainLiveView.commanderConcurrency);
const splitBrainLiveCounts = asRecord(splitBrainLiveView.counts);
assertCondition(supervisorBody.length < fullBody.length * 0.55, "supervisor output should be materially smaller than full output", { supervisorChars: supervisorBody.length, fullChars: fullBody.length });
@@ -294,6 +295,7 @@ export function runCodeQueueSupervisorDisclosureContract(): JsonRecord {
assertCondition(asArray(unreadFilteredSection.items).length <= 3, "unread list should be locally paged below --limit", unreadFilteredSection);
assertCondition(unreadFilteredBody.length < 14_000, "unread output should remain bounded", { chars: unreadFilteredBody.length });
assertCondition(splitBrainLiveCounts.running === 8, "split-brain supervisor should preserve DB running task count", splitBrainLiveCounts);
assertCondition(splitBrainLiveCounts.commanderActiveRunnerCount === 8, "split-brain supervisor should mirror commander active count in counts", splitBrainLiveCounts);
assertCondition(splitBrainLiveCounts.effectiveActive === 8, "split-brain supervisor should foreground effective active count", splitBrainLiveCounts);
assertCondition(splitBrainLiveCounts.databaseRunning === 8, "split-brain supervisor should distinguish database running tasks", splitBrainLiveCounts);
assertCondition(splitBrainLiveCounts.heartbeatFreshActive === 8, "split-brain supervisor should distinguish heartbeat-effective active runners", splitBrainLiveCounts);
@@ -305,6 +307,13 @@ export function runCodeQueueSupervisorDisclosureContract(): JsonRecord {
assertCondition(splitBrainLiveActivity.schedulerLocalActiveQueueCount === 0, "split-brain supervisor activity should expose scheduler-local queue count", splitBrainLiveActivity);
assertCondition(splitBrainLiveActivity.schedulerLocalActiveRunSlotCount === 0, "split-brain supervisor activity should expose scheduler-local slot count", splitBrainLiveActivity);
assertCondition(splitBrainLiveActivity.splitBrainLive === true, "split-brain supervisor activity should mark live split-brain", splitBrainLiveActivity);
assertCondition(splitBrainLiveActivity.splitBrainDisposition === "live-count-as-active", "split-brain supervisor activity should classify live split-brain as active capacity", splitBrainLiveActivity);
assertCondition(splitBrainLiveActivity.commanderConcurrency !== undefined, "split-brain supervisor activity should include commander concurrency guidance", splitBrainLiveActivity);
assertCondition(splitBrainLiveConcurrency.activeRunnerCount === 8, "split-brain supervisor should expose commander-facing active runner count", splitBrainLiveConcurrency);
assertCondition(splitBrainLiveConcurrency.activeRunnerCountField === "activity.effectiveActiveTaskCount", "split-brain supervisor should name the field to use", splitBrainLiveConcurrency);
assertCondition(splitBrainLiveConcurrency.splitBrainDisposition === "live-count-as-active", "split-brain supervisor should explain live split-brain disposition", splitBrainLiveConcurrency);
assertCondition(splitBrainLiveConcurrency.interventionRequired === false, "fresh split-brain supervisor should not require intervention", splitBrainLiveConcurrency);
assertCondition(String(splitBrainLiveConcurrency.decisionRule ?? "").includes("15 - activeRunnerCount"), "split-brain supervisor should give 15-concurrency arithmetic", splitBrainLiveConcurrency);
assertCondition(String(splitBrainLiveActivity.activeQueueIdsNote ?? "").includes("zero local queue ids does not mean zero active runners"), "split-brain supervisor activity should explain activeQueueIds are local-only", splitBrainLiveActivity);
assertCondition(String(splitBrainLiveActivity.interpretation ?? "").includes("continue supervision"), "split-brain supervisor activity should not imply scheduler stoppage", splitBrainLiveActivity);
@@ -320,6 +329,7 @@ export function runCodeQueueSupervisorDisclosureContract(): JsonRecord {
"drill-down commands preserved",
"full view remains detailed",
"split-brain live supervisor activity distinguishes scheduler-local, database, and heartbeat counts",
"commander concurrency block names the active runner count and 15-concurrency rule",
],
supervisorChars: supervisorBody.length,
fullChars: fullBody.length,
+55
View File
@@ -1359,6 +1359,12 @@ function compactCodeQueueActivity(
stringListCount(rawDiagnostics.heartbeatRiskTaskIds),
stringListCount(compactDiagnostics.heartbeatRiskTaskIds),
) ?? 0;
const staleRecoveryCandidateTaskCount = firstFiniteNumber(
rawLiveness.staleRecoveryCandidateTaskCount,
compactLiveness.staleRecoveryCandidateTaskCount,
stringListCount(rawDiagnostics.staleRecoveryCandidateTaskIds),
stringListCount(compactDiagnostics.staleRecoveryCandidateTaskIds),
) ?? 0;
const schedulerLocalActiveRunSlotCount = firstFiniteNumber(
rawDiagnostics.schedulerActiveRunSlotCount,
queue.schedulerActiveRunSlotCount,
@@ -1367,6 +1373,10 @@ function compactCodeQueueActivity(
);
const runnableQueueCount = firstFiniteNumber(options.runnableQueueCount, queue.runnableQueueCount);
const effectiveActiveTaskCount = Math.max(databaseActiveTaskCount, databaseRunningTaskCount, heartbeatFreshActiveTaskCount);
const executionState = asString(rawDiagnostics.state ?? compactDiagnostics.state);
const effectiveLiveness = asString(rawDiagnostics.effectiveLiveness ?? compactDiagnostics.effectiveLiveness ?? rawLiveness.effectiveLiveness ?? compactLiveness.effectiveLiveness);
const recommendedAction = asString(rawDiagnostics.recommendedAction ?? compactDiagnostics.recommendedAction ?? rawLiveness.recommendedAction ?? compactLiveness.recommendedAction);
const splitBrain = asBoolean(rawDiagnostics.splitBrain) || asBoolean(compactDiagnostics.splitBrain) || executionState === "split-brain";
const splitBrainLive = splitBrainLiveFromDiagnostics(rawDiagnostics) || splitBrainLiveFromDiagnostics(compactDiagnostics);
const effectiveActiveSource = heartbeatFreshActiveTaskCount > 0 && heartbeatFreshActiveTaskCount >= databaseActiveTaskCount
? "heartbeat-fresh"
@@ -1378,6 +1388,39 @@ function compactCodeQueueActivity(
const activeQueueIdsNote = schedulerLocalActiveQueueIds.length === 0 && effectiveActiveTaskCount > 0
? "activeQueueIds are scheduler-local only; zero local queue ids does not mean zero active runners when database or heartbeat counts are nonzero."
: "activeQueueIds are scheduler-local active-run slots; use effectiveActiveTaskCount for commander concurrency decisions.";
const recommendedActionIntervenes = recommendedAction.length > 0 && recommendedAction !== "continue-supervision";
const interventionRequired = heartbeatRiskTaskCount > 0 || staleRecoveryCandidateTaskCount > 0 || recommendedActionIntervenes || (splitBrain && !splitBrainLive);
const splitBrainDisposition = splitBrainLive
? "live-count-as-active"
: splitBrain
? "risk-investigate-before-new-work"
: "not-split-brain";
const splitBrainReason = splitBrainLive
? "database active/running tasks have fresh heartbeat evidence and no heartbeat-risk candidates in the compact summary."
: splitBrain
? "split-brain without fresh-heartbeat live evidence is risky; inspect heartbeat risk, stale recovery candidates, and raw diagnostics before changing capacity."
: "control-plane and execution-plane activity signals are not split-brain.";
const interventionReason = splitBrainLive
? "fresh heartbeat makes split-brain live; count these runners as active and continue supervision."
: heartbeatRiskTaskCount > 0
? "heartbeat risk is present; inspect before adding replacement work or recovering tasks."
: staleRecoveryCandidateTaskCount > 0
? "stale recovery candidates are present; follow the recovery runbook before changing concurrency."
: recommendedActionIntervenes
? `execution diagnostics recommend ${recommendedAction}; intervene before adding work.`
: splitBrain
? "split-brain is not proven live; inspect raw diagnostics before treating capacity as available."
: "no intervention signal in compact activity summary.";
const commanderConcurrency = {
activeRunnerCount: effectiveActiveTaskCount,
activeRunnerCountField: "activity.effectiveActiveTaskCount",
activeRunnerCountSource: effectiveActiveSource,
decisionRule: "subtract activeRunnerCount from the commander concurrency target; for a 15-runner policy, remaining slots = 15 - activeRunnerCount.",
splitBrainDisposition,
splitBrainReason,
interventionRequired,
interventionReason,
};
return {
effectiveActiveTaskCount,
effectiveActiveSource,
@@ -1386,10 +1429,16 @@ function compactCodeQueueActivity(
heartbeatFreshActiveTaskCount,
activeHeartbeatTaskCount,
heartbeatRiskTaskCount,
staleRecoveryCandidateTaskCount,
schedulerLocalActiveQueueCount: schedulerLocalActiveQueueIds.length,
schedulerLocalActiveRunSlotCount,
runnableQueueCount,
effectiveLiveness: effectiveLiveness || null,
recommendedAction: recommendedAction || null,
splitBrainLive,
splitBrainDisposition,
splitBrainReason,
commanderConcurrency,
activeQueueIdsScope: "scheduler-local-active-run-slots",
activeQueueIdsNote,
interpretation: splitBrainLive
@@ -2454,6 +2503,7 @@ function codexTasksOverviewResult(
const pagination = taskPage.pagination;
const diagnostics = supervisorExecutionDiagnostics(asRecord(taskPage.queue)?.executionDiagnostics);
const activity = compactCodeQueueActivity(asRecord(taskPage.queue) ?? {}, diagnostics);
const commanderConcurrency = asRecord(activity.commanderConcurrency) ?? {};
const visibleSupervisorItems = [...runningSection.items, ...unreadSection.items, ...recentSection.items, ...queuedSection.items];
const classifierCounts = visibleSupervisorItems.reduce((counts, item) => {
const key = item.kind;
@@ -2504,12 +2554,14 @@ function codexTasksOverviewResult(
completedUnread: unreadSection.count,
recentCompleted: recentSection.count,
queued: queuedSection.count,
commanderActiveRunnerCount: asNumber(commanderConcurrency.activeRunnerCount, 0),
effectiveActive: asNumber(activity.effectiveActiveTaskCount, 0),
databaseRunning: asNumber(activity.databaseRunningTaskCount, 0),
heartbeatFreshActive: asNumber(activity.heartbeatFreshActiveTaskCount, 0),
schedulerLocalActiveQueues: asNumber(activity.schedulerLocalActiveQueueCount, 0),
},
classifierCounts,
commanderConcurrency,
activity,
executionDiagnostics: diagnostics,
degraded,
@@ -2803,6 +2855,7 @@ function compactQueuesResponse(body: Record<string, unknown>, options: CodexQueu
const visible = selected.slice(options.offset, options.offset + options.limit);
const diagnostics = compactQueueExecutionDiagnostics(queue.executionDiagnostics);
const activity = compactCodeQueueActivity(queue, diagnostics, { schedulerLocalActiveQueueIds: activeIds, runnableQueueCount: runnableQueues.length });
const commanderConcurrency = asRecord(activity.commanderConcurrency) ?? {};
const activeTaskIds = boundedUniqueStringList(queue.activeTaskIds, Math.min(options.limit, maxTasksLimit));
const queuedTaskIds = boundedUniqueStringList(queue.queuedTaskIds, Math.min(options.limit, maxTasksLimit));
const nextOffset = options.offset + visible.length;
@@ -2832,6 +2885,7 @@ function compactQueuesResponse(body: Record<string, unknown>, options: CodexQueu
activeQueueCount: activeIds.length,
activeQueueCountScope: "scheduler-local-active-run-slots",
schedulerLocalActiveQueueCount: activeIds.length,
commanderActiveRunnerCount: commanderConcurrency.activeRunnerCount,
effectiveActiveTaskCount: activity.effectiveActiveTaskCount,
databaseRunningTaskCount: activity.databaseRunningTaskCount,
databaseActiveTaskCount: activity.databaseActiveTaskCount,
@@ -2840,6 +2894,7 @@ function compactQueuesResponse(body: Record<string, unknown>, options: CodexQueu
unreadQueueCount: unreadQueues.length,
runnableQueueCount: runnableQueues.length,
},
commanderConcurrency,
activity,
activeQueueIds: queue.activeQueueIds ?? [],
activeQueueIdsScope: "scheduler-local-active-run-slots",
+1
View File
@@ -287,6 +287,7 @@ function codexHelp(): unknown {
activityFields: {
path: "data.queues.activity and data.supervisor.activity",
effectiveActiveTaskCount: "Commander-facing active count derived from database active/running tasks and heartbeat-fresh runners.",
commanderConcurrency: "Use data.supervisor.commanderConcurrency.activeRunnerCount or data.queues.commanderConcurrency.activeRunnerCount for concurrency decisions; the block states the 15-runner arithmetic and intervention signal.",
schedulerLocalActiveQueueCount: "Only queues currently visible in this scheduler-local active-run slot view; zero does not override DB or heartbeat activity.",
heartbeatFreshActiveTaskCount: "Heartbeat-effective active runner count used to avoid split-brain zero-active mistakes.",
},