fix: name commander active runner count

2026-05-23 01:01:53 +00:00
parent 031c3fda39
commit 982f21ec62
6 changed files with 81 additions and 3 deletions
@@ -47,7 +47,7 @@ CLI 可以从 `master` 快速演进，但必须兼容 `deploy.json` 固定的 CI
 - `codex submit [prompt] [--prompt-file path|--prompt-stdin] [--queue queueId] [--provider-id id] [--cwd path] [--model model] [--reasoning-effort effort] [--execution-mode mode] [--max-attempts N] [--reference-task-id id] [--dry-run]` 通过 backend-core 私有代理向稳定 `code-queue` 用户服务路径提交任务；prompt 必须且只能来自位置参数、文件或 stdin 之一，`--dry-run` 只返回结构化请求且不实际入队。长 prompt、多行 prompt、含引号/反引号/Markdown 表格/JSON/反斜杠的 prompt 必须优先用 `--prompt-stdin` 或 `--prompt-file`，不要拼进 shell 单个参数；位置参数只适合短单行 smoke prompt。stdin 推荐用 quoted heredoc：`cat <<'PROMPT' | bun scripts/cli.ts codex submit --prompt-stdin --queue <id> --dry-run`，文件路径推荐 `bun scripts/cli.ts codex submit --prompt-file /tmp/code-queue-prompt.md --queue <id> --dry-run`，确认 dry-run 后移除 `--dry-run` 提交同一 payload。dry-run 会额外输出 `routingRecommendation`，包含推荐 route、runner、model、风险信号、prompt 自包含/issue 非唯一来源/prod-secret-DB 禁止/运行态或 release 禁止/证据要求/中等复杂度候选等 guard 状态；同时输出 `policyContract`，固定暴露 GPT-5.5、DeepSeek、MiniMax 的风险分层、并发上限和外部 provider 429 退避处置。该建议只用于指挥官 preflight，不会改写 payload，不改变 runtime admission，也不假设生产 MiniMax 或 DeepSeek 可用。`--dry-run` 必须返回完整 prompt、字符数和 `truncated=false` 用于人工验收；真实提交是写入操作，默认只返回 `accepted=true`、task id、队列、写入保护摘要和后续查看命令，必须标记 `promptOmitted=true` 且不得回显 prompt 或 promptPreview。真实提交会经过本机本地串行化保护和短节流，避免同一指挥端并发 submit 把低内存主机或 `code-queue-mgr` 控制面打抖；返回值会附带低噪声 `submitConcurrencyGuard` 说明本次提交的锁与等待信息。backend-core 默认把提交、队列 CRUD、已读状态、历史摘要和轻量 Trace 读取分流到主 server `code-queue-mgr`，由它写入主 PostgreSQL；D601 scheduler 只轮询并执行已入库任务。
 - `codex pr-preflight [--remote] [--push-dry-run --push-dry-run-ref refs/heads/probe/<name>] [--pr-create-dry-run --pr-create-dry-run-head <head>] [--issue N] [--full]` 通过稳定 `code-queue` proxy 请求 D601 scheduler `/api/runtime-preflight`，用于 PR 型派单 admission。输出会压缩展示 scheduler/runner 的 token 覆盖、Auth Broker source/capability/nextAction、工具、agent port、Git worktree、GitHub egress、repo/issue/PR 只读探测、可选 push dry-run，以及可选 PR body/create dry-run guard；只报告 `GH_TOKEN`/`GITHUB_TOKEN` 是否存在和来源 key，不打印值。当 auth-broker 配置存在时，`tokenCoverage.source="auth-broker"`、`credentialSource="broker-issued-token"` 且 runner env token 不是成功前提；当仅 env token 存在时，`credentialSource="env-token"` 且 `preflight.authBroker.nextAction="use-env-token-until-auth-broker-live"`；两者都缺失时顶层 `ok=false`、`runnerDisposition=infra-blocked`、`degradedReason=auth-broker-needed`，`tokenCoverage.missing` 同时列出 `GH_TOKEN` 与 `GITHUB_TOKEN`，并输出 `preflight.authBroker.source="broker/auth-broker-needed"`、`capability.source="missing-token"`。该 `auth-missing` 的 scope 是 `scheduler-runner-env`，不能简化成“当前 active runner/dev container 不能创建 PR”；输出必须带 `scopeBoundary` 和 `activeRunnerDevContainer`，要求调用方另跑 `bun scripts/cli.ts gh auth status --repo pikasTech/unidesk` 与 PR dry-run 来确认当前 dev container 能力。`preflight.prCapabilityContract` 是 runner-facing 合同摘要，必须包含目标分支、token/auth 来源、`systemGhBinaryRequiredForWrites=false`、UniDesk REST `bun scripts/cli.ts gh` 可用性、push dry-run/PR create dry-run 的 `writesRemote=false`、expected PR handoff、真实 PR 创建需要 commander 授权和 `gh pr merge` 的 `unsupported-command` 边界；系统 `gh` binary 缺失只进入 `tools.systemGhBinary`，不得误判为 UniDesk REST `gh` CLI 不可用。`--remote` 在 runner-like 环境里不再依赖本地 `unidesk-backend-core`、`unidesk-database`、`baidu-netdisk-backend` 容器存在；这些缺失只作为本地观测证据。若远程控制面可达，则继续走远程控制面结果；若远程控制面不可达，则结构化返回 `failureKind=control-plane-missing` / `degradedReason=remote-control-plane-unreachable`，而不是把本地 `backend-core-container-missing` 当作最终阻塞。`--pr-create-dry-run` 不 POST GitHub，只证明 runner 内 PR body 生成、`scripts/cli.ts gh pr create --dry-run` 和 branch 参数形态可用；服务端创建权限仍以 token/auth broker、repo/issue/PR read、push dry-run 和最终授权后的真实 PR 创建结果为准。
 - `codex task <taskId>` 通过 Code Queue 私有代理按任务 ID 查询结构化审阅摘要；默认只返回任务身份、执行 Provider、工作目录、attempt 计数、原始 prompt、最终 response、最后错误和渐进披露命令，适合指挥官审阅完成未读任务且避免上下文爆炸。`--detail` 仍是有界详细摘要：默认只返回少量 attempt/tool 行、短 prompt/response/stderr/feedback 预览和 omitted/truncated 元数据；需要完整 prompt/response 文本或更多 tool/attempt 细节时再显式加 `--full`、`--tool-limit N`、`--trace` 或 `codex output`。该摘要读取默认由主 server `code-queue-mgr` 从 PostgreSQL 返回，不依赖 D601 `code-queue-read` Service 可用。
- `codex tasks [--view supervisor|full] [--queue id] [--status succeeded|running|queued|failed|canceled|judging|retry_wait[,..]] [--unread|--unread-only] [--limit N] [--before-id id]` 通过同一私有代理输出渐进式披露视图。默认 `supervisor` 是低噪声指挥官视图，只返回 `running`、`completedUnread`、`recentCompleted`、`queued` 和 `executionDiagnostics` 的紧凑行；prompt/body 只给短预览和原始字符数，`running`/`completedUnread`/`queued` 默认只返回一个很小的有界页并通过 section `commands.next` 继续分页，`recentCompleted` 默认也限量且不重复 `completedUnread` 未读终态，不嵌入完整 Trace、final response 或全量 overview。`--limit` 在 supervisor 中主要是扫描/分页预算，不是返回几十条肥行的开关；需要更详细当前页任务行时显式使用 `--view full` 或 `--full`，仍受 `--limit` 和 `--before-id` 分页约束。
+- `codex tasks [--view supervisor|full] [--queue id] [--status succeeded|running|queued|failed|canceled|judging|retry_wait[,..]] [--unread|--unread-only] [--limit N] [--before-id id]` 通过同一私有代理输出渐进式披露视图。默认 `supervisor` 是低噪声指挥官视图，只返回 `running`、`completedUnread`、`recentCompleted`、`queued`、`activity`、`commanderConcurrency` 和 `executionDiagnostics` 的紧凑行；prompt/body 只给短预览和原始字符数，`running`/`completedUnread`/`queued` 默认只返回一个有界小页并通过 section `commands.next` 继续分页，`recentCompleted` 默认限量且不重复 `completedUnread` 未读终态，不嵌入完整 Trace、final response 或全量 overview。`commanderConcurrency.activeRunnerCount` 是并发策略应使用的 active/running 计数，等于 `activity.effectiveActiveTaskCount`；15 并发策略按 `15 - activeRunnerCount` 计算剩余窗口。`commanderConcurrency.splitBrainDisposition=live-count-as-active` 表示 split-brain 有 fresh heartbeat 证据，应继续监督并计入 active；`interventionRequired=true` 才提示介入。每个条目只保留 task id、队列、状态、issue、分类和短摘要，`show/detail/trace/output/full/read` 放在 section template 中避免重复噪声，并带 `kind` 标记直接推进、部署修复、验证/报告噪声等类别。`--limit` 在 supervisor 中主要是扫描/分页预算，不是返回几十条肥行的开关；`--unread` 是 `--unread-only` 的别名，必须只保留未读终态；`--status` 必须真实过滤支持的状态，未知参数或未知状态必须结构化失败，不能静默忽略。需要更详细当前页任务行时显式使用 `--view full` 或 `--full`，仍受 `--limit` 和 `--before-id` 分页约束。
 - `codex task <taskId> --trace --tail|--from-start|--after-seq N|--before-seq N --limit N` 按页拉取 Code Queue 的逻辑 trace；响应会返回 `nextAfterSeq`、`previousBeforeSeq`、`hasMore`、`hasBefore` 和下一页/上一页命令，默认 `--trace` 取最新一页，且仍以分页 trace 为主；需要完整 prompt/最终 response 时加 `--full`，需要详细 task 摘要时加 `--detail`。
 - `codex output <taskId> --tail|--from-start|--after-seq N|--before-seq N --limit N [--full-text]` 按原始 output seq 分页读取底层记录；当 trace 行提示 `commandOmittedLines`、`bodyOmittedLines` 或 `rawSeqs` 时，用该命令按 seq 补取信息。默认是低噪声 raw-output 摘要：即使传入很大的 `--limit`，非 `--full-text` 也会限制返回行数和单条文本预览，并在 `disclosure.limitCapped`、`requestedLimit`、`effectiveLimit` 和 `commands.fullText` 中说明如何继续展开；显式 `--full-text` 才返回该页全文。
 - `codex read <taskId>` 在人工审阅后标记单个终态任务已读；列表、overview 和 supervisor 视图只返回这个命令字段，不得自动执行，也不得批量清空未读状态。
@@ -56,7 +56,7 @@ CLI 可以从 `master` 快速演进，但必须兼容 `deploy.json` 固定的 CI
 - `codex steer <taskId> [prompt|--prompt-file path|--prompt-stdin] [--dry-run] [--no-retry|--retry-attempts N]` 通过 Code Queue 私有代理向正在运行的 task 注入纠偏提示，正式替代底层 `microservice proxy code-queue /api/tasks/<taskId>/steer` 调用。prompt 必须且只能来自位置参数、文件或 stdin 之一；`--dry-run` 只输出 `method`、`path`、`stableProxyPath`、retry policy、prompt 字符数、截断预览和 raw proxy 等价命令，不触碰运行中 session，也不得泄露超长 prompt 全文。真实执行是写入操作，成功只返回 `accepted=true`、task id、prompt 字符数、`promptOmitted=true`、有界 task/queue 确认、attempt summary 和后续查看命令，不回显 prompt 或完整 task state；路径固定为 `/api/microservices/code-queue/proxy/api/tasks/<taskId>/steer`，只能作用于 D601 scheduler 上存在 active steerable turn 的 running task。默认对 `stable-proxy-failed` 和 `backend-core-unreachable` 这类 retryable control-plane failures 做一次有界重试；`--retry-attempts N` 最大为 3，`--retry-delay-ms N` 最大为 5000，`--no-retry` 用于复现单次失败。
 - `codex steer` 非 dry-run 失败仍输出 JSON 且退出非零；`.data.diagnostics.reason` 用于 runner 分流，当前包括 `backend-core-unreachable`、`code-queue-microservice-unregistered`、`proxy-unauthorized`、`proxy-404`、`steer-endpoint-404`、`upstream-runtime-rejected`、`stable-proxy-failed` 和 `invalid-proxy-response`。`scope` 区分 `backend-core`、`stable-proxy`、`code-queue-runtime` 或 `unknown`，并带 `status`、`exitCode`、`retryable`、有界 `upstreamBodyPreview`、`attempts`、`retryPolicy` 和推荐交叉验证命令；若任务不在 running/active-turn 状态，通常归类为 `upstream-runtime-rejected`，不得静默成功。`502 provider HTTP tunnel failed`、`provider-gateway-http-fetch`、`The operation was aborted` 或约 30 秒 tunnel wait abort 会归类为 `stable-proxy-failed`，CLI 会先按 retry policy 重试；如果仍失败，`.data.diagnostics.operatorGuidance.rawProxyEquivalentIsFallback=false` 表示 raw proxy 等价命令走同一条 tunnel，只能用于对照诊断，不应被当作更低噪声 fallback。此时 `.data.steer.deliveryUnconfirmed=true`，指挥官应先看 `codex tasks --view supervisor`、`codex task <taskId>` 和 `microservice health code-queue`，再从主 server CLI 或显式 SSH transport 重试同一个 `codex steer`。
 - `codex interrupt|cancel <taskId>` 通过 Code Queue 私有代理请求中断；running/judging 任务会请求 D601 当前 agent run 停止，queued/retry_wait 任务的取消也必须保持与 WebUI 相同代理路径，返回有界 task 摘要和后续查询命令。任何需要接触 active run 的动作仍属于 D601 执行面。
- Code Queue 多队列 lane 由 `codex` 命令命名空间管理：`queues [--full|--all] [--limit N] [--page N|--offset N]` 列表、`queue create <queueId>` 创建、`queue merge <sourceQueueId> --into <targetQueueId>` 合并、`move <taskId> --queue <queueId>` 迁移；这些队列管理入口默认由主 server `code-queue-mgr` 直管 PostgreSQL，仍通过稳定 `code-queue` 用户服务代理路径访问。`codex queues` 默认只返回 active/nonempty/unread/runnable queue 摘要、activity、全局 counts 和 execution diagnostics；`--full` 或 `--all` 只切换为完整队列行视图的一页，仍受 `--limit`/`--page`/`--offset` 分页约束，不再默认携带 deprecated full array。summary 和 full 的稳定机读路径都是 `.data.queues.items[]`，全局元数据固定在 `.data.queues.activity`、`.data.queues.counts`、`.data.queues.executionDiagnostics`、`.data.queues.activeTaskIds` 和 `.data.queues.queuedTaskIds`；需要完整 upstream 时使用输出中的 raw command。`activity.effectiveActiveTaskCount` 是指挥官并发判断的有效活跃数，`schedulerLocalActiveQueueCount`/`activeQueueIds` 只描述本地 scheduler active-run slots，不能覆盖数据库 running 计数或 heartbeat-fresh runner 计数。旧 full 顶层数组语义已作为 deprecated 兼容信息记录，不再作为 `.data.queues` 主形态。同一个 queue 内部串行执行，不同 queue 之间并行执行。迁移只允许尚未被 scheduler claim 的 `queued`/`retry_wait` 任务，必须满足 `startedAt=null`、`currentAttempt=0` 且没有 active thread/turn；已进入 `running`/`judging` 或已有 claim 标记的任务返回 409，不得被 move/merge 回写成 queued。合并会移动可迁移任务归属并自动删除源 queue 记录，只保留合并后的目标 queue；若 source 或 target queue 存在 active/claimed 任务，合并整体返回 409。合并后的目标 queue 按任务原 `queueEnteredAt`/`createdAt` 时间顺序串行，成功迁移 queued/retry_wait 任务后由 D601 scheduler 轮询推进。
+- Code Queue 多队列 lane 由 `codex` 命令命名空间管理：`queues [--full|--all] [--limit N] [--page N|--offset N]` 列表、`queue create <queueId>` 创建、`queue merge <sourceQueueId> --into <targetQueueId>` 合并、`move <taskId> --queue <queueId>` 迁移；这些队列管理入口默认由主 server `code-queue-mgr` 直管 PostgreSQL，仍通过稳定 `code-queue` 用户服务代理路径访问。`codex queues` 默认只返回 active/nonempty/unread/runnable queue 摘要、activity、commanderConcurrency、全局 counts 和 execution diagnostics；`--full` 或 `--all` 只切换为完整队列行视图的一页，仍受 `--limit`/`--page`/`--offset` 分页约束，不再默认携带 deprecated full array。summary 和 full 的稳定机读路径都是 `.data.queues.items[]`，全局元数据固定在 `.data.queues.commanderConcurrency`、`.data.queues.activity`、`.data.queues.counts`、`.data.queues.executionDiagnostics`、`.data.queues.activeTaskIds` 和 `.data.queues.queuedTaskIds`；需要完整 upstream 时使用输出中的 raw command。`commanderConcurrency.activeRunnerCount` / `activity.effectiveActiveTaskCount` 是指挥官并发判断的有效活跃数，`schedulerLocalActiveQueueCount`/`activeQueueIds` 只描述本地 scheduler active-run slots，不能覆盖数据库 running 计数或 heartbeat-fresh runner 计数。旧 full 顶层数组语义已作为 deprecated 兼容信息记录，不再作为 `.data.queues` 主形态。同一个 queue 内部串行执行，不同 queue 之间并行执行。迁移只允许尚未被 scheduler claim 的 `queued`/`retry_wait` 任务，必须满足 `startedAt=null`、`currentAttempt=0` 且没有 active thread/turn；已进入 `running`/`judging` 或已有 claim 标记的任务返回 409，不得被 move/merge 回写成 queued。合并会移动可迁移任务归属并自动删除源 queue 记录，只保留合并后的目标 queue；若 source 或 target queue 存在 active/claimed 任务，合并整体返回 409。合并后的目标 queue 按任务原 `queueEnteredAt`/`createdAt` 时间顺序串行，成功迁移 queued/retry_wait 任务后由 D601 scheduler 轮询推进。
 - 所有 `codex` 查询和管理命令必须走与 WebUI 相同的 backend-core 私有代理路径 `/api/microservices/code-queue/proxy/...`；CLI 不得为了提交、移动、中断、取消或队列管理直接调用 D601 内部 Service、数据库、pod curl 或 k3sctl scheduler 子服务。若该路径失败，应先修复 CLI/backend/provider tunnel 链路，而不是绕过控制面。
 - `job list [--limit N] [--include-command]` 与 `job status <jobId|latest> [--tail-bytes N]` 查询 `.state/jobs/` 文件系统状态，是异步命令的可观测入口。`job list` 默认只返回最新 50 条摘要；`job status` 默认只返回 stdout/stderr 末尾 12000 字节，并带 `tailPolicy` 与完整日志路径。
 - `debug health`、`debug dispatch` 与 `debug task` 走真实内部 core、WebSocket、数据库、provider、系统指标、Docker 状态和 Host SSH 维护桥流程，只用于开发调试，不写入 `TEST.md` 的正式验收步骤。