fix: clarify code queue liveness snapshots

2026-05-23 12:36:20 +00:00
parent 15d074424c
commit 758377c551
4 changed files with 338 additions and 23 deletions
@@ -289,9 +289,13 @@ commander 视图的任务分类必须是确定性字段，至少区分 `business

 队列诊断中的 `split-brain` 表示控制面/执行面观测分裂，不自动证明任务已经死亡。只要任务 heartbeat 还在刷新、trace 仍在推进，就不能把它判成服务中断或要求立刻 stop；应把它视为 `splitBrainLive=true` 的 live 任务，继续监督并推进 #20 里的已排任务，而不是 interrupt、替换或把 backend 当成已经挂掉。队列摘要应显示 `effectiveLiveness=live`、`splitBrainLive=true` 和 `recommendedAction=continue-supervision`；compact 输出还应在 `executionDiagnostics.liveness` 中重复这些低噪声字段，并突出 `activeHeartbeatCount`、有界 `heartbeatFreshTaskIds`、`databaseActiveTaskCount` 和 `schedulerActiveRunSlotCount`。当 master/control-plane 的 `schedulerActiveRunSlotCount=0` 但 `heartbeatFreshTaskIds` 非空时，active 数应优先按 scheduler heartbeat 摘要解释为 live，而不是按 master 本地 slot 0 解释为执行停摆。只有 heartbeat expired/missing 或满足 stale-recovery 条件时，才应显示 `effectiveLiveness=at-risk` 并进入恢复判断。

+`codex submit` 成功后的 `queue` 是 submit-confirmation 即时 bounded snapshot，不是恢复判据。它必须同时区分 `submitted.taskStates[]` 中本次提交任务的 queued/running 状态、`queue.countContext.databaseActive` / `activity.databaseActiveTaskCount` 中 PostgreSQL active running、`activity.schedulerHeartbeatFreshness` 中 scheduler heartbeat freshness，以及 `activity.recovery` 中的 transient risk。若同一个提交回显里出现 `counts.running>0`、`queued>0`、`heartbeatRiskTaskCount>0` 或 `staleRecoveryCandidateTaskCount>0`，但 `lastObservedAgentEventAt` 明显早于 submit time 或还没有下一次 supervisor poll 确认，输出应保留候选可见性并给出 `re-poll supervisor before recovery`；这类单次 bounded snapshot 只能设置 `attentionRequired=true`，不得把 `commanderConcurrency.interventionRequired` 直接升级为高风险恢复。默认 drill-down 是重新运行 `codex tasks --view supervisor --limit 20` 或 raw overview，而不是 restart、cancel、interrupt 或 DB write。
+
+默认 supervisor poll 也遵循同一低噪声语义：heartbeat expired/missing、`heartbeatRiskTaskIds` 和 `staleRecoveryCandidateTaskIds` 必须可见，但第一次 poll 只表示 `transient-needs-repoll`，`activity.recovery.hint` 应为 `re-poll supervisor before recovery`。只有 repeated poll 仍确认 owner heartbeat expired、scheduler local no active run、database-active task 仍存在，并且输出显式带 `repeatedPollConfirmed=true` 或 confirmed stale candidate，才允许进入 bounded dry-run reconcile；真实恢复仍受高风险边界约束。
+
 stale-active 恢复和 `/api/scheduler/reconcile?staleMs=...` 诊断入口的 heartbeat stale 阈值必须按安全下限归一化：缺省和低于默认 5 分钟的值都按 5 分钟处理，过大值按 24 小时上限截断，并在结构化响应中返回 `requestedStaleMs*`、`staleMsAdjusted`、`staleMsAdjustmentReason`、`minStaleMs` 和 `maxStaleMs`。任何 `staleMs=0` 或过低阈值都不能把仍有 fresh scheduler heartbeat 的任务判成 stale/recoverable。

-`codex queues`、`codex tasks --view commander` 和默认 supervisor 视图的 `activity` / `commanderConcurrency` 是指挥官并发治理的主读数。并发决策固定使用 `commanderConcurrency.activeRunnerCount` 或 commander `activeRunners.count`，它等于 `activity.effectiveActiveTaskCount`；15 并发策略的可补窗口按 `15 - activeRunnerCount` 计算，不能用 `activeQueueIds.length` 或 scheduler-local slot 数替代。`effectiveActiveTaskCount` 表示用于调度判断的有效活跃任务数；`databaseRunningTaskCount` 来自 PostgreSQL 中 `running` 状态计数；`databaseActiveTaskCount` 覆盖 running/judging 等数据库活跃任务；`heartbeatFreshActiveTaskCount` 表示 heartbeat-fresh 的有效 runner 数；`schedulerLocalActiveQueueCount` 和 `schedulerLocalActiveRunSlotCount` 只表示当前控制面本地可见 active run slots。`activeQueueIds` 与 `activeQueueCount` 是 scheduler-local 字段，可能在 `counts.running>0` 且 heartbeat 新鲜时为 0；看到这种组合时应按 `activity.effectiveActiveTaskCount`、`activity.heartbeatFreshActiveTaskCount` 和 `splitBrainLive` 决策，不得把空 `activeQueueIds` 当作零并发或停摆证据。`commanderConcurrency.splitBrainDisposition=live-count-as-active` 表示 split-brain 仍是 live 且应计入 active runner；`interventionRequired=true`、heartbeat risk、stale recovery candidates，或非 `continue-supervision` 的 recommended action 才进入人工介入/恢复判断。
+`codex queues`、`codex tasks --view commander` 和默认 supervisor 视图的 `activity` / `commanderConcurrency` 是指挥官并发治理的主读数。并发决策固定使用 `commanderConcurrency.activeRunnerCount` 或 commander `activeRunners.count`，它等于 `activity.effectiveActiveTaskCount`；15 并发策略的可补窗口按 `15 - activeRunnerCount` 计算，不能用 `activeQueueIds.length` 或 scheduler-local slot 数替代。`effectiveActiveTaskCount` 表示用于调度判断的有效活跃任务数；`databaseRunningTaskCount` 来自 PostgreSQL 中 `running` 状态计数；`databaseActiveTaskCount` 覆盖 running/judging 等数据库活跃任务；`heartbeatFreshActiveTaskCount` 表示 heartbeat-fresh 的有效 runner 数；`schedulerLocalActiveQueueCount` 和 `schedulerLocalActiveRunSlotCount` 只表示当前控制面本地可见 active run slots。`activeQueueIds` 与 `activeQueueCount` 是 scheduler-local 字段，可能在 `counts.running>0` 且 heartbeat 新鲜时为 0；看到这种组合时应按 `activity.effectiveActiveTaskCount`、`activity.heartbeatFreshActiveTaskCount` 和 `splitBrainLive` 决策，不得把空 `activeQueueIds` 当作零并发或停摆证据。`commanderConcurrency.splitBrainDisposition=live-count-as-active` 表示 split-brain 仍是 live 且应计入 active runner；`attentionRequired=true` 表示需要人工看一眼或重新 poll，`interventionRequired=true` 才表示当前输出已经足以进入高风险介入路径。单次 heartbeat risk、stale recovery candidates 或 `recommendedAction=investigate-heartbeat-risk` 应先落到 `attentionRequired=true` 加 `re-poll supervisor before recovery`，不得直接等价为恢复授权。

 单次 `provider is not online`、SSH 超时、proxy 超时或 registry 请求失败只能证明“当前观察路径失败”，不能单独升级为 D601 全局离线、CI/CD 全局阻塞或业务任务不可推进。指挥官和 runner 必须用多信号裁决运行面状态，至少区分以下观察面：