diff --git a/.agents/skills/unidesk-sub2api/SKILL.md b/.agents/skills/unidesk-sub2api/SKILL.md index 9fde20aa..b4adb2fd 100644 --- a/.agents/skills/unidesk-sub2api/SKILL.md +++ b/.agents/skills/unidesk-sub2api/SKILL.md @@ -74,15 +74,17 @@ bun scripts/cli.ts platform-infra sub2api codex-pool cleanup-probes --confirm - `pool.apiKeySecretName` / `pool.apiKeySecretKey`: 统一消费 API key 的 k3s Secret 位置,默认 `platform-infra/sub2api-codex-pool-api-key.API_KEY`。 - `pool.minOwnerBalanceUsd`: pool key owner 最低余额,sync/validate 会补齐。 - `pool.minOwnerConcurrency`: 可选统一消费 API key owner 最低并发;省略时 CLI 自动使用所有已解析账号 capacity 的总和,sync/validate 会补齐。显式 YAML 值只作为 override,仍必须不小于账号 capacity 总和;未显式写 `profiles.entries[].capacity` 的账号会使用 `pool.defaultAccountCapacity` 参与求和,不要用提高某个 provider capacity 来掩盖用户并发层 WS 1013。 -- `pool.defaultTempUnschedulable`: Sub2API 内置临时不可调度开关和 YAML 规则列表。当前要求是 `enabled=false`,YAML 保留规则用于以后显式恢复;sync 按 WebUI 关闭开关语义删除运行时 `temp_unschedulable_enabled` / `temp_unschedulable_rules` credentials 字段,不让 Sub2API 内置规则参与调度。 -- `pool.defaultTempUnschedulable` 与外部 `sentinel.*` 分开配置、互不驱动。内置开关关闭不影响哨兵;哨兵配置变化也不能隐式打开内置规则。 +- `pool.defaultTempUnschedulable`: Sub2API 内置请求路径临时不可调度开关和 YAML 规则列表。当前要求是按 YAML 开启通用规则;sync 把 `temp_unschedulable_enabled` / `temp_unschedulable_rules` 渲染到 managed accounts,让匹配的 400/5xx/超时/模型路由/加密内容错误短暂冷却当前账号并触发同组 failover。 +- `pool.defaultTempUnschedulable` 与外部 `sentinel.*` 分开配置、互不驱动。内置规则负责 near-real-time request-path cooling/failover;哨兵负责 marker health、账号级隔离/恢复和 probe 退避。 +- 外部 sentinel 的写入面只允许通过 Sub2API admin `schedulable` 接口冻结/恢复账号;不能写入、清理或间接清理 `temp_unschedulable_until` / `temp_unschedulable_reason`、rate-limit、overload、model-rate-limit 等 Sub2API 请求路径 runtime 状态,也不能调用 `recover-state` 作为恢复动作。看到 UI 里的“触发时间/解除时间/规则序号/匹配关键词”临时不可调度状态时,默认先归因到 Sub2API 内置 request-path temp-unschedulable,而不是 sentinel。 - YAML 只选择和配置 Codex 上游,不声明 `schedulable` 长期字段;`schedulable=true` 只能作为 `codex-pool sync --confirm` 对未处于哨兵隔离账号的过程控制基线恢复。 - `profiles.entries`: 从 master `~/.codex/` 选择上游 profile 并映射到 Sub2API account。 - `profiles.entries[].capacity`: 可选 per-account concurrency override;不写则使用 `pool.defaultAccountCapacity`。具体数值只以 `config/platform-infra/sub2api-codex-pool.yaml` 为准,skill 和长期参考只描述规则,不重复写当前值。 - `profiles.entries[].loadFactor`: 可选 per-account Sub2API `load_factor` override;不写则使用 `pool.defaultAccountLoadFactor`。具体数值只以 `config/platform-infra/sub2api-codex-pool.yaml` 为准,修改后必须 `codex-pool sync --confirm` 和 `codex-pool validate`。 - `profiles.entries[].trustUpstream`: 可选账号级哨兵信任标记;默认 `false`。可信账号使用 `sentinel.cadence.trustedSuccessMaxIntervalMinutes` 作为连续成功后的最大探测退避,不可信账号使用 `sentinel.cadence.untrustedSuccessMaxIntervalMinutes`。它只影响哨兵探测频率和状态可见性,不改变 Sub2API account priority/capacity/loadFactor。 - 除非用户明确要求修改配置,不要仅凭推断改账号 membership、priority、capacity、loadFactor、WebSocket mode 或其他调度策略;先保留 YAML,完成 provenance/runtime evidence 溯源,并把结论写回相关 issue 或 runbook 后再提出变更。 -- `profiles.entries[].tempUnschedulable`: 可选 per-account Sub2API 内置临时不可调度覆盖;当前同样应保持开关关闭,规则只保留在 YAML,不作为调度健康机制。 +- Sub2API 是 UniDesk 可读源码和可观测运行面的受控组件;排查 Sub2API 调度、failover、错误传播、临时不可调度或 account selection 时,默认先读当前 Sub2API 源码实现,再用真实 request id、Sub2API 日志和原入口流量验证。不要用 mock upstream、临时 probe account 或测试桩作为默认结论来源;这类探针最多是显式 debug 辅助,不能替代源码链路和真实运行证据。 +- `profiles.entries[].tempUnschedulable`: 可选 per-account Sub2API 内置临时不可调度覆盖;只用于明确偏离 pool 默认规则,不用它给某个账号特殊优先级或临时绕过通用 failover。 - `profiles.entries[].openaiResponsesWebSocketsV2Mode`: 需要 Responses WebSocket v2 的上游才设置,值为 `off`、`ctx_pool` 或 `passthrough`。 - `profiles.entries[].upstreamUserAgent`: 少数要求 Codex CLI User-Agent 的上游才设置,不能含换行。 - `sentinel.monitor.enabled`: 账号级 marker 哨兵监控开关;开启后 `codex-pool sync --confirm` 会在 `platform-infra` 创建/更新 k8s CronJob、ConfigMap、Secret、ServiceAccount、Role 和 RoleBinding。CronJob 直打 YAML-managed 上游账号的 OpenAI Responses `gpt-5.5`,用确定 marker 作为唯一健康标准,并在独立 state ConfigMap 中记录 token/cost 账本。 @@ -94,11 +96,11 @@ bun scripts/cli.ts platform-infra sub2api codex-pool cleanup-probes --confirm - `sentinel.freeze`: 失败冻结 TTL 指数退避配置。当前口径是初始 1 分钟,失败后 `1m -> 2m -> 4m -> 8m -> 10m`,最大 10 分钟;失败 probe 基本不消耗有效输出 token,因此冻结窗口保持短周期。冻结到期后只做恢复 probe,通过才自动恢复,不能仅靠 TTL 到期解封。 - `sentinel.pricing`: 直打上游时哨兵自己的 token/cost 估算价格。因为 direct upstream probe 不经过 Sub2API 普通用量账本,哨兵必须自己记录全局与 per-account token/cost;这些账本只用于观察,不作为跳过探测的预算门禁。 -`sync --confirm` 会登录 Sub2API admin、创建/更新 group、创建/更新 YAML 中的 `unidesk-codex-*` accounts、创建/复用统一 API key Secret,并把未触发质检的 managed account 的 `schedulable=true` 恢复为过程控制基线;它默认不删除 YAML 中缺席的 managed account。只有明确退役上游时才使用 `sync --confirm --prune-removed` 删除缺席且 `extra.unidesk_managed=true` 的 `unidesk-codex-*` account。 +`sync --confirm` 会登录 Sub2API admin、创建/更新 group、创建/更新 YAML 中的 `unidesk-codex-*` accounts、创建/复用统一 API key Secret,并把未处于哨兵 active quarantine 的 managed account 的 `schedulable=true` 恢复为过程控制基线;它默认不删除 YAML 中缺席的 managed account。只有明确退役上游时才使用 `sync --confirm --prune-removed` 删除缺席且 `extra.unidesk_managed=true` 的 `unidesk-codex-*` account。 `sentinel-image status|build` 管理哨兵 Python 运行环境镜像。镜像由 YAML 的 `sentinel.image` 基础镜像和 `sentinel.sdk.openaiPythonVersion` 派生,发布到 G14 本地 registry `127.0.0.1:5000/platform-infra/sub2api-account-sentinel:`;`build --confirm` 会先检查 registry tag,存在则快速复用,不存在才在 G14 host 构建并 push。CronJob 启动时只校验 SDK 版本,不在运行时 `pip install`。 -`sync --confirm` 同时会按 YAML 渲染账号级哨兵资源,并在 monitor 开启时先确保可复用哨兵镜像存在。当前目标是 `sentinel.monitor.enabled=true` + `sentinel.actions.enabled=true` 的 marker-only 自动冻结/恢复;不要手工 patch CronJob、Secret 或 Sub2API account。若 YAML 新增账号或修改 profile/base URL/API key fingerprint/upstream User-Agent/Responses WebSocket mode,sync 会先从变更前 runtime state 写入 pending quality gate,再更新账号并保持默认冻结,最后立即安排 sentinel probe;marker 通过后才自动恢复调度,避免坏号未经质检混入池子。无关账号的既有成功/失败退避不能被重置。若 YAML 下调失败冻结最大窗口,sync 会把仍 active 的旧冻结状态迁移到当前最大窗口内并立即安排 recovery probe,但不会直接解冻。若怀疑某个账号被误判,先用 `codex-pool sentinel-probe --account --confirm` 立即触发该账号测量;该命令从现有 CronJob 模板派生一次性 Job,复用同一份 Secret、ConfigMap、OpenAI SDK probe、token/cost 账本和冻结/恢复状态机。 +`sync --confirm` 同时会按 YAML 渲染账号级哨兵资源,并在 monitor 开启时先确保可复用哨兵镜像存在。当前目标是 `sentinel.monitor.enabled=true` + `sentinel.actions.enabled=true` 的 marker-only 自动冻结/恢复;不要手工 patch CronJob、Secret 或 Sub2API account。若 YAML 新增账号或修改 profile/base URL/API key fingerprint/upstream User-Agent/Responses WebSocket mode,sync 会从变更前 runtime state 写入 pending probe 记录并立即安排 sentinel probe,但默认仍保持该 account 可调度;只有实际 marker probe 非命中或已有 active quarantine 才会冻结账号。sentinel 冻结/恢复只改 `schedulable=false|true`,不得顺手调用 Sub2API `recover-state` 清除请求路径临时不可调度或其他 runtime backoff。无关账号的既有成功/失败退避不能被重置。若 YAML 下调失败冻结最大窗口,sync 会把仍 active 的旧冻结状态迁移到当前最大窗口内并立即安排 recovery probe,但不会直接解冻。若怀疑某个账号被误判,先用 `codex-pool sentinel-probe --account --confirm` 立即触发该账号测量;该命令从现有 CronJob 模板派生一次性 Job,复用同一份 Secret、ConfigMap、OpenAI SDK probe、token/cost 账本和冻结/恢复状态机。 `sentinel-report` 是只读低噪声报表,不触发 probe、不修改账号。默认输出类似 `ps` 的文本表,展示每个账号的探测次数、最近 marker/HTTP/动作、冻结 TTL、成功退避、下一次 probe 和最近 run 事件;需要机器处理时使用 `sentinel-report --raw`。 @@ -179,7 +181,7 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm ## 排障 -- Codex pool 哨兵、账号冻结/恢复、marker-only 判断或 probe 周期看不清:第一步跑 `bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-report`。这个报表是主观察面;只有报表缺字段或需要底层证据时,才继续看 `--raw`、CronJob log、state ConfigMap 或 Sub2API 管理 UI。 +- Codex pool 哨兵、账号冻结/恢复、marker-only 判断或 probe 周期看不清:第一步跑 `bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-report`。这个报表是主观察面;只有报表缺字段或需要底层证据时,才继续看 `--raw`、CronJob log、state ConfigMap 或 Sub2API 管理 UI。若看到“临时不可调度状态”且包含规则序号/匹配关键词,检查 Sub2API `account_temp_unschedulable` 日志和账号 `temp_unschedulable_*` 字段;sentinel 只解释 `schedulable=false` 的 active quarantine,不解释这类内置临时冷却。 - profile invalid:先修 `~/.codex/config.toml.` 的 `base_url`、`wire_api`、`model` 或 `auth.json.` 的 API key;不要在 YAML 中写密钥。 - Sub2API 卡在 `wait-postgres` / `wait-redis` 或服务内大量 `context deadline exceeded`:先跑 `sub2api status` 看 `networkPolicy.ok`,再跑 `sub2api validate` 看 `postgresCrossPodPgIsReady` / `redisCrossPodPing`;缺失或异常时用 `sub2api apply --confirm` 恢复受控 `NetworkPolicy/allow-all`,不要保留手工 iptables bypass 作为长期修复。 - pool key 401:跑 `codex-pool sync --confirm` 重建 Sub2API key 与 k3s Secret 绑定,再跑 `codex-pool validate`。 @@ -191,11 +193,11 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm - 上游需要 WebSocket v2:先做 direct Codex WSv2 probe;通过后才给该 profile 配 `openaiResponsesWebSocketsV2Mode: ctx_pool|passthrough` 并跑 `sync --confirm`;把它当 capability candidate,容量仍以 YAML 中的 `capacity` 或默认值为准。 - Codex 启动 WebSocket 回退:用原入口 Codex smoke 复现,再用 bounded Sub2API 日志确认 account;对 WS handshake 4xx/5xx、`openai.websocket_account_select_failed` 或 close-before-`response.completed` 的账号关闭 YAML WSv2 能力后同步。若没有剩余 WSv2-capable account,把 `localCodex.supportsWebSockets` 和 `localCodex.responsesWebSocketsV2` 一起关掉,不把临时可用性推断写成调度配置。 - 上游要求 Codex User-Agent:只给该 profile 配 `upstreamUserAgent`,跑 `sync --confirm`。 -- 上游报 capacity/rate-limit/overload/Bad Gateway/Gateway Timeout 后没有隔离或频繁先失败再恢复:先看 `codex-pool sentinel-report` 的 marker、动作、冻结 TTL 和下一次 probe;必要时用 `codex-pool sentinel-probe --account --confirm` 立即测量。不要通过开启 Sub2API 内置临时不可调度、手动禁用账号、删除账号或从 YAML 移除问题账号来替代哨兵隔离/恢复。 +- 上游报 capacity/rate-limit/overload/Bad Gateway/Gateway Timeout 后没有隔离或频繁先失败再恢复:先看 `codex-pool sentinel-report` 的 marker、动作、冻结 TTL 和下一次 probe,也看 `codex-pool validate --full` 的 recent gateway failover/forward failure 证据;同时对照当前 Sub2API 源码里 `/v1/responses` handler、`Forward`、`shouldFailoverOpenAIUpstreamResponse` 和 `handleOpenAIAccountUpstreamError` 的真实传播路径。不要手动禁用账号、删除账号、改 membership/priority/capacity/loadFactor 或从 YAML 移除问题账号来替代通用 failover 与哨兵隔离/恢复。 - `codex-pool sync --confirm` 或 `codex-pool validate` 超时:先区分 CLI 传输超时和 Sub2API 运行失败。受控 CLI 应返回远端作业进度和 stdout/stderr tail;如果只是低层 `trans` 60s 超时,不能据此判定 Sub2API failover 不工作。改用或修复 CLI 的远端 job/poll 路径后重跑,并以最终结构化结果作为证据。 - Codex 报 weekly-limit、`less than 10% of your weekly limit left`、`Run /status for a breakdown` 等账号状态/软配额提示并要求切号:不要把新关键词写成 Sub2API 内置临时不可调度策略来恢复可用性;由 marker-only 哨兵按非 marker 响应统一冻结,并用 `sentinel-report` / `sentinel-probe` 验证。 -- 上游 400/503 响应体出现 `invalid_encrypted_content`、`bad_response_status_code`、`invalid_request_error` + 稳定 unsupported-model 文案、unsupported-model、`暂不支持` / `可用模型`、`model_not_found`、`No available channel for model ...` 或同类稳定模型路由 / Responses encrypted-content 兼容性失败:按哨兵 marker 失败处理,不用 account membership、priority、capacity、loadFactor、WebSocket mode、User-Agent 或 Sub2API 内置临时不可调度改动掩盖该错误族。 -- 上游错误反复触发:`invalid_encrypted_content`、unsupported-model、`Recovered upstream error ...`、`Bad Gateway`、`Gateway Timeout`、Cloudflare `524`、Codex-facing `Upstream request failed`、`Unknown error`、`context deadline exceeded`、`context canceled`、`model_not_found`、`No available channel for model`、大上下文 `413` 和 `openai_error` 这类稳定包装文案都由外部哨兵和运行日志证据处理;内置临时不可调度规则保留但默认关闭,不作为当前恢复路径。长期判定见 `docs/reference/platform-infra.md`。 +- 上游 400/503 响应体出现 `invalid_encrypted_content`、`bad_response_status_code`、`invalid_request_error` + 稳定 unsupported-model 文案、unsupported-model、`暂不支持` / `可用模型`、`model_not_found`、`No available channel for model ...` 或同类稳定模型路由 / Responses encrypted-content 兼容性失败:按通用 temp-unschedulable/failover 加哨兵 marker 证据处理,不用 account membership、priority、capacity、loadFactor、WebSocket mode、User-Agent 或 provider pinning 掩盖该错误族。 +- 上游错误反复触发:`invalid_encrypted_content`、unsupported-model、`Recovered upstream error ...`、`Bad Gateway`、`Gateway Timeout`、Cloudflare `524`、Codex-facing `Upstream request failed`、`Unknown error`、`context deadline exceeded`、`context canceled`、`model_not_found`、`No available channel for model`、大上下文 `413` 和 `openai_error` 这类稳定包装文案,先确认 YAML temp-unschedulable 已同步、Sub2API 源码会把该错误族传播成 `UpstreamFailoverError`、运行日志出现 `openai.upstream_failover_switching`。若匹配规则后仍只看到 `openai.forward_failed`,根因是 Sub2API HTTP `/responses` 没把该错误传播成 `UpstreamFailoverError`,应修 Sub2API failover classifier/error propagation,不硬编码账号或给 `only` 特权。 - Codex auto compact 后丢上下文:先确认 YAML `localCodex` 是否声明启用 WSv2;若启用,再确认本机 `~/.codex/config.toml` 是否有 `supports_websockets = true` 和 `responses_websockets_v2 = true`,并看 `codex-pool validate` 的 WSv2 candidate 和 Sub2API 日志里的 `transport=responses_websockets_v2`。若 YAML 当前禁用 WSv2,则按 HTTP Responses 稳定性排查,不把旧 WS 口径当成验收要求。 - Codex smoke 有 reconnect/1013:这是上游并发/可用性问题,和 HTTP-only compact context-loss 分开处理;记录 session/log 证据并关联专项 issue,不要用运行时手补覆盖 YAML 容量。 @@ -207,5 +209,5 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm - 不给 Sub2API manifest 添加 CPU/memory limits,除非有新的 YAML 化明确决策。 - 不打印完整 API key、admin password 或 Secret 明文。 - 不把普通上游增删做成代码变更、CI/CD、feature flag 或兼容双路径。 -- 不把手动禁用账号、删除账号、移除 YAML entry、降低 membership、临时改 priority/capacity/loadFactor 或打开 Sub2API 内置临时不可调度当作哨兵隔离/恢复问题的修复。 +- 不把手动禁用账号、删除账号、移除 YAML entry、降低 membership、临时改 priority/capacity/loadFactor、provider pinning 或给某个账号特权当作通用 failover / 哨兵隔离恢复问题的修复。 - 不魔改 Sub2API:Sub2API 本身不支持的能力就不做,不通过 UniDesk 脚本、k8s 原地热补、本地 fork、YAML 伪声明或隐藏 fallback 代替上游实现。 diff --git a/config/platform-infra/sub2api-codex-pool.yaml b/config/platform-infra/sub2api-codex-pool.yaml index 549d7bd0..13d58bb0 100644 --- a/config/platform-infra/sub2api-codex-pool.yaml +++ b/config/platform-infra/sub2api-codex-pool.yaml @@ -8,7 +8,7 @@ pool: defaultAccountCapacity: 10 defaultAccountLoadFactor: 10 defaultTempUnschedulable: - enabled: false + enabled: true rules: - statusCode: 400 keywords: [invalid_encrypted_content, encrypted content, could not be verified, could not be decrypted, bad_response_status_code, model_not_found, no available channel for model, unsupported, not supported, not support, 暂不支持, 可用模型] diff --git a/docs/reference/platform-infra.md b/docs/reference/platform-infra.md index 1bff5891..b1029ec9 100644 --- a/docs/reference/platform-infra.md +++ b/docs/reference/platform-infra.md @@ -26,8 +26,9 @@ - `pool.groupName` names the Sub2API group that represents the pool. - `pool.apiKeySecretName` and `pool.apiKeySecretKey` name the k3s Secret that stores the single consumer API key. - `pool.minOwnerConcurrency` is optional; when omitted, the CLI automatically uses the sum of all resolved account capacities as the minimum concurrency for the Sub2API user that owns the unified consumer API key. A YAML value is only an explicit override and must still be at least that capacity sum, so the shared key does not fail requests or WS sessions at the user-concurrency layer. "Resolved" means each account's explicit `profiles.entries[].capacity` or, when omitted, `pool.defaultAccountCapacity`. Do not compensate for owner-concurrency 1013 errors by pinning capacity to one provider. -- `pool.defaultTempUnschedulable` is the Sub2API built-in temporary-unschedulable switch plus its YAML rule list. UniDesk keeps this built-in switch disabled by default while preserving the rule list in YAML for explicit future recovery; sync follows the WebUI close-switch behavior by omitting the runtime `temp_unschedulable_enabled` and `temp_unschedulable_rules` credential fields. The external account-level sentinel is the active account health and freeze/restore mechanism. -- The built-in temporary-unschedulable configuration and external `sentinel.*` configuration are separate control surfaces. Changing `pool.defaultTempUnschedulable.enabled` or `profiles.entries[].tempUnschedulable` must not change sentinel cadence, marker health semantics, or sentinel quarantine state; changing sentinel settings must not implicitly enable Sub2API built-in temporary-unschedulable rules. +- `pool.defaultTempUnschedulable` is the Sub2API built-in request-path temporary-unschedulable switch plus its YAML rule list. When enabled, `codex-pool sync --confirm` renders `temp_unschedulable_enabled` and `temp_unschedulable_rules` into every managed account unless an account-level override says otherwise. This is the generic same-request recovery path for selected-account upstream failures: a matching upstream error briefly cools the selected account so Sub2API's existing failover loop can select another account in the same group. +- The built-in temporary-unschedulable configuration and external `sentinel.*` configuration are separate control surfaces. `pool.defaultTempUnschedulable` handles near-real-time request-path cooling and failover; `sentinel.*` handles account-level marker health, quarantine, restore, and probe cadence. Changing one surface must not silently rewrite the other surface's cadence, marker semantics, quarantine state, or rule list. +- The external sentinel write surface is intentionally limited to the Sub2API admin `schedulable` action. Sentinel freeze/restore may set `schedulable=false|true`, but must not write, clear, or indirectly clear Sub2API request-path runtime state such as `temp_unschedulable_until`, `temp_unschedulable_reason`, rate-limit, overload, or model-rate-limit state. In particular, sentinel restore must not call Sub2API `recover-state`, because that endpoint is a broader runtime-state recovery operation rather than a pure schedulability restore. - Codex accounts selected by YAML do not declare `schedulable` as durable configuration. `schedulable=true` is a `codex-pool sync --confirm` process-control baseline for UniDesk-managed accounts that are not under sentinel quarantine, not a YAML field. - `codex-pool sync --confirm` preserves UniDesk-managed accounts that are absent from YAML by default; explicit upstream retirement requires `codex-pool sync --confirm --prune-removed`. This keeps account deletion out of the normal availability-recovery path and prevents temporary YAML edits from becoming destructive runtime changes. - `profiles.entries` selects local Codex profile files from `~/.codex/` and maps them to Sub2API account names. @@ -36,8 +37,9 @@ - `profiles.entries[].capacity` optionally overrides `pool.defaultAccountCapacity` for one account. Capacity is a YAML-controlled routing input; concrete current values belong only in `config/platform-infra/sub2api-codex-pool.yaml` and runtime validation output, not in long-term reference prose. Code constants, Secrets, ad-hoc runtime patches, or stale tests must not override YAML source of truth. - `profiles.entries[].loadFactor` optionally overrides `pool.defaultAccountLoadFactor` for one account and is rendered to Sub2API `load_factor`. Treat it as routing policy: values belong in YAML and `codex-pool validate` output, not code constants, Secrets, or ad-hoc runtime patches. - Do not change account membership, priority, capacity, load factor, WebSocket mode, or other routing policy from inference alone. Unless the user explicitly asks for a configuration change, first preserve the current YAML, collect provenance and runtime evidence, and write the finding to the relevant issue or runbook before proposing a change. -- `profiles.entries[].tempUnschedulable` may override the pool default for one account. When enabled, the CLI renders it into Sub2API credentials as `temp_unschedulable_enabled` and `temp_unschedulable_rules`; when disabled, runtime credentials omit both fields and the YAML rule list remains only source-side configuration. -- Codex account-state, quota prompts, model-routing failures, gateway wrappers, and timeout-like upstream errors are handled by the external marker-only sentinel unless the Sub2API built-in temporary-unschedulable switch is explicitly re-enabled. Do not change membership, priority, capacity, load factor, WebSocket mode, or `pool_mode` merely to work around those errors. +- Sub2API is a source-available UniDesk-operated runtime component. For Sub2API scheduling, failover, temporary-unschedulable behavior, error propagation, and account selection, the default investigation path is to read the current Sub2API source implementation and then verify it with real request ids, gateway logs, and original-entry traffic. Do not use mock upstreams, temporary probe accounts, or test stubs as the default proof for Sub2API behavior; those are explicit debug aids only and do not replace source-path review plus runtime evidence. +- `profiles.entries[].tempUnschedulable` may override the pool default for one account. When enabled, the CLI renders it into Sub2API credentials as `temp_unschedulable_enabled` and `temp_unschedulable_rules`; when disabled, runtime credentials omit both fields. Use account-level override only for an explicit deviation from the pool policy, not as an availability workaround for a named account. +- Codex account-state, quota prompts, model-routing failures, encrypted-content affinity failures, gateway wrappers, and timeout-like upstream errors must be handled by the generic temporary-unschedulable/failover path plus the external marker sentinel. Do not change membership, priority, capacity, load factor, WebSocket mode, `pool_mode`, or a specific provider's status merely to work around those errors. If a matching upstream failure still logs `openai.forward_failed` without `openai.upstream_failover_switching`, the missing fix is in Sub2API's HTTP `/responses` failover classification/error propagation, not in account pinning. - `profiles.entries[].openaiResponsesWebSocketsV2Mode` is the account-level Responses WebSocket v2 switch for OpenAI-compatible upstreams that require WebSocket transport. Allowed values are `off`, `ctx_pool`, and `passthrough`; omit the field unless that upstream needs it. - `profiles.entries[].upstreamUserAgent` is an optional account-level upstream request User-Agent override. Use it only for upstreams that require a Codex CLI compatible User-Agent; keep the value YAML-controlled and newline-free. - `publicExposure` controls the optional FRP bridge from master server to the G14 ClusterIP service. @@ -51,9 +53,13 @@ When Codex startup repeatedly reports WebSocket reconnects or HTTPS fallback, pr Do not encode current availability assumptions in long-term reference prose. If an account needs a higher concurrency or load factor than the pool default, make that a deliberate YAML override and verify it with `codex-pool validate`; the reference document should describe the rule, not repeat the current numeric value. -Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path and does not replace sentinel quarantine. The current failover and recovery model is: the external marker-only sentinel freezes or restores account schedulability, while Sub2API routes among currently schedulable accounts in the group. +Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path and does not replace temporary-unschedulable request failover or sentinel quarantine. The current failover and recovery model is: matching request-path errors temporarily cool the selected account and trigger same-group failover, while the external marker-only sentinel freezes or restores account schedulability from direct marker probes. -Sub2API temporary-unschedulable rules require both an HTTP status match and a response-body keyword match in the upstream failure/error path when the built-in switch is enabled. UniDesk currently keeps that switch disabled and does not use built-in rules as a successful-response content filter. HTTP 200 private content, maintenance text, quota prompts, ads, and similar semantic failures are handled by the external account-level sentinel. +Sub2API temporary-unschedulable rules require both an HTTP status match and a response-body keyword match in the upstream failure/error path. UniDesk uses these rules as a generic request-path failover trigger, not as a successful-response content classifier. Runtime UI fields such as trigger time, release time, matched keyword, and rule index identify this built-in request-path state and should not be attributed to sentinel unless separate sentinel state shows an active quarantine. HTTP 200 private content, maintenance text, quota prompts, ads, and similar semantic failures remain the external account-level sentinel's job. + +The `invalid_encrypted_content` failure mode is a stable regression guard for Codex pool routing. It means an upstream could not verify or parse encrypted Responses/Codex state carried by the request; a fresh account probe can still pass while a large resumed request fails because the encrypted content is not acceptable to that selected upstream. The required behavior is generic: Sub2API should perform its built-in recoverable handling for encrypted reasoning state when available, mark the selected account temporarily unschedulable when the configured status/keyword rule matches, and continue same-group failover before the client sees a final failure whenever the response has not already been committed. Do not interpret this failure as proof that the pool should pin to `only`, delete the selected account, change membership/priority/capacity/load factor, or move the error into sentinel-specific provider logic. + +For this failure class, the regression evidence must come from the real request path. A valid investigation should connect the client request id to Sub2API gateway logs showing the selected account id, upstream status, `account_temp_unschedulable`, `openai.upstream_failover_switching`, and the final access-log status. A `sentinel-report` row with `quarantineActive=false` and marker success proves only that the external marker sentinel did not quarantine that account; it does not disprove request-path temporary cooling. Conversely, a marker sentinel recovery must not call `recover-state` or clear the temporary-unschedulable state created by the failed request. If this failure still reaches the client as 502/503 while another schedulable account is available and no stream bytes were committed, fix Sub2API failover classification/error propagation or the UniDesk sync/render path rather than adding mock probes, provider pinning, or account-specific exceptions. ## Sub2API Account Test Semantics @@ -69,13 +75,13 @@ An external account-level sentinel that wants parity with this WebUI path should ## Account Sentinel Marker Contract -The UniDesk account-level sentinel uses marker-only health semantics. A probe is healthy only when the upstream response satisfies the configured marker match. Every other result is unhealthy and must enter the same exponential freeze state machine, regardless of whether the immediate response is HTTP 200, 400, 403, 429, 500, 502, 503, 504, a streaming error event, malformed output, empty output, timeout, or any other transport/API failure. HTTP status, upstream error code, body hash, body preview, headers, and SDK exception class are diagnostics only; they must not become additional allow/deny criteria that bypass marker mismatch. +The UniDesk account-level sentinel uses marker-only health semantics. A probe is healthy only when the upstream response satisfies the configured marker match. Every other result is unhealthy and must enter the same exponential freeze state machine, regardless of whether the immediate response is HTTP 200, 400, 403, 429, 500, 502, 503, 504, a streaming error event, malformed output, empty output, timeout, or any other transport/API failure. HTTP status, upstream error code, body hash, body preview, headers, and SDK exception class are diagnostics only; they must not become additional allow/deny criteria that bypass marker mismatch. Sentinel actions are only `schedulable=false` on freeze and `schedulable=true` on marker-matching recovery; they must not clear Sub2API temporary-unschedulable or rate-limit state as part of marker recovery. The sentinel must not maintain separate classifiers for "private content", "maintenance", "quota", "ads", or provider-specific body phrases as health gates. The only recovery condition is a later recovery probe that matches the marker. Freeze TTL expiry only schedules the next recovery probe; it does not restore an account by itself. Repeated non-marker results use a short exponential freeze backoff because failed marker probes produce little or no useful output token usage; repeated marker-matching results use the configured success cadence backoff. This contract applies equally to OpenAI Responses `gpt-5.5` direct account probes and manual `codex-pool sentinel-probe --account ... --confirm` measurements. `profiles.entries[].trustUpstream` is the durable account-level trust marker for sentinel success cadence, and the absence of the field means untrusted. Trusted and untrusted accounts use separate YAML cadence maximums after marker-matching probes; the values belong only in `config/platform-infra/sub2api-codex-pool.yaml`. This field must not change Sub2API scheduler priority, capacity, load factor, membership, built-in temporary-unschedulable settings, or the marker-only health contract. Its purpose is to keep intermittently unreliable 200-success providers under more frequent direct probes without adding provider-specific content classifiers. -When `codex-pool sync --confirm` creates a YAML-managed account or changes direct-probe-relevant account inputs such as the profile mapping, upstream base URL, API key fingerprint, upstream User-Agent, Responses WebSocket mode, or `trustUpstream`, only that account must be default-frozen before it can enter the scheduler. Sync first records a pending sentinel quality gate from the pre-mutation runtime state, then updates the account, then schedules the account probe immediately. This ordering prevents a new or changed account from being written to Sub2API without a matching sentinel quarantine record if sync fails midway. Passing the marker clears the quality gate and restores schedulability; any non-marker result continues the failure freeze backoff. Unchanged accounts must not have their existing success or failure backoff reset by unrelated YAML syncs. +When `codex-pool sync --confirm` creates a YAML-managed account or changes direct-probe-relevant account inputs such as the profile mapping, upstream base URL, API key fingerprint, upstream User-Agent, Responses WebSocket mode, or `trustUpstream`, sync records a pending sentinel probe from the pre-mutation runtime state, updates the account, restores `schedulable=true` unless an active sentinel quarantine already exists, and schedules the account probe immediately. New or changed accounts are not default-frozen; only an actual non-marker probe result or an existing active quarantine may remove an account from the scheduler. This avoids zero-available windows during sync while still ensuring that later marker failures enter the normal freeze/restore state machine. Unchanged accounts must not have their existing success or failure backoff reset by unrelated YAML syncs. If the YAML failure freeze maximum is lowered, `codex-pool sync --confirm` may migrate only currently active sentinel quarantines whose stored interval or next recovery time exceeds the current maximum. The migration keeps the account frozen, marks the next recovery probe due immediately, and lets the next marker result decide restore versus the new shorter failure backoff. It must not clear quarantine or restore schedulability merely because an older TTL has expired. @@ -113,6 +119,7 @@ Kubernetes readiness is not the same as pool availability: - The FRP client deployment is currently a simple connector deployment and does not itself prove that master-local traffic reaches Sub2API. - No scheduled `CronJob`, `ServiceMonitor`, or `PodMonitor` currently proves the full unified Codex API path. - `platform-infra sub2api validate` and `platform-infra sub2api codex-pool validate` are on-demand checks. Operational usage is documented in `$unidesk-sub2api`; they are acceptable for deployment closeout, but they are not continuous monitoring. `codex-pool validate` must test both `GET /v1/models` and a small `POST /v1/responses` request, and the Responses smoke should report request id, selected/final account evidence, upstream failover count, and whether the validation succeeded only after failover. It should also summarize recent `/responses` and `/responses/compact` gateway failures separately so ordinary long streaming failures are not hidden behind compact-only evidence. +- `codex-pool validate` must not create mock upstreams or temporary failover-probe accounts as its default proof of Sub2API behavior. When a suspected failover path is in question, validate should surface the relevant source-path expectation and real runtime evidence: request ids, selected/final account ids, `openai.upstream_failover_switching`, `openai.forward_failed`, `openai.account_select_failed`, and final status. If runtime evidence contradicts the source-path expectation, fix Sub2API or the UniDesk integration path rather than converting the mismatch into a mock-only success. - Public exposure closeout must include the edge layer when the user-facing URL is involved. A Sub2API-side compact success summary does not rule out Caddy/FRP 502s that happened before Sub2API received the request; inspect the edge Caddy/frps/frpc evidence or use a CLI report that summarizes it before declaring public compact stable. - Because `codex-pool validate` includes account alignment, recent-log inspection, and gateway smoke, timeout of the CLI transport is not valid negative evidence about Sub2API scheduling by itself. Closeout evidence must come from the final structured validation result or from an explicitly reported remote job failure with stdout/stderr tail, not from a single low-level `trans` timeout. diff --git a/scripts/src/platform-infra-sub2api-codex-sentinel.ts b/scripts/src/platform-infra-sub2api-codex-sentinel.ts index ef1835fd..6b22e400 100644 --- a/scripts/src/platform-infra-sub2api-codex-sentinel.ts +++ b/scripts/src/platform-infra-sub2api-codex-sentinel.ts @@ -804,16 +804,6 @@ class Sub2ApiAdmin: self.request("POST", f"/api/v1/admin/accounts/{account['id']}/schedulable", {"schedulable": bool(schedulable)}) return {"accountId": account.get("id"), "schedulable": bool(schedulable)} - def recover_state(self, account_name): - account = self.account(account_name) - if not account or account.get("id") is None: - return {"skipped": True, "reason": "account-not-found"} - try: - self.request("POST", f"/api/v1/admin/accounts/{account['id']}/recover-state", {}) - return {"ok": True, "accountId": account.get("id")} - except Exception as exc: - return {"ok": False, "accountId": account.get("id"), "error": str(exc)} - def upstream_base_url(base_url): base = str(base_url).rstrip("/") return base if base.endswith("/v1") else base + "/v1" @@ -1228,18 +1218,20 @@ def apply_result(result, state, config, now, admin, profile): was_recovery = bool(quarantine and quarantine.get("active") is True) action = {"taken": False, "type": None} if result.get("ok") is True: + quality_gate = account_state.get("qualityGate") if isinstance(account_state.get("qualityGate"), dict) else None if was_recovery: if actions_enabled and quarantine.get("applied") is True: try: - action = {"taken": True, "type": "restore", "result": admin.set_schedulable(name, True), "recoverState": admin.recover_state(name)} + action = {"taken": True, "type": "restore", "result": admin.set_schedulable(name, True)} except Exception as exc: action = {"taken": False, "type": "restore-failed", "error": str(exc)} account_state["quarantine"] = {"active": False, "clearedAt": iso(now), "lastApplied": quarantine.get("applied") is True} - quality_gate = account_state.get("qualityGate") if isinstance(account_state.get("qualityGate"), dict) else None - if quality_gate and quality_gate.get("pending") is True: - account_state["qualityGate"] = {**quality_gate, "pending": False, "clearedAt": iso(now)} account_state["successStreak"] = 0 account_state["successIntervalMinutes"] = 0 + elif isinstance(quarantine, dict) and quarantine.get("active") is not True: + account_state["quarantine"] = {"active": False, "clearedAt": iso(now), "lastApplied": quarantine.get("applied") is True} + if quality_gate and quality_gate.get("pending") is True: + account_state["qualityGate"] = {**quality_gate, "pending": False, "clearedAt": iso(now)} interval = next_success_interval(account_state, config, profile) account_state["successStreak"] = int(account_state.get("successStreak") or 0) + 1 account_state["successIntervalMinutes"] = interval diff --git a/scripts/src/platform-infra-sub2api-codex.ts b/scripts/src/platform-infra-sub2api-codex.ts index f67288ba..24c57499 100644 --- a/scripts/src/platform-infra-sub2api-codex.ts +++ b/scripts/src/platform-infra-sub2api-codex.ts @@ -1573,10 +1573,6 @@ function compactTempUnschedulableStatus(block: unknown): Record const result = pickSummaryFields(item, [ "accountName", "accountId", - "expectedEnabled", - "runtimeEnabled", - "expectedRuleCount", - "runtimeRuleCount", "status", "schedulable", "tempUnschedulableUntil", @@ -1602,8 +1598,17 @@ function compactTempUnschedulableStatus(block: unknown): Record mismatched: block.mismatched, itemCount: compactItems.length, frozenCount: frozenItems.length, - frozen: focusedFrozenItems, - manuallyUnschedulable: compactItems.filter((item) => item.schedulable === false && item.tempUnschedulableSet !== true), + frozenShown: focusedFrozenItems.length, + frozen: focusedFrozenItems.map((item) => pickSummaryFields(item, [ + "accountName", + "accountId", + "schedulable", + "tempUnschedulableUntil", + "tempUnschedulableReason", + ])), + manuallyUnschedulable: compactItems + .filter((item) => item.schedulable === false && item.tempUnschedulableSet !== true) + .map((item) => pickSummaryFields(item, ["accountName", "accountId", "status", "schedulable"])), valuesPrinted: false, }; } @@ -1636,7 +1641,7 @@ function compactRuntimeCapability(block: unknown): unknown { if (!isRecord(block)) return block; const probe = isRecord(block.probe) ? block.probe : {}; const logEvidence = isRecord(probe.logEvidence) ? probe.logEvidence : {}; - const accountState = isRecord(probe.accountState) ? probe.accountState : {}; + const accountState = isRecord(probe.accountState) ? probe.accountState : (isRecord(probe.badAccountState) ? probe.badAccountState : {}); const resources = isRecord(block.resources) ? block.resources : {}; const requirement = isRecord(block.requirement) ? block.requirement : {}; return { @@ -1646,11 +1651,11 @@ function compactRuntimeCapability(block: unknown): unknown { outcome: block.outcome, requirement: Object.keys(requirement).length === 0 ? undefined : { statusCode: requirement.statusCode, - probeKeyword: requirement.probeKeyword, + representativeKeyword: requirement.representativeKeyword, durationMinutes: requirement.durationMinutes, sourceAccountName: requirement.sourceAccountName, }, - probe: Object.keys(probe).length === 0 ? undefined : { + requestEvidence: Object.keys(probe).length === 0 ? undefined : { requestId: probe.requestId, durationMs: probe.durationMs, httpStatus: probe.httpStatus, @@ -1766,6 +1771,78 @@ function compactGatewayCompactRecent(block: unknown): unknown { }; } +function compactSentinelErrorDetails(value: unknown): Record | undefined { + if (!isRecord(value)) return undefined; + const body = isRecord(value.body) ? value.body : {}; + const openaiError = isRecord(value.openaiError) ? value.openaiError : {}; + return { + kind: value.kind, + statusCode: value.statusCode, + code: value.code ?? body.code ?? openaiError.code, + type: value.type ?? body.type ?? openaiError.type, + bodyHash: value.bodyHash, + valuesPrinted: false, + }; +} + +function compactSentinelQuarantine(item: unknown): Record { + if (!isRecord(item)) return {}; + return { + accountName: item.accountName, + until: item.until, + applied: item.applied, + reason: item.reason, + failureKind: item.failureKind, + intervalMinutes: item.intervalMinutes, + error: compactSentinelErrorDetails(item.errorDetails), + }; +} + +function compactSentinelRecentAccount(item: unknown): Record { + if (!isRecord(item)) return {}; + const action = isRecord(item.action) ? item.action : {}; + const error = compactSentinelErrorDetails(item.errorDetails); + return { + accountName: item.accountName, + lastProbeAt: item.lastProbeAt, + lastStatus: item.lastStatus, + nextProbeAfter: item.nextProbeAfter, + ok: item.ok, + purpose: item.purpose, + httpStatus: item.httpStatus, + durationMs: item.durationMs, + markerMatched: item.markerMatched, + failureKind: item.failureKind, + requestShape: item.requestShape, + action: Object.keys(action).length > 0 ? { + taken: action.taken, + type: action.type, + } : undefined, + errorStatusCode: error?.statusCode, + errorCode: error?.code, + errorBodyHash: error?.bodyHash, + }; +} + +function compactSentinelLastRun(value: unknown): Record | undefined { + if (!isRecord(value)) return undefined; + return { + at: value.at, + monitorEnabled: value.monitorEnabled, + actionsEnabled: value.actionsEnabled, + profileCount: value.profileCount, + selected: value.selected, + okCount: value.okCount, + mismatchCount: value.mismatchCount, + markerMismatchCount: value.markerMismatchCount, + transportFailureCount: value.transportFailureCount, + actionsTaken: value.actionsTaken, + gatewayFailureMonitor: value.gatewayFailureMonitor, + selection: value.selection, + reconcileCount: Array.isArray(value.reconcile) ? value.reconcile.length : undefined, + }; +} + function compactSentinelStatus(block: unknown): unknown { if (!isRecord(block)) return block; const runtime = isRecord(block.runtime) ? block.runtime : block; @@ -1777,6 +1854,13 @@ function compactSentinelStatus(block: unknown): unknown { const freezeReassert = isRecord(block.freezeReassert) ? block.freezeReassert : {}; const qualityGatePrepare = isRecord(block.qualityGatePrepare) ? block.qualityGatePrepare : {}; const qualityGate = isRecord(block.qualityGate) ? block.qualityGate : {}; + const quarantined = recordArray(state.quarantined).map(compactSentinelQuarantine); + const recentAccounts = recordArray(state.recentAccounts).map(compactSentinelRecentAccount); + const recentAttention = recentAccounts.filter((item) => item.ok === false || isRecord(item.action) && item.action.taken === true); + const recentHealthy = recentAccounts + .filter((item) => item.ok === true) + .slice(-3) + .map((item) => pickSummaryFields(item, ["accountName", "lastProbeAt", "lastStatus", "nextProbeAfter", "ok", "httpStatus", "markerMatched"])); return { ok: block.ok, action: block.action, @@ -1813,9 +1897,12 @@ function compactSentinelStatus(block: unknown): unknown { exists: state.exists, accountCount: state.accountCount, quarantinedCount: state.quarantinedCount, - quarantined: state.quarantined, - recentAccounts: state.recentAccounts, - lastRun: state.lastRun, + quarantinedShown: Math.min(quarantined.length, 3), + quarantined: quarantined.slice(-3), + recentAccountCount: recentAccounts.length, + recentAttention: uniqueByAccountName(recentAttention.slice(-3)), + recentHealthy, + lastRun: compactSentinelLastRun(state.lastRun), error: state.error, }, freezeReassert: Object.keys(freezeReassert).length > 0 ? { @@ -1825,7 +1912,7 @@ function compactSentinelStatus(block: unknown): unknown { itemCount: freezeReassert.itemCount, attentionItems: freezeReassert.attentionItems, } : undefined, - qualityGatePrepare: Object.keys(qualityGatePrepare).length > 0 ? { + pendingProbePrepare: Object.keys(qualityGatePrepare).length > 0 ? { ok: qualityGatePrepare.ok, skipped: qualityGatePrepare.skipped, reason: qualityGatePrepare.reason, @@ -1834,7 +1921,7 @@ function compactSentinelStatus(block: unknown): unknown { pendingOnly: qualityGatePrepare.pendingOnly, items: qualityGatePrepare.items, } : undefined, - qualityGate: Object.keys(qualityGate).length > 0 ? { + pendingProbe: Object.keys(qualityGate).length > 0 ? { ok: qualityGate.ok, skipped: qualityGate.skipped, reason: qualityGate.reason, @@ -3698,7 +3785,8 @@ def planned_sentinel_account_results(profiles, existing_accounts): "sentinelProbeConfigFingerprint": profile.get("sentinelProbeConfigFingerprint"), "sentinelProbeRequired": quality_gate_required, "sentinelChangeReasons": change_reasons if quality_gate_required else [], - "sentinelDefaultFrozen": quality_gate_required, + "sentinelProbePending": quality_gate_required, + "sentinelDefaultFrozen": False, "valuesPrinted": False, }) return results @@ -3728,7 +3816,7 @@ def ensure_accounts(token, profiles, group_id, prune_removed=False, protected_fr data = ensure_success(curl_api("POST", "/api/v1/admin/accounts", bearer=token, payload=payload), f"create account {profile['accountName']}") action = "created" if isinstance(data, dict) and data.get("id") is not None: - schedulable_data = ensure_account_schedulable(token, data["id"], profile["accountName"], not quality_gate_required and not keep_frozen) + schedulable_data = ensure_account_schedulable(token, data["id"], profile["accountName"], not keep_frozen) if schedulable_data: data = schedulable_data results.append({ @@ -3746,7 +3834,8 @@ def ensure_accounts(token, profiles, group_id, prune_removed=False, protected_fr "sentinelProbeConfigFingerprint": profile.get("sentinelProbeConfigFingerprint"), "sentinelProbeRequired": quality_gate_required, "sentinelChangeReasons": change_reasons if quality_gate_required else [], - "sentinelDefaultFrozen": quality_gate_required, + "sentinelProbePending": quality_gate_required, + "sentinelDefaultFrozen": False, "sentinelFreezeProtected": keep_frozen, "openaiResponsesWebSocketsV2Mode": profile.get("openaiResponsesWebSocketsV2Mode"), "trustUpstream": profile.get("trustUpstream") is True, @@ -4145,7 +4234,6 @@ def ensure_sentinel_state_for_sync(account_results, pending_only=False): accounts_state = {} state["accounts"] = accounts_state now = utc_iso() - pending_until = utc_iso(3600) items = [] clamped_items = [] if pending_only else clamp_sentinel_freezes_for_config(state, now) cadence_clamped_items = [] if pending_only else clamp_sentinel_success_cadence_for_config(state, [item.get("profileConfig") for item in account_results if isinstance(item.get("profileConfig"), dict)], now) @@ -4167,18 +4255,15 @@ def ensure_sentinel_state_for_sync(account_results, pending_only=False): continue changed_count += 1 reasons = item.get("sentinelChangeReasons") if isinstance(item.get("sentinelChangeReasons"), list) else [] - account_state["quarantine"] = { - "active": True, - "applied": True, - "until": pending_until if pending_only else now, - "intervalMinutes": 0, - "reason": "yaml-account-change-pending-sentinel-probe", - "failureKind": "pending-quality-gate", - "changeReasons": reasons, - "startedAt": now, - "lastBadAt": now, - } - account_state["nextProbeAfter"] = pending_until if pending_only else now + quarantine = account_state.get("quarantine") if isinstance(account_state.get("quarantine"), dict) else None + if not (isinstance(quarantine, dict) and quarantine.get("active") is True): + account_state["quarantine"] = { + "active": False, + "reason": "yaml-account-change-pending-sentinel-probe", + "lastPendingAt": now, + "changeReasons": reasons, + } + account_state["nextProbeAfter"] = now account_state["successStreak"] = 0 account_state["successIntervalMinutes"] = 0 profile_config = item.get("profileConfig") if isinstance(item.get("profileConfig"), dict) else {} @@ -4191,17 +4276,18 @@ def ensure_sentinel_state_for_sync(account_results, pending_only=False): "changeReasons": reasons, "markedAt": now, "pendingOnly": pending_only, + "defaultFrozen": False, } - items.append({"accountName": name, "changeReasons": reasons, "nextProbeAfter": pending_until if pending_only else now, "defaultFrozen": True, "pendingOnly": pending_only}) + items.append({"accountName": name, "changeReasons": reasons, "nextProbeAfter": now, "defaultFrozen": False, "defaultSchedulable": True, "pendingOnly": pending_only}) if changed_count <= 0 and len(clamped_items) <= 0 and len(cadence_clamped_items) <= 0: return {"ok": True, "skipped": False, "reason": "no-new-or-changed-accounts", "changedCount": 0, "fingerprintOnlyCount": fingerprint_only_count, "clampedCount": 0, "cadenceClampedCount": 0, "items": [], "valuesPrinted": False} update = update_sentinel_state_configmap(state_obj, state) if pending_only and changed_count > 0: - reason = "new-or-changed-accounts-pending-quality-gate-prepared" + reason = "new-or-changed-accounts-pending-probe-prepared-default-schedulable" elif changed_count > 0 and (len(clamped_items) > 0 or len(cadence_clamped_items) > 0): - reason = "new-or-changed-accounts-default-frozen-and-sentinel-cadence-clamped" + reason = "new-or-changed-accounts-default-schedulable-and-sentinel-cadence-clamped" elif changed_count > 0: - reason = "new-or-changed-accounts-default-frozen" + reason = "new-or-changed-accounts-default-schedulable" elif len(cadence_clamped_items) > 0: reason = "success-cadence-clamped-to-current-config" else: @@ -4777,7 +4863,7 @@ def success_body_reclassification_requirement(): "sourceAccountName": name, "statusCode": error_code, "keywords": keywords, - "probeKeyword": keywords[0], + "representativeKeyword": keywords[0], "durationMinutes": rule.get("duration_minutes"), } return { @@ -4785,12 +4871,12 @@ def success_body_reclassification_requirement(): "sourceAccountName": None, "statusCode": None, "keywords": [], - "probeKeyword": None, + "representativeKeyword": None, "durationMinutes": None, } def model_routing_400_failover_requirement(): - preferred = ["暂不支持", "可用模型", "unsupported model", "model not supported", "does not support", "not supported", "model_not_found", "no available channel for model"] + preferred = ["invalid_encrypted_content", "encrypted content", "could not be verified", "could not be decrypted", "暂不支持", "可用模型", "unsupported model", "model not supported", "does not support", "not supported", "model_not_found", "no available channel for model"] for name in sorted(EXPECTED_ACCOUNT_TEMP_UNSCHEDULABLE): expected = normalize_temp_unschedulable_credentials(EXPECTED_ACCOUNT_TEMP_UNSCHEDULABLE[name]) if expected["enabled"] is not True: @@ -4800,13 +4886,13 @@ def model_routing_400_failover_requirement(): keywords = rule.get("keywords") or [] if error_code != 400 or not keywords: continue - probe_keyword = next((item for item in preferred if item in keywords), keywords[0]) + representative_keyword = next((item for item in preferred if item in keywords), keywords[0]) return { "required": True, "sourceAccountName": name, "statusCode": error_code, "keywords": keywords, - "probeKeyword": probe_keyword, + "representativeKeyword": representative_keyword, "durationMinutes": rule.get("duration_minutes"), } return { @@ -4814,7 +4900,7 @@ def model_routing_400_failover_requirement(): "sourceAccountName": None, "statusCode": None, "keywords": [], - "probeKeyword": None, + "representativeKeyword": None, "durationMinutes": None, } @@ -4834,291 +4920,25 @@ def delete_probe_resource(token, method, path, label): "valuesPrinted": False, } -def launch_success_body_mock_upstream(status_code, body_text): - port = 28000 + secrets.randbelow(2000) - body_b64 = base64.b64encode(body_text.encode("utf-8")).decode("ascii") - script = r''' -set -eu -port="$1" -status="$2" -body_b64="$3" -body="$(printf "%s" "$body_b64" | base64 -d)" -length="$(printf "%s" "$body" | wc -c | tr -d " ")" -{ printf "HTTP/1.1 %s OK\r\nContent-Type: application/json\r\nContent-Length: %s\r\nConnection: close\r\n\r\n" "$status" "$length"; printf "%s" "$body"; } | nc -l -p "$port" -w 5 -''' - proc = subprocess.Popen([ - "kubectl", "-n", NAMESPACE, "exec", "-i", APP_POD, - "--", "sh", "-c", script, "sh", str(port), str(status_code), body_b64, - ], stdout=subprocess.PIPE, stderr=subprocess.PIPE) - time.sleep(0.35) - if proc.poll() is None: - return port, proc, None - stdout, stderr = proc.communicate(timeout=2) - return None, None, { - "ok": False, - "error": "mock-upstream-exited-before-probe", - "exitCode": proc.returncode, - "stdoutTail": text(stdout, 1000), - "stderrTail": text(stderr, 1000), - } - -def finish_mock_upstream(proc): - if proc is None: - return None - timed_out = False - try: - stdout, stderr = proc.communicate(timeout=2) - except subprocess.TimeoutExpired: - timed_out = True - proc.kill() - stdout, stderr = proc.communicate(timeout=2) - return { - "exitCode": proc.returncode, - "timedOut": timed_out, - "stdoutTail": text(stdout, 1000), - "stderrTail": text(stderr, 1000), - } - -def create_success_body_probe_resources(token, base_url, requirement): - stamp = str(int(time.time() * 1000)) - suffix = stamp + "-" + "".join(secrets.choice(string.ascii_lowercase + string.digits) for _ in range(6)) - group_payload_obj = group_payload() - group_payload_obj["name"] = "unidesk-probe-2xx-body-" + suffix - group_payload_obj["description"] = "UniDesk validate probe for OpenAI 2xx success-body reclassification." - group_id = None - account_id = None - api_key_id = None - try: - group = ensure_success(curl_api("POST", "/api/v1/admin/groups", bearer=token, payload=group_payload_obj), "create 2xx success-body probe group") - group_id = group.get("id") if isinstance(group, dict) else None - if group_id is None: - raise RuntimeError("2xx success-body probe group id missing") - - account_payload_obj = { - "name": "unidesk-probe-2xx-body-" + suffix, - "notes": "Temporary UniDesk validate probe account; safe to delete.", - "platform": "openai", - "type": "apikey", - "credentials": { - "api_key": "sk-unidesk-probe-upstream", - "base_url": base_url, - "temp_unschedulable_enabled": True, - "temp_unschedulable_rules": [{ - "error_code": requirement["statusCode"], - "keywords": [requirement["probeKeyword"]], - "duration_minutes": 1, - "description": "UniDesk runtime capability probe for OpenAI 2xx success-body reclassification.", - }], - }, - "extra": { - "openai_responses_mode": "force_responses", - "unidesk_probe": "success_body_reclassification", - }, - "concurrency": 1, - "priority": 0, - "rate_multiplier": 1, - "load_factor": 1, - "group_ids": [group_id], - "confirm_mixed_channel_risk": True, - } - account = ensure_success(curl_api("POST", "/api/v1/admin/accounts", bearer=token, payload=account_payload_obj), "create 2xx success-body probe account") - account_id = account.get("id") if isinstance(account, dict) else None - if account_id is None: - raise RuntimeError("2xx success-body probe account id missing") - - api_key = "sk-unidesk-probe-" + "".join(secrets.choice(string.ascii_letters + string.digits) for _ in range(36)) - api_key_obj = ensure_success(curl_api("POST", "/api/v1/keys", bearer=token, payload={ - "name": "unidesk-probe-2xx-body-" + suffix, - "group_id": group_id, - "custom_key": api_key, - "quota": 0, - "rate_limit_5h": 0, - "rate_limit_1d": 0, - "rate_limit_7d": 0, - }), "create 2xx success-body probe API key") - api_key_id = api_key_obj.get("id") if isinstance(api_key_obj, dict) else None - return { - "groupId": group_id, - "groupName": group_payload_obj["name"], - "accountId": account_id, - "accountName": account_payload_obj["name"], - "apiKeyId": api_key_id, - "apiKey": api_key, - "keyPreview": api_key_preview(api_key), - "valuesPrinted": False, - } - except Exception: - if api_key_id is not None: - delete_probe_resource(token, "DELETE", f"/api/v1/keys/{api_key_id}", "api-key") - if account_id is not None: - delete_probe_resource(token, "DELETE", f"/api/v1/admin/accounts/{account_id}", "account") - if group_id is not None: - delete_probe_resource(token, "DELETE", f"/api/v1/admin/groups/{group_id}", "group") - raise - -def gateway_success_body_probe_request(api_key, request_id): - payload = { - "model": RESPONSES_SMOKE_MODEL, - "input": "Reply exactly: success-body probe", - "stream": False, - "store": False, - "max_output_tokens": 8, - } - body = json.dumps(payload, separators=(",", ":")).encode("utf-8") - script = r''' -set -eu -token="$1" -request_id="$2" -tmp="$(mktemp)" -trap 'rm -f "$tmp"' EXIT -cat > "$tmp" -curl -sS -w '\\n__HTTP_CODE__:%{http_code}' -X POST \ - -H "Authorization: Bearer $token" \ - -H 'Content-Type: application/json' \ - -H "X-Request-ID: $request_id" \ - -H "OpenAI-Client-Request-ID: $request_id" \ - --data-binary @"$tmp" \ - http://127.0.0.1:8080/v1/responses -''' - proc = run([ - "kubectl", "-n", NAMESPACE, "exec", "-i", APP_POD, - "--", "sh", "-c", script, "sh", api_key, request_id, - ], body) - return parse_curl_output(proc) - -def account_temp_unschedulable_probe_state(token, account_id): - detail = ensure_success(curl_api("GET", f"/api/v1/admin/accounts/{account_id}", bearer=token), "get temp-unschedulable probe account") - if not isinstance(detail, dict): - return { - "accountId": account_id, - "status": None, - "schedulable": None, - "tempUnschedulableUntil": None, - "tempUnschedulableReasonPreview": "", - "tempUnschedulableSet": False, - } - until = detail.get("temp_unschedulable_until") or detail.get("tempUnschedulableUntil") - reason = detail.get("temp_unschedulable_reason") or detail.get("tempUnschedulableReason") or "" - return { - "accountId": account_id, - "status": detail.get("status"), - "schedulable": detail.get("schedulable"), - "tempUnschedulableUntil": until, - "tempUnschedulableReasonPreview": text(str(reason), 500) if reason else "", - "tempUnschedulableSet": until is not None or bool(reason), - } - -def validate_success_body_reclassification(token): - requirement = success_body_reclassification_requirement() - if not requirement["required"]: - return { - "ok": True, - "required": False, - "capability": "openai-2xx-success-body-temp-unschedulable-failover", - "outcome": "not-required-by-yaml", - "valuesPrinted": False, - } - - resources = None - cleanup = [] - mock_proc = None - try: - keyword = requirement["probeKeyword"] - upstream_body = json.dumps({ - "id": "resp_unidesk_success_body_probe", - "object": "response", - "created_at": int(time.time()), - "status": "completed", - "model": RESPONSES_SMOKE_MODEL, - "output": [{ - "type": "message", - "role": "assistant", - "content": [{"type": "output_text", "text": keyword}], - }], - "usage": {"input_tokens": 1, "output_tokens": 1, "total_tokens": 2}, - }, separators=(",", ":")) - port, mock_proc, mock_error = launch_success_body_mock_upstream(requirement["statusCode"], upstream_body) - if mock_error is not None: - return { - "ok": False, - "required": True, - "capability": "openai-2xx-success-body-temp-unschedulable-failover", - "outcome": "probe-infrastructure-failed", - "requirement": requirement, - "mock": mock_error, - "valuesPrinted": False, - } - resources = create_success_body_probe_resources(token, f"http://127.0.0.1:{port}", requirement) - request_id = "unidesk-2xx-body-probe-" + str(int(time.time() * 1000)) - started = time.time() - response = gateway_success_body_probe_request(resources["apiKey"], request_id) - mock_result = finish_mock_upstream(mock_proc) - mock_proc = None - state = account_temp_unschedulable_probe_state(token, resources["accountId"]) - evidence = request_log_evidence(request_id) - supported = response.get("ok") is not True and state.get("tempUnschedulableSet") is True - return { - "ok": supported, - "required": True, - "capability": "openai-2xx-success-body-temp-unschedulable-failover", - "outcome": "supported" if supported else "unsupported-runtime-image", - "requirement": requirement, - "probe": { - "requestId": request_id, - "durationMs": int((time.time() - started) * 1000), - "httpStatus": response.get("httpStatus"), - "transportExitCode": response.get("transportExitCode"), - "responseOk": response.get("ok"), - "bodyPreview": text(response.get("body", ""), 800), - "stderr": response.get("stderr", ""), - "accountState": state, - "logEvidence": evidence, - }, - "mock": mock_result, - "resources": { - "groupId": resources["groupId"], - "accountId": resources["accountId"], - "apiKeyId": resources["apiKeyId"], - "keyPreview": resources["keyPreview"], - "valuesPrinted": False, - }, - "cleanup": cleanup, - "message": "Sub2API image must reclassify matching 2xx OpenAI bodies into failover/temp-unschedulable before statusCode=200 YAML rules are effective." if not supported else "Matching 2xx OpenAI bodies are reclassified into account failover/temp-unschedulable.", - "valuesPrinted": False, - } - except Exception as exc: - return { - "ok": False, - "required": True, - "capability": "openai-2xx-success-body-temp-unschedulable-failover", - "outcome": "probe-failed", - "requirement": requirement, - "error": str(exc), - "cleanup": cleanup, - "valuesPrinted": False, - } - finally: - if mock_proc is not None: - _ = finish_mock_upstream(mock_proc) - if resources is not None: - cleanup.append(delete_probe_resource(token, "DELETE", f"/api/v1/keys/{resources['apiKeyId']}" if resources.get("apiKeyId") is not None else "", "api-key")) - cleanup.append(delete_probe_resource(token, "DELETE", f"/api/v1/admin/accounts/{resources['accountId']}", "account")) - cleanup.append(delete_probe_resource(token, "DELETE", f"/api/v1/admin/groups/{resources['groupId']}", "group")) - def validate_runtime_capabilities(token): - success_body = validate_success_body_reclassification(token) + success_body = success_body_reclassification_requirement() model_routing_400 = model_routing_400_failover_requirement() return { - "ok": success_body.get("ok") is True, + "ok": success_body.get("required") is not True and model_routing_400.get("required") is True, "runtimeImage": app_pod_runtime_image(), - "successBodyReclassification": success_body, + "successBodyReclassification": { + "ok": success_body.get("required") is not True, + "required": success_body.get("required"), + "outcome": "not-required-by-yaml" if success_body.get("required") is not True else "requires-source-review-and-real-traffic-evidence", + "requirement": success_body, + "valuesPrinted": False, + }, "modelRouting400Failover": { - "ok": True, - "required": model_routing_400.get("required") is True, - "capability": "openai-400-model-routing-temp-unschedulable-failover", - "outcome": "declared-by-yaml-and-checked-by-runtime-rules", + "ok": model_routing_400.get("required") is True, + "required": model_routing_400.get("required"), + "outcome": "yaml-rules-present-runtime-observed-by-real-traffic", "requirement": model_routing_400, - "message": "Default validate stays short; runtime proof comes from synced rules plus real request logs/failover evidence.", + "evidence": "Use Sub2API source review plus validation.gatewayResponsesRecent and Artificer/real request ids for runtime proof; default validate does not create mock upstreams or temporary failover accounts.", "valuesPrinted": False, }, "valuesPrinted": False, @@ -5418,7 +5238,7 @@ def run_sync(): planned_account_results = planned_sentinel_account_results(profiles, existing_accounts) sentinel_quality_prepare = ensure_sentinel_state_for_sync(planned_account_results, True) if sentinel_quality_prepare.get("ok") is not True: - raise RuntimeError("prepare sentinel quality gate failed: " + json.dumps(sentinel_quality_prepare, ensure_ascii=False)) + raise RuntimeError("prepare sentinel pending probe failed: " + json.dumps(sentinel_quality_prepare, ensure_ascii=False)) protected_frozen_names = active_sentinel_quarantine_names() account_results, pruned_account_results = ensure_accounts(token, profiles, group_id, prune_removed, protected_frozen_names, existing_accounts) capacity_status = account_capacity_status(token) @@ -5438,7 +5258,7 @@ def run_sync(): sentinel_quality = ensure_sentinel_state_for_sync(account_results) sentinel_reassert = reassert_sentinel_freezes_after_sync(token) return { - "ok": gateway["ok"] is True and responses_smoke["ok"] is True and owner_concurrency["ok"] is True and capacity_status["ok"] is True and load_factor_status["ok"] is True and ws_v2_status["ok"] is True and temp_unschedulable_status["ok"] is True and sentinel.get("ok") is True and sentinel_quality_prepare.get("ok") is True and sentinel_quality.get("ok") is True and sentinel_reassert.get("ok") is True, + "ok": gateway["ok"] is True and responses_smoke["ok"] is True and owner_concurrency["ok"] is True and capacity_status["ok"] is True and load_factor_status["ok"] is True and ws_v2_status["ok"] is True and temp_unschedulable_status["ok"] is True and sentinel.get("ok") is True and sentinel_quality_prepare.get("ok") is True and sentinel_quality.get("ok") is True and sentinel_reassert.get("ok") is True and runtime_capabilities.get("ok") is True, "degraded": bool(responses_smoke.get("degraded")) or bool(compact_evidence.get("degraded")) or bool(responses_evidence.get("degraded")) or runtime_capabilities.get("ok") is not True, "mode": "sync", "namespace": NAMESPACE, @@ -5502,7 +5322,7 @@ def run_validate(): runtime_capabilities = validate_runtime_capabilities(token) sentinel = sentinel_runtime_status() return { - "ok": gateway["ok"] is True and responses_smoke["ok"] is True and (owner_concurrency is None or owner_concurrency["ok"] is True) and capacity_status["ok"] is True and load_factor_status["ok"] is True and ws_v2_status["ok"] is True and temp_unschedulable_status["ok"] is True and sentinel.get("ok") is True, + "ok": gateway["ok"] is True and responses_smoke["ok"] is True and (owner_concurrency is None or owner_concurrency["ok"] is True) and capacity_status["ok"] is True and load_factor_status["ok"] is True and ws_v2_status["ok"] is True and temp_unschedulable_status["ok"] is True and sentinel.get("ok") is True and runtime_capabilities.get("ok") is True, "degraded": bool(responses_smoke.get("degraded")) or bool(compact_evidence.get("degraded")) or bool(responses_evidence.get("degraded")) or runtime_capabilities.get("ok") is not True, "mode": "validate", "namespace": NAMESPACE,