fix: harden sub2api codex pool sync controls

This commit is contained in:
Codex
2026-06-10 12:04:03 +00:00
parent 2485591138
commit 375117271a
4 changed files with 745 additions and 56 deletions
+13 -6
View File
@@ -50,6 +50,7 @@ bun scripts/cli.ts platform-infra sub2api validate
bun scripts/cli.ts platform-infra sub2api codex-pool plan
bun scripts/cli.ts platform-infra sub2api codex-pool sync --confirm
bun scripts/cli.ts platform-infra sub2api codex-pool validate
bun scripts/cli.ts platform-infra sub2api codex-pool cleanup-probes --confirm
```
`config/platform-infra/sub2api-codex-pool.yaml` 控制:
@@ -57,8 +58,10 @@ bun scripts/cli.ts platform-infra sub2api codex-pool validate
- `pool.groupName`: Sub2API group 名称。
- `pool.apiKeySecretName` / `pool.apiKeySecretKey`: 统一消费 API key 的 k3s Secret 位置,默认 `platform-infra/sub2api-codex-pool-api-key.API_KEY`
- `pool.minOwnerBalanceUsd`: pool key owner 最低余额,sync/validate 会补齐。
- `pool.minOwnerConcurrency`: 统一消费 API key owner 最低并发sync/validate 会补齐;必须不小于 YAML 中所有账号声明 capacity 总和,用于避免共享 key 在用户并发层触发 WS 1013,不要用提高某个 provider capacity 来掩盖。
- `pool.minOwnerConcurrency`: 可选统一消费 API key owner 最低并发;省略时 CLI 自动使用所有已解析账号 capacity 的总和,sync/validate 会补齐。显式 YAML 值只作为 override,仍必须不小于账号 capacity 总和;未显式写 `profiles.entries[].capacity` 的账号会使用 `pool.defaultAccountCapacity` 参与求和,不要用提高某个 provider capacity 来掩盖用户并发层 WS 1013
- `pool.defaultTempUnschedulable`: 默认账号级临时下线规则;只声明 Sub2API 已支持的错误路径能力,用于在上游返回容量、限流、overload、service unavailable、gateway timeout、稳定模型路由错误或认证状态异常时,让 Sub2API 冷却该账号并切换到同组其他账号。不要用 YAML、UniDesk CLI、k8s 热补或本地 fork 魔改 Sub2API 不支持的行为。
- 自动冻结/切号失败时,必须修复 `temp_unschedulable` 与 failover 机制本身,并用运行时证据证明失败账号被临时冻结且请求切到其他可调度账号;禁止通过手动禁用账号、删除账号、移除 YAML entry、降低 membership 或临时改调度策略来替代自动恢复。只有明确的上游退役或所有权变更才走删除/禁用上游流程。
- YAML 只选择和配置 Codex 上游,不声明 `schedulable` 长期字段;`schedulable=true` 只能作为 `codex-pool sync --confirm` 的过程控制基线恢复。自动冻结必须表现为 `temp_unschedulable_until` / `temp_unschedulable_reason`,避免把永久不可调度误当成自动冻结。
- `profiles.entries`: 从 master `~/.codex/` 选择上游 profile 并映射到 Sub2API account。
- `profiles.entries[].capacity`: 可选 per-account concurrency override;不写则使用 `pool.defaultAccountCapacity`。具体数值只以 `config/platform-infra/sub2api-codex-pool.yaml` 为准,skill 和长期参考只描述规则,不重复写当前值。
- `profiles.entries[].loadFactor`: 可选 per-account Sub2API `load_factor` override;不写则使用 `pool.defaultAccountLoadFactor`。具体数值只以 `config/platform-infra/sub2api-codex-pool.yaml` 为准,修改后必须 `codex-pool sync --confirm``codex-pool validate`
@@ -67,7 +70,7 @@ bun scripts/cli.ts platform-infra sub2api codex-pool validate
- `profiles.entries[].openaiResponsesWebSocketsV2Mode`: 需要 Responses WebSocket v2 的上游才设置,值为 `off``ctx_pool``passthrough`
- `profiles.entries[].upstreamUserAgent`: 少数要求 Codex CLI User-Agent 的上游才设置,不能含换行。
`sync --confirm` 会登录 Sub2API admin、创建/更新 group、创建/更新 YAML 中的 `unidesk-codex-*` accounts、创建/复用统一 API key Secret,并删除 YAML 中已移除`extra.unidesk_managed=true``unidesk-codex-*` account。
`sync --confirm` 会登录 Sub2API admin、创建/更新 group、创建/更新 YAML 中的 `unidesk-codex-*` accounts、创建/复用统一 API key Secret,并把 managed account 的 `schedulable=true` 恢复为过程控制基线;它默认不删除 YAML 中缺席的 managed account。只有明确退役上游时才使用 `sync --confirm --prune-removed` 删除缺席`extra.unidesk_managed=true``unidesk-codex-*` account。
`sync --confirm``validate` 可能超过单次 SSH/runtime 短连接窗口。必须继续使用 `bun scripts/cli.ts platform-infra sub2api codex-pool ...`,由 CLI 在 G14 远端提交作业并短轮询状态;不要改用裸 `trans G14:k3s script` 等一个长连接等待完整结果。若看到 `UNIDESK_SSH_RUNTIME_TIMEOUT`,先按 `docs/reference/platform-infra.md` 的规则处理为控制面可见性问题,修 CLI/job/poll 或重跑受控命令,不要手工 patch Sub2API credentials 或源码。
@@ -91,13 +94,15 @@ Codex 启动时反复出现 WebSocket reconnect、HTTPS fallback、`websocket cl
## 删除上游
删除上游只用于明确退役、凭据所有权变更或用户明确要求移除 provider;不能作为上游 5xx、compact 失败、限流、模型路由失败或自动冻结/切号缺陷的恢复手段。
1.`config/platform-infra/sub2api-codex-pool.yaml` 删除对应 `profiles.entries` 项。
2.`codex-pool plan` 检查 desired 列表。
3.`codex-pool sync --confirm`
3.`codex-pool sync --confirm --prune-removed`
4. 确认输出 `accounts.pruned` 只包含期望删除项。
5.`codex-pool validate`
CLI 会 prune `name``unidesk-codex-` 开头且 `extra.unidesk_managed=true` 的缺席账号。
CLI 默认保留缺席账号,避免把可用性问题误处理成删除;只有显式 `--prune-removed`会 prune `name``unidesk-codex-` 开头且 `extra.unidesk_managed=true` 的缺席账号。
## FRP 暴露
@@ -144,16 +149,17 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm
- profile invalid:先修 `~/.codex/config.toml.<profile>``base_url``wire_api``model``auth.json.<profile>` 的 API key;不要在 YAML 中写密钥。
- pool key 401:跑 `codex-pool sync --confirm` 重建 Sub2API key 与 k3s Secret 绑定,再跑 `codex-pool validate`
- 运行中过去的验证探针残留:只用 `codex-pool cleanup-probes --confirm` 清理 `unidesk-probe-*` 临时资源;不要把真实 managed account 删除当作探针清理或可用性恢复。
- FRP 不通:先看 `codex-pool expose --confirm` 输出的 `masterFrps``masterCaddy``sub2api-frpc` 和 public 401 probe;需要低层证据时只用 `trans G14:k3s` 做 bounded 查询。
- `/responses/compact` 约 30 秒后返回 504 但 Sub2API 日志稍后记录 `codex.remote_compact.succeeded` 时,优先检查 master Caddy `response_header_timeout` 是否由 YAML `publicExposure.masterCaddy.responseHeaderTimeoutSeconds` 渲染,修正后跑 `codex-pool expose --confirm`;这类边缘代理超时不会触发 Sub2API 账号级临时下线。
- default profile 递归:检查 YAML default entry 是否使用 `*.pre-sub2api` 备份文件;必要时恢复备份后重新 `configure-local --confirm`
- 上游需要 WebSocket v2:先做 direct Codex WSv2 probe;通过后才给该 profile 配 `openaiResponsesWebSocketsV2Mode: ctx_pool|passthrough` 并跑 `sync --confirm`;把它当 capability candidate,容量仍以 YAML 中的 `capacity` 或默认值为准。
- Codex 启动 WebSocket 回退:用原入口 Codex smoke 复现,再用 bounded Sub2API 日志确认 account;对 WS handshake 4xx/5xx、`openai.websocket_account_select_failed` 或 close-before-`response.completed` 的账号关闭 YAML WSv2 能力后同步。若没有剩余 WSv2-capable account,把 `localCodex.supportsWebSockets``localCodex.responsesWebSocketsV2` 一起关掉,不把临时可用性推断写成调度配置。
- 上游要求 Codex User-Agent:只给该 profile 配 `upstreamUserAgent`,跑 `sync --confirm`
- 上游报 capacity/rate-limit/overload/Bad Gateway/Gateway Timeout 后没有切号或频繁先失败再恢复:先确认 `codex-pool validate``tempUnschedulable.ok=true` 且目标 account `runtimeEnabled=true`、规则数符合 YAML;再看 `validation.gatewayResponses.evidence.failovers` 的 account/upstream status。若 mismatch,跑 `codex-pool sync --confirm`,不要手工 patch Sub2API credentials
- 上游报 capacity/rate-limit/overload/Bad Gateway/Gateway Timeout 后没有切号或频繁先失败再恢复:先确认 `codex-pool validate``tempUnschedulable.ok=true` 且目标 account `runtimeEnabled=true`、规则数符合 YAML;再看 `validation.gatewayResponses.evidence.failovers` 的 account/upstream status。若 mismatch,跑 `codex-pool sync --confirm`;若 runtime 规则已对齐但仍不冻结或不切号,继续修 Sub2API 自动冻结/failover 能力并复测,不要手工 patch Sub2API credentials,也不要手动禁用、删除或从 YAML 移除问题账号来绕过机制缺陷
- `codex-pool sync --confirm``codex-pool validate` 超时:先区分 CLI 传输超时和 Sub2API 运行失败。受控 CLI 应返回远端作业进度和 stdout/stderr tail;如果只是低层 `trans` 60s 超时,不能据此判定 Sub2API failover 不工作。改用或修复 CLI 的远端 job/poll 路径后重跑,并以最终结构化结果作为证据。
- Codex 报 weekly-limit、`less than 10% of your weekly limit left``Run /status for a breakdown` 等账号状态/软配额提示并要求切号:如果上游以 403/429 等错误状态返回,把稳定 body 关键词放进 `pool.defaultTempUnschedulable` 的对应规则,跑 `codex-pool sync --confirm`,再用 `codex-pool validate` 确认每个 managed account 的 runtime 规则包含这些关键词。若该文案是 HTTP 200 成功内容,当前 Sub2API 不支持把它重分类为账号冷却;不要写 YAML 200 规则、不要热补 Sub2API、不要绕过 sync,必要时登记上游能力缺口 issue。
- 上游 400/503 响应体出现 `invalid_encrypted_content``bad_response_status_code`、unsupported-model、`可用模型``model_not_found``No available channel for model ...` 或同类稳定模型路由 / Responses encrypted-content 兼容性失败:把稳定 body 关键词放进 `pool.defaultTempUnschedulable` 的对应 400503 规则,跑 `codex-pool sync --confirm`,再用 `codex-pool validate` 确认目标 account 的 runtime rule 包含这些关键词;不要用 account membership、priority、capacity、loadFactor、WebSocket mode 或 User-Agent 改动掩盖该错误族。
- 上游 400/503 响应体出现 `invalid_encrypted_content``bad_response_status_code``invalid_request_error` + 稳定 unsupported-model 文案、unsupported-model、`暂不支持` / `可用模型``model_not_found``No available channel for model ...` 或同类稳定模型路由 / Responses encrypted-content 兼容性失败:把稳定 body 关键词放进 `pool.defaultTempUnschedulable` 的对应 400/503 规则,跑 `codex-pool sync --confirm`,再用 `codex-pool validate` 确认目标 account 的 runtime rule 包含这些关键词;不要用 account membership、priority、capacity、loadFactor、WebSocket mode 或 User-Agent 改动掩盖该错误族。
- 上游错误反复触发:默认错误冷却按严重程度分层;临时问题可从 10 分钟起步,网关/服务不可用/过载/模型路由类应更长,认证/权限/配额/账号状态/账号兼容类使用最长冷却。`invalid_encrypted_content`、unsupported-model、`Recovered upstream error ...``Bad Gateway``Gateway Timeout`、Cloudflare `524`、Codex-facing `Upstream request failed``Unknown error``context deadline exceeded``context canceled``model_not_found``No available channel for model`、大上下文 `413``openai_error` 这类稳定包装文案都应留在对应 YAML 冷却政策里,特别是普通 `/responses` 与 compact 链路里上游兼容性错误或 524 可能最终表现为客户端 502/504 + `Unknown error`。具体数值只以 YAML 为准,修改后必须 `codex-pool sync --confirm``codex-pool validate`。长期判定见 `docs/reference/platform-infra.md`
- Codex auto compact 后丢上下文:先确认 YAML `localCodex` 是否声明启用 WSv2;若启用,再确认本机 `~/.codex/config.toml` 是否有 `supports_websockets = true``responses_websockets_v2 = true`,并看 `codex-pool validate` 的 WSv2 candidate 和 Sub2API 日志里的 `transport=responses_websockets_v2`。若 YAML 当前禁用 WSv2,则按 HTTP Responses 稳定性排查,不把旧 WS 口径当成验收要求。
- Codex smoke 有 reconnect/1013:这是上游并发/可用性问题,和 HTTP-only compact context-loss 分开处理;记录 session/log 证据并关联专项 issue,不要用运行时手补覆盖 YAML 容量。
@@ -166,4 +172,5 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm
- 不给 Sub2API manifest 添加 CPU/memory limits,除非有新的 YAML 化明确决策。
- 不打印完整 API key、admin password 或 Secret 明文。
- 不把普通上游增删做成代码变更、CI/CD、feature flag 或兼容双路径。
- 不把手动禁用账号、删除账号、移除 YAML entry、降低 membership 或临时改 priority/capacity/loadFactor 当作自动冻结/切号失败的修复。
- 不魔改 Sub2API:Sub2API 本身不支持的能力就不做,不通过 UniDesk 脚本、k8s 原地热补、本地 fork、YAML 伪声明或隐藏 fallback 代替上游实现。
@@ -4,7 +4,6 @@ pool:
apiKeySecretName: sub2api-codex-pool-api-key
apiKeySecretKey: API_KEY
minOwnerBalanceUsd: 1000
minOwnerConcurrency: 120
defaultAccountPriority: 10
defaultAccountCapacity: 10
defaultAccountLoadFactor: 1
+7 -2
View File
@@ -25,8 +25,11 @@
- `pool.groupName` names the Sub2API group that represents the pool.
- `pool.apiKeySecretName` and `pool.apiKeySecretKey` name the k3s Secret that stores the single consumer API key.
- `pool.minOwnerConcurrency` declares the minimum concurrency for the Sub2API user that owns the unified consumer API key. It must be at least the sum of all declared account capacities, so the shared key does not fail requests or WS sessions at the user-concurrency layer. Do not compensate for owner-concurrency 1013 errors by pinning capacity to one provider.
- `pool.minOwnerConcurrency` is optional; when omitted, the CLI automatically uses the sum of all resolved account capacities as the minimum concurrency for the Sub2API user that owns the unified consumer API key. A YAML value is only an explicit override and must still be at least that capacity sum, so the shared key does not fail requests or WS sessions at the user-concurrency layer. "Resolved" means each account's explicit `profiles.entries[].capacity` or, when omitted, `pool.defaultAccountCapacity`. Do not compensate for owner-concurrency 1013 errors by pinning capacity to one provider.
- `pool.defaultTempUnschedulable` declares Sub2API account-level temporary unschedulable rules for capabilities that Sub2API itself already supports. Keep 429/overload/capacity, service-unavailable, gateway timeout, and stable model-routing failures in this YAML policy so the scheduler can cool down a failing account and choose another candidate instead of hard-pinning one provider. Do not declare unsupported Sub2API behavior in YAML as a promise that UniDesk code or runtime patches should emulate.
- When a managed upstream repeatedly causes `/v1/responses` or `/responses/compact` failures, the required fix path is to make automatic temporary-unschedulable and failover work, then verify it with runtime evidence. Do not restore availability by manually disabling an account, deleting a managed account, removing its YAML entry, lowering membership, or otherwise changing routing policy merely to avoid the failing upstream; those actions are allowed only for an explicit upstream retirement or ownership change.
- Codex accounts selected by YAML do not declare `schedulable` as durable configuration. `schedulable=true` is a `codex-pool sync --confirm` process-control baseline for UniDesk-managed accounts, not a YAML field. Account cooling must be represented by `temp_unschedulable_until` / `temp_unschedulable_reason`, so validation can distinguish real automatic cooldown from stale manual unschedulable state.
- `codex-pool sync --confirm` preserves UniDesk-managed accounts that are absent from YAML by default; explicit upstream retirement requires `codex-pool sync --confirm --prune-removed`. This keeps account deletion out of the normal availability-recovery path and prevents temporary YAML edits from becoming destructive runtime changes.
- `profiles.entries` selects local Codex profile files from `~/.codex/` and maps them to Sub2API account names.
- The unsuffixed master `~/.codex/config.toml` and `~/.codex/auth.json` are reserved for the unified Sub2API consumer. `config.toml` must keep `base_url = "https://sub2api.74-48-78-17.nip.io/"`, and `auth.json` must contain the unified pool API key from `pool.apiKeySecretName` / `pool.apiKeySecretKey`. Do not replace these two files with direct upstream account credentials.
- Additional upstream accounts must use suffixed local profile files such as `config.toml.<profile>` and `auth.json.<profile>`, then be declared through `profiles.entries` in `config/platform-infra/sub2api-codex-pool.yaml`.
@@ -35,7 +38,7 @@
- Do not change account membership, priority, capacity, load factor, WebSocket mode, or other routing policy from inference alone. Unless the user explicitly asks for a configuration change, first preserve the current YAML, collect provenance and runtime evidence, and write the finding to the relevant issue or runbook before proposing a change.
- `profiles.entries[].tempUnschedulable` may override the pool default for one account. The CLI renders it into Sub2API credentials as `temp_unschedulable_enabled` and `temp_unschedulable_rules`; rules match HTTP status plus response-body keywords and place only that account into a temporary unschedulable cooldown.
- Codex account-state or quota prompts that stop a task and ask the operator to switch accounts belong in `pool.defaultTempUnschedulable`, not in account membership, priority, capacity, load factor, WebSocket mode, or `pool_mode`. Keep stable body phrases such as weekly-limit and `/status` prompts in both the 403 account-state rule and the 429 quota/rate-limit rule, then run `codex-pool sync --confirm` and `codex-pool validate`. The validation evidence must include runtime temporary-unschedulable alignment for each managed account, not only successful group-level `/v1/models` or `/v1/responses` smoke output.
- Upstream model-routing and Responses compatibility failures that surface as 400 responses, such as `invalid_encrypted_content`, `bad_response_status_code`, unsupported-model wrappers, or stable "available models" messages, belong in `pool.defaultTempUnschedulable` when another account can handle the same Codex request. Upstream model-routing failures that surface as 503 responses, such as `model_not_found` or "no available channel for model" wrappers, also belong there. Gateway and timeout failures that surface as 502, 504, or 524 responses, including `Gateway Timeout`, `Unknown error`, `Upstream request failed`, `context deadline exceeded`, `context canceled`, or recovered upstream-error wrappers, belong in the same YAML policy. This is especially important for compact and long `/responses` requests, where an upstream Cloudflare 524 or account-specific compatibility failure may eventually reach Codex as a 502/504 unknown-error wrapper after failover or client cancellation. They are not membership, priority, capacity, load factor, WebSocket mode, or User-Agent decisions by themselves. After adding stable body phrases, run `codex-pool sync --confirm` and `codex-pool validate`, and verify the affected account's runtime status-specific rule includes the new keywords.
- Upstream model-routing and Responses compatibility failures that surface as 400 responses, such as `invalid_encrypted_content`, `bad_response_status_code`, `invalid_request_error` with a stable unsupported-model message, unsupported-model wrappers, or stable "available models" messages, belong in `pool.defaultTempUnschedulable` when another account can handle the same Codex request. Upstream model-routing failures that surface as 503 responses, such as `model_not_found` or "no available channel for model" wrappers, also belong there. Gateway and timeout failures that surface as 502, 504, or 524 responses, including `Gateway Timeout`, `Unknown error`, `Upstream request failed`, `context deadline exceeded`, `context canceled`, or recovered upstream-error wrappers, belong in the same YAML policy. This is especially important for compact and long `/responses` requests, where an upstream Cloudflare 524 or account-specific compatibility failure may eventually reach Codex as a 502/504 unknown-error wrapper after failover or client cancellation. They are not membership, priority, capacity, load factor, WebSocket mode, or User-Agent decisions by themselves. After adding stable body phrases, run `codex-pool sync --confirm` and `codex-pool validate`, and verify the affected account's runtime status-specific rule includes the new keywords.
- `profiles.entries[].openaiResponsesWebSocketsV2Mode` is the account-level Responses WebSocket v2 switch for OpenAI-compatible upstreams that require WebSocket transport. Allowed values are `off`, `ctx_pool`, and `passthrough`; omit the field unless that upstream needs it.
- `profiles.entries[].upstreamUserAgent` is an optional account-level upstream request User-Agent override. Use it only for upstreams that require a Codex CLI compatible User-Agent; keep the value YAML-controlled and newline-free.
- `publicExposure` controls the optional FRP bridge from master server to the G14 ClusterIP service.
@@ -52,6 +55,8 @@ Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode
Sub2API temporary-unschedulable rules require both an HTTP status match and a response-body keyword match in the upstream failure/error path. Do not treat them as a general successful-response content filter. If an upstream returns a quota warning or maintenance prompt as normal HTTP 200 assistant content, do not add a YAML 200 cooldown rule, patch Sub2API in place, fork behavior in UniDesk, or bypass `codex-pool sync` to make the pool pretend that account cooling exists. Record the upstream capability gap in an issue when it matters operationally; until upstream Sub2API supports that behavior and `codex-pool validate` proves it, UniDesk should not implement or rely on it.
If automatic cooling or same-request failover does not happen for an error that the YAML policy declares, treat that as a Sub2API capability or integration defect. The closeout must show the failing account being marked temporarily unschedulable and the next request or same request selecting another schedulable account; a manually disabled, deleted, or pruned account is not valid evidence for this class of fix.
The request path is:
1. A client sends an OpenAI-compatible request to the configured consumer base URL, normally `https://sub2api.74-48-78-17.nip.io/v1/...`, with the unified API key.
File diff suppressed because it is too large Load Diff