docs: record sub2api codex pool long-run control path

This commit is contained in:
Codex
2026-06-10 13:23:49 +00:00
parent 85e2450f5b
commit 2485591138
2 changed files with 6 additions and 0 deletions
+3
View File
@@ -69,6 +69,8 @@ bun scripts/cli.ts platform-infra sub2api codex-pool validate
`sync --confirm` 会登录 Sub2API admin、创建/更新 group、创建/更新 YAML 中的 `unidesk-codex-*` accounts、创建/复用统一 API key Secret,并删除 YAML 中已移除且 `extra.unidesk_managed=true``unidesk-codex-*` account。
`sync --confirm``validate` 可能超过单次 SSH/runtime 短连接窗口。必须继续使用 `bun scripts/cli.ts platform-infra sub2api codex-pool ...`,由 CLI 在 G14 远端提交作业并短轮询状态;不要改用裸 `trans G14:k3s script` 等一个长连接等待完整结果。若看到 `UNIDESK_SSH_RUNTIME_TIMEOUT`,先按 `docs/reference/platform-infra.md` 的规则处理为控制面可见性问题,修 CLI/job/poll 或重跑受控命令,不要手工 patch Sub2API credentials 或源码。
不要给 UniDesk-managed Codex accounts 开 Sub2API `pool_mode`。UniDesk 期望的 failover 是把失败账号临时标记为 unschedulable,让同组其他账号接手;`pool_mode` 会重试同一个 account path。
WebSocket v2 是账号能力集合,不是调度 pin。`openaiResponsesWebSocketsV2Mode` 只声明该账号可承担 Codex Responses WSv2 链路;只有 `localCodex.supportsWebSockets=true` / `localCodex.responsesWebSocketsV2=true` 时,`codex-pool validate` 才必须看到至少一个 `webSocketsV2.schedulableEnabled` 账号。真实可用性仍以 direct Codex WSv2 probe、Sub2API 日志和原入口 Codex smoke 为准。
@@ -149,6 +151,7 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm
- Codex 启动 WebSocket 回退:用原入口 Codex smoke 复现,再用 bounded Sub2API 日志确认 account;对 WS handshake 4xx/5xx、`openai.websocket_account_select_failed` 或 close-before-`response.completed` 的账号关闭 YAML WSv2 能力后同步。若没有剩余 WSv2-capable account,把 `localCodex.supportsWebSockets``localCodex.responsesWebSocketsV2` 一起关掉,不把临时可用性推断写成调度配置。
- 上游要求 Codex User-Agent:只给该 profile 配 `upstreamUserAgent`,跑 `sync --confirm`
- 上游报 capacity/rate-limit/overload/Bad Gateway/Gateway Timeout 后没有切号或频繁先失败再恢复:先确认 `codex-pool validate``tempUnschedulable.ok=true` 且目标 account `runtimeEnabled=true`、规则数符合 YAML;再看 `validation.gatewayResponses.evidence.failovers` 的 account/upstream status。若 mismatch,跑 `codex-pool sync --confirm`,不要手工 patch Sub2API credentials。
- `codex-pool sync --confirm``codex-pool validate` 超时:先区分 CLI 传输超时和 Sub2API 运行失败。受控 CLI 应返回远端作业进度和 stdout/stderr tail;如果只是低层 `trans` 60s 超时,不能据此判定 Sub2API failover 不工作。改用或修复 CLI 的远端 job/poll 路径后重跑,并以最终结构化结果作为证据。
- Codex 报 weekly-limit、`less than 10% of your weekly limit left``Run /status for a breakdown` 等账号状态/软配额提示并要求切号:如果上游以 403/429 等错误状态返回,把稳定 body 关键词放进 `pool.defaultTempUnschedulable` 的对应规则,跑 `codex-pool sync --confirm`,再用 `codex-pool validate` 确认每个 managed account 的 runtime 规则包含这些关键词。若该文案是 HTTP 200 成功内容,当前 Sub2API 不支持把它重分类为账号冷却;不要写 YAML 200 规则、不要热补 Sub2API、不要绕过 sync,必要时登记上游能力缺口 issue。
- 上游 400/503 响应体出现 `invalid_encrypted_content``bad_response_status_code`、unsupported-model、`可用模型``model_not_found``No available channel for model ...` 或同类稳定模型路由 / Responses encrypted-content 兼容性失败:把稳定 body 关键词放进 `pool.defaultTempUnschedulable` 的对应 400 或 503 规则,跑 `codex-pool sync --confirm`,再用 `codex-pool validate` 确认目标 account 的 runtime rule 包含这些关键词;不要用 account membership、priority、capacity、loadFactor、WebSocket mode 或 User-Agent 改动掩盖该错误族。
- 上游错误反复触发:默认错误冷却按严重程度分层;临时问题可从 10 分钟起步,网关/服务不可用/过载/模型路由类应更长,认证/权限/配额/账号状态/账号兼容类使用最长冷却。`invalid_encrypted_content`、unsupported-model、`Recovered upstream error ...``Bad Gateway``Gateway Timeout`、Cloudflare `524`、Codex-facing `Upstream request failed``Unknown error``context deadline exceeded``context canceled``model_not_found``No available channel for model`、大上下文 `413``openai_error` 这类稳定包装文案都应留在对应 YAML 冷却政策里,特别是普通 `/responses` 与 compact 链路里上游兼容性错误或 524 可能最终表现为客户端 502/504 + `Unknown error`。具体数值只以 YAML 为准,修改后必须 `codex-pool sync --confirm``codex-pool validate`。长期判定见 `docs/reference/platform-infra.md`
+3
View File
@@ -62,6 +62,8 @@ The request path is:
Adding, removing, exposing, validating, and configuring local Codex consumers are daily operations covered by `$unidesk-sub2api`. The development rule is that ordinary pool membership changes stay YAML-only and do not add code or CI/CD. Code changes are only appropriate when UniDesk needs to render or validate a Sub2API capability that already exists upstream, such as account-level WebSocket mode or per-account upstream User-Agent. If Sub2API itself does not support a desired behavior, do not magic-patch it through UniDesk scripts, Kubernetes hotfixes, local forks, or hidden compatibility paths; either leave the behavior unsupported or pursue it upstream as an explicit Sub2API feature.
`codex-pool sync --confirm` and `codex-pool validate` are runtime operations that may need more than one SSH short-connection window because they log in to Sub2API, reconcile accounts, inspect recent logs, and run gateway smoke requests. The formal entry remains the UniDesk CLI, which must use a submit-and-short-poll control shape or an equivalent remote job wrapper instead of one long `trans G14:k3s script` call. If these commands fail with `UNIDESK_SSH_RUNTIME_TIMEOUT` while the remote operation may still be running, treat it as a control-plane visibility gap first: improve or use the CLI's job/poll path, then rerun `sync` or `validate`. Do not replace it with raw `kubectl`, manual Sub2API admin API patches, repeated blind full loops, or Sub2API source modifications.
After `codex-pool configure-local --confirm`, the default `~/.codex/config.toml` / `auth.json` pair must remain the unified Sub2API consumer and must not be reused as an upstream account profile. Keep every upstream source profile in suffixed files such as `config.toml.<profile>` / `auth.json.<profile>` and register it through YAML `profiles.entries`.
## Public FRP Boundary
@@ -78,6 +80,7 @@ Kubernetes readiness is not the same as pool availability:
- The FRP client deployment is currently a simple connector deployment and does not itself prove that master-local traffic reaches Sub2API.
- No scheduled `CronJob`, `ServiceMonitor`, or `PodMonitor` currently proves the full unified Codex API path.
- `platform-infra sub2api validate` and `platform-infra sub2api codex-pool validate` are on-demand checks. Operational usage is documented in `$unidesk-sub2api`; they are acceptable for deployment closeout, but they are not continuous monitoring. `codex-pool validate` must test both `GET /v1/models` and a small `POST /v1/responses` request, and the Responses smoke should report request id, selected/final account evidence, upstream failover count, and whether the validation succeeded only after failover. It should also summarize recent `/responses` and `/responses/compact` gateway failures separately so ordinary long streaming failures are not hidden behind compact-only evidence.
- Because `codex-pool validate` includes account alignment, recent-log inspection, and gateway smoke, timeout of the CLI transport is not valid negative evidence about Sub2API scheduling by itself. Closeout evidence must come from the final structured validation result or from an explicitly reported remote job failure with stdout/stderr tail, not from a single low-level `trans` timeout.
When an automatic availability probe is added, it should be YAML-controlled and cover these layers without printing secrets: