Merge pull request #243 from pikasTech/fix/sub2api-ws-capability-238

修复 Sub2API Codex WebSocket 能力误声明
This commit is contained in:
Lyon
2026-06-10 11:47:44 +08:00
committed by GitHub
5 changed files with 32 additions and 13 deletions
+6 -6
View File
@@ -70,9 +70,9 @@ bun scripts/cli.ts platform-infra sub2api codex-pool validate
不要给 UniDesk-managed Codex accounts 开 Sub2API `pool_mode`。UniDesk 期望的 failover 是把失败账号临时标记为 unschedulable,让同组其他账号接手;`pool_mode` 会重试同一个 account path。
WebSocket v2 是账号能力集合,不是调度 pin。`openaiResponsesWebSocketsV2Mode` 只声明该账号可承担 Codex Responses WSv2 链路;`codex-pool validate` 至少要看到一个 `webSocketsV2.schedulableEnabled` 账号真实可用性仍以 Codex smoke 和运行日志为准。
WebSocket v2 是账号能力集合,不是调度 pin。`openaiResponsesWebSocketsV2Mode` 只声明该账号可承担 Codex Responses WSv2 链路;只有 `localCodex.supportsWebSockets=true` / `localCodex.responsesWebSocketsV2=true` 时,`codex-pool validate` 才必须看到至少一个 `webSocketsV2.schedulableEnabled` 账号真实可用性仍以 direct Codex WSv2 probe、Sub2API 日志和原入口 Codex smoke 为准。
Codex 启动时反复出现 WebSocket reconnect、HTTPS fallback、`websocket closed by server before response.completed`,或 Sub2API 日志出现 `openai.websocket_proxy_failed` / 上游 WS handshake 5xx 时,先按运行证据定位具体 account 和 transport。若只有某个账号的 WSv2 握手失败,优先只在 YAML 中把该账号的 `openaiResponsesWebSocketsV2Mode` 收敛为 `off` `codex-pool sync --confirm`不要顺手改 membership、priority、capacity、Secret 或代码 fallback。
Codex 启动时反复出现 WebSocket reconnect、HTTPS fallback、`websocket closed by server before response.completed`,或 Sub2API 日志出现 `openai.websocket_proxy_failed` / `openai.websocket_account_select_failed` / 上游 WS handshake 4xx/5xx 时,先按运行证据定位具体 account 和 transport。若账号的 WSv2 握手失败,优先只在 YAML 中把该账号的 `openaiResponsesWebSocketsV2Mode` 收敛为 `off`;若没有任何 direct Codex WSv2 probe 通过,则同时把 `localCodex.supportsWebSockets``localCodex.responsesWebSocketsV2` 收敛为 `false`,再 `codex-pool sync --confirm`不要顺手改 membership、priority、capacity、Secret 或代码 fallback。
## 添加上游
@@ -119,7 +119,7 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm
-`platform-infra/<apiKeySecretName>.<apiKeySecretKey>` 读取统一 API key。
- 把当前 `~/.codex/config.toml``~/.codex/auth.json` 备份为 `.<backupSuffix>`,默认 `.pre-sub2api`
- 重写默认 `~/.codex` 消费端,指向 `publicExposure.masterBaseUrl`provider 名称和 wire API 来自 `localCodex`
- 写入 Codex Responses WebSocket v2 能力标记:provider section 必须有 `supports_websockets = true``[features]` 必须有 `responses_websockets_v2 = true`,否则 auto compact 后续链路会退回 HTTP-only summary context。
- `localCodex` 写入 Codex transport 标记:`supports_websockets``[features] responses_websockets_v2` 必须同开同关。只有至少一个上游通过 direct Codex WSv2 probe 时才启用;否则保持 HTTP Responses,避免每次原入口先经历无效 WS reconnect。
- 用统一 key 做一次 gateway 验证。
防递归规则:`profiles.entries` 中 default 上游应指向 `config.toml.pre-sub2api` / `auth.json.pre-sub2api`,不要把已经改成 Sub2API consumer 的默认文件再导回上游池。
@@ -141,14 +141,14 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm
- pool key 401:跑 `codex-pool sync --confirm` 重建 Sub2API key 与 k3s Secret 绑定,再跑 `codex-pool validate`
- FRP 不通:先看 `codex-pool expose --confirm` 输出的 `masterFrps``sub2api-frpc` 和 public 401 probe;需要低层证据时只用 `trans G14:k3s` 做 bounded 查询。
- default profile 递归:检查 YAML default entry 是否使用 `*.pre-sub2api` 备份文件;必要时恢复备份后重新 `configure-local --confirm`
- 上游需要 WebSocket v2给该 profile 配 `openaiResponsesWebSocketsV2Mode: ctx_pool|passthrough``sync --confirm`;把它当 capability candidate,容量仍以 YAML 中的 `capacity` 或默认值为准。
- Codex 启动 WebSocket 回退:用原入口 Codex smoke 复现,再用 bounded Sub2API 日志确认 account;对 WS handshake 5xx 的账号关闭 YAML WSv2 能力后同步,不把临时可用性推断写成调度配置。
- 上游需要 WebSocket v2先做 direct Codex WSv2 probe;通过后才给该 profile 配 `openaiResponsesWebSocketsV2Mode: ctx_pool|passthrough``sync --confirm`;把它当 capability candidate,容量仍以 YAML 中的 `capacity` 或默认值为准。
- Codex 启动 WebSocket 回退:用原入口 Codex smoke 复现,再用 bounded Sub2API 日志确认 account;对 WS handshake 4xx/5xx、`openai.websocket_account_select_failed` 或 close-before-`response.completed` 的账号关闭 YAML WSv2 能力后同步。若没有剩余 WSv2-capable account,把 `localCodex.supportsWebSockets``localCodex.responsesWebSocketsV2` 一起关掉,不把临时可用性推断写成调度配置。
- 上游要求 Codex User-Agent:只给该 profile 配 `upstreamUserAgent`,跑 `sync --confirm`
- 上游报 capacity/rate-limit/overload/Bad Gateway/Gateway Timeout 后没有切号或频繁先失败再恢复:先确认 `codex-pool validate``tempUnschedulable.ok=true` 且目标 account `runtimeEnabled=true`、规则数符合 YAML;再看 `validation.gatewayResponses.evidence.failovers` 的 account/upstream status。若 mismatch,跑 `codex-pool sync --confirm`,不要手工 patch Sub2API credentials。
- Codex 报 weekly-limit、`less than 10% of your weekly limit left``Run /status for a breakdown` 等账号状态/软配额提示并要求切号:把稳定 body 关键词放进 `pool.defaultTempUnschedulable` 的 403 和 429 规则,跑 `codex-pool sync --confirm`,再用 `codex-pool validate` 确认每个 managed account 的 runtime 403/429 rules 都包含这些关键词。Sub2API 临时下线规则按 HTTP status + body keyword 匹配;如果该文案是 HTTP 200 成功内容,需要另提响应分类能力 issue,不能只靠 YAML 冷却规则声明解决。
- 上游 503 响应体出现 `model_not_found``No available channel for model ...` 或同类稳定模型路由失败文案:把稳定 body 关键词放进 `pool.defaultTempUnschedulable` 的 503 规则,跑 `codex-pool sync --confirm`,再用 `codex-pool validate` 确认目标 account 的 runtime 503 rule 包含这些关键词;不要用 account membership、priority、capacity、loadFactor、WebSocket mode 或 User-Agent 改动掩盖该错误族。
- 上游错误反复触发:默认错误冷却按严重程度分层;临时问题可从 10 分钟起步,网关/服务不可用/过载/模型路由类应更长,认证/权限/配额/账号状态类使用最长冷却。`Recovered upstream error ...``Bad Gateway``Gateway Timeout`、Cloudflare `524`、Codex-facing `Upstream request failed``Unknown error``context deadline exceeded``context canceled``model_not_found``No available channel for model`、大上下文 `413``openai_error` 这类稳定包装文案都应留在 YAML 冷却政策里。具体数值只以 YAML 为准,修改后必须 `codex-pool sync --confirm``codex-pool validate`。长期判定见 `docs/reference/platform-infra.md`
- Codex auto compact 后丢上下文:先确认本机 `~/.codex/config.toml` 是否有 `supports_websockets = true``responses_websockets_v2 = true``codex-pool validate` 的 WSv2 candidate 和 Sub2API 日志里的 `transport=responses_websockets_v2`
- Codex auto compact 后丢上下文:先确认 YAML `localCodex` 是否声明启用 WSv2;若启用,再确认本机 `~/.codex/config.toml` 是否有 `supports_websockets = true``responses_websockets_v2 = true``codex-pool validate` 的 WSv2 candidate 和 Sub2API 日志里的 `transport=responses_websockets_v2`若 YAML 当前禁用 WSv2,则按 HTTP Responses 稳定性排查,不把旧 WS 口径当成验收要求。
- Codex smoke 有 reconnect/1013:这是上游并发/可用性问题,和 HTTP-only compact context-loss 分开处理;记录 session/log 证据并关联专项 issue,不要用运行时手补覆盖 YAML 容量。
## 禁止事项
@@ -56,7 +56,7 @@ profiles:
accountName: unidesk-codex-hy
configFile: config.toml.HY
authFile: auth.json.HY
openaiResponsesWebSocketsV2Mode: passthrough
openaiResponsesWebSocketsV2Mode: off
capacity: 10
loadFactor: 10
priority: 1
@@ -107,6 +107,6 @@ localCodex:
backupSuffix: pre-sub2api
providerName: OpenAI
wireApi: responses
supportsWebSockets: true
responsesWebSocketsV2: true
supportsWebSockets: false
responsesWebSocketsV2: false
responsesSmokeModel: gpt-5.5
+3 -3
View File
@@ -37,11 +37,11 @@
- `profiles.entries[].openaiResponsesWebSocketsV2Mode` is the account-level Responses WebSocket v2 switch for OpenAI-compatible upstreams that require WebSocket transport. Allowed values are `off`, `ctx_pool`, and `passthrough`; omit the field unless that upstream needs it.
- `profiles.entries[].upstreamUserAgent` is an optional account-level upstream request User-Agent override. Use it only for upstreams that require a Codex CLI compatible User-Agent; keep the value YAML-controlled and newline-free.
- `publicExposure` controls the optional FRP bridge from master server to the G14 ClusterIP service.
- `localCodex` controls how the master server's current `~/.codex` consumer files are backed up and rewritten. Codex consumers using Sub2API must keep `supportsWebSockets` and `responsesWebSocketsV2` enabled so compacted long sessions can continue through the Responses WebSocket v2 response chain instead of falling back to HTTP-only summary context. `localCodex.responsesSmokeModel` is the YAML-declared model used by `codex-pool validate` for the lightweight `POST /v1/responses` smoke.
- `localCodex` controls how the master server's current `~/.codex` consumer files are backed up and rewritten. Keep `supportsWebSockets` and `responsesWebSocketsV2` in the same state, and enable them only when at least one YAML-managed account has a current direct Codex WSv2 smoke that passes. If no upstream profile can sustain Responses WSv2, the honest long-term state is `false/false` so Codex uses HTTP Responses directly instead of repeatedly reconnecting before `response.completed`. `localCodex.responsesSmokeModel` is the YAML-declared model used by `codex-pool validate` for the lightweight `POST /v1/responses` smoke.
Enable account-level WebSocket v2 only for upstream profiles that have passed a direct Codex WSv2 probe. Treat this as a YAML-declared capability set, not a hard scheduling pin to one profile; `codex-pool validate` must show at least one current `webSocketsV2.schedulableEnabled` account, and runtime smoke remains the availability proof. The same validation reports each managed account's runtime WebSocket v2 mode and whether it matches YAML, so stale `ctx_pool` settings cannot silently keep routing Codex WS sessions to an upstream that closes with `no available account`, WS handshake 5xx, or before `response.completed`.
Enable account-level WebSocket v2 only for upstream profiles that have passed a direct Codex WSv2 probe. Treat this as a YAML-declared capability set, not a hard scheduling pin to one profile; if `localCodex` enables WebSocket transport, `codex-pool validate` must show at least one current `webSocketsV2.schedulableEnabled` account, and runtime smoke remains the availability proof. The same validation reports each managed account's runtime WebSocket v2 mode and whether it matches YAML, so stale `ctx_pool` / `passthrough` settings cannot silently keep routing Codex WS sessions to an upstream that closes with `no available account`, WS handshake 5xx/4xx, or before `response.completed`.
When Codex startup repeatedly reports WebSocket reconnects or HTTPS fallback, preserve membership, priority, capacity, load factor, and other routing policy until runtime logs identify the failing account and transport. If bounded Sub2API logs show repeated `openai.websocket_proxy_failed` or upstream WS handshake 5xx for one account, remove only that account from the WSv2 capability set in YAML, run `codex-pool sync --confirm`, and prove the result with Codex smoke plus `codex-pool validate`.
When Codex startup repeatedly reports WebSocket reconnects or HTTPS fallback, preserve membership, priority, capacity, load factor, and other routing policy until runtime logs identify the failing account and transport. If bounded Sub2API logs show repeated `openai.websocket_proxy_failed`, `openai.websocket_account_select_failed`, upstream WS handshake 4xx/5xx, or repeated close-before-`response.completed` for the only WS-capable account, remove that account from the WSv2 capability set in YAML; if the resulting capability set is empty, also turn off the `localCodex` WS feature flags. Then run `codex-pool sync --confirm`, `codex-pool validate`, and prove the result with a Codex smoke that no longer emits reconnects.
Do not encode current availability assumptions in long-term reference prose. If an account needs a higher concurrency or load factor than the pool default, make that a deliberate YAML override and verify it with `codex-pool validate`; the reference document should describe the rule, not repeat the current numeric value.
@@ -49,10 +49,20 @@ assertCondition(fresh.includes("[features]"), "fresh TOML must create the featur
assertCondition(fresh.includes("supports_websockets = true"), "fresh TOML must enable WebSocket transport", fresh);
assertCondition(fresh.includes("responses_websockets_v2 = true"), "fresh TOML must enable Responses WebSocket v2", fresh);
const disabled = renderCodexLocalConsumerToml(existing, {
...baseOptions,
supportsWebSockets: false,
responsesWebSocketsV2: false,
});
assertCondition(disabled.includes("supports_websockets = false"), "disabled localCodex policy must render provider WebSocket transport off", disabled);
assertCondition(disabled.includes("responses_websockets_v2 = false"), "disabled localCodex policy must render Responses WebSocket v2 off", disabled);
console.log(JSON.stringify({
ok: true,
checks: [
"existing Codex TOML is upgraded to the Sub2API WSv2 consumer settings",
"fresh Codex TOML creates provider and feature sections with WSv2 enabled",
"disabled localCodex WebSocket policy renders both consumer flags off",
],
}));
@@ -17,7 +17,7 @@ const parsed = Bun.YAML.parse(readFileSync(configPath, "utf8")) as {
};
};
profiles?: { entries?: Array<{ profile?: string; accountName?: string; capacity?: number; loadFactor?: number; openaiResponsesWebSocketsV2Mode?: string | null }> };
localCodex?: { responsesSmokeModel?: string };
localCodex?: { supportsWebSockets?: boolean; responsesWebSocketsV2?: boolean; responsesSmokeModel?: string };
};
const entries = parsed.profiles?.entries ?? [];
@@ -26,6 +26,8 @@ const defaultCapacity = parsed.pool?.defaultAccountCapacity ?? 0;
const defaultLoadFactor = parsed.pool?.defaultAccountLoadFactor ?? 0;
const desiredCapacity = entries.reduce((total, entry) => total + (entry.capacity ?? defaultCapacity), 0);
const allowedWebSocketModes = new Set(["off", "ctx_pool", "passthrough"]);
const wsEnabledEntries = entries.filter((entry) => entry.openaiResponsesWebSocketsV2Mode && entry.openaiResponsesWebSocketsV2Mode !== "off");
const localWsEnabled = parsed.localCodex?.supportsWebSockets === true || parsed.localCodex?.responsesWebSocketsV2 === true;
assertCondition(entries.length > 0, "Codex pool must declare YAML-managed profile entries", parsed.profiles);
assertCondition(Number.isInteger(defaultCapacity) && defaultCapacity > 0, "defaultAccountCapacity must be a positive integer", parsed.pool);
@@ -39,6 +41,12 @@ assertCondition(
"profile WebSocket mode overrides must use supported values when declared",
entries,
);
assertCondition(parsed.localCodex?.supportsWebSockets === parsed.localCodex?.responsesWebSocketsV2, "local Codex WebSocket feature flags must be changed together", parsed.localCodex);
if (localWsEnabled) {
assertCondition(wsEnabledEntries.length > 0, "local Codex WebSocket transport must not be enabled without at least one YAML WSv2-capable account", { localCodex: parsed.localCodex, entries });
} else {
assertCondition(wsEnabledEntries.length === 0, "local Codex WebSocket transport disabled means all account WSv2 capability declarations must be off or omitted", { localCodex: parsed.localCodex, wsEnabledEntries });
}
assertCondition((parsed.pool?.minOwnerConcurrency ?? 0) >= desiredCapacity, "pool owner concurrency must not bottleneck the declared account capacity set", { minOwnerConcurrency: parsed.pool?.minOwnerConcurrency, desiredCapacity });
if (parsed.pool?.defaultTempUnschedulable?.enabled === true) {
assertCondition(rules.length > 0, "enabled temporary unschedulable policy must declare rules", parsed.pool?.defaultTempUnschedulable);
@@ -86,6 +94,7 @@ console.log(JSON.stringify({
"pool owner concurrency covers the YAML account capacity set",
"profile load factor overrides are YAML-controlled positive integers",
"optional WebSocket mode overrides use supported values",
"local Codex WebSocket transport is consistent with YAML-declared WSv2-capable accounts",
"temporary unschedulable rules are structurally valid when enabled",
"generic recovered upstream error wrappers are caught by cooldown rules",
"large-context upstream failures are caught by the 413 cooldown rule",