fix: 按严重程度设置 Sub2API 错误冷却

This commit is contained in:
Codex
2026-06-09 10:27:20 +00:00
parent 7dd7b893ed
commit 1120ed286e
4 changed files with 22 additions and 22 deletions
+1 -1
View File
@@ -144,7 +144,7 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm
- Codex 启动 WebSocket 回退:用原入口 Codex smoke 复现,再用 bounded Sub2API 日志确认 account;对 WS handshake 5xx 的账号关闭 YAML WSv2 能力后同步,不把临时可用性推断写成调度配置。
- 上游要求 Codex User-Agent:只给该 profile 配 `upstreamUserAgent`,跑 `sync --confirm`
- 上游报 capacity/rate-limit/overload/Bad Gateway 后没有切号或频繁先失败再恢复:先确认 `codex-pool validate``tempUnschedulable.ok=true` 且目标 account `runtimeEnabled=true`、规则数符合 YAML;再看 `validation.gatewayResponses.evidence.failovers` 的 account/upstream status。若 mismatch,跑 `codex-pool sync --confirm`,不要手工 patch Sub2API credentials。
- 上游错误反复触发:默认错误冷却要保持至少 2 小时,让失败账号退出调度足够久,避免同一错误在短时间内反复影响 Codex;除非用户明确要求,不要把冷却时间收窄
- 上游错误反复触发:默认错误冷却按严重程度分层;临时问题可从 10 分钟起步,网关/服务不可用/过载类应更长,认证/权限/配额/账号状态类使用最长冷却。具体数值只以 YAML 为准,修改后必须 `codex-pool sync --confirm``codex-pool validate`
- Codex auto compact 后丢上下文:先确认本机 `~/.codex/config.toml` 是否有 `supports_websockets = true``responses_websockets_v2 = true`,再看 `codex-pool validate` 的 WSv2 candidate 和 Sub2API 日志里的 `transport=responses_websockets_v2`
- Codex smoke 有 reconnect/1013:这是上游并发/可用性问题,和 HTTP-only compact context-loss 分开处理;记录 session/log 证据并关联专项 issue,不要用运行时手补覆盖 YAML 容量。
+10 -10
View File
@@ -19,24 +19,24 @@ pool:
description: Permission, quota, or account-state failures should fail over to another account for at least two hours.
- statusCode: 429
keywords: [capacity, rate limit, rate_limit, quota, too many requests, overloaded, resource_exhausted]
durationMinutes: 120
description: Capacity and rate-limit responses should cool down this account for at least two hours and use another account.
durationMinutes: 10
description: Capacity and rate-limit responses are often temporary; start with a ten-minute cooldown and use another account.
- statusCode: 500
keywords: [capacity, overloaded, temporarily unavailable, temporary, upstream]
durationMinutes: 120
description: Transient upstream server failures should prefer another account for at least two hours.
durationMinutes: 10
description: Transient upstream server failures should start with a ten-minute cooldown and prefer another account.
- statusCode: 502
keywords: [capacity, overloaded, temporarily unavailable, temporary, upstream, bad gateway, upstream request failed, websocket dial, handshake response]
durationMinutes: 120
description: Gateway upstream failures should prefer another account for at least two hours.
durationMinutes: 30
description: Gateway upstream failures are more disruptive than transient capacity signals and should cool down longer.
- statusCode: 503
keywords: [capacity, overloaded, temporarily unavailable, temporary, upstream]
durationMinutes: 120
description: Service unavailable responses should prefer another account for at least two hours.
durationMinutes: 30
description: Service unavailable responses should cool down longer than one-off transient failures.
- statusCode: 529
keywords: [capacity, overloaded, temporarily unavailable, temporary]
durationMinutes: 120
description: Provider overloaded responses should cool down this account for at least two hours and use another account.
durationMinutes: 30
description: Provider overloaded responses should cool down longer than generic transient failures and use another account.
profiles:
entries:
- profile: default
+1 -1
View File
@@ -42,7 +42,7 @@ When Codex startup repeatedly reports WebSocket reconnects or HTTPS fallback, pr
Do not encode current availability assumptions in long-term reference prose. If an account needs a higher concurrency than `pool.defaultAccountCapacity`, make that a deliberate YAML override and verify it with `codex-pool validate`; the reference document should describe the rule, not repeat the current numeric value.
Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path, while UniDesk's desired failover behavior is to mark the failing account temporarily unschedulable and let Sub2API choose another account from the group. `codex-pool validate` reports each managed account's temporary-unschedulable runtime alignment and should be used after `codex-pool sync --confirm`. Generic 502 bodies such as `Bad Gateway` and Codex-facing `Upstream request failed` must stay in the YAML cooldown policy so an intermittently bad account is cooled down instead of repeatedly adding latency at the next compact or Responses request. The Codex pool default error cooldown should be long enough to avoid repeated error loops; keep the default policy at two hours or longer unless the user explicitly asks for a different value.
Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path, while UniDesk's desired failover behavior is to mark the failing account temporarily unschedulable and let Sub2API choose another account from the group. `codex-pool validate` reports each managed account's temporary-unschedulable runtime alignment and should be used after `codex-pool sync --confirm`. Generic 502 bodies such as `Bad Gateway` and Codex-facing `Upstream request failed` must stay in the YAML cooldown policy so an intermittently bad account is cooled down instead of repeatedly adding latency at the next compact or Responses request. The Codex pool default error cooldown is severity-tiered: temporary signals can start at ten minutes, gateway/service/overload failures should cool down longer, and credential, permission, quota, or account-state failures should use the longest cooldown. Exact current values belong in YAML and runtime validation output.
The request path is:
+10 -10
View File
@@ -658,32 +658,32 @@ export function defaultCodexTempUnschedulablePolicy(): CodexTempUnschedulablePol
{
statusCode: 429,
keywords: ["capacity", "rate limit", "rate_limit", "quota", "too many requests", "overloaded", "resource_exhausted"],
durationMinutes: 120,
description: "Capacity and rate-limit responses should cool down this account for at least two hours and use another account.",
durationMinutes: 10,
description: "Capacity and rate-limit responses are often temporary; start with a ten-minute cooldown and use another account.",
},
{
statusCode: 500,
keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream"],
durationMinutes: 120,
description: "Transient upstream server failures should prefer another account for at least two hours.",
durationMinutes: 10,
description: "Transient upstream server failures should start with a ten-minute cooldown and prefer another account.",
},
{
statusCode: 502,
keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream", "bad gateway", "upstream request failed", "websocket dial", "handshake response"],
durationMinutes: 120,
description: "Gateway upstream failures should prefer another account for at least two hours.",
durationMinutes: 30,
description: "Gateway upstream failures are more disruptive than transient capacity signals and should cool down longer.",
},
{
statusCode: 503,
keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream"],
durationMinutes: 120,
description: "Service unavailable responses should prefer another account for at least two hours.",
durationMinutes: 30,
description: "Service unavailable responses should cool down longer than one-off transient failures.",
},
{
statusCode: 529,
keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary"],
durationMinutes: 120,
description: "Provider overloaded responses should cool down this account for at least two hours and use another account.",
durationMinutes: 30,
description: "Provider overloaded responses should cool down longer than generic transient failures and use another account.",
},
],
};