fix: 延长 Sub2API 错误冷却时间
This commit is contained in:
@@ -144,6 +144,7 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm
|
||||
- Codex 启动 WebSocket 回退:用原入口 Codex smoke 复现,再用 bounded Sub2API 日志确认 account;对 WS handshake 5xx 的账号关闭 YAML WSv2 能力后同步,不把临时可用性推断写成调度配置。
|
||||
- 上游要求 Codex User-Agent:只给该 profile 配 `upstreamUserAgent`,跑 `sync --confirm`。
|
||||
- 上游报 capacity/rate-limit/overload/Bad Gateway 后没有切号或频繁先失败再恢复:先确认 `codex-pool validate` 里 `tempUnschedulable.ok=true` 且目标 account `runtimeEnabled=true`、规则数符合 YAML;再看 `validation.gatewayResponses.evidence.failovers` 的 account/upstream status。若 mismatch,跑 `codex-pool sync --confirm`,不要手工 patch Sub2API credentials。
|
||||
- 上游错误反复触发:默认错误冷却要保持至少 2 小时,让失败账号退出调度足够久,避免同一错误在短时间内反复影响 Codex;除非用户明确要求,不要把冷却时间收窄。
|
||||
- Codex auto compact 后丢上下文:先确认本机 `~/.codex/config.toml` 是否有 `supports_websockets = true` 和 `responses_websockets_v2 = true`,再看 `codex-pool validate` 的 WSv2 candidate 和 Sub2API 日志里的 `transport=responses_websockets_v2`。
|
||||
- Codex smoke 有 reconnect/1013:这是上游并发/可用性问题,和 HTTP-only compact context-loss 分开处理;记录 session/log 证据并关联专项 issue,不要用运行时手补覆盖 YAML 容量。
|
||||
|
||||
|
||||
@@ -11,32 +11,32 @@ pool:
|
||||
rules:
|
||||
- statusCode: 401
|
||||
keywords: [unauthorized, invalid api key, invalid_api_key, authentication]
|
||||
durationMinutes: 30
|
||||
description: Credential/auth failures should leave the scheduler quickly and retry after a cooldown.
|
||||
durationMinutes: 120
|
||||
description: Credential/auth failures should leave the scheduler quickly and retry after a two-hour cooldown.
|
||||
- statusCode: 403
|
||||
keywords: [forbidden, access denied, quota, billing, capacity]
|
||||
durationMinutes: 30
|
||||
description: Permission, quota, or account-state failures should fail over to another account.
|
||||
durationMinutes: 120
|
||||
description: Permission, quota, or account-state failures should fail over to another account for at least two hours.
|
||||
- statusCode: 429
|
||||
keywords: [capacity, rate limit, rate_limit, quota, too many requests, overloaded, resource_exhausted]
|
||||
durationMinutes: 10
|
||||
description: Capacity and rate-limit responses should cool down this account and use another account.
|
||||
durationMinutes: 120
|
||||
description: Capacity and rate-limit responses should cool down this account for at least two hours and use another account.
|
||||
- statusCode: 500
|
||||
keywords: [capacity, overloaded, temporarily unavailable, temporary, upstream]
|
||||
durationMinutes: 5
|
||||
description: Transient upstream server failures should prefer another account for a short period.
|
||||
durationMinutes: 120
|
||||
description: Transient upstream server failures should prefer another account for at least two hours.
|
||||
- statusCode: 502
|
||||
keywords: [capacity, overloaded, temporarily unavailable, temporary, upstream, bad gateway, upstream request failed, websocket dial, handshake response]
|
||||
durationMinutes: 5
|
||||
description: Gateway upstream failures should prefer another account for a short period.
|
||||
durationMinutes: 120
|
||||
description: Gateway upstream failures should prefer another account for at least two hours.
|
||||
- statusCode: 503
|
||||
keywords: [capacity, overloaded, temporarily unavailable, temporary, upstream]
|
||||
durationMinutes: 5
|
||||
description: Service unavailable responses should prefer another account for a short period.
|
||||
durationMinutes: 120
|
||||
description: Service unavailable responses should prefer another account for at least two hours.
|
||||
- statusCode: 529
|
||||
keywords: [capacity, overloaded, temporarily unavailable, temporary]
|
||||
durationMinutes: 10
|
||||
description: Provider overloaded responses should cool down this account and use another account.
|
||||
durationMinutes: 120
|
||||
description: Provider overloaded responses should cool down this account for at least two hours and use another account.
|
||||
profiles:
|
||||
entries:
|
||||
- profile: default
|
||||
|
||||
@@ -42,7 +42,7 @@ When Codex startup repeatedly reports WebSocket reconnects or HTTPS fallback, pr
|
||||
|
||||
Do not encode current availability assumptions in long-term reference prose. If an account needs a higher concurrency than `pool.defaultAccountCapacity`, make that a deliberate YAML override and verify it with `codex-pool validate`; the reference document should describe the rule, not repeat the current numeric value.
|
||||
|
||||
Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path, while UniDesk's desired failover behavior is to mark the failing account temporarily unschedulable and let Sub2API choose another account from the group. `codex-pool validate` reports each managed account's temporary-unschedulable runtime alignment and should be used after `codex-pool sync --confirm`. Generic 502 bodies such as `Bad Gateway` and Codex-facing `Upstream request failed` must stay in the YAML cooldown policy so an intermittently bad account is cooled down instead of repeatedly adding latency at the next compact or Responses request.
|
||||
Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path, while UniDesk's desired failover behavior is to mark the failing account temporarily unschedulable and let Sub2API choose another account from the group. `codex-pool validate` reports each managed account's temporary-unschedulable runtime alignment and should be used after `codex-pool sync --confirm`. Generic 502 bodies such as `Bad Gateway` and Codex-facing `Upstream request failed` must stay in the YAML cooldown policy so an intermittently bad account is cooled down instead of repeatedly adding latency at the next compact or Responses request. The Codex pool default error cooldown should be long enough to avoid repeated error loops; keep the default policy at two hours or longer unless the user explicitly asks for a different value.
|
||||
|
||||
The request path is:
|
||||
|
||||
|
||||
@@ -41,9 +41,6 @@ if (parsed.pool?.defaultTempUnschedulable?.enabled === true) {
|
||||
assertCondition(rules.every((rule) => Number.isInteger(rule.statusCode) && (rule.statusCode ?? 0) >= 100 && (rule.statusCode ?? 0) <= 599), "temporary unschedulable rules must declare valid HTTP status codes", rules);
|
||||
assertCondition(rules.every((rule) => Array.isArray(rule.keywords) && rule.keywords.length > 0), "temporary unschedulable rules must declare non-empty keywords", rules);
|
||||
assertCondition(rules.every((rule) => Number.isInteger(rule.durationMinutes) && (rule.durationMinutes ?? 0) > 0), "temporary unschedulable rules must declare positive cooldown durations", rules);
|
||||
const gateway502Rule = rules.find((rule) => rule.statusCode === 502);
|
||||
const gateway502Keywords = new Set((gateway502Rule?.keywords ?? []).map((keyword) => keyword.toLowerCase()));
|
||||
assertCondition(gateway502Keywords.has("bad gateway") && gateway502Keywords.has("upstream request failed"), "502 temporary-unschedulable rule must catch generic gateway failures", gateway502Rule);
|
||||
}
|
||||
assertCondition(typeof parsed.localCodex?.responsesSmokeModel === "string" && parsed.localCodex.responsesSmokeModel.length > 0, "localCodex.responsesSmokeModel must be declared for Responses smoke validation", parsed.localCodex);
|
||||
|
||||
@@ -54,7 +51,6 @@ console.log(JSON.stringify({
|
||||
"pool owner concurrency covers the YAML account capacity set",
|
||||
"optional WebSocket mode overrides use supported values",
|
||||
"temporary unschedulable rules are structurally valid when enabled",
|
||||
"generic 502 gateway failures cool down the selected account",
|
||||
"Responses smoke model is YAML-declared",
|
||||
],
|
||||
}));
|
||||
|
||||
@@ -646,44 +646,44 @@ export function defaultCodexTempUnschedulablePolicy(): CodexTempUnschedulablePol
|
||||
{
|
||||
statusCode: 401,
|
||||
keywords: ["unauthorized", "invalid api key", "invalid_api_key", "authentication"],
|
||||
durationMinutes: 30,
|
||||
description: "Credential/auth failures should leave the scheduler quickly and retry after a cooldown.",
|
||||
durationMinutes: 120,
|
||||
description: "Credential/auth failures should leave the scheduler quickly and retry after a two-hour cooldown.",
|
||||
},
|
||||
{
|
||||
statusCode: 403,
|
||||
keywords: ["forbidden", "access denied", "quota", "billing", "capacity"],
|
||||
durationMinutes: 30,
|
||||
description: "Permission, quota, or account-state failures should fail over to another account.",
|
||||
durationMinutes: 120,
|
||||
description: "Permission, quota, or account-state failures should fail over to another account for at least two hours.",
|
||||
},
|
||||
{
|
||||
statusCode: 429,
|
||||
keywords: ["capacity", "rate limit", "rate_limit", "quota", "too many requests", "overloaded", "resource_exhausted"],
|
||||
durationMinutes: 10,
|
||||
description: "Capacity and rate-limit responses should cool down this account and use another account.",
|
||||
durationMinutes: 120,
|
||||
description: "Capacity and rate-limit responses should cool down this account for at least two hours and use another account.",
|
||||
},
|
||||
{
|
||||
statusCode: 500,
|
||||
keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream"],
|
||||
durationMinutes: 5,
|
||||
description: "Transient upstream server failures should prefer another account for a short period.",
|
||||
durationMinutes: 120,
|
||||
description: "Transient upstream server failures should prefer another account for at least two hours.",
|
||||
},
|
||||
{
|
||||
statusCode: 502,
|
||||
keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream", "bad gateway", "upstream request failed", "websocket dial", "handshake response"],
|
||||
durationMinutes: 5,
|
||||
description: "Gateway upstream failures should prefer another account for a short period.",
|
||||
durationMinutes: 120,
|
||||
description: "Gateway upstream failures should prefer another account for at least two hours.",
|
||||
},
|
||||
{
|
||||
statusCode: 503,
|
||||
keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream"],
|
||||
durationMinutes: 5,
|
||||
description: "Service unavailable responses should prefer another account for a short period.",
|
||||
durationMinutes: 120,
|
||||
description: "Service unavailable responses should prefer another account for at least two hours.",
|
||||
},
|
||||
{
|
||||
statusCode: 529,
|
||||
keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary"],
|
||||
durationMinutes: 10,
|
||||
description: "Provider overloaded responses should cool down this account and use another account.",
|
||||
durationMinutes: 120,
|
||||
description: "Provider overloaded responses should cool down this account for at least two hours and use another account.",
|
||||
},
|
||||
],
|
||||
};
|
||||
|
||||
Reference in New Issue
Block a user