diff --git a/.agents/skills/unidesk-sub2api/SKILL.md b/.agents/skills/unidesk-sub2api/SKILL.md index fb684c48..6468a1fd 100644 --- a/.agents/skills/unidesk-sub2api/SKILL.md +++ b/.agents/skills/unidesk-sub2api/SKILL.md @@ -145,7 +145,7 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm - Codex 启动 WebSocket 回退:用原入口 Codex smoke 复现,再用 bounded Sub2API 日志确认 account;对 WS handshake 5xx 的账号关闭 YAML WSv2 能力后同步,不把临时可用性推断写成调度配置。 - 上游要求 Codex User-Agent:只给该 profile 配 `upstreamUserAgent`,跑 `sync --confirm`。 - 上游报 capacity/rate-limit/overload/Bad Gateway 后没有切号或频繁先失败再恢复:先确认 `codex-pool validate` 里 `tempUnschedulable.ok=true` 且目标 account `runtimeEnabled=true`、规则数符合 YAML;再看 `validation.gatewayResponses.evidence.failovers` 的 account/upstream status。若 mismatch,跑 `codex-pool sync --confirm`,不要手工 patch Sub2API credentials。 -- 上游错误反复触发:默认错误冷却按严重程度分层;临时问题可从 10 分钟起步,网关/服务不可用/过载类应更长,认证/权限/配额/账号状态类使用最长冷却。具体数值只以 YAML 为准,修改后必须 `codex-pool sync --confirm` 和 `codex-pool validate`。 +- 上游错误反复触发:默认错误冷却按严重程度分层;临时问题可从 10 分钟起步,网关/服务不可用/过载类应更长,认证/权限/配额/账号状态类使用最长冷却。`Recovered upstream error ...`、`Bad Gateway` 和 Codex-facing `Upstream request failed` 这类通用包装文案都应留在 YAML 冷却政策里。具体数值只以 YAML 为准,修改后必须 `codex-pool sync --confirm` 和 `codex-pool validate`。 - Codex auto compact 后丢上下文:先确认本机 `~/.codex/config.toml` 是否有 `supports_websockets = true` 和 `responses_websockets_v2 = true`,再看 `codex-pool validate` 的 WSv2 candidate 和 Sub2API 日志里的 `transport=responses_websockets_v2`。 - Codex smoke 有 reconnect/1013:这是上游并发/可用性问题,和 HTTP-only compact context-loss 分开处理;记录 session/log 证据并关联专项 issue,不要用运行时手补覆盖 YAML 容量。 diff --git a/config/platform-infra/sub2api-codex-pool.yaml b/config/platform-infra/sub2api-codex-pool.yaml index e4e64774..27e569be 100644 --- a/config/platform-infra/sub2api-codex-pool.yaml +++ b/config/platform-infra/sub2api-codex-pool.yaml @@ -11,31 +11,31 @@ pool: enabled: true rules: - statusCode: 401 - keywords: [unauthorized, invalid api key, invalid_api_key, authentication] + keywords: [unauthorized, invalid api key, invalid_api_key, authentication, recovered upstream error] durationMinutes: 120 - description: Credential/auth failures should leave the scheduler quickly and retry after a two-hour cooldown. + description: Credential/auth failures should use the longest cooldown. - statusCode: 403 - keywords: [forbidden, access denied, quota, billing, capacity] + keywords: [forbidden, access denied, quota, billing, capacity, recovered upstream error] durationMinutes: 120 - description: Permission, quota, or account-state failures should fail over to another account for at least two hours. + description: Permission, quota, or account-state failures should use the longest cooldown. - statusCode: 429 - keywords: [capacity, rate limit, rate_limit, quota, too many requests, overloaded, resource_exhausted] + keywords: [capacity, rate limit, rate_limit, quota, too many requests, overloaded, resource_exhausted, recovered upstream error] durationMinutes: 10 description: Capacity and rate-limit responses are often temporary; start with a ten-minute cooldown and use another account. - statusCode: 500 - keywords: [capacity, overloaded, temporarily unavailable, temporary, upstream] + keywords: [capacity, overloaded, temporarily unavailable, temporary, upstream, recovered upstream error] durationMinutes: 10 description: Transient upstream server failures should start with a ten-minute cooldown and prefer another account. - statusCode: 502 - keywords: [capacity, overloaded, temporarily unavailable, temporary, upstream, bad gateway, upstream request failed, websocket dial, handshake response] + keywords: [capacity, overloaded, temporarily unavailable, temporary, upstream, bad gateway, upstream request failed, websocket dial, handshake response, recovered upstream error] durationMinutes: 30 - description: Gateway upstream failures are more disruptive than transient capacity signals and should cool down longer. + description: Gateway upstream failures, including recovered upstream error wrappers, should cool down longer. - statusCode: 503 - keywords: [capacity, overloaded, temporarily unavailable, temporary, upstream] + keywords: [capacity, overloaded, temporarily unavailable, temporary, upstream, recovered upstream error] durationMinutes: 30 description: Service unavailable responses should cool down longer than one-off transient failures. - statusCode: 529 - keywords: [capacity, overloaded, temporarily unavailable, temporary] + keywords: [capacity, overloaded, temporarily unavailable, temporary, recovered upstream error] durationMinutes: 30 description: Provider overloaded responses should cool down longer than generic transient failures and use another account. profiles: diff --git a/docs/reference/platform-infra.md b/docs/reference/platform-infra.md index ec7e98e2..7dd26d98 100644 --- a/docs/reference/platform-infra.md +++ b/docs/reference/platform-infra.md @@ -43,7 +43,7 @@ When Codex startup repeatedly reports WebSocket reconnects or HTTPS fallback, pr Do not encode current availability assumptions in long-term reference prose. If an account needs a higher concurrency or load factor than the pool default, make that a deliberate YAML override and verify it with `codex-pool validate`; the reference document should describe the rule, not repeat the current numeric value. -Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path, while UniDesk's desired failover behavior is to mark the failing account temporarily unschedulable and let Sub2API choose another account from the group. `codex-pool validate` reports each managed account's temporary-unschedulable runtime alignment and should be used after `codex-pool sync --confirm`. Generic 502 bodies such as `Bad Gateway` and Codex-facing `Upstream request failed` must stay in the YAML cooldown policy so an intermittently bad account is cooled down instead of repeatedly adding latency at the next compact or Responses request. The Codex pool default error cooldown is severity-tiered: temporary signals can start at ten minutes, gateway/service/overload failures should cool down longer, and credential, permission, quota, or account-state failures should use the longest cooldown. Exact current values belong in YAML and runtime validation output. +Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path, while UniDesk's desired failover behavior is to mark the failing account temporarily unschedulable and let Sub2API choose another account from the group. `codex-pool validate` reports each managed account's temporary-unschedulable runtime alignment and should be used after `codex-pool sync --confirm`. Generic 502 bodies such as `Recovered upstream error 502`, `Bad Gateway`, and Codex-facing `Upstream request failed` must stay in the YAML cooldown policy so an intermittently bad account is cooled down instead of repeatedly adding latency at the next compact or Responses request. The Codex pool default error cooldown is severity-tiered: temporary signals can start at ten minutes, gateway/service/overload failures should cool down longer, and credential, permission, quota, or account-state failures should use the longest cooldown. Exact current values belong in YAML and runtime validation output. The request path is: diff --git a/scripts/platform-infra-sub2api-codex-routing-contract-test.ts b/scripts/platform-infra-sub2api-codex-routing-contract-test.ts index 4b0cc32a..f3ebcb4e 100644 --- a/scripts/platform-infra-sub2api-codex-routing-contract-test.ts +++ b/scripts/platform-infra-sub2api-codex-routing-contract-test.ts @@ -45,6 +45,9 @@ if (parsed.pool?.defaultTempUnschedulable?.enabled === true) { assertCondition(rules.every((rule) => Number.isInteger(rule.statusCode) && (rule.statusCode ?? 0) >= 100 && (rule.statusCode ?? 0) <= 599), "temporary unschedulable rules must declare valid HTTP status codes", rules); assertCondition(rules.every((rule) => Array.isArray(rule.keywords) && rule.keywords.length > 0), "temporary unschedulable rules must declare non-empty keywords", rules); assertCondition(rules.every((rule) => Number.isInteger(rule.durationMinutes) && (rule.durationMinutes ?? 0) > 0), "temporary unschedulable rules must declare positive cooldown durations", rules); + const gateway502Rule = rules.find((rule) => rule.statusCode === 502); + const gateway502Keywords = new Set((gateway502Rule?.keywords ?? []).map((keyword) => keyword.toLowerCase())); + assertCondition(gateway502Keywords.has("recovered upstream error"), "502 temporary-unschedulable rule must catch recovered upstream error wrappers", gateway502Rule); } assertCondition(typeof parsed.localCodex?.responsesSmokeModel === "string" && parsed.localCodex.responsesSmokeModel.length > 0, "localCodex.responsesSmokeModel must be declared for Responses smoke validation", parsed.localCodex); @@ -56,6 +59,7 @@ console.log(JSON.stringify({ "profile load factor overrides are YAML-controlled positive integers", "optional WebSocket mode overrides use supported values", "temporary unschedulable rules are structurally valid when enabled", + "generic recovered upstream error wrappers are caught by cooldown rules", "Responses smoke model is YAML-declared", ], })); diff --git a/scripts/src/platform-infra-sub2api-codex.ts b/scripts/src/platform-infra-sub2api-codex.ts index 42fdb253..0afeb614 100644 --- a/scripts/src/platform-infra-sub2api-codex.ts +++ b/scripts/src/platform-infra-sub2api-codex.ts @@ -654,43 +654,43 @@ export function defaultCodexTempUnschedulablePolicy(): CodexTempUnschedulablePol rules: [ { statusCode: 401, - keywords: ["unauthorized", "invalid api key", "invalid_api_key", "authentication"], + keywords: ["unauthorized", "invalid api key", "invalid_api_key", "authentication", "recovered upstream error"], durationMinutes: 120, - description: "Credential/auth failures should leave the scheduler quickly and retry after a two-hour cooldown.", + description: "Credential/auth failures should use the longest cooldown.", }, { statusCode: 403, - keywords: ["forbidden", "access denied", "quota", "billing", "capacity"], + keywords: ["forbidden", "access denied", "quota", "billing", "capacity", "recovered upstream error"], durationMinutes: 120, - description: "Permission, quota, or account-state failures should fail over to another account for at least two hours.", + description: "Permission, quota, or account-state failures should use the longest cooldown.", }, { statusCode: 429, - keywords: ["capacity", "rate limit", "rate_limit", "quota", "too many requests", "overloaded", "resource_exhausted"], + keywords: ["capacity", "rate limit", "rate_limit", "quota", "too many requests", "overloaded", "resource_exhausted", "recovered upstream error"], durationMinutes: 10, description: "Capacity and rate-limit responses are often temporary; start with a ten-minute cooldown and use another account.", }, { statusCode: 500, - keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream"], + keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream", "recovered upstream error"], durationMinutes: 10, description: "Transient upstream server failures should start with a ten-minute cooldown and prefer another account.", }, { statusCode: 502, - keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream", "bad gateway", "upstream request failed", "websocket dial", "handshake response"], + keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream", "bad gateway", "upstream request failed", "websocket dial", "handshake response", "recovered upstream error"], durationMinutes: 30, - description: "Gateway upstream failures are more disruptive than transient capacity signals and should cool down longer.", + description: "Gateway upstream failures, including recovered upstream error wrappers, should cool down longer.", }, { statusCode: 503, - keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream"], + keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream", "recovered upstream error"], durationMinutes: 30, description: "Service unavailable responses should cool down longer than one-off transient failures.", }, { statusCode: 529, - keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary"], + keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "recovered upstream error"], durationMinutes: 30, description: "Provider overloaded responses should cool down longer than generic transient failures and use another account.", },