diff --git a/.agents/skills/unidesk-sub2api/SKILL.md b/.agents/skills/unidesk-sub2api/SKILL.md index 297342b0..f80755e0 100644 --- a/.agents/skills/unidesk-sub2api/SKILL.md +++ b/.agents/skills/unidesk-sub2api/SKILL.md @@ -144,7 +144,7 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm - Codex 启动 WebSocket 回退:用原入口 Codex smoke 复现,再用 bounded Sub2API 日志确认 account;对 WS handshake 5xx 的账号关闭 YAML WSv2 能力后同步,不把临时可用性推断写成调度配置。 - 上游要求 Codex User-Agent:只给该 profile 配 `upstreamUserAgent`,跑 `sync --confirm`。 - 上游报 capacity/rate-limit/overload/Bad Gateway 后没有切号或频繁先失败再恢复:先确认 `codex-pool validate` 里 `tempUnschedulable.ok=true` 且目标 account `runtimeEnabled=true`、规则数符合 YAML;再看 `validation.gatewayResponses.evidence.failovers` 的 account/upstream status。若 mismatch,跑 `codex-pool sync --confirm`,不要手工 patch Sub2API credentials。 -- 上游错误反复触发:默认错误冷却要保持至少 2 小时,让失败账号退出调度足够久,避免同一错误在短时间内反复影响 Codex;除非用户明确要求,不要把冷却时间收窄。 +- 上游错误反复触发:默认错误冷却按严重程度分层;临时问题可从 10 分钟起步,网关/服务不可用/过载类应更长,认证/权限/配额/账号状态类使用最长冷却。具体数值只以 YAML 为准,修改后必须 `codex-pool sync --confirm` 和 `codex-pool validate`。 - Codex auto compact 后丢上下文:先确认本机 `~/.codex/config.toml` 是否有 `supports_websockets = true` 和 `responses_websockets_v2 = true`,再看 `codex-pool validate` 的 WSv2 candidate 和 Sub2API 日志里的 `transport=responses_websockets_v2`。 - Codex smoke 有 reconnect/1013:这是上游并发/可用性问题,和 HTTP-only compact context-loss 分开处理;记录 session/log 证据并关联专项 issue,不要用运行时手补覆盖 YAML 容量。 diff --git a/config/platform-infra/sub2api-codex-pool.yaml b/config/platform-infra/sub2api-codex-pool.yaml index 25f5843d..5c16f42b 100644 --- a/config/platform-infra/sub2api-codex-pool.yaml +++ b/config/platform-infra/sub2api-codex-pool.yaml @@ -19,24 +19,24 @@ pool: description: Permission, quota, or account-state failures should fail over to another account for at least two hours. - statusCode: 429 keywords: [capacity, rate limit, rate_limit, quota, too many requests, overloaded, resource_exhausted] - durationMinutes: 120 - description: Capacity and rate-limit responses should cool down this account for at least two hours and use another account. + durationMinutes: 10 + description: Capacity and rate-limit responses are often temporary; start with a ten-minute cooldown and use another account. - statusCode: 500 keywords: [capacity, overloaded, temporarily unavailable, temporary, upstream] - durationMinutes: 120 - description: Transient upstream server failures should prefer another account for at least two hours. + durationMinutes: 10 + description: Transient upstream server failures should start with a ten-minute cooldown and prefer another account. - statusCode: 502 keywords: [capacity, overloaded, temporarily unavailable, temporary, upstream, bad gateway, upstream request failed, websocket dial, handshake response] - durationMinutes: 120 - description: Gateway upstream failures should prefer another account for at least two hours. + durationMinutes: 30 + description: Gateway upstream failures are more disruptive than transient capacity signals and should cool down longer. - statusCode: 503 keywords: [capacity, overloaded, temporarily unavailable, temporary, upstream] - durationMinutes: 120 - description: Service unavailable responses should prefer another account for at least two hours. + durationMinutes: 30 + description: Service unavailable responses should cool down longer than one-off transient failures. - statusCode: 529 keywords: [capacity, overloaded, temporarily unavailable, temporary] - durationMinutes: 120 - description: Provider overloaded responses should cool down this account for at least two hours and use another account. + durationMinutes: 30 + description: Provider overloaded responses should cool down longer than generic transient failures and use another account. profiles: entries: - profile: default diff --git a/docs/reference/platform-infra.md b/docs/reference/platform-infra.md index 79a7c2e3..907de29f 100644 --- a/docs/reference/platform-infra.md +++ b/docs/reference/platform-infra.md @@ -42,7 +42,7 @@ When Codex startup repeatedly reports WebSocket reconnects or HTTPS fallback, pr Do not encode current availability assumptions in long-term reference prose. If an account needs a higher concurrency than `pool.defaultAccountCapacity`, make that a deliberate YAML override and verify it with `codex-pool validate`; the reference document should describe the rule, not repeat the current numeric value. -Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path, while UniDesk's desired failover behavior is to mark the failing account temporarily unschedulable and let Sub2API choose another account from the group. `codex-pool validate` reports each managed account's temporary-unschedulable runtime alignment and should be used after `codex-pool sync --confirm`. Generic 502 bodies such as `Bad Gateway` and Codex-facing `Upstream request failed` must stay in the YAML cooldown policy so an intermittently bad account is cooled down instead of repeatedly adding latency at the next compact or Responses request. The Codex pool default error cooldown should be long enough to avoid repeated error loops; keep the default policy at two hours or longer unless the user explicitly asks for a different value. +Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path, while UniDesk's desired failover behavior is to mark the failing account temporarily unschedulable and let Sub2API choose another account from the group. `codex-pool validate` reports each managed account's temporary-unschedulable runtime alignment and should be used after `codex-pool sync --confirm`. Generic 502 bodies such as `Bad Gateway` and Codex-facing `Upstream request failed` must stay in the YAML cooldown policy so an intermittently bad account is cooled down instead of repeatedly adding latency at the next compact or Responses request. The Codex pool default error cooldown is severity-tiered: temporary signals can start at ten minutes, gateway/service/overload failures should cool down longer, and credential, permission, quota, or account-state failures should use the longest cooldown. Exact current values belong in YAML and runtime validation output. The request path is: diff --git a/scripts/src/platform-infra-sub2api-codex.ts b/scripts/src/platform-infra-sub2api-codex.ts index 86d17353..85bfd3a8 100644 --- a/scripts/src/platform-infra-sub2api-codex.ts +++ b/scripts/src/platform-infra-sub2api-codex.ts @@ -658,32 +658,32 @@ export function defaultCodexTempUnschedulablePolicy(): CodexTempUnschedulablePol { statusCode: 429, keywords: ["capacity", "rate limit", "rate_limit", "quota", "too many requests", "overloaded", "resource_exhausted"], - durationMinutes: 120, - description: "Capacity and rate-limit responses should cool down this account for at least two hours and use another account.", + durationMinutes: 10, + description: "Capacity and rate-limit responses are often temporary; start with a ten-minute cooldown and use another account.", }, { statusCode: 500, keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream"], - durationMinutes: 120, - description: "Transient upstream server failures should prefer another account for at least two hours.", + durationMinutes: 10, + description: "Transient upstream server failures should start with a ten-minute cooldown and prefer another account.", }, { statusCode: 502, keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream", "bad gateway", "upstream request failed", "websocket dial", "handshake response"], - durationMinutes: 120, - description: "Gateway upstream failures should prefer another account for at least two hours.", + durationMinutes: 30, + description: "Gateway upstream failures are more disruptive than transient capacity signals and should cool down longer.", }, { statusCode: 503, keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream"], - durationMinutes: 120, - description: "Service unavailable responses should prefer another account for at least two hours.", + durationMinutes: 30, + description: "Service unavailable responses should cool down longer than one-off transient failures.", }, { statusCode: 529, keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary"], - durationMinutes: 120, - description: "Provider overloaded responses should cool down this account for at least two hours and use another account.", + durationMinutes: 30, + description: "Provider overloaded responses should cool down longer than generic transient failures and use another account.", }, ], };