fix: cool sub2api upstream 400 responses failures

2026-06-10 11:35:44 +00:00
parent f56216a6cf
commit 0afc927d88
6 changed files with 101 additions and 9 deletions
@@ -11,6 +11,10 @@ pool:
  defaultTempUnschedulable:
    enabled: true
    rules:
+      - statusCode: 400
+        keywords: [invalid_encrypted_content, encrypted content, could not be verified, could not be decrypted, bad_response_status_code, model_not_found, no available channel for model, unsupported, not supported, not support, 暂不支持, 可用模型]
+        durationMinutes: 120
+        description: Stable upstream 400 model-routing or Responses encrypted-content compatibility failures should use another account.
      - statusCode: 401
        keywords: [unauthorized, invalid api key, invalid_api_key, authentication, recovered upstream error]
        durationMinutes: 120
@@ -1,6 +1,6 @@
 image:
  repository: weishaw/sub2api
-  tag: 0.1.135
+  tag: 0.1.136
  pullPolicy: IfNotPresent
 security:
  urlAllowlist:
@@ -35,7 +35,7 @@
 - Do not change account membership, priority, capacity, load factor, WebSocket mode, or other routing policy from inference alone. Unless the user explicitly asks for a configuration change, first preserve the current YAML, collect provenance and runtime evidence, and write the finding to the relevant issue or runbook before proposing a change.
 - `profiles.entries[].tempUnschedulable` may override the pool default for one account. The CLI renders it into Sub2API credentials as `temp_unschedulable_enabled` and `temp_unschedulable_rules`; rules match HTTP status plus response-body keywords and place only that account into a temporary unschedulable cooldown.
 - Codex account-state or quota prompts that stop a task and ask the operator to switch accounts belong in `pool.defaultTempUnschedulable`, not in account membership, priority, capacity, load factor, WebSocket mode, or `pool_mode`. Keep stable body phrases such as weekly-limit and `/status` prompts in both the 403 account-state rule and the 429 quota/rate-limit rule, then run `codex-pool sync --confirm` and `codex-pool validate`. The validation evidence must include runtime temporary-unschedulable alignment for each managed account, not only successful group-level `/v1/models` or `/v1/responses` smoke output.
- Upstream model-routing failures that surface as 503 responses, such as `model_not_found` or "no available channel for model" wrappers, also belong in `pool.defaultTempUnschedulable`. Gateway and timeout failures that surface as 502, 504, or 524 responses, including `Gateway Timeout`, `Unknown error`, `Upstream request failed`, `context deadline exceeded`, `context canceled`, or recovered upstream-error wrappers, belong in the same YAML policy. This is especially important for compact requests, where an upstream Cloudflare 524 may eventually reach Codex as a 502/504 unknown-error wrapper after failover or client cancellation. They are not membership, priority, capacity, load factor, WebSocket mode, or User-Agent decisions by themselves. After adding stable body phrases, run `codex-pool sync --confirm` and `codex-pool validate`, and verify the affected account's runtime status-specific rule includes the new keywords.
+- Upstream model-routing and Responses compatibility failures that surface as 400 responses, such as `invalid_encrypted_content`, `bad_response_status_code`, unsupported-model wrappers, or stable "available models" messages, belong in `pool.defaultTempUnschedulable` when another account can handle the same Codex request. Upstream model-routing failures that surface as 503 responses, such as `model_not_found` or "no available channel for model" wrappers, also belong there. Gateway and timeout failures that surface as 502, 504, or 524 responses, including `Gateway Timeout`, `Unknown error`, `Upstream request failed`, `context deadline exceeded`, `context canceled`, or recovered upstream-error wrappers, belong in the same YAML policy. This is especially important for compact and long `/responses` requests, where an upstream Cloudflare 524 or account-specific compatibility failure may eventually reach Codex as a 502/504 unknown-error wrapper after failover or client cancellation. They are not membership, priority, capacity, load factor, WebSocket mode, or User-Agent decisions by themselves. After adding stable body phrases, run `codex-pool sync --confirm` and `codex-pool validate`, and verify the affected account's runtime status-specific rule includes the new keywords.
 - `profiles.entries[].openaiResponsesWebSocketsV2Mode` is the account-level Responses WebSocket v2 switch for OpenAI-compatible upstreams that require WebSocket transport. Allowed values are `off`, `ctx_pool`, and `passthrough`; omit the field unless that upstream needs it.
 - `profiles.entries[].upstreamUserAgent` is an optional account-level upstream request User-Agent override. Use it only for upstreams that require a Codex CLI compatible User-Agent; keep the value YAML-controlled and newline-free.
 - `publicExposure` controls the optional FRP bridge from master server to the G14 ClusterIP service.
@@ -48,7 +48,7 @@ When Codex startup repeatedly reports WebSocket reconnects or HTTPS fallback, pr

 Do not encode current availability assumptions in long-term reference prose. If an account needs a higher concurrency or load factor than the pool default, make that a deliberate YAML override and verify it with `codex-pool validate`; the reference document should describe the rule, not repeat the current numeric value.

-Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path, while UniDesk's desired failover behavior is to mark the failing account temporarily unschedulable and let Sub2API choose another account from the group. `codex-pool validate` reports each managed account's temporary-unschedulable runtime alignment and should be used after `codex-pool sync --confirm`. Generic 502/503/504 bodies such as `Recovered upstream error 502`, `Bad Gateway`, `Gateway Timeout`, Codex-facing `Upstream request failed`, `Unknown error`, context-deadline/canceled wrappers, and stable `model_not_found` / "no available channel for model" wrappers must stay in the YAML cooldown policy so an intermittently bad account is cooled down instead of repeatedly adding latency at the next compact or Responses request. The Codex pool default error cooldown is severity-tiered: temporary signals can start at ten minutes, gateway/service/overload/model-routing failures should cool down longer, and credential, permission, quota, or account-state failures should use the longest cooldown. Exact current values belong in YAML and runtime validation output.
+Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path, while UniDesk's desired failover behavior is to mark the failing account temporarily unschedulable and let Sub2API choose another account from the group. `codex-pool validate` reports each managed account's temporary-unschedulable runtime alignment and should be used after `codex-pool sync --confirm`. Generic 502/503/504 bodies such as `Recovered upstream error 502`, `Bad Gateway`, `Gateway Timeout`, Codex-facing `Upstream request failed`, `Unknown error`, context-deadline/canceled wrappers, stable 400 `invalid_encrypted_content` / unsupported-model wrappers, and stable `model_not_found` / "no available channel for model" wrappers must stay in the YAML cooldown policy so an intermittently bad account is cooled down instead of repeatedly adding latency at the next compact or Responses request. The Codex pool default error cooldown is severity-tiered: temporary signals can start at ten minutes, gateway/service/overload/model-routing failures should cool down longer, and credential, permission, quota, account-compatibility, or account-state failures should use the longest cooldown. Exact current values belong in YAML and runtime validation output.

 Sub2API temporary-unschedulable rules require both an HTTP status match and a response-body keyword match in the upstream failure/error path. Do not treat them as a general successful-response content filter. If an upstream returns a quota warning or maintenance prompt as normal HTTP 200 assistant content, do not add a YAML 200 cooldown rule, patch Sub2API in place, fork behavior in UniDesk, or bypass `codex-pool sync` to make the pool pretend that account cooling exists. Record the upstream capability gap in an issue when it matters operationally; until upstream Sub2API supports that behavior and `codex-pool validate` proves it, UniDesk should not implement or rely on it.

@@ -77,7 +77,7 @@ Kubernetes readiness is not the same as pool availability:
 - The Sub2API app, PostgreSQL, and Redis manifests include container-level health probes. These only prove the pods and local dependencies are healthy enough for Kubernetes scheduling.
 - The FRP client deployment is currently a simple connector deployment and does not itself prove that master-local traffic reaches Sub2API.
 - No scheduled `CronJob`, `ServiceMonitor`, or `PodMonitor` currently proves the full unified Codex API path.
- `platform-infra sub2api validate` and `platform-infra sub2api codex-pool validate` are on-demand checks. Operational usage is documented in `$unidesk-sub2api`; they are acceptable for deployment closeout, but they are not continuous monitoring. `codex-pool validate` must test both `GET /v1/models` and a small `POST /v1/responses` request, and the Responses smoke should report request id, selected/final account evidence, upstream failover count, and whether the validation succeeded only after failover.
+- `platform-infra sub2api validate` and `platform-infra sub2api codex-pool validate` are on-demand checks. Operational usage is documented in `$unidesk-sub2api`; they are acceptable for deployment closeout, but they are not continuous monitoring. `codex-pool validate` must test both `GET /v1/models` and a small `POST /v1/responses` request, and the Responses smoke should report request id, selected/final account evidence, upstream failover count, and whether the validation succeeded only after failover. It should also summarize recent `/responses` and `/responses/compact` gateway failures separately so ordinary long streaming failures are not hidden behind compact-only evidence.

 When an automatic availability probe is added, it should be YAML-controlled and cover these layers without printing secrets:

@@ -84,20 +84,24 @@ if (parsed.pool?.defaultTempUnschedulable?.enabled === true) {
    assertCondition(cloudflare524Keywords.has(keyword), "524 temporary-unschedulable rule must catch Cloudflare timeout wrappers", { keyword, cloudflare524Rule });
  }
  const accountState403Rule = rules.find((rule) => rule.statusCode === 403);
+  const clientError400Rule = rules.find((rule) => rule.statusCode === 400);
  const quota429Rule = rules.find((rule) => rule.statusCode === 429);
  const successBody200Rule = rules.find((rule) => rule.statusCode === 200);
  const serviceUnavailable503Rule = rules.find((rule) => rule.statusCode === 503);
  const accountState403Keywords = new Set((accountState403Rule?.keywords ?? []).map((keyword) => keyword.toLowerCase()));
+  const clientError400Keywords = new Set((clientError400Rule?.keywords ?? []).map((keyword) => keyword.toLowerCase()));
  const quota429Keywords = new Set((quota429Rule?.keywords ?? []).map((keyword) => keyword.toLowerCase()));
  const successBody200Keywords = new Set((successBody200Rule?.keywords ?? []).map((keyword) => keyword.toLowerCase()));
  const serviceUnavailable503Keywords = new Set((serviceUnavailable503Rule?.keywords ?? []).map((keyword) => keyword.toLowerCase()));
  const accountStatePhrases = ["weekly limit", "less than 10% of your weekly limit left", "run /status for a breakdown"];
  const successBodyPhrase = "less than 10% of your weekly limit left";
+  for (const keyword of ["invalid_encrypted_content", "encrypted content", "could not be verified", "bad_response_status_code", "暂不支持", "可用模型"]) {
+    assertCondition(clientError400Keywords.has(keyword), "400 temporary-unschedulable rule must catch upstream Responses compatibility and model-routing failures", { keyword, clientError400Rule });
+  }
  for (const accountStatePhrase of accountStatePhrases) {
    assertCondition(accountState403Keywords.has(accountStatePhrase), "403 temporary-unschedulable rule must catch Codex account-state phrases", { accountStatePhrase, accountState403Rule });
    assertCondition(quota429Keywords.has(accountStatePhrase), "429 temporary-unschedulable rule must catch Codex account-state phrases", { accountStatePhrase, quota429Rule });
  }
-  assertCondition(successBody200Rule !== undefined, "200 temporary-unschedulable rule must be declared when YAML needs success-body reclassification", rules);
  if (successBody200Rule !== undefined) {
    assertCondition(successBody200Keywords.size === 1 && successBody200Keywords.has(successBodyPhrase), "200 temporary-unschedulable rule must use one stable success-body classifier phrase", successBody200Rule);
    assertCondition(/reclassification/u.test(successBody200Rule.description ?? ""), "200 temporary-unschedulable rule must be documented as a runtime reclassification requirement", successBody200Rule);
@@ -118,6 +122,7 @@ console.log(JSON.stringify({
    "optional WebSocket mode overrides use supported values",
    "local Codex WebSocket transport is consistent with YAML-declared WSv2-capable accounts",
    "temporary unschedulable rules are structurally valid when enabled",
+    "upstream 400 Responses compatibility and model-routing failures are caught by the 400 cooldown rule",
    "generic recovered upstream error wrappers are caught by cooldown rules",
    "large-context upstream failures are caught by the 413 cooldown rule",
    "gateway timeout wrappers are caught by the 504 cooldown rule",
@@ -28,6 +28,7 @@ assertCondition(rules.every((rule, index) => rule.description === policy.rules[i
 assertCondition(!("pool_mode" in credentials), "pool_mode must not be enabled because it retries the same account instead of cooling it down", credentials);
 assertCondition(!("api_key" in credentials) && !("base_url" in credentials), "temporary-unschedulable rendering must not include secrets or endpoints", credentials);
 const accountState403Rule = rules.find((rule) => rule.error_code === 403);
+const clientError400Rule = rules.find((rule) => rule.error_code === 400);
 const quota429Rule = rules.find((rule) => rule.error_code === 429);
 const successBody200Rule = rules.find((rule) => rule.error_code === 200);
 const gateway502Rule = rules.find((rule) => rule.error_code === 502);
@@ -38,6 +39,9 @@ const cloudflare524Rule = rules.find((rule) => rule.error_code === 524);
 const accountStatePhrases = ["weekly limit", "less than 10% of your weekly limit left", "run /status for a breakdown"];
 const successBodyPhrase = "less than 10% of your weekly limit left";
 assertCondition(successBody200Rule?.keywords?.length === 1 && successBody200Rule.keywords.includes(successBodyPhrase), "200 rendered rule must use the single stable success-body account-state phrase", successBody200Rule);
+for (const keyword of ["invalid_encrypted_content", "encrypted content", "could not be verified", "bad_response_status_code", "暂不支持", "可用模型"]) {
+  assertCondition(clientError400Rule?.keywords?.includes(keyword), "400 rendered rule must catch upstream Responses compatibility and model-routing failures", { keyword, clientError400Rule });
+}
 for (const accountStatePhrase of accountStatePhrases) {
  assertCondition(accountState403Rule?.keywords?.includes(accountStatePhrase), "403 rendered rule must preserve Codex account-state phrases", { accountStatePhrase, accountState403Rule });
  assertCondition(quota429Rule?.keywords?.includes(accountStatePhrase), "429 rendered rule must preserve Codex account-state phrases", { accountStatePhrase, quota429Rule });
@@ -72,6 +76,7 @@ console.log(JSON.stringify({
    "temporary unschedulable policy renders to Sub2API credential field names",
    "temporary unschedulable rendering follows the input policy without hard-coded policy gates",
    "Codex account-state prompt uses one stable phrase, including the 200 success-body rule",
+    "upstream 400 Responses compatibility and model-routing failures render into the 400 cooldown rule",
    "large-context upstream failures render into the 413 cooldown rule",
    "upstream model-routing failures render into the 503 cooldown rule",
    "gateway timeout wrappers render into the 504 cooldown rule",
@@ -676,6 +676,12 @@ export function defaultCodexTempUnschedulablePolicy(): CodexTempUnschedulablePol
        durationMinutes: 120,
        description: "Success-body account-state prompts require Sub2API 2xx body reclassification before they can cool accounts.",
      },
+      {
+        statusCode: 400,
+        keywords: ["invalid_encrypted_content", "encrypted content", "could not be verified", "could not be decrypted", "bad_response_status_code", "model_not_found", "no available channel for model", "unsupported", "not supported", "not support", "暂不支持", "可用模型"],
+        durationMinutes: 120,
+        description: "Stable upstream 400 model-routing or Responses encrypted-content compatibility failures should use another account.",
+      },
      {
        statusCode: 401,
        keywords: ["unauthorized", "invalid api key", "invalid_api_key", "authentication", "recovered upstream error"],
@@ -2412,6 +2418,76 @@ def recent_compact_gateway_evidence():
        "valuesPrinted": False,
    }

+def recent_responses_gateway_evidence():
+    proc = kubectl(["-n", NAMESPACE, "logs", "deployment/sub2api", "--since=6h", "--tail=2500"])
+    stdout = proc.stdout.decode("utf-8", errors="replace")
+    failovers = []
+    forward_failures = []
+    final_errors = []
+    context_canceled = []
+    slow_final_errors = []
+    for line in stdout.splitlines():
+        if '"/responses"' not in line and '"/v1/responses"' not in line:
+            continue
+        json_start = line.find("{")
+        if json_start < 0:
+            continue
+        try:
+            item = json.loads(line[json_start:])
+        except Exception:
+            continue
+        path = item.get("path")
+        if path not in ("/responses", "/v1/responses"):
+            continue
+        entry = {
+            "requestId": item.get("request_id"),
+            "clientRequestId": item.get("client_request_id"),
+            "accountId": item.get("account_id"),
+            "statusCode": item.get("status_code"),
+            "upstreamStatus": item.get("upstream_status"),
+            "latencyMs": item.get("latency_ms"),
+            "path": path,
+        }
+        if "upstream_failover_switching" in line:
+            failovers.append({
+                **entry,
+                "switchCount": item.get("switch_count"),
+                "maxSwitches": item.get("max_switches"),
+            })
+        elif "openai.forward_failed" in line:
+            forward_failures.append({
+                **entry,
+                "errorPreview": text(str(item.get("error") or ""), 500),
+                "fallbackErrorResponseWritten": item.get("fallback_error_response_written"),
+                "upstreamErrorResponseAlreadyWritten": item.get("upstream_error_response_already_written"),
+            })
+        elif "http request completed" in line and isinstance(item.get("status_code"), int) and item.get("status_code") >= 400:
+            final_errors.append(entry)
+            latency_ms = item.get("latency_ms")
+            if isinstance(latency_ms, int) and latency_ms >= 30000:
+                slow_final_errors.append(entry)
+        if "context canceled" in line:
+            context_canceled.append(entry)
+    return {
+        "ok": True,
+        "degraded": len(forward_failures) > 0 or len(final_errors) > 0 or len(context_canceled) > 0,
+        "window": "6h",
+        "tailLines": 2500,
+        "failoverCount": len(failovers),
+        "forwardFailureCount": len(forward_failures),
+        "finalErrorCount": len(final_errors),
+        "slowFinalErrorCount": len(slow_final_errors),
+        "contextCanceledCount": len(context_canceled),
+        "recentFailovers": failovers[-8:],
+        "recentForwardFailures": forward_failures[-8:],
+        "recentFinalErrors": final_errors[-8:],
+        "recentSlowFinalErrors": slow_final_errors[-5:],
+        "recentContextCanceled": context_canceled[-5:],
+        "logsExitCode": proc.returncode,
+        "logsStderr": text(proc.stderr, 1000),
+        "valuesPrinted": False,
+    }
+
 def validate_gateway_responses(api_key):
    request_id = "unidesk-codex-pool-validate-" + str(int(time.time() * 1000))
    payload = {
@@ -3101,10 +3177,11 @@ def run_sync():
    gateway = validate_gateway(api_key)
    responses_smoke = validate_gateway_responses(api_key)
    compact_evidence = recent_compact_gateway_evidence()
+    responses_evidence = recent_responses_gateway_evidence()
    runtime_capabilities = validate_runtime_capabilities(token)
    return {
        "ok": gateway["ok"] is True and responses_smoke["ok"] is True and owner_concurrency["ok"] is True and capacity_status["ok"] is True and load_factor_status["ok"] is True and ws_v2_status["ok"] is True and temp_unschedulable_status["ok"] is True,
-        "degraded": bool(responses_smoke.get("degraded")) or bool(compact_evidence.get("degraded")) or runtime_capabilities.get("ok") is not True,
+        "degraded": bool(responses_smoke.get("degraded")) or bool(compact_evidence.get("degraded")) or bool(responses_evidence.get("degraded")) or runtime_capabilities.get("ok") is not True,
        "mode": "sync",
        "namespace": NAMESPACE,
        "serviceDns": SERVICE_DNS,
@@ -3139,7 +3216,7 @@ def run_sync():
        "ownerBalance": owner_balance,
        "ownerConcurrency": owner_concurrency,
        "runtimeCapabilities": runtime_capabilities,
-        "validation": {"gatewayModels": gateway, "gatewayResponses": responses_smoke, "gatewayCompactRecent": compact_evidence},
+        "validation": {"gatewayModels": gateway, "gatewayResponses": responses_smoke, "gatewayResponsesRecent": responses_evidence, "gatewayCompactRecent": compact_evidence},
    }

 def run_validate():
@@ -3160,10 +3237,11 @@ def run_validate():
    gateway = validate_gateway(api_key)
    responses_smoke = validate_gateway_responses(api_key)
    compact_evidence = recent_compact_gateway_evidence()
+    responses_evidence = recent_responses_gateway_evidence()
    runtime_capabilities = validate_runtime_capabilities(token)
    return {
        "ok": gateway["ok"] is True and responses_smoke["ok"] is True and (owner_concurrency is None or owner_concurrency["ok"] is True) and capacity_status["ok"] is True and load_factor_status["ok"] is True and ws_v2_status["ok"] is True and temp_unschedulable_status["ok"] is True,
-        "degraded": bool(responses_smoke.get("degraded")) or bool(compact_evidence.get("degraded")) or runtime_capabilities.get("ok") is not True,
+        "degraded": bool(responses_smoke.get("degraded")) or bool(compact_evidence.get("degraded")) or bool(responses_evidence.get("degraded")) or runtime_capabilities.get("ok") is not True,
        "mode": "validate",
        "namespace": NAMESPACE,
        "serviceDns": SERVICE_DNS,
@@ -3183,7 +3261,7 @@ def run_validate():
        "webSocketsV2": ws_v2_status,
        "tempUnschedulable": temp_unschedulable_status,
        "runtimeCapabilities": runtime_capabilities,
-        "validation": {"gatewayModels": gateway, "gatewayResponses": responses_smoke, "gatewayCompactRecent": compact_evidence},
+        "validation": {"gatewayModels": gateway, "gatewayResponses": responses_smoke, "gatewayResponsesRecent": responses_evidence, "gatewayCompactRecent": compact_evidence},
    }

 try: