fix: disable sub2api builtin temp unschedulable

2026-06-11 15:44:53 +00:00
parent 5ba46f0dc3
commit 01b19d9238
4 changed files with 46 additions and 113 deletions
@@ -26,9 +26,9 @@
 - `pool.groupName` names the Sub2API group that represents the pool.
 - `pool.apiKeySecretName` and `pool.apiKeySecretKey` name the k3s Secret that stores the single consumer API key.
 - `pool.minOwnerConcurrency` is optional; when omitted, the CLI automatically uses the sum of all resolved account capacities as the minimum concurrency for the Sub2API user that owns the unified consumer API key. A YAML value is only an explicit override and must still be at least that capacity sum, so the shared key does not fail requests or WS sessions at the user-concurrency layer. "Resolved" means each account's explicit `profiles.entries[].capacity` or, when omitted, `pool.defaultAccountCapacity`. Do not compensate for owner-concurrency 1013 errors by pinning capacity to one provider.
- `pool.defaultTempUnschedulable` declares Sub2API account-level temporary unschedulable rules for capabilities that Sub2API itself already supports. Keep 429/overload/capacity, service-unavailable, gateway timeout, and stable model-routing failures in this YAML policy so the scheduler can cool down a failing account and choose another candidate instead of hard-pinning one provider. Do not declare unsupported Sub2API behavior in YAML as a promise that UniDesk code or runtime patches should emulate.
- When a managed upstream repeatedly causes `/v1/responses` or `/responses/compact` failures, the required fix path is to make automatic temporary-unschedulable and failover work, then verify it with runtime evidence. Do not restore availability by manually disabling an account, deleting a managed account, removing its YAML entry, lowering membership, or otherwise changing routing policy merely to avoid the failing upstream; those actions are allowed only for an explicit upstream retirement or ownership change.
- Codex accounts selected by YAML do not declare `schedulable` as durable configuration. `schedulable=true` is a `codex-pool sync --confirm` process-control baseline for UniDesk-managed accounts, not a YAML field. Account cooling must be represented by `temp_unschedulable_until` / `temp_unschedulable_reason`, so validation can distinguish real automatic cooldown from stale manual unschedulable state.
+- `pool.defaultTempUnschedulable` is the Sub2API built-in temporary-unschedulable switch plus its YAML rule list. UniDesk keeps this built-in switch disabled by default while preserving the rule list in YAML for explicit future recovery; sync follows the WebUI close-switch behavior by omitting the runtime `temp_unschedulable_enabled` and `temp_unschedulable_rules` credential fields. The external account-level sentinel is the active account health and freeze/restore mechanism.
+- The built-in temporary-unschedulable configuration and external `sentinel.*` configuration are separate control surfaces. Changing `pool.defaultTempUnschedulable.enabled` or `profiles.entries[].tempUnschedulable` must not change sentinel cadence, marker health semantics, or sentinel quarantine state; changing sentinel settings must not implicitly enable Sub2API built-in temporary-unschedulable rules.
+- Codex accounts selected by YAML do not declare `schedulable` as durable configuration. `schedulable=true` is a `codex-pool sync --confirm` process-control baseline for UniDesk-managed accounts that are not under sentinel quarantine, not a YAML field.
 - `codex-pool sync --confirm` preserves UniDesk-managed accounts that are absent from YAML by default; explicit upstream retirement requires `codex-pool sync --confirm --prune-removed`. This keeps account deletion out of the normal availability-recovery path and prevents temporary YAML edits from becoming destructive runtime changes.
 - `profiles.entries` selects local Codex profile files from `~/.codex/` and maps them to Sub2API account names.
 - The unsuffixed master `~/.codex/config.toml` and `~/.codex/auth.json` are reserved for the unified Sub2API consumer. `config.toml` must keep `base_url = "https://sub2api.74-48-78-17.nip.io/"`, and `auth.json` must contain the unified pool API key from `pool.apiKeySecretName` / `pool.apiKeySecretKey`. Do not replace these two files with direct upstream account credentials.
@@ -36,9 +36,8 @@
 - `profiles.entries[].capacity` optionally overrides `pool.defaultAccountCapacity` for one account. Capacity is a YAML-controlled routing input; concrete current values belong only in `config/platform-infra/sub2api-codex-pool.yaml` and runtime validation output, not in long-term reference prose. Code constants, Secrets, ad-hoc runtime patches, or stale tests must not override YAML source of truth.
 - `profiles.entries[].loadFactor` optionally overrides `pool.defaultAccountLoadFactor` for one account and is rendered to Sub2API `load_factor`. Treat it as routing policy: values belong in YAML and `codex-pool validate` output, not code constants, Secrets, or ad-hoc runtime patches.
 - Do not change account membership, priority, capacity, load factor, WebSocket mode, or other routing policy from inference alone. Unless the user explicitly asks for a configuration change, first preserve the current YAML, collect provenance and runtime evidence, and write the finding to the relevant issue or runbook before proposing a change.
- `profiles.entries[].tempUnschedulable` may override the pool default for one account. The CLI renders it into Sub2API credentials as `temp_unschedulable_enabled` and `temp_unschedulable_rules`; rules match HTTP status plus response-body keywords and place only that account into a temporary unschedulable cooldown.
- Codex account-state or quota prompts that stop a task and ask the operator to switch accounts belong in `pool.defaultTempUnschedulable`, not in account membership, priority, capacity, load factor, WebSocket mode, or `pool_mode`. Keep stable body phrases such as weekly-limit and `/status` prompts in both the 403 account-state rule and the 429 quota/rate-limit rule, then run `codex-pool sync --confirm` and `codex-pool validate`. The validation evidence must include runtime temporary-unschedulable alignment for each managed account, not only successful group-level `/v1/models` or `/v1/responses` smoke output.
- Upstream model-routing and Responses compatibility failures that surface as 400 responses, such as `invalid_encrypted_content`, `bad_response_status_code`, `invalid_request_error` with a stable unsupported-model message, unsupported-model wrappers, or stable "available models" messages, belong in `pool.defaultTempUnschedulable` when another account can handle the same Codex request. Upstream model-routing failures that surface as 503 responses, such as `model_not_found` or "no available channel for model" wrappers, also belong there. Gateway and timeout failures that surface as 502, 504, or 524 responses, including `Gateway Timeout`, `Unknown error`, `Upstream request failed`, `context deadline exceeded`, `context canceled`, or recovered upstream-error wrappers, belong in the same YAML policy. This is especially important for compact and long `/responses` requests, where an upstream Cloudflare 524 or account-specific compatibility failure may eventually reach Codex as a 502/504 unknown-error wrapper after failover or client cancellation. They are not membership, priority, capacity, load factor, WebSocket mode, or User-Agent decisions by themselves. After adding stable body phrases, run `codex-pool sync --confirm` and `codex-pool validate`, and verify the affected account's runtime status-specific rule includes the new keywords.
+- `profiles.entries[].tempUnschedulable` may override the pool default for one account. When enabled, the CLI renders it into Sub2API credentials as `temp_unschedulable_enabled` and `temp_unschedulable_rules`; when disabled, runtime credentials omit both fields and the YAML rule list remains only source-side configuration.
+- Codex account-state, quota prompts, model-routing failures, gateway wrappers, and timeout-like upstream errors are handled by the external marker-only sentinel unless the Sub2API built-in temporary-unschedulable switch is explicitly re-enabled. Do not change membership, priority, capacity, load factor, WebSocket mode, or `pool_mode` merely to work around those errors.
 - `profiles.entries[].openaiResponsesWebSocketsV2Mode` is the account-level Responses WebSocket v2 switch for OpenAI-compatible upstreams that require WebSocket transport. Allowed values are `off`, `ctx_pool`, and `passthrough`; omit the field unless that upstream needs it.
 - `profiles.entries[].upstreamUserAgent` is an optional account-level upstream request User-Agent override. Use it only for upstreams that require a Codex CLI compatible User-Agent; keep the value YAML-controlled and newline-free.
 - `publicExposure` controls the optional FRP bridge from master server to the G14 ClusterIP service.
@@ -52,11 +51,9 @@ When Codex startup repeatedly reports WebSocket reconnects or HTTPS fallback, pr

 Do not encode current availability assumptions in long-term reference prose. If an account needs a higher concurrency or load factor than the pool default, make that a deliberate YAML override and verify it with `codex-pool validate`; the reference document should describe the rule, not repeat the current numeric value.

-Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path, while UniDesk's desired failover behavior is to mark the failing account temporarily unschedulable and let Sub2API choose another account from the group. `codex-pool validate` reports each managed account's temporary-unschedulable runtime alignment and should be used after `codex-pool sync --confirm`. Generic 502/503/504 bodies such as `Recovered upstream error 502`, `Bad Gateway`, `Gateway Timeout`, Codex-facing `Upstream request failed`, `Unknown error`, context-deadline/canceled wrappers, stable 400 `invalid_encrypted_content` / unsupported-model wrappers, and stable `model_not_found` / "no available channel for model" wrappers must stay in the YAML cooldown policy so an intermittently bad account is cooled down instead of repeatedly adding latency at the next compact or Responses request. Exact current cooldown values and any business-policy grouping belong only in YAML and runtime validation output; do not repeat those values here, encode them as code/schema hard limits, or require contract tests for value changes.
+Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path and does not replace sentinel quarantine. The current failover and recovery model is: the external marker-only sentinel freezes or restores account schedulability, while Sub2API routes among currently schedulable accounts in the group.

-Sub2API temporary-unschedulable rules require both an HTTP status match and a response-body keyword match in the upstream failure/error path. Do not treat them as a general successful-response content filter, and do not add a YAML 200 cooldown rule, patch Sub2API in place, fork Sub2API behavior in UniDesk, or bypass `codex-pool sync` to make the native pool pretend that HTTP 200 content cooling exists. HTTP 200 private content, maintenance text, quota prompts, ads, and similar semantic failures are handled by the external account-level sentinel when that sentinel is enabled, not by Sub2API native `temp_unschedulable_rules`.
-
-If automatic cooling or same-request failover does not happen for an error that the YAML policy declares, treat that as a Sub2API capability or integration defect. The closeout must show the failing account being marked temporarily unschedulable and the next request or same request selecting another schedulable account; a manually disabled, deleted, or pruned account is not valid evidence for this class of fix.
+Sub2API temporary-unschedulable rules require both an HTTP status match and a response-body keyword match in the upstream failure/error path when the built-in switch is enabled. UniDesk currently keeps that switch disabled and does not use built-in rules as a successful-response content filter. HTTP 200 private content, maintenance text, quota prompts, ads, and similar semantic failures are handled by the external account-level sentinel.

 ## Sub2API Account Test Semantics

@@ -76,7 +73,7 @@ The UniDesk account-level sentinel uses marker-only health semantics. A probe is

 The sentinel must not maintain separate classifiers for "private content", "maintenance", "quota", "ads", or provider-specific body phrases as health gates. The only recovery condition is a later recovery probe that matches the marker. Freeze TTL expiry only schedules the next recovery probe; it does not restore an account by itself. Repeated non-marker results use a short exponential freeze backoff because failed marker probes produce little or no useful output token usage; repeated marker-matching results use the configured success cadence backoff. This contract applies equally to OpenAI Responses `gpt-5.5` direct account probes and manual `codex-pool sentinel-probe --account ... --confirm` measurements.

-`profiles.entries[].trustUpstream` is the durable account-level trust marker for sentinel success cadence, and the absence of the field means untrusted. Trusted and untrusted accounts use separate YAML cadence maximums after marker-matching probes; the values belong only in `config/platform-infra/sub2api-codex-pool.yaml`. This field must not change Sub2API scheduler priority, capacity, load factor, membership, native temporary-unschedulable rules, or the marker-only health contract. Its purpose is to keep intermittently unreliable 200-success providers under more frequent direct probes without adding provider-specific content classifiers.
+`profiles.entries[].trustUpstream` is the durable account-level trust marker for sentinel success cadence, and the absence of the field means untrusted. Trusted and untrusted accounts use separate YAML cadence maximums after marker-matching probes; the values belong only in `config/platform-infra/sub2api-codex-pool.yaml`. This field must not change Sub2API scheduler priority, capacity, load factor, membership, built-in temporary-unschedulable settings, or the marker-only health contract. Its purpose is to keep intermittently unreliable 200-success providers under more frequent direct probes without adding provider-specific content classifiers.

 When `codex-pool sync --confirm` creates a YAML-managed account or changes direct-probe-relevant account inputs such as the profile mapping, upstream base URL, API key fingerprint, upstream User-Agent, Responses WebSocket mode, or `trustUpstream`, only that account must be default-frozen before it can enter the scheduler. Sync first records a pending sentinel quality gate from the pre-mutation runtime state, then updates the account, then schedules the account probe immediately. This ordering prevents a new or changed account from being written to Sub2API without a matching sentinel quarantine record if sync fails midway. Passing the marker clears the quality gate and restores schedulability; any non-marker result continues the failure freeze backoff. Unchanged accounts must not have their existing success or failure backoff reset by unrelated YAML syncs.

@@ -106,7 +103,7 @@ When `publicExposure.enabled` is true, the same FRP TCP bridge exposes both Open

 The public management UI is an operations endpoint. Keep Sub2API itself in `platform-infra`, keep the Kubernetes Service as ClusterIP, and treat FRP as the only public bridge unless a later decision explicitly changes the exposure model.

-The public bridge has two separate failure classes. Sub2API upstream/account failures are visible in Sub2API logs and should be handled by temporary-unschedulable rules, sentinel quarantine, or Sub2API failover. Edge failures between master Caddy and the FRP remotePort are not visible to Sub2API; symptoms include Caddy `connect: connection refused`, EOF, connection reset, or short 502 bursts while frps closes and reopens the configured remotePort. Those failures must be diagnosed from Caddy and frps/frpc evidence and mitigated through YAML-controlled Caddy edge retry or FRP stability fixes, not by disabling accounts or changing pool membership.
+The public bridge has two separate failure classes. Sub2API upstream/account failures are visible in Sub2API logs and currently belong to sentinel quarantine plus normal Sub2API routing among schedulable accounts. Edge failures between master Caddy and the FRP remotePort are not visible to Sub2API; symptoms include Caddy `connect: connection refused`, EOF, connection reset, or short 502 bursts while frps closes and reopens the configured remotePort. Those failures must be diagnosed from Caddy and frps/frpc evidence and mitigated through YAML-controlled Caddy edge retry or FRP stability fixes, not by disabling accounts or changing pool membership.

 ## Availability And Probes