fix: add sub2api edge retry

2026-06-11 12:01:21 +00:00
parent 5e44545f07
commit ea92eed148
4 changed files with 136 additions and 3 deletions
@@ -43,6 +43,7 @@
 - `profiles.entries[].upstreamUserAgent` is an optional account-level upstream request User-Agent override. Use it only for upstreams that require a Codex CLI compatible User-Agent; keep the value YAML-controlled and newline-free.
 - `publicExposure` controls the optional FRP bridge from master server to the G14 ClusterIP service.
 - `publicExposure.masterCaddy.responseHeaderTimeoutSeconds` controls the master Caddy `response_header_timeout` for the public Sub2API site. It must be long enough for Codex `/responses/compact` requests; otherwise Caddy can return a client-visible 504 before Sub2API finishes the upstream compact request, and that edge timeout is not an account-level upstream failure that Sub2API can use for temporary-unschedulable failover. The numeric value belongs only in `config/platform-infra/sub2api-codex-pool.yaml`; after changing it, use `codex-pool expose --confirm` to reload Caddy and verify the rendered `response_header_timeout`. Requests that were already in flight before the reload may still finish with the previous timeout, so post-change evidence should check only requests that started after the reload.
+- `publicExposure.masterCaddy.edgeRetry` controls the master Caddy reverse-proxy retry window for the public Sub2API site. This belongs at the edge because FRP remotePort listener loss, `connection refused`, EOF, or connection reset can happen before a request reaches Sub2API, so Sub2API account failover and sentinel logic cannot observe or recover that request. Keep retry scope narrow, especially for non-idempotent POST traffic: connection-attempt failures may be retried by the reverse proxy, while round-trip retry after an upstream connection was established should be limited by YAML `retryMatch` to paths that are safe to repeat, such as compact. Retry durations and intervals belong only in YAML; after changing them, run `codex-pool expose --confirm` and verify the rendered Caddyfile contains the expected `lb_try_duration`, `lb_try_interval`, and `lb_retry_match`.
 - `localCodex` controls how the master server's current `~/.codex` consumer files are backed up and rewritten. Keep `supportsWebSockets` and `responsesWebSocketsV2` in the same state, and enable them only when at least one YAML-managed account has a current direct Codex WSv2 smoke that passes. If no upstream profile can sustain Responses WSv2, the honest long-term state is `false/false` so Codex uses HTTP Responses directly instead of repeatedly reconnecting before `response.completed`. `localCodex.responsesSmokeModel` is the YAML-declared model used by `codex-pool validate` for the lightweight `POST /v1/responses` smoke.

 Enable account-level WebSocket v2 only for upstream profiles that have passed a direct Codex WSv2 probe. Treat this as a YAML-declared capability set, not a hard scheduling pin to one profile; if `localCodex` enables WebSocket transport, `codex-pool validate` must show at least one current `webSocketsV2.schedulableEnabled` account, and runtime smoke remains the availability proof. The same validation reports each managed account's runtime WebSocket v2 mode and whether it matches YAML, so stale `ctx_pool` / `passthrough` settings cannot silently keep routing Codex WS sessions to an upstream that closes with `no available account`, WS handshake 5xx/4xx, or before `response.completed`.
@@ -101,6 +102,8 @@ When `publicExposure.enabled` is true, the same FRP TCP bridge exposes both Open

 The public management UI is an operations endpoint. Keep Sub2API itself in `platform-infra`, keep the Kubernetes Service as ClusterIP, and treat FRP as the only public bridge unless a later decision explicitly changes the exposure model.

+The public bridge has two separate failure classes. Sub2API upstream/account failures are visible in Sub2API logs and should be handled by temporary-unschedulable rules, sentinel quarantine, or Sub2API failover. Edge failures between master Caddy and the FRP remotePort are not visible to Sub2API; symptoms include Caddy `connect: connection refused`, EOF, connection reset, or short 502 bursts while frps closes and reopens the configured remotePort. Those failures must be diagnosed from Caddy and frps/frpc evidence and mitigated through YAML-controlled Caddy edge retry or FRP stability fixes, not by disabling accounts or changing pool membership.
+
 ## Availability And Probes

 Kubernetes readiness is not the same as pool availability:
@@ -109,6 +112,7 @@ Kubernetes readiness is not the same as pool availability:
 - The FRP client deployment is currently a simple connector deployment and does not itself prove that master-local traffic reaches Sub2API.
 - No scheduled `CronJob`, `ServiceMonitor`, or `PodMonitor` currently proves the full unified Codex API path.
 - `platform-infra sub2api validate` and `platform-infra sub2api codex-pool validate` are on-demand checks. Operational usage is documented in `$unidesk-sub2api`; they are acceptable for deployment closeout, but they are not continuous monitoring. `codex-pool validate` must test both `GET /v1/models` and a small `POST /v1/responses` request, and the Responses smoke should report request id, selected/final account evidence, upstream failover count, and whether the validation succeeded only after failover. It should also summarize recent `/responses` and `/responses/compact` gateway failures separately so ordinary long streaming failures are not hidden behind compact-only evidence.
+- Public exposure closeout must include the edge layer when the user-facing URL is involved. A Sub2API-side compact success summary does not rule out Caddy/FRP 502s that happened before Sub2API received the request; inspect the edge Caddy/frps/frpc evidence or use a CLI report that summarizes it before declaring public compact stable.
 - Because `codex-pool validate` includes account alignment, recent-log inspection, and gateway smoke, timeout of the CLI transport is not valid negative evidence about Sub2API scheduling by itself. Closeout evidence must come from the final structured validation result or from an explicitly reported remote job failure with stdout/stderr tail, not from a single low-level `trans` timeout.

 When an automatic availability probe is added, it should be YAML-controlled and cover these layers without printing secrets: