Files
pikasTech-unidesk/docs/reference/platform-infra.md
T
2026-06-12 05:37:16 +00:00

173 lines
36 KiB
Markdown

# Platform Infra
`platform-infra` is the k3s namespace for UniDesk-operated shared platform services. G14 is the active default runtime for this namespace; D601 may host explicitly declared standby platform targets when the service needs node-local preparation or cutover capacity. It is separate from HWLAB runtime lanes, AgentRun lanes, D601 user services, and legacy `devops-infra` control-plane helpers. New shared infra should land here first; old `devops-infra` resources migrate gradually only when a concrete owner and validation path exist.
## Source Of Truth
- UniDesk-owned platform configuration must be YAML-first. `config/platform-infra/*.yaml` is the durable source for images, versions, endpoints, FRP exposure, account profile selection, and local consumer configuration.
- Runtime Secrets and local `~/.codex/config.toml*` / `auth.json*` files are inputs or generated local state, not committed truth. CLI output may show Secret paths, byte counts, fingerprints, and short previews only; it must not print complete API keys.
- Code that reads platform YAML must validate object shape, field types, required fields, Kubernetes names, image strings, and ports before mutating G14 k3s or local consumer files.
- Do not hide image versions, namespace names, endpoint URLs, FRP ports, or profile lists in Python/TOML/JSON helper constants when they are UniDesk-owned choices. External tools may still require their own TOML/JSON/env file formats at the edge.
## Sub2API Deployment Boundary
- Sub2API is a platform service operated by UniDesk in namespace `platform-infra`. It is not a HWLAB lane workload, AgentRun workload, D601 user service, or master server daemon.
- The canonical deployment entrypoint is `bun scripts/cli.ts platform-infra sub2api plan|apply|status|validate|codex-pool`. Runtime targets are selected with `--target`; `G14` is the active default target and `D601` is a standby target controlled by the same YAML. Daily operation procedures live in `$unidesk-sub2api` at `.agents/skills/unidesk-sub2api/SKILL.md`. This reference keeps only development boundaries and project-specific source-of-truth rules.
- Raw `kubectl` through `trans <target>:k3s` is only for bounded diagnosis and evidence, not a formal mutate path.
- The image version is controlled by `config/platform-infra/sub2api.yaml`. Image update procedures are daily operations owned by `$unidesk-sub2api`; the development boundary is that image choices remain YAML-controlled.
- Sub2API should stay ClusterIP-only by default. Do not add Ingress, NodePort, LoadBalancer, or broad FRP exposure unless a YAML-controlled public exposure decision exists.
- Sub2API currently has no resource limits by design. Do not add CPU or memory limits unless a later explicit decision changes that policy and stores the new policy in YAML.
- Master server is a consumer/control host, not the runtime location. Do not deploy Sub2API, PostgreSQL, Redis, or heavy validation loops on master server.
- D601 Sub2API is a predeployment target, not a second active singleton. While the platform database handoff is pending, it must render without a local PostgreSQL StatefulSet, keep the Sub2API app and local Redis cache scaled to zero, and use only ephemeral Redis storage when Redis is later activated. After the external platform DB endpoint, Secret, and runtime images are ready, activation must be expressed by YAML and applied through the same `platform-infra sub2api --target D601` CLI path.
- External platform PostgreSQL endpoints for Sub2API are produced by the platform DB YAML and its `platform-db postgres` CLI. Cross-node Sub2API consumers connect directly to that endpoint; the master server is not a PostgreSQL data-plane relay. DNS aliases are optional when the exported `DATABASE_URL` uses a reachable IP with `sslmode=require`; current PK01-specific rules live in `docs/reference/pk01.md`.
- Sub2API account sentinel and public FRP exposure remain singleton concerns. Do not create a second sentinel or public management surface for D601 unless a later YAML-controlled decision explicitly moves or splits that responsibility.
## Codex Pool Routing
`config/platform-infra/sub2api-codex-pool.yaml` controls the Codex-facing OpenAI-compatible pool:
- `pool.groupName` names the Sub2API group that represents the pool.
- `pool.apiKeySecretName` and `pool.apiKeySecretKey` name the k3s Secret that stores the single consumer API key.
- `pool.minOwnerConcurrency` is optional; when omitted, the CLI automatically uses the sum of all resolved account capacities as the minimum concurrency for the Sub2API user that owns the unified consumer API key. A YAML value is only an explicit override and must still be at least that capacity sum, so the shared key does not fail requests or WS sessions at the user-concurrency layer. "Resolved" means each account's explicit `profiles.entries[].capacity` or, when omitted, `pool.defaultAccountCapacity`. Do not compensate for owner-concurrency 1013 errors by pinning capacity to one provider.
- `pool.defaultTempUnschedulable` is the Sub2API built-in request-path temporary-unschedulable switch plus its YAML rule list. When enabled, `codex-pool sync --confirm` renders `temp_unschedulable_enabled` and `temp_unschedulable_rules` into every managed account unless an account-level override says otherwise. This is the generic same-request recovery path for selected-account upstream failures: a matching upstream error briefly cools the selected account so Sub2API's existing failover loop can select another account in the same group.
- The built-in temporary-unschedulable configuration and external `sentinel.*` configuration are separate control surfaces. `pool.defaultTempUnschedulable` handles near-real-time request-path cooling and failover; `sentinel.*` handles account-level marker health, quarantine, restore, and probe cadence. Changing one surface must not silently rewrite the other surface's cadence, marker semantics, quarantine state, or rule list.
- The external sentinel write surface is intentionally limited to the Sub2API admin `schedulable` action. Sentinel freeze/restore may set `schedulable=false|true`, but must not write, clear, or indirectly clear Sub2API request-path runtime state such as `temp_unschedulable_until`, `temp_unschedulable_reason`, rate-limit, overload, or model-rate-limit state. In particular, sentinel restore must not call Sub2API `recover-state`, because that endpoint is a broader runtime-state recovery operation rather than a pure schedulability restore.
- Codex accounts selected by YAML do not declare `schedulable` as durable configuration. `schedulable=true` is a `codex-pool sync --confirm` process-control baseline for UniDesk-managed accounts that are not under sentinel quarantine, not a YAML field.
- `codex-pool sync --confirm` preserves UniDesk-managed accounts that are absent from YAML by default; explicit upstream retirement requires `codex-pool sync --confirm --prune-removed`. This keeps account deletion out of the normal availability-recovery path and prevents temporary YAML edits from becoming destructive runtime changes.
- `profiles.entries` selects local Codex profile files from `~/.codex/` and maps them to Sub2API account names.
- The unsuffixed master `~/.codex/config.toml` and `~/.codex/auth.json` are reserved for the unified Sub2API consumer. `config.toml` must keep `base_url = "https://sub2api.74-48-78-17.nip.io/"`, and `auth.json` must contain the unified pool API key from `pool.apiKeySecretName` / `pool.apiKeySecretKey`. Do not replace these two files with direct upstream account credentials.
- Additional upstream accounts must use suffixed local profile files such as `config.toml.<profile>` and `auth.json.<profile>`, then be declared through `profiles.entries` in `config/platform-infra/sub2api-codex-pool.yaml`.
- `profiles.entries[].capacity` optionally overrides `pool.defaultAccountCapacity` for one account. Capacity is a YAML-controlled routing input; concrete current values belong only in `config/platform-infra/sub2api-codex-pool.yaml` and runtime validation output, not in long-term reference prose. Code constants, Secrets, ad-hoc runtime patches, or stale tests must not override YAML source of truth.
- `profiles.entries[].loadFactor` optionally overrides `pool.defaultAccountLoadFactor` for one account and is rendered to Sub2API `load_factor`. Treat it as routing policy: values belong in YAML and `codex-pool validate` output, not code constants, Secrets, or ad-hoc runtime patches.
- Do not change account membership, priority, capacity, load factor, WebSocket mode, or other routing policy from inference alone. Unless the user explicitly asks for a configuration change, first preserve the current YAML, collect provenance and runtime evidence, and write the finding to the relevant issue or runbook before proposing a change.
- Sub2API is a source-available UniDesk-operated runtime component. For Sub2API scheduling, failover, temporary-unschedulable behavior, error propagation, and account selection, the default investigation path is to read the current Sub2API source implementation and then verify it with real request ids, gateway logs, and original-entry traffic. Do not use mock upstreams, temporary probe accounts, or test stubs as the default proof for Sub2API behavior; those are explicit debug aids only and do not replace source-path review plus runtime evidence.
- `profiles.entries[].tempUnschedulable` may override the pool default for one account. When enabled, the CLI renders it into Sub2API credentials as `temp_unschedulable_enabled` and `temp_unschedulable_rules`; when disabled, runtime credentials omit both fields. Use account-level override only for an explicit deviation from the pool policy, not as an availability workaround for a named account.
- Codex account-state, quota prompts, model-routing failures, encrypted-content affinity failures, gateway wrappers, and timeout-like upstream errors must be handled by the generic temporary-unschedulable/failover path plus the external marker sentinel. Do not change membership, priority, capacity, load factor, WebSocket mode, `pool_mode`, or a specific provider's status merely to work around those errors. If a matching upstream failure still logs `openai.forward_failed` without `openai.upstream_failover_switching`, the missing fix is in Sub2API's HTTP `/responses` failover classification/error propagation, not in account pinning.
- `profiles.entries[].openaiResponsesWebSocketsV2Mode` is the account-level Responses WebSocket v2 switch for OpenAI-compatible upstreams that require WebSocket transport. Allowed values are `off`, `ctx_pool`, and `passthrough`; omit the field unless that upstream needs it.
- `profiles.entries[].upstreamUserAgent` is an optional account-level upstream request User-Agent override. Use it only for upstreams that require a Codex CLI compatible User-Agent; keep the value YAML-controlled and newline-free.
- `publicExposure` controls the optional FRP bridge from master server to the G14 ClusterIP service.
- `publicExposure.masterCaddy.responseHeaderTimeoutSeconds` controls the master Caddy `response_header_timeout` for the public Sub2API site. It must be long enough for Codex `/responses/compact` requests; otherwise Caddy can return a client-visible 504 before Sub2API finishes the upstream compact request, and that edge timeout is not an account-level upstream failure that Sub2API can use for temporary-unschedulable failover. The numeric value belongs only in `config/platform-infra/sub2api-codex-pool.yaml`; after changing it, use `codex-pool expose --confirm` to reload Caddy and verify the rendered `response_header_timeout`. Requests that were already in flight before the reload may still finish with the previous timeout, so post-change evidence should check only requests that started after the reload.
- `publicExposure.masterCaddy.edgeRetry` controls the master Caddy reverse-proxy retry window for the public Sub2API site. This belongs at the edge because FRP remotePort listener loss, `connection refused`, EOF, or connection reset can happen before a request reaches Sub2API, so Sub2API account failover and sentinel logic cannot observe or recover that request. Keep retry scope narrow, especially for non-idempotent POST traffic: connection-attempt failures may be retried by the reverse proxy, while round-trip retry after an upstream connection was established should be limited by YAML `retryMatch` to paths that are safe to repeat, such as compact. Retry durations and intervals belong only in YAML; after changing them, run `codex-pool expose --confirm` and verify the rendered Caddyfile contains the expected `lb_try_duration`, `lb_try_interval`, and `lb_retry_match`.
- `localCodex` controls how the master server's current `~/.codex` consumer files are backed up and rewritten. Keep `supportsWebSockets` and `responsesWebSocketsV2` in the same state, and enable them only when at least one YAML-managed account has a current direct Codex WSv2 smoke that passes. If no upstream profile can sustain Responses WSv2, the honest long-term state is `false/false` so Codex uses HTTP Responses directly instead of repeatedly reconnecting before `response.completed`. `localCodex.responsesSmokeModel` is the YAML-declared model used by `codex-pool validate` for the lightweight `POST /v1/responses` smoke.
Enable account-level WebSocket v2 only for upstream profiles that have passed a direct Codex WSv2 probe. Treat this as a YAML-declared capability set, not a hard scheduling pin to one profile; if `localCodex` enables WebSocket transport, `codex-pool validate` must show at least one current `webSocketsV2.schedulableEnabled` account, and runtime smoke remains the availability proof. The same validation reports each managed account's runtime WebSocket v2 mode and whether it matches YAML, so stale `ctx_pool` / `passthrough` settings cannot silently keep routing Codex WS sessions to an upstream that closes with `no available account`, WS handshake 5xx/4xx, or before `response.completed`.
When Codex startup repeatedly reports WebSocket reconnects or HTTPS fallback, preserve membership, priority, capacity, load factor, and other routing policy until runtime logs identify the failing account and transport. If bounded Sub2API logs show repeated `openai.websocket_proxy_failed`, `openai.websocket_account_select_failed`, upstream WS handshake 4xx/5xx, or repeated close-before-`response.completed` for the only WS-capable account, remove that account from the WSv2 capability set in YAML; if the resulting capability set is empty, also turn off the `localCodex` WS feature flags. Then run `codex-pool sync --confirm`, `codex-pool validate`, and prove the result with a Codex smoke that no longer emits reconnects.
Do not encode current availability assumptions in long-term reference prose. If an account needs a higher concurrency or load factor than the pool default, make that a deliberate YAML override and verify it with `codex-pool validate`; the reference document should describe the rule, not repeat the current numeric value.
Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path and does not replace temporary-unschedulable request failover or sentinel quarantine. The current failover and recovery model is: matching request-path errors temporarily cool the selected account and trigger same-group failover, while the external marker-only sentinel freezes or restores account schedulability from direct marker probes.
Sub2API temporary-unschedulable rules require both an HTTP status match and a response-body keyword match in the upstream failure/error path. UniDesk uses these rules as a generic request-path failover trigger, not as a successful-response content classifier. Runtime UI fields such as trigger time, release time, matched keyword, and rule index identify this built-in request-path state and should not be attributed to sentinel unless separate sentinel state shows an active quarantine. HTTP 200 private content, maintenance text, quota prompts, ads, and similar semantic failures remain the external account-level sentinel's job.
The `invalid_encrypted_content` failure mode is a stable regression guard for Codex pool routing. It means an upstream could not verify or parse encrypted Responses/Codex state carried by the request; a fresh account probe can still pass while a large resumed request fails because the encrypted content is not acceptable to that selected upstream. The required behavior is generic: Sub2API should perform its built-in recoverable handling for encrypted reasoning state when available, mark the selected account temporarily unschedulable when the configured status/keyword rule matches, and continue same-group failover before the client sees a final failure whenever the response has not already been committed. Do not interpret this failure as proof that the pool should pin to `only`, delete the selected account, change membership/priority/capacity/load factor, or move the error into sentinel-specific provider logic.
For this failure class, the regression evidence must come from the real request path. A valid investigation should connect the client request id to Sub2API gateway logs showing the selected account id, upstream status, `account_temp_unschedulable`, `openai.upstream_failover_switching`, and the final access-log status. A `sentinel-report` row with `quarantineActive=false` and marker success proves only that the external marker sentinel did not quarantine that account; it does not disprove request-path temporary cooling. Conversely, a marker sentinel recovery must not call `recover-state` or clear the temporary-unschedulable state created by the failed request. If this failure still reaches the client as 502/503 while another schedulable account is available and no stream bytes were committed, fix Sub2API failover classification/error propagation or the UniDesk sync/render path rather than adding mock probes, provider pinning, or account-specific exceptions.
## Sub2API Account Test Semantics
Sub2API v0.1.136 has a separate management-plane account connection test. The admin WebUI account modal calls `POST /api/v1/admin/accounts/:id/test` with `model_id` and, for the admin account table modal, no OpenAI `mode`; the backend binds this to `AccountTestService.TestAccountConnection`, which normalizes an empty mode to `default`.
For OpenAI API-key accounts in default mode, the test loads the account by id, applies `account.GetMappedModel(model_id)`, checks `openai_compat.ShouldUseResponsesAPI(account.Extra)`, and then builds an upstream URL from the account base URL with `/v1/responses`. It sends a direct upstream request through `httpUpstream.DoWithTLS` with `Content-Type: application/json` and `Authorization: Bearer <account-key>`. The request body is Responses API SSE, not a non-streaming JSON request: `model` is the mapped model, `input` is one user message whose text is `hi`, `stream` is `true`, and `instructions` is Sub2API's embedded OpenAI default instructions. For API-key accounts it does not set `store: false`, `max_output_tokens`, Codex CLI `User-Agent`, `OpenAI-Beta`, `Originator`, `Version`, `Session_ID`, or `Conversation_ID`; those Codex-like headers appear in other paths such as compact probing, not in the default account test.
The management test success criterion is transport and stream completion, not semantic content. A non-200 upstream response becomes an SSE error. A 200 response is considered successful when `processOpenAIStream` sees `response.completed` or `response.done`; `response.output_text.delta` chunks are forwarded to the WebUI as display text, while `response.failed`, `error`, or EOF before completion fails the test. Therefore a WebUI "hi" success proves that this direct account can complete a streaming `/v1/responses` request with Sub2API's default payload shape, but it does not prove that a non-streaming Responses request, marker prompt, `max_output_tokens`, `store: false`, Codex header set, compact path, WebSocket path, or normal pool-scheduled gateway request will behave identically.
This management-plane test is also outside the normal consumer gateway scheduler. It fetches the account by id instead of listing only schedulable accounts, so `status=active` in the modal and a successful account test can coexist with `schedulable=false` in scheduler state. Because the test performs its own outbound `DoWithTLS` call, regular gateway access logs and usage logs may not contain the upstream account id/path/status evidence expected from ordinary `/v1/responses` traffic. When diagnosing account tests, use the management route semantics above or Sub2API source, not gateway access-log absence or an unrelated pool request as proof.
An external account-level sentinel that wants parity with this WebUI path should reuse the same request shape as far as the standard OpenAI SDK allows: direct account credentials, Responses API, `stream=true`, no `store: false` for API-key accounts, no upstream `max_output_tokens` field, and success parsing based on the streaming events. A local stream delta collection limit is acceptable as a sentinel safety bound, but it should not change the upstream request body. The sentinel may replace the user text `hi` with a marker prompt, but it should not introduce extra request fields or Codex/compact headers merely for convenience. If a marker-only sentinel intentionally diverges from the management test shape, the divergence must be documented in probe output so a WebUI success and sentinel failure are not misread as operator error.
## Account Sentinel Marker Contract
The UniDesk account-level sentinel uses marker-only health semantics. A probe is healthy only when the upstream response satisfies the configured marker match. Every other result is unhealthy and must enter the same exponential freeze state machine, regardless of whether the immediate response is HTTP 200, 400, 403, 429, 500, 502, 503, 504, a streaming error event, malformed output, empty output, timeout, or any other transport/API failure. HTTP status, upstream error code, body hash, body preview, headers, and SDK exception class are diagnostics only; they must not become additional allow/deny criteria that bypass marker mismatch. Sentinel actions are only `schedulable=false` on freeze and `schedulable=true` on marker-matching recovery; they must not clear Sub2API temporary-unschedulable or rate-limit state as part of marker recovery.
The sentinel must not maintain separate classifiers for "private content", "maintenance", "quota", "ads", or provider-specific body phrases as health gates. The only recovery condition is a later recovery probe that matches the marker. Freeze TTL expiry only schedules the next recovery probe; it does not restore an account by itself. Repeated non-marker results use a short exponential freeze backoff because failed marker probes produce little or no useful output token usage; repeated marker-matching results use the configured success cadence backoff. This contract applies equally to OpenAI Responses `gpt-5.5` direct account probes and manual `codex-pool sentinel-probe --account ... --confirm` measurements.
`profiles.entries[].trustUpstream` is the durable account-level trust marker for sentinel success cadence, and the absence of the field means untrusted. Trusted and untrusted accounts use separate YAML cadence maximums after marker-matching probes; the values belong only in `config/platform-infra/sub2api-codex-pool.yaml`. This field must not change Sub2API scheduler priority, capacity, load factor, membership, built-in temporary-unschedulable settings, or the marker-only health contract. Its purpose is to keep intermittently unreliable 200-success providers under more frequent direct probes without adding provider-specific content classifiers.
When `codex-pool sync --confirm` creates a YAML-managed account or changes direct-probe-relevant account inputs such as the profile mapping, upstream base URL, API key fingerprint, upstream User-Agent, Responses WebSocket mode, or `trustUpstream`, sync records a pending sentinel probe from the pre-mutation runtime state, updates the account, restores `schedulable=true` unless an active sentinel quarantine already exists, and schedules the account probe immediately. New or changed accounts are not default-frozen; only an actual non-marker probe result or an existing active quarantine may remove an account from the scheduler. This avoids zero-available windows during sync while still ensuring that later marker failures enter the normal freeze/restore state machine. Unchanged accounts must not have their existing success or failure backoff reset by unrelated YAML syncs.
If the YAML failure freeze maximum is lowered, `codex-pool sync --confirm` may migrate only currently active sentinel quarantines whose stored interval or next recovery time exceeds the current maximum. The migration keeps the account frozen, marks the next recovery probe due immediately, and lets the next marker result decide restore versus the new shorter failure backoff. It must not clear quarantine or restore schedulability merely because an older TTL has expired.
If the YAML success cadence maximum is lowered or an account changes trust class, `codex-pool sync --confirm` may clamp existing successful account state so the next probe is due under the current YAML policy instead of waiting for an older, longer success window to expire. This clamp only affects sentinel state and probe timing; it does not by itself restore a quarantined account or bypass the next marker result.
Operational observation for this sentinel should use the read-only `codex-pool sentinel-report` table or its `--raw` form. It is the canonical low-noise view for per-account probe count, trust class, marker result, HTTP/error diagnostics, freeze TTL, success cadence, success cadence maximum, next probe time, and recent CronJob runs; raw ConfigMap dumps and ad hoc log scraping are fallback diagnostics, not the primary state surface.
The request path is:
1. A client sends an OpenAI-compatible request to the configured consumer base URL, normally `https://sub2api.74-48-78-17.nip.io/v1/...`, with the unified API key.
2. master `frps` forwards the TCP connection to `platform-infra/sub2api-frpc` when `publicExposure.enabled` is true.
3. `sub2api-frpc` forwards to `sub2api.platform-infra.svc.cluster.local:8080`.
4. Sub2API validates the unified key and resolves its `group_id`.
5. Accounts listed in `profiles.entries` are bound to the same group via `group_ids`, so Sub2API dispatches through that group using its own account selection semantics.
Adding, removing, exposing, validating, and configuring local Codex consumers are daily operations covered by `$unidesk-sub2api`. The development rule is that ordinary pool membership changes stay YAML-only and do not add code or CI/CD. Code changes are only appropriate when UniDesk needs to render or validate a Sub2API capability that already exists upstream, such as account-level WebSocket mode or per-account upstream User-Agent. If Sub2API itself does not support a desired behavior, do not magic-patch it through UniDesk scripts, Kubernetes hotfixes, local forks, or hidden compatibility paths; either leave the behavior unsupported or pursue it upstream as an explicit Sub2API feature.
`codex-pool sync --confirm` and `codex-pool validate` are runtime operations that may need more than one SSH short-connection window because they log in to Sub2API, reconcile accounts, inspect recent logs, and run gateway smoke requests. The formal entry remains the UniDesk CLI, which must use a submit-and-short-poll control shape or an equivalent remote job wrapper instead of one long `trans G14:k3s script` call. If these commands fail with `UNIDESK_SSH_RUNTIME_TIMEOUT` while the remote operation may still be running, treat it as a control-plane visibility gap first: improve or use the CLI's job/poll path, then rerun `sync` or `validate`. Do not replace it with raw `kubectl`, manual Sub2API admin API patches, repeated blind full loops, or Sub2API source modifications.
After `codex-pool configure-local --confirm`, the default `~/.codex/config.toml` / `auth.json` pair must remain the unified Sub2API consumer and must not be reused as an upstream account profile. Keep every upstream source profile in suffixed files such as `config.toml.<profile>` / `auth.json.<profile>` and register it through YAML `profiles.entries`.
## Public FRP Boundary
When `publicExposure.enabled` is true, the same FRP TCP bridge exposes both OpenAI-compatible API paths and the built-in Sub2API management frontend. The management UI is reachable at the configured `publicExposure.publicBaseUrl` and its `/login` route; do not allocate a second public port unless a separate YAML-controlled exposure decision exists.
The public management UI is an operations endpoint. Keep Sub2API itself in `platform-infra`, keep the Kubernetes Service as ClusterIP, and treat FRP as the only public bridge unless a later decision explicitly changes the exposure model.
The public bridge has two separate failure classes. Sub2API upstream/account failures are visible in Sub2API logs and currently belong to sentinel quarantine plus normal Sub2API routing among schedulable accounts. Edge failures between master Caddy and the FRP remotePort are not visible to Sub2API; symptoms include Caddy `connect: connection refused`, EOF, connection reset, or short 502 bursts while frps closes and reopens the configured remotePort. Those failures must be diagnosed from Caddy and frps/frpc evidence and mitigated through YAML-controlled Caddy edge retry or FRP stability fixes, not by disabling accounts or changing pool membership.
## Availability And Probes
Kubernetes readiness is not the same as pool availability:
- The Sub2API app, PostgreSQL, and Redis manifests include container-level health probes. These only prove the pods and local dependencies are healthy enough for Kubernetes scheduling.
- The FRP client deployment is currently a simple connector deployment and does not itself prove that master-local traffic reaches Sub2API.
- No scheduled `CronJob`, `ServiceMonitor`, or `PodMonitor` currently proves the full unified Codex API path.
- `platform-infra sub2api validate` and `platform-infra sub2api codex-pool validate` are on-demand checks. Operational usage is documented in `$unidesk-sub2api`; they are acceptable for deployment closeout, but they are not continuous monitoring. `codex-pool validate` must test both `GET /v1/models` and a small `POST /v1/responses` request, and the Responses smoke should report request id, selected/final account evidence, upstream failover count, and whether the validation succeeded only after failover. It should also summarize recent `/responses` and `/responses/compact` gateway failures separately so ordinary long streaming failures are not hidden behind compact-only evidence.
- `codex-pool validate` must not create mock upstreams or temporary failover-probe accounts as its default proof of Sub2API behavior. When a suspected failover path is in question, validate should surface the relevant source-path expectation and real runtime evidence: request ids, selected/final account ids, `openai.upstream_failover_switching`, `openai.forward_failed`, `openai.account_select_failed`, and final status. If runtime evidence contradicts the source-path expectation, fix Sub2API or the UniDesk integration path rather than converting the mismatch into a mock-only success.
- Public exposure closeout must include the edge layer when the user-facing URL is involved. A Sub2API-side compact success summary does not rule out Caddy/FRP 502s that happened before Sub2API received the request; inspect the edge Caddy/frps/frpc evidence or use a CLI report that summarizes it before declaring public compact stable.
- Because `codex-pool validate` includes account alignment, recent-log inspection, and gateway smoke, timeout of the CLI transport is not valid negative evidence about Sub2API scheduling by itself. Closeout evidence must come from the final structured validation result or from an explicitly reported remote job failure with stdout/stderr tail, not from a single low-level `trans` timeout.
When an automatic availability probe is added, it should be YAML-controlled and cover these layers without printing secrets:
1. G14 in-cluster `GET /v1/models` through `sub2api.platform-infra.svc.cluster.local:8080` with the unified key.
2. master-local `GET /v1/models` through the configured FRP endpoint when public exposure is enabled.
3. A tiny `POST /v1/responses` call through the same consumer URL for true OpenAI-compatible request validation.
4. Optional per-upstream account probes if Sub2API exposes a safe account selection or admin-health mechanism; otherwise document that group-level success does not prove every upstream account is healthy.
Until continuous probing exists, closeout comments must state that validation was on-demand and include the exact CLI/API entrypoints used.
## k3s Network Policy Requirements
G14 k3s runs kube-router as its network policy controller. When any NetworkPolicy CRD exists in a namespace, kube-router replaces its default allow-all behavior with explicit iptables/ipset rules that only permit traffic matching declared policies. If a namespace has NetworkPolicy resources but the generated iptables rules miss or incorrectly evaluate a traffic path, pods in that namespace will experience silent connection timeouts (REJECT with `icmp-port-unreachable`) even though `kubectl get networkpolicy` shows the policy and DNS/service resolution works.
The `platform-infra` namespace **must** have a `NetworkPolicy` named `allow-all` (or equivalent) that explicitly permits all ingress and egress within the namespace. Without it, kube-router's default-deny iptables chains block cross-pod traffic including Sub2API → PostgreSQL and Sub2API → Redis connections, causing Sub2API init containers and background services to hang with `context deadline exceeded` or `no response` errors.
Diagnostic symptoms:
- Sub2API pod stuck `Init:0/2` with `wait-postgres` logging `sub2api-postgres:5432 - no response` perpetually
- `pg_isready` succeeds inside the postgres pod itself but TCP from any other pod times out
- `kubectl exec` from a different pod or `nc -zv` to the postgres ClusterIP/pod-IP returns `Operation timed out`
- `iptables -L KUBE-ROUTER-INPUT -n | grep <namespace>` shows per-pod FW chains; the chain ends with `REJECT ... mark match ! 0x10000/0x10000`
If kube-router iptables rules become stale after a NetworkPolicy create/update cycle (e.g., ipset references old pod IPs or mark-bit logic fails to match), the fastest recovery is: `iptables -I FORWARD 1 -s 10.42.0.0/16 -d 10.42.0.0/16 -j ACCEPT` as a temporary bypass, then recreate the NetworkPolicy or restart kube-router/k3s to force a full iptables sync. After recovery, remove the temporary rule: `iptables -D FORWARD -s 10.42.0.0/16 -d 10.42.0.0/16 -j ACCEPT`.
The manifest for the required `allow-all` policy is:
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-all
namespace: platform-infra
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
ingress:
- {}
egress:
- {}
```
This policy must be included in the `sub2api plan` / `apply` manifest rendering so that it is created as part of the normal deployment flow, not maintained as a manual one-off.
`platform-infra sub2api status` must report whether `NetworkPolicy/allow-all` exists and still has `podSelector: {}`, `policyTypes: [Ingress, Egress]`, `ingress: [{}]`, and `egress: [{}]`. For active bundled targets, `platform-infra sub2api validate` must also run temporary in-namespace probe pods that connect to `sub2api-postgres:5432` and `sub2api-redis:6379`; local `pg_isready` inside the PostgreSQL pod alone is insufficient because it does not exercise kube-router cross-pod policy evaluation. For external-DB pending standby targets, `validate --target` checks the predeployment shape instead: no local PostgreSQL, app replicas zero, ClusterIP services, allow-all NetworkPolicy, and local Redis declared as ephemeral cache with readiness required only when Redis replicas are above zero.