Files
pikasTech-unidesk/docs/reference/platform-infra.md
T

23 KiB

G14 Platform Infra

platform-infra is the G14 k3s namespace for UniDesk-operated shared platform services. It is separate from HWLAB runtime lanes, AgentRun lanes, D601 user services, and legacy devops-infra control-plane helpers. New shared infra should land here first; old devops-infra resources migrate gradually only when a concrete owner and validation path exist.

Source Of Truth

  • UniDesk-owned platform configuration must be YAML-first. config/platform-infra/*.yaml is the durable source for images, versions, endpoints, FRP exposure, account profile selection, and local consumer configuration.
  • Runtime Secrets and local ~/.codex/config.toml* / auth.json* files are inputs or generated local state, not committed truth. CLI output may show Secret paths, byte counts, fingerprints, and short previews only; it must not print complete API keys.
  • Code that reads platform YAML must validate object shape, field types, required fields, Kubernetes names, image strings, and ports before mutating G14 k3s or local consumer files.
  • Do not hide image versions, namespace names, endpoint URLs, FRP ports, or profile lists in Python/TOML/JSON helper constants when they are UniDesk-owned choices. External tools may still require their own TOML/JSON/env file formats at the edge.

Sub2API Deployment Boundary

  • Sub2API is a G14 platform service operated by UniDesk in namespace platform-infra. It is not a HWLAB lane workload, AgentRun workload, D601 service, or master server daemon.
  • The canonical deployment entrypoint is bun scripts/cli.ts platform-infra sub2api plan|apply|status|validate|codex-pool; daily operation procedures live in $unidesk-sub2api at .agents/skills/unidesk-sub2api/SKILL.md. This reference keeps only development boundaries and project-specific source-of-truth rules.
  • Raw kubectl through trans G14:k3s is only for bounded diagnosis and evidence, not a formal mutate path.
  • The image version is controlled by config/platform-infra/sub2api.yaml. Image update procedures are daily operations owned by $unidesk-sub2api; the development boundary is that image choices remain YAML-controlled.
  • Sub2API should stay ClusterIP-only by default. Do not add Ingress, NodePort, LoadBalancer, or broad FRP exposure unless a YAML-controlled public exposure decision exists.
  • Sub2API currently has no resource limits by design. Do not add CPU or memory limits unless a later explicit decision changes that policy and stores the new policy in YAML.
  • Master server is a consumer/control host, not the runtime location. Do not deploy Sub2API, PostgreSQL, Redis, or heavy validation loops on master server.

Codex Pool Routing

config/platform-infra/sub2api-codex-pool.yaml controls the Codex-facing OpenAI-compatible pool:

  • pool.groupName names the Sub2API group that represents the pool.
  • pool.apiKeySecretName and pool.apiKeySecretKey name the k3s Secret that stores the single consumer API key.
  • pool.minOwnerConcurrency is optional; when omitted, the CLI automatically uses the sum of all resolved account capacities as the minimum concurrency for the Sub2API user that owns the unified consumer API key. A YAML value is only an explicit override and must still be at least that capacity sum, so the shared key does not fail requests or WS sessions at the user-concurrency layer. "Resolved" means each account's explicit profiles.entries[].capacity or, when omitted, pool.defaultAccountCapacity. Do not compensate for owner-concurrency 1013 errors by pinning capacity to one provider.
  • pool.defaultTempUnschedulable declares Sub2API account-level temporary unschedulable rules for capabilities that Sub2API itself already supports. Keep 429/overload/capacity, service-unavailable, gateway timeout, and stable model-routing failures in this YAML policy so the scheduler can cool down a failing account and choose another candidate instead of hard-pinning one provider. Do not declare unsupported Sub2API behavior in YAML as a promise that UniDesk code or runtime patches should emulate.
  • When a managed upstream repeatedly causes /v1/responses or /responses/compact failures, the required fix path is to make automatic temporary-unschedulable and failover work, then verify it with runtime evidence. Do not restore availability by manually disabling an account, deleting a managed account, removing its YAML entry, lowering membership, or otherwise changing routing policy merely to avoid the failing upstream; those actions are allowed only for an explicit upstream retirement or ownership change.
  • Codex accounts selected by YAML do not declare schedulable as durable configuration. schedulable=true is a codex-pool sync --confirm process-control baseline for UniDesk-managed accounts, not a YAML field. Account cooling must be represented by temp_unschedulable_until / temp_unschedulable_reason, so validation can distinguish real automatic cooldown from stale manual unschedulable state.
  • codex-pool sync --confirm preserves UniDesk-managed accounts that are absent from YAML by default; explicit upstream retirement requires codex-pool sync --confirm --prune-removed. This keeps account deletion out of the normal availability-recovery path and prevents temporary YAML edits from becoming destructive runtime changes.
  • profiles.entries selects local Codex profile files from ~/.codex/ and maps them to Sub2API account names.
  • The unsuffixed master ~/.codex/config.toml and ~/.codex/auth.json are reserved for the unified Sub2API consumer. config.toml must keep base_url = "https://sub2api.74-48-78-17.nip.io/", and auth.json must contain the unified pool API key from pool.apiKeySecretName / pool.apiKeySecretKey. Do not replace these two files with direct upstream account credentials.
  • Additional upstream accounts must use suffixed local profile files such as config.toml.<profile> and auth.json.<profile>, then be declared through profiles.entries in config/platform-infra/sub2api-codex-pool.yaml.
  • profiles.entries[].capacity optionally overrides pool.defaultAccountCapacity for one account. Capacity is a YAML-controlled routing input; concrete current values belong only in config/platform-infra/sub2api-codex-pool.yaml and runtime validation output, not in long-term reference prose. Code constants, Secrets, ad-hoc runtime patches, or stale tests must not override YAML source of truth.
  • profiles.entries[].loadFactor optionally overrides pool.defaultAccountLoadFactor for one account and is rendered to Sub2API load_factor. Treat it as routing policy: values belong in YAML and codex-pool validate output, not code constants, Secrets, or ad-hoc runtime patches.
  • Do not change account membership, priority, capacity, load factor, WebSocket mode, or other routing policy from inference alone. Unless the user explicitly asks for a configuration change, first preserve the current YAML, collect provenance and runtime evidence, and write the finding to the relevant issue or runbook before proposing a change.
  • profiles.entries[].tempUnschedulable may override the pool default for one account. The CLI renders it into Sub2API credentials as temp_unschedulable_enabled and temp_unschedulable_rules; rules match HTTP status plus response-body keywords and place only that account into a temporary unschedulable cooldown.
  • Codex account-state or quota prompts that stop a task and ask the operator to switch accounts belong in pool.defaultTempUnschedulable, not in account membership, priority, capacity, load factor, WebSocket mode, or pool_mode. Keep stable body phrases such as weekly-limit and /status prompts in both the 403 account-state rule and the 429 quota/rate-limit rule, then run codex-pool sync --confirm and codex-pool validate. The validation evidence must include runtime temporary-unschedulable alignment for each managed account, not only successful group-level /v1/models or /v1/responses smoke output.
  • Upstream model-routing and Responses compatibility failures that surface as 400 responses, such as invalid_encrypted_content, bad_response_status_code, invalid_request_error with a stable unsupported-model message, unsupported-model wrappers, or stable "available models" messages, belong in pool.defaultTempUnschedulable when another account can handle the same Codex request. Upstream model-routing failures that surface as 503 responses, such as model_not_found or "no available channel for model" wrappers, also belong there. Gateway and timeout failures that surface as 502, 504, or 524 responses, including Gateway Timeout, Unknown error, Upstream request failed, context deadline exceeded, context canceled, or recovered upstream-error wrappers, belong in the same YAML policy. This is especially important for compact and long /responses requests, where an upstream Cloudflare 524 or account-specific compatibility failure may eventually reach Codex as a 502/504 unknown-error wrapper after failover or client cancellation. They are not membership, priority, capacity, load factor, WebSocket mode, or User-Agent decisions by themselves. After adding stable body phrases, run codex-pool sync --confirm and codex-pool validate, and verify the affected account's runtime status-specific rule includes the new keywords.
  • profiles.entries[].openaiResponsesWebSocketsV2Mode is the account-level Responses WebSocket v2 switch for OpenAI-compatible upstreams that require WebSocket transport. Allowed values are off, ctx_pool, and passthrough; omit the field unless that upstream needs it.
  • profiles.entries[].upstreamUserAgent is an optional account-level upstream request User-Agent override. Use it only for upstreams that require a Codex CLI compatible User-Agent; keep the value YAML-controlled and newline-free.
  • publicExposure controls the optional FRP bridge from master server to the G14 ClusterIP service.
  • publicExposure.masterCaddy.responseHeaderTimeoutSeconds controls the master Caddy response_header_timeout for the public Sub2API site. It must be long enough for Codex /responses/compact requests; otherwise Caddy can return a client-visible 504 before Sub2API finishes the upstream compact request, and that edge timeout is not an account-level upstream failure that Sub2API can use for temporary-unschedulable failover. The numeric value belongs only in config/platform-infra/sub2api-codex-pool.yaml; after changing it, use codex-pool expose --confirm to reload Caddy and verify the rendered response_header_timeout. Requests that were already in flight before the reload may still finish with the previous timeout, so post-change evidence should check only requests that started after the reload.
  • localCodex controls how the master server's current ~/.codex consumer files are backed up and rewritten. Keep supportsWebSockets and responsesWebSocketsV2 in the same state, and enable them only when at least one YAML-managed account has a current direct Codex WSv2 smoke that passes. If no upstream profile can sustain Responses WSv2, the honest long-term state is false/false so Codex uses HTTP Responses directly instead of repeatedly reconnecting before response.completed. localCodex.responsesSmokeModel is the YAML-declared model used by codex-pool validate for the lightweight POST /v1/responses smoke.

Enable account-level WebSocket v2 only for upstream profiles that have passed a direct Codex WSv2 probe. Treat this as a YAML-declared capability set, not a hard scheduling pin to one profile; if localCodex enables WebSocket transport, codex-pool validate must show at least one current webSocketsV2.schedulableEnabled account, and runtime smoke remains the availability proof. The same validation reports each managed account's runtime WebSocket v2 mode and whether it matches YAML, so stale ctx_pool / passthrough settings cannot silently keep routing Codex WS sessions to an upstream that closes with no available account, WS handshake 5xx/4xx, or before response.completed.

When Codex startup repeatedly reports WebSocket reconnects or HTTPS fallback, preserve membership, priority, capacity, load factor, and other routing policy until runtime logs identify the failing account and transport. If bounded Sub2API logs show repeated openai.websocket_proxy_failed, openai.websocket_account_select_failed, upstream WS handshake 4xx/5xx, or repeated close-before-response.completed for the only WS-capable account, remove that account from the WSv2 capability set in YAML; if the resulting capability set is empty, also turn off the localCodex WS feature flags. Then run codex-pool sync --confirm, codex-pool validate, and prove the result with a Codex smoke that no longer emits reconnects.

Do not encode current availability assumptions in long-term reference prose. If an account needs a higher concurrency or load factor than the pool default, make that a deliberate YAML override and verify it with codex-pool validate; the reference document should describe the rule, not repeat the current numeric value.

Do not enable Sub2API pool_mode for UniDesk-managed Codex accounts. pool_mode retries the same selected account path, while UniDesk's desired failover behavior is to mark the failing account temporarily unschedulable and let Sub2API choose another account from the group. codex-pool validate reports each managed account's temporary-unschedulable runtime alignment and should be used after codex-pool sync --confirm. Generic 502/503/504 bodies such as Recovered upstream error 502, Bad Gateway, Gateway Timeout, Codex-facing Upstream request failed, Unknown error, context-deadline/canceled wrappers, stable 400 invalid_encrypted_content / unsupported-model wrappers, and stable model_not_found / "no available channel for model" wrappers must stay in the YAML cooldown policy so an intermittently bad account is cooled down instead of repeatedly adding latency at the next compact or Responses request. The Codex pool default error cooldown is severity-tiered: temporary signals can start at ten minutes, gateway/service/overload/model-routing failures should cool down longer, and credential, permission, quota, account-compatibility, or account-state failures should use the longest cooldown. Exact current values belong in YAML and runtime validation output.

Sub2API temporary-unschedulable rules require both an HTTP status match and a response-body keyword match in the upstream failure/error path. Do not treat them as a general successful-response content filter. If an upstream returns a quota warning or maintenance prompt as normal HTTP 200 assistant content, do not add a YAML 200 cooldown rule, patch Sub2API in place, fork behavior in UniDesk, or bypass codex-pool sync to make the pool pretend that account cooling exists. Record the upstream capability gap in an issue when it matters operationally; until upstream Sub2API supports that behavior and codex-pool validate proves it, UniDesk should not implement or rely on it.

If automatic cooling or same-request failover does not happen for an error that the YAML policy declares, treat that as a Sub2API capability or integration defect. The closeout must show the failing account being marked temporarily unschedulable and the next request or same request selecting another schedulable account; a manually disabled, deleted, or pruned account is not valid evidence for this class of fix.

The request path is:

  1. A client sends an OpenAI-compatible request to the configured consumer base URL, normally https://sub2api.74-48-78-17.nip.io/v1/..., with the unified API key.
  2. master frps forwards the TCP connection to platform-infra/sub2api-frpc when publicExposure.enabled is true.
  3. sub2api-frpc forwards to sub2api.platform-infra.svc.cluster.local:8080.
  4. Sub2API validates the unified key and resolves its group_id.
  5. Accounts listed in profiles.entries are bound to the same group via group_ids, so Sub2API dispatches through that group using its own account selection semantics.

Adding, removing, exposing, validating, and configuring local Codex consumers are daily operations covered by $unidesk-sub2api. The development rule is that ordinary pool membership changes stay YAML-only and do not add code or CI/CD. Code changes are only appropriate when UniDesk needs to render or validate a Sub2API capability that already exists upstream, such as account-level WebSocket mode or per-account upstream User-Agent. If Sub2API itself does not support a desired behavior, do not magic-patch it through UniDesk scripts, Kubernetes hotfixes, local forks, or hidden compatibility paths; either leave the behavior unsupported or pursue it upstream as an explicit Sub2API feature.

codex-pool sync --confirm and codex-pool validate are runtime operations that may need more than one SSH short-connection window because they log in to Sub2API, reconcile accounts, inspect recent logs, and run gateway smoke requests. The formal entry remains the UniDesk CLI, which must use a submit-and-short-poll control shape or an equivalent remote job wrapper instead of one long trans G14:k3s script call. If these commands fail with UNIDESK_SSH_RUNTIME_TIMEOUT while the remote operation may still be running, treat it as a control-plane visibility gap first: improve or use the CLI's job/poll path, then rerun sync or validate. Do not replace it with raw kubectl, manual Sub2API admin API patches, repeated blind full loops, or Sub2API source modifications.

After codex-pool configure-local --confirm, the default ~/.codex/config.toml / auth.json pair must remain the unified Sub2API consumer and must not be reused as an upstream account profile. Keep every upstream source profile in suffixed files such as config.toml.<profile> / auth.json.<profile> and register it through YAML profiles.entries.

Public FRP Boundary

When publicExposure.enabled is true, the same FRP TCP bridge exposes both OpenAI-compatible API paths and the built-in Sub2API management frontend. The management UI is reachable at the configured publicExposure.publicBaseUrl and its /login route; do not allocate a second public port unless a separate YAML-controlled exposure decision exists.

The public management UI is an operations endpoint. Keep Sub2API itself in platform-infra, keep the Kubernetes Service as ClusterIP, and treat FRP as the only public bridge unless a later decision explicitly changes the exposure model.

Availability And Probes

Kubernetes readiness is not the same as pool availability:

  • The Sub2API app, PostgreSQL, and Redis manifests include container-level health probes. These only prove the pods and local dependencies are healthy enough for Kubernetes scheduling.
  • The FRP client deployment is currently a simple connector deployment and does not itself prove that master-local traffic reaches Sub2API.
  • No scheduled CronJob, ServiceMonitor, or PodMonitor currently proves the full unified Codex API path.
  • platform-infra sub2api validate and platform-infra sub2api codex-pool validate are on-demand checks. Operational usage is documented in $unidesk-sub2api; they are acceptable for deployment closeout, but they are not continuous monitoring. codex-pool validate must test both GET /v1/models and a small POST /v1/responses request, and the Responses smoke should report request id, selected/final account evidence, upstream failover count, and whether the validation succeeded only after failover. It should also summarize recent /responses and /responses/compact gateway failures separately so ordinary long streaming failures are not hidden behind compact-only evidence.
  • Because codex-pool validate includes account alignment, recent-log inspection, and gateway smoke, timeout of the CLI transport is not valid negative evidence about Sub2API scheduling by itself. Closeout evidence must come from the final structured validation result or from an explicitly reported remote job failure with stdout/stderr tail, not from a single low-level trans timeout.

When an automatic availability probe is added, it should be YAML-controlled and cover these layers without printing secrets:

  1. G14 in-cluster GET /v1/models through sub2api.platform-infra.svc.cluster.local:8080 with the unified key.
  2. master-local GET /v1/models through the configured FRP endpoint when public exposure is enabled.
  3. A tiny POST /v1/responses call through the same consumer URL for true OpenAI-compatible request validation.
  4. Optional per-upstream account probes if Sub2API exposes a safe account selection or admin-health mechanism; otherwise document that group-level success does not prove every upstream account is healthy.

Until continuous probing exists, closeout comments must state that validation was on-demand and include the exact CLI/API entrypoints used.

k3s Network Policy Requirements

G14 k3s runs kube-router as its network policy controller. When any NetworkPolicy CRD exists in a namespace, kube-router replaces its default allow-all behavior with explicit iptables/ipset rules that only permit traffic matching declared policies. If a namespace has NetworkPolicy resources but the generated iptables rules miss or incorrectly evaluate a traffic path, pods in that namespace will experience silent connection timeouts (REJECT with icmp-port-unreachable) even though kubectl get networkpolicy shows the policy and DNS/service resolution works.

The platform-infra namespace must have a NetworkPolicy named allow-all (or equivalent) that explicitly permits all ingress and egress within the namespace. Without it, kube-router's default-deny iptables chains block cross-pod traffic including Sub2API → PostgreSQL and Sub2API → Redis connections, causing Sub2API init containers and background services to hang with context deadline exceeded or no response errors.

Diagnostic symptoms:

  • Sub2API pod stuck Init:0/2 with wait-postgres logging sub2api-postgres:5432 - no response perpetually
  • pg_isready succeeds inside the postgres pod itself but TCP from any other pod times out
  • kubectl exec from a different pod or nc -zv to the postgres ClusterIP/pod-IP returns Operation timed out
  • iptables -L KUBE-ROUTER-INPUT -n | grep <namespace> shows per-pod FW chains; the chain ends with REJECT ... mark match ! 0x10000/0x10000

If kube-router iptables rules become stale after a NetworkPolicy create/update cycle (e.g., ipset references old pod IPs or mark-bit logic fails to match), the fastest recovery is: iptables -I FORWARD 1 -s 10.42.0.0/16 -d 10.42.0.0/16 -j ACCEPT as a temporary bypass, then recreate the NetworkPolicy or restart kube-router/k3s to force a full iptables sync. After recovery, remove the temporary rule: iptables -D FORWARD -s 10.42.0.0/16 -d 10.42.0.0/16 -j ACCEPT.

The manifest for the required allow-all policy is:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-all
  namespace: platform-infra
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - {}
  egress:
  - {}

This policy must be included in the sub2api plan / apply manifest rendering so that it is created as part of the normal deployment flow, not maintained as a manual one-off.

platform-infra sub2api status must report whether NetworkPolicy/allow-all exists and still has podSelector: {}, policyTypes: [Ingress, Egress], ingress: [{}], and egress: [{}]. platform-infra sub2api validate must also run temporary in-namespace probe pods that connect to sub2api-postgres:5432 and sub2api-redis:6379; local pg_isready inside the PostgreSQL pod alone is insufficient because it does not exercise kube-router cross-pod policy evaluation.