Files

T

Codex ceb3fb4627 docs: converge trans shell examples

2026-06-15 05:25:59 +00:00

57 KiB

Raw Blame History

Platform Infra

platform-infra is the k3s namespace for UniDesk-operated shared platform services. Runtime placement is service-specific and YAML-selected. For Sub2API, D601 is the active externally backed target and G14 is a predeployed standby target scaled to zero; other platform services may still declare G14 as their active runtime in their own YAML. It is separate from HWLAB runtime lanes, AgentRun lanes, D601 user services, and legacy devops-infra control-plane helpers. New shared infra should land here first; old devops-infra resources migrate gradually only when a concrete owner and validation path exists.

Source Of Truth

UniDesk-owned platform configuration must be YAML-first. config/platform-infra/*.yaml is the durable source for images, versions, endpoints, FRP exposure, account profile selection, and local consumer configuration.
Runtime Secrets and local ~/.codex/config.toml* / auth.json* files are inputs or generated local state, not committed truth. CLI output may show Secret paths, byte counts, fingerprints, and short previews only; it must not print complete API keys.
Code that reads platform YAML must validate object shape, field types, required fields, Kubernetes names, image strings, and ports before mutating G14 k3s or local consumer files.
Do not hide image versions, namespace names, endpoint URLs, FRP ports, or profile lists in Python/TOML/JSON helper constants when they are UniDesk-owned choices. External tools may still require their own TOML/JSON/env file formats at the edge.

Secret Distribution Boundary

UniDesk-owned platform service credential distribution must be YAML-controlled: declare the sourceRef, source key, target Secret, and target key first, then use the controlled CLI to sync/apply it. Runtime Kubernetes Secrets, pod env, logs, and database state are observation surfaces, not credential source of truth.
config/secrets-distribution.yaml is the current shared distribution map and the canonical entrypoint is bun scripts/cli.ts secrets plan|sync|status --config config/secrets-distribution.yaml --scope platform-infra.
The YAML maps local secret source files under the declared sources.root to target Kubernetes Secret names and keys. It is the source of authority for LangBot/n8n runtime Secret handoff and the pattern for future platform services; do not reverse-engineer passwords, API keys, JWT/encryption keys, database passwords or DATABASE_URL values from live pods or existing Kubernetes Secrets.
secrets plan is read-only and may show sourceRef paths, required key names, generated-key intent, target Secret names, target keys, presence, missing keys and fingerprints. secrets sync --confirm may create missing local generated keys only when YAML explicitly allows createIfMissing; database passwords exported by platform-db postgres are not regenerated here. secrets status verifies live Secret key presence without decoding values.
CLI output for Secret distribution may disclose key names, object names, sourceRef names, byte/count-style metadata and fingerprints only. It must not print base64 payloads, decoded values, full DATABASE_URL, API keys, JWT secrets, encryption keys, database passwords, copy-pastable credential mutation commands or remote raw transcripts.
Service-specific platform-infra <service> apply commands may read the declared local sourceRef files to render/apply runtime Secrets, but they must not infer missing values from the current runtime. If required local source keys are absent, the durable fix is the owning YAML/sourceRef/Secret generation entrypoint followed by secrets sync or the service apply path, not a runtime reverse lookup.
When a runtime Secret already contains a value that is missing locally, treat that as drift to resolve through declared source authority. Do not decode it for local repair, do not copy it into YAML or env files, and do not make live Secret contents the bootstrap source for a new service.
If a platform CLI, service error, log, issue, trace, or terminal transcript exposes a credential value, treat that credential as compromised. Rotate it from the declared YAML/sourceRef authority, push it through secrets sync and the relevant service apply/bootstrap entrypoint, then revoke stale service-side API keys or tokens without printing old or new values.

Sub2API Deployment Boundary

Sub2API is a platform service operated by UniDesk in namespace platform-infra. It is not a HWLAB lane workload, AgentRun workload, D601 user service, or master server daemon.
The canonical deployment entrypoint is bun scripts/cli.ts platform-infra sub2api plan|apply|status|validate|codex-pool. Runtime targets are selected with --target; the Sub2API active target is the target whose YAML role/database mode enables active replicas, currently D601, and G14 is kept as a standby predeploy. Daily operation procedures live in $unidesk-sub2api at .agents/skills/unidesk-sub2api/SKILL.md. This reference keeps only development boundaries and project-specific source-of-truth rules.
Raw kubectl through trans <target>:k3s is only for bounded diagnosis and evidence, not a formal mutate path.
The image version is controlled by config/platform-infra/sub2api.yaml. Image update procedures are daily operations owned by $unidesk-sub2api; the development boundary is that image choices remain YAML-controlled.
Sub2API should stay ClusterIP-only by default. Do not add Ingress, NodePort, LoadBalancer, or broad FRP exposure unless a YAML-controlled public exposure decision exists.
Sub2API currently has no resource limits by design. Do not add CPU or memory limits unless a later explicit decision changes that policy and stores the new policy in YAML.
Master server is a consumer/control host, not the runtime location. Do not deploy Sub2API, PostgreSQL, Redis, or heavy validation loops on master server.
Sub2API active/standby placement is selected by YAML, not by ad hoc runtime patches. A standby target must render without a local PostgreSQL StatefulSet, keep the Sub2API app and local Redis cache scaled to zero, use only ephemeral Redis storage if Redis is later activated, and omit public FRP, HTTPS egress proxy, and account sentinel resources unless YAML explicitly promotes that target. An externally backed active target connects directly to the YAML-declared external PostgreSQL endpoint with sslmode=require, keeps durable app state outside the k3s node, and uses local Redis only as ephemeral cache. Promotion or failback must be applied by editing config/platform-infra/sub2api.yaml and running the same platform-infra sub2api --target <id> CLI path.
External platform PostgreSQL endpoints for Sub2API are produced by the platform DB YAML and its platform-db postgres CLI. Cross-node Sub2API consumers connect directly to that endpoint; the master server is not a PostgreSQL data-plane relay. DNS aliases are optional when the exported DATABASE_URL uses a reachable IP with sslmode=require; current PK01-specific rules live in docs/reference/pk01.md.
Sub2API account sentinel, public exposure, and HTTPS egress proxy are target-scoped YAML decisions. The active target may run them when YAML enables them; the standby G14 target must stay deployed but inactive until YAML promotion. Do not create a second sentinel, FRP client, public management surface, or edge proxy by hand; enable or move those resources only through the target YAML and the platform-infra sub2api / codex-pool --target CLI paths.

LangBot Deployment Boundary

LangBot is a UniDesk-operated public platform service in namespace platform-infra. The canonical entrypoint is bun scripts/cli.ts platform-infra langbot plan|apply|status|logs|validate|bootstrap-api-key|query; G14 is the default runtime target.
LangBot configuration is YAML-first in config/platform-infra/langbot.yaml. Image tag, target namespace, PVCs, PK01 Caddy/FRP exposure, API key seed source, and official WeChat adapter metadata must stay in YAML rather than helper constants or manual runtime patches.
LangBot runtime Secret handoff uses config/secrets-distribution.yaml and bun scripts/cli.ts secrets ... --scope platform-infra. platform-infra langbot apply must not create hidden passwords or reverse-read live Kubernetes Secret values to fill missing local source keys.
LangBot uses the existing PK01 host-native PostgreSQL instance through config/platform-db/postgres-pk01.yaml and platform-db postgres. Adding LangBot state means adding a dedicated database and role inside that existing instance; do not deploy a second PostgreSQL StatefulSet, container, or external DB instance for LangBot.
Public exposure uses PK01 Caddy plus FRP to the G14 ClusterIP service. Do not add Kubernetes Ingress, NodePort, LoadBalancer, host networking, or host ports for LangBot unless a later YAML-controlled platform decision changes the exposure model.
LangBot's built-in Web frontend and API share the same public HTTPS origin. CLI queries must use the YAML-declared API key source and must report key names/fingerprints only, never the API key value.
bootstrap-api-key writes the YAML-declared key into LangBot's api_keys table after the app has initialized its schema. If the table is absent, start LangBot first and let its migrations run; do not create a parallel auth table or print the key while seeding it.
LangBot startup logs may include upstream env override values. platform-infra langbot logs must redact env keys containing PASSWORD, SECRET, TOKEN, API_KEY, or DATABASE_URL; any leaked DB password, JWT secret, or API key must be rotated through YAML/Secret sources and rolled out through the controlled apply path.
LangBot Secret material changes must update the app Deployment template with a Secret fingerprint annotation so apply rolls the Pod. Manual Pod deletion is only a temporary recovery action, not the durable rotation mechanism.
Closeout for public LangBot changes requires platform-infra langbot status, platform-infra langbot validate, and an API-key-backed platform-infra langbot query; frontend exposure is proved by the same public origin returning the built-in Web UI.
LangBot Box is disabled by default for the public service because the official Box deployment needs Docker socket access. Enabling Box requires a separate explicit platform decision and YAML-controlled security boundary.
Official WeChat support is through LangBot's official platform adapters such as officialaccount, wecom, and wecomcs; real AppID, token, EncodingAESKey and channel credentials are bound in LangBot after deployment. Personal WeChat or OpenClaw-style adapters are not part of the default public-service boundary.

n8n Workflow Boundary

n8n is the UniDesk-operated workflow/automation layer for LangBot and platform service integration. It is a workflow bridge for webhook orchestration, service calls, manual approval flows and external integrations; it does not replace LangBot or become the chat runtime.
The canonical entrypoint is bun scripts/cli.ts platform-infra n8n plan|apply|status|logs|validate; G14 is the default runtime target and config/platform-infra/n8n.yaml is the YAML source of truth.
n8n runtime Secret handoff uses config/secrets-distribution.yaml and bun scripts/cli.ts secrets ... --scope platform-infra. platform-infra n8n apply must not create hidden encryption keys or reverse-read live Kubernetes Secret values to fill missing local source keys.
n8n uses the existing Pika01/PK01 host-native PostgreSQL instance through config/platform-db/postgres-pk01.yaml and platform-db postgres. Adding n8n state means adding a dedicated n8n database and role inside that single external PostgreSQL instance; do not deploy an in-cluster PostgreSQL StatefulSet, a second PostgreSQL instance, or long-term SQLite state for n8n.
Public exposure uses PK01 Caddy plus FRP to the G14 ClusterIP service at https://n8n.pikapython.com. Do not add Kubernetes Ingress, NodePort, LoadBalancer, host networking, or host ports for n8n unless a later YAML-controlled platform decision changes the exposure model.
n8n reverse-proxy and webhook settings such as public base URL, WEBHOOK_URL, proxy hop trust and PostgreSQL connection fields must be rendered from YAML. Secret output may show key names, presence and fingerprints only; it must not print the database password, N8N_ENCRYPTION_KEY, or full DATABASE_URL.
Closeout for public n8n changes requires platform-infra n8n status and platform-infra n8n validate --full, proving both in-cluster HTTP and public HTTPS. Actual LangBot workflows, credentials and business automations are separate follow-up scope after the base n8n service is healthy.

WeChat Archive Workflow Boundary

WeChat-to-Baidu archive automation is a shared platform workflow, not a separate service-specific fork. Its durable source of truth is config/platform-infra/wechat-archive.yaml; the canonical entrypoint is bun scripts/cli.ts platform-infra wechat-archive plan|apply|status|validate|pull.
The workflow composes the existing LangBot public service, existing n8n public service, and the private baidu-netdisk microservice. LangBot remains the chat ingress, n8n owns webhook normalization/orchestration, and Baidu upload/download is performed through backend-core microservice proxy so Baidu OAuth tokens are never exposed in G14 or CLI output.
Text and image archive policy, remote path templates, staging roots, webhook path, timeout and validation fixtures must stay in YAML. CLI code may validate the YAML shape and render n8n workflow JSON, but it must not hard-code current path roots, credentials, message channel IDs, or Baidu account choices outside YAML/service runtime.
The archive callback token is controlled by archiveCallback.secretRoot, archiveCallback.tokenSourceRef, and archiveCallback.tokenKey in YAML plus config/secrets-distribution.yaml. secrets sync may create the local source when YAML explicitly allows it; n8n receives the token only through controlled workflow rendering. Do not recover this token from the n8n database, frontend runtime, Baidu runtime, pod env, or logs.
For the current n8n runtime, production webhook reachability uses the registered path shape workflowId/nodeName/webhookPath; workflow node names used in generated webhooks should be ASCII path-safe, and webhookPath in YAML should remain one relative path segment.
Generated n8n workflows should use n8n-native HTTP Request nodes for outbound service callbacks. Code nodes may normalize payloads, but must not assume sandbox globals such as fetch exist in the runtime.
Personal WeChat ingestion must be read-only. The durable shape is a YAML-declared LangBot inbound webhook that mirrors messages to the archive workflow and returns skip_pipeline=true; the OpenClaw/LangBot bot must also have discard routing as fallback so webhook failure does not produce an automated reply. Do not connect personal WeChat through a normal reply pipeline, do not enable send-message surfaces for this purpose, and do not treat a successful archive upload as permission to reply.
D601 personal WeChat ingestion is a YAML-declared upstream of the same archive workflow. config/platform-infra/wechat-archive.yaml owns the Windows host route, isolated PC WeChat version pin, WeChatFerry release pin, RPC ports, Windows user-session supervisor, firewall boundary, D601 k3s collector runtime and read-only method allowlist. The Windows PC WeChat process and WeChatFerry SDK/RPC host must run in the same Windows user session; the collector/client must run in the existing D601 platform-infra namespace with createNamespace=false, not in a newly created namespace.
WeChatFerry compatibility is part of the upstream contract, not something UniDesk should bypass. If the YAML-pinned PC WeChat version can reach QR login but the WeChat service rejects login as too old, classify the personal WeChat upstream as blocked by version compatibility. Preserve prepared Windows artifacts and collector Kubernetes objects for later reuse, but pause the collector by changing the YAML-declared replica count to zero and re-running the controlled platform-infra wechat-archive collector-apply path. Do not keep a CrashLooping collector as the desired state, do not use raw kubectl scale, do not create a new namespace, and do not adopt third-party version-check bypass tools as a durable platform path.
The WeChatFerry raw RPC surface must not be exposed publicly or reused as a general bot API. A collector may call only the YAML allowlisted read operations and must report sendCapability=false; send, friend/group management, database query, timeline, transfer or other outbound/control methods are policy violations. Login state, WeChat profile data, WCF session material and client databases remain runtime state and must not be decoded, printed, copied into YAML, or reconstructed from the running host.
The first D601 WCF-host PoC must use a test or low-risk WeChat account and the YAML-declared observation window before any production account promotion. RDP operations should disconnect instead of logging out so the Windows user-session processes keep running; this is an operational boundary until a controlled Windows supervisor/collector CLI fully owns start, status and validate.
If LangBot or n8n public HTTPS fails while in-cluster service and FRP local-port probes are healthy, restore the PK01 Caddy managed blocks through platform-infra langbot apply --confirm --wait or platform-infra n8n apply --confirm --wait. Do not manually edit Caddy as the durable fix.
The archive uses the same single PK01/Pika01 PostgreSQL instance indirectly through the existing LangBot and n8n databases. Adding this workflow must not create another PostgreSQL instance, in-cluster PostgreSQL StatefulSet, or ad hoc database namespace.
platform-infra-wechat-archive and future similar public workflow CLIs should reuse the common platform-infra operations library for YAML parsing, target selection, workflow sync, private microservice proxy calls, transfer polling, staging path mapping, redaction and bounded output. Service-specific modules should keep only their business mapping and workflow payload rendering.
Closeout for the LangBot/n8n/Baidu workflow requires platform-infra wechat-archive apply --confirm --wait, platform-infra wechat-archive status, platform-infra wechat-archive validate --full, and a platform-infra wechat-archive pull command that retrieves an uploaded file by remote path or fsId and reports local path plus hash. Closeout for the optional D601 personal WeChat upstream additionally requires a supported PC WeChat/WeChatFerry pair that can log in and receive the YAML-required message types; a service-side version rejection is a blocker, not a successful deployment.

Codex Pool Routing

config/platform-infra/sub2api-codex-pool.yaml controls the Codex-facing OpenAI-compatible pool:

pool.groupName names the Sub2API group that represents the pool.
pool.apiKeySecretName and pool.apiKeySecretKey name the k3s Secret that stores the single consumer API key.
pool.minOwnerConcurrency is optional; when omitted, the CLI automatically uses the sum of all resolved account capacities as the minimum concurrency for the Sub2API user that owns the unified consumer API key. A YAML value is only an explicit override and must still be at least that capacity sum, so the shared key does not fail requests or WS sessions at the user-concurrency layer. "Resolved" means each account's explicit profiles.entries[].capacity or, when omitted, pool.defaultAccountCapacity. Do not compensate for owner-concurrency 1013 errors by pinning capacity to one provider.
pool.defaultTempUnschedulable is the Sub2API built-in request-path temporary-unschedulable switch plus its YAML rule list. When enabled, codex-pool sync --confirm renders temp_unschedulable_enabled and temp_unschedulable_rules into every managed account unless an account-level override says otherwise. This is the generic same-request recovery path for selected-account upstream failures: a matching upstream error briefly cools the selected account so Sub2API's existing failover loop can select another account in the same group.
The built-in temporary-unschedulable configuration and external sentinel.* configuration are separate control surfaces. pool.defaultTempUnschedulable handles near-real-time request-path cooling and failover; sentinel.* handles account-level marker health, quarantine, restore, and probe cadence. Changing one surface must not silently rewrite the other surface's cadence, marker semantics, quarantine state, or rule list.
The external sentinel write surface is intentionally limited to the Sub2API admin schedulable action. Sentinel freeze/restore may set schedulable=false|true, but must not write, clear, or indirectly clear Sub2API request-path runtime state such as temp_unschedulable_until, temp_unschedulable_reason, rate-limit, overload, or model-rate-limit state. In particular, sentinel restore must not call Sub2API recover-state, because that endpoint is a broader runtime-state recovery operation rather than a pure schedulability restore.
Codex accounts selected by YAML do not declare schedulable as durable configuration. codex-pool sync --confirm must not restore existing account schedulability merely because YAML selects the account or sentinel state lacks an active quarantine. Existing schedulable=false is runtime state: the sentinel first reads Sub2API's actual account state, schedules a recovery probe for unschedulable managed accounts, and restores schedulable=true only after the marker probe matches.
codex-pool sync --confirm preserves UniDesk-managed accounts that are absent from YAML by default; explicit upstream retirement requires codex-pool sync --confirm --prune-removed. This keeps account deletion out of the normal availability-recovery path and prevents temporary YAML edits from becoming destructive runtime changes.
profiles.entries selects local Codex profile files from ~/.codex/ and maps them to Sub2API account names.
The unsuffixed master ~/.codex/config.toml and ~/.codex/auth.json are reserved for the unified Sub2API consumer. config.toml must keep the YAML-selected consumer base URL written by codex-pool configure-local --target <active> --confirm, and auth.json must contain the unified pool API key from pool.apiKeySecretName / pool.apiKeySecretKey on that active target. Do not replace these two files with direct upstream account credentials.
Additional upstream accounts must use suffixed local profile files such as config.toml.<profile> and auth.json.<profile>, then be declared through profiles.entries in config/platform-infra/sub2api-codex-pool.yaml.
profiles.entries[].capacity optionally overrides pool.defaultAccountCapacity for one account. Capacity is a YAML-controlled routing input; concrete current values belong only in config/platform-infra/sub2api-codex-pool.yaml and runtime validation output, not in long-term reference prose. Code constants, Secrets, ad-hoc runtime patches, or stale tests must not override YAML source of truth.
profiles.entries[].loadFactor optionally overrides pool.defaultAccountLoadFactor for one account and is rendered to Sub2API load_factor. Treat it as routing policy: values belong in YAML and codex-pool validate output, not code constants, Secrets, or ad-hoc runtime patches.
Do not change account membership, priority, capacity, load factor, WebSocket mode, or other routing policy from inference alone. Unless the user explicitly asks for a configuration change, first preserve the current YAML, collect provenance and runtime evidence, and write the finding to the relevant issue or runbook before proposing a change.
Sub2API is a source-available UniDesk-operated runtime component. For Sub2API scheduling, failover, temporary-unschedulable behavior, error propagation, and account selection, the default investigation path is to read the current Sub2API source implementation and then verify it with real request ids, gateway logs, and original-entry traffic. Do not use mock upstreams, temporary probe accounts, or test stubs as the default proof for Sub2API behavior; those are explicit debug aids only and do not replace source-path review plus runtime evidence.
profiles.entries[].tempUnschedulable may override the pool default for one account. When enabled, the CLI renders it into Sub2API credentials as temp_unschedulable_enabled and temp_unschedulable_rules; when disabled, runtime credentials omit both fields. Use account-level override only for an explicit deviation from the pool policy, not as an availability workaround for a named account.
Codex account-state, quota prompts, model-routing failures, encrypted-content affinity failures, gateway wrappers, and timeout-like upstream errors must be handled by the generic temporary-unschedulable/failover path plus the external marker sentinel. Do not change membership, priority, capacity, load factor, WebSocket mode, pool_mode, or a specific provider's status merely to work around those errors. If a matching upstream failure still logs openai.forward_failed without openai.upstream_failover_switching, the missing fix is in Sub2API's HTTP /responses failover classification/error propagation, not in account pinning.
profiles.entries[].openaiResponsesWebSocketsV2Mode is the account-level Responses WebSocket v2 switch for OpenAI-compatible upstreams that require WebSocket transport. Allowed values are off, ctx_pool, and passthrough; omit the field unless that upstream needs it.
profiles.entries[].upstreamUserAgent is an optional account-level upstream request User-Agent override. Use it only for upstreams that require a Codex CLI compatible User-Agent; keep the value YAML-controlled and newline-free.
manualAccounts.protected declares Sub2API accounts that were created or edited manually and must stay outside UniDesk-managed Codex pool credentials, scheduler policy, and sentinel control. The only allowed reconciliation for such an account is an explicitly declared narrow capability such as proxyBinding, which may align the account's Sub2API proxy_id to the YAML-selected target egress proxy, or groupBinding, which may attach the account to the YAML-selected pool group so the unified consumer key can use it. codex-pool sync --confirm must not rewrite protected account credentials, status, schedulability, priority, capacity, load factor, or sentinel state, and sentinel-probe --account ... must refuse protected manual accounts.
publicExposure in config/platform-infra/sub2api-codex-pool.yaml controls the legacy Codex-pool public bridge from master server to the G14 ClusterIP service and should stay disabled unless that bridge is explicitly reintroduced. Target-level publicExposure in config/platform-infra/sub2api.yaml controls the active public edge such as D601-to-PK01.
publicExposure.masterCaddy.responseHeaderTimeoutSeconds controls the master Caddy response_header_timeout for the public Sub2API site. It must be long enough for Codex /responses/compact requests; otherwise Caddy can return a client-visible 504 before Sub2API finishes the upstream compact request, and that edge timeout is not an account-level upstream failure that Sub2API can use for temporary-unschedulable failover. The numeric value belongs only in config/platform-infra/sub2api-codex-pool.yaml; after changing it, use codex-pool expose --confirm to reload Caddy and verify the rendered response_header_timeout. Requests that were already in flight before the reload may still finish with the previous timeout, so post-change evidence should check only requests that started after the reload.
publicExposure.masterCaddy.edgeRetry controls the master Caddy reverse-proxy retry window for the public Sub2API site. This belongs at the edge because FRP remotePort listener loss, connection refused, EOF, or connection reset can happen before a request reaches Sub2API, so Sub2API account failover and sentinel logic cannot observe or recover that request. Keep retry scope narrow, especially for non-idempotent POST traffic: connection-attempt failures may be retried by the reverse proxy, while round-trip retry after an upstream connection was established should be limited by YAML retryMatch to paths that are safe to repeat, such as compact. Retry durations and intervals belong only in YAML; after changing them, run codex-pool expose --confirm and verify the rendered Caddyfile contains the expected lb_try_duration, lb_try_interval, and lb_retry_match.
localCodex controls how the master server's current ~/.codex consumer files are backed up and rewritten. Keep supportsWebSockets and responsesWebSocketsV2 in the same state, and enable them only when at least one YAML-managed account has a current direct Codex WSv2 smoke that passes. If no upstream profile can sustain Responses WSv2, the honest long-term state is false/false so Codex uses HTTP Responses directly instead of repeatedly reconnecting before response.completed. localCodex.responsesSmokeModel is the YAML-declared model used by codex-pool validate for the lightweight POST /v1/responses smoke.

Enable account-level WebSocket v2 only for upstream profiles that have passed a direct Codex WSv2 probe. Treat this as a YAML-declared capability set, not a hard scheduling pin to one profile; if localCodex enables WebSocket transport, codex-pool validate must show at least one current webSocketsV2.schedulableEnabled account, and runtime smoke remains the availability proof. The same validation reports each managed account's runtime WebSocket v2 mode and whether it matches YAML, so stale ctx_pool / passthrough settings cannot silently keep routing Codex WS sessions to an upstream that closes with no available account, WS handshake 5xx/4xx, or before response.completed.

When Codex startup repeatedly reports WebSocket reconnects or HTTPS fallback, preserve membership, priority, capacity, load factor, and other routing policy until runtime logs identify the failing account and transport. If bounded Sub2API logs show repeated openai.websocket_proxy_failed, openai.websocket_account_select_failed, upstream WS handshake 4xx/5xx, or repeated close-before-response.completed for the only WS-capable account, remove that account from the WSv2 capability set in YAML; if the resulting capability set is empty, also turn off the localCodex WS feature flags. Then run codex-pool sync --confirm, codex-pool validate, and prove the result with a Codex smoke that no longer emits reconnects.

Do not encode current availability assumptions in long-term reference prose. If an account needs a higher concurrency or load factor than the pool default, make that a deliberate YAML override and verify it with codex-pool validate; the reference document should describe the rule, not repeat the current numeric value.

Do not enable Sub2API pool_mode for UniDesk-managed Codex accounts. pool_mode retries the same selected account path and does not replace temporary-unschedulable request failover or sentinel quarantine. The current failover and recovery model is: matching request-path errors temporarily cool the selected account and trigger same-group failover, while the external marker-only sentinel freezes or restores account schedulability from direct marker probes.

Sub2API temporary-unschedulable rules require both an HTTP status match and a response-body keyword match in the upstream failure/error path. UniDesk uses these rules as a generic request-path failover trigger, not as a successful-response content classifier. Runtime UI fields such as trigger time, release time, matched keyword, and rule index identify this built-in request-path state and should not be attributed to sentinel unless separate sentinel state shows an active quarantine. HTTP 200 private content, maintenance text, quota prompts, ads, and similar semantic failures remain the external account-level sentinel's job.

The invalid_encrypted_content failure mode is a stable regression guard for Codex pool routing. It means an upstream could not verify or parse encrypted Responses/Codex state carried by the request; a fresh account probe can still pass while a large resumed request fails because the encrypted content is not acceptable to that selected upstream. The required behavior is generic: Sub2API should perform its built-in recoverable handling for encrypted reasoning state when available, mark the selected account temporarily unschedulable when the configured status/keyword rule matches, and continue same-group failover before the client sees a final failure whenever the response has not already been committed. Do not interpret this failure as proof that the pool should pin to only, delete the selected account, change membership/priority/capacity/load factor, or move the error into sentinel-specific provider logic.

For this failure class, the regression evidence must come from the real request path. A valid investigation should connect the client request id to Sub2API gateway logs showing the selected account id, upstream status, account_temp_unschedulable, openai.upstream_failover_switching, and the final access-log status. A sentinel-report row with quarantineActive=false and marker success proves only that the external marker sentinel did not quarantine that account; it does not disprove request-path temporary cooling. Conversely, a marker sentinel recovery must not call recover-state or clear the temporary-unschedulable state created by the failed request. If this failure still reaches the client as 502/503 while another schedulable account is available and no stream bytes were committed, fix Sub2API failover classification/error propagation or the UniDesk sync/render path rather than adding mock probes, provider pinning, or account-specific exceptions.

Sub2API Account Test Semantics

Sub2API v0.1.136 has a separate management-plane account connection test. The admin WebUI account modal calls POST /api/v1/admin/accounts/:id/test with model_id and, for the admin account table modal, no OpenAI mode; the backend binds this to AccountTestService.TestAccountConnection, which normalizes an empty mode to default.

For OpenAI API-key accounts in default mode, the test loads the account by id, applies account.GetMappedModel(model_id), checks openai_compat.ShouldUseResponsesAPI(account.Extra), and then builds an upstream URL from the account base URL with /v1/responses. It sends a direct upstream request through httpUpstream.DoWithTLS with Content-Type: application/json and Authorization: Bearer <account-key>. The request body is Responses API SSE, not a non-streaming JSON request: model is the mapped model, input is one user message whose text is hi, stream is true, and instructions is Sub2API's embedded OpenAI default instructions. For API-key accounts it does not set store: false, max_output_tokens, Codex CLI User-Agent, OpenAI-Beta, Originator, Version, Session_ID, or Conversation_ID; those Codex-like headers appear in other paths such as compact probing, not in the default account test.

The management test success criterion is transport and stream completion, not semantic content. A non-200 upstream response becomes an SSE error. A 200 response is considered successful when processOpenAIStream sees response.completed or response.done; response.output_text.delta chunks are forwarded to the WebUI as display text, while response.failed, error, or EOF before completion fails the test. Therefore a WebUI "hi" success proves that this direct account can complete a streaming /v1/responses request with Sub2API's default payload shape, but it does not prove that a non-streaming Responses request, marker prompt, max_output_tokens, store: false, Codex header set, compact path, WebSocket path, or normal pool-scheduled gateway request will behave identically.

This management-plane test is also outside the normal consumer gateway scheduler. It fetches the account by id instead of listing only schedulable accounts, so status=active in the modal and a successful account test can coexist with schedulable=false in scheduler state. Because the test performs its own outbound DoWithTLS call, regular gateway access logs and usage logs may not contain the upstream account id/path/status evidence expected from ordinary /v1/responses traffic. When diagnosing account tests, use the management route semantics above or Sub2API source, not gateway access-log absence or an unrelated pool request as proof.

The management test uses Sub2API's account-level proxy selection, not the Pod environment as a fallback. In Sub2API v0.1.136 the upstream HTTP transport is configured from the account's ProxyID / proxy URL; an account with no proxy binding goes direct even if the Sub2API Pod has HTTP_PROXY or HTTPS_PROXY set. For protected manual accounts that need the target egress path, declare manualAccounts.protected[].proxyBinding in config/platform-infra/sub2api-codex-pool.yaml and reconcile it with codex-pool sync --target <active> --confirm; do not hand-patch the runtime account or infer proxy coverage from Pod env alone.

The management test is also not proof that the unified consumer key can select the account. A protected manual account must be attached to the pool group before ordinary /responses or /v1/responses traffic can use it. When that is intended, declare manualAccounts.protected[].groupBinding.source: pool-group; sync should add the account to the current pool.groupName without making it a YAML-managed profile or sentinel target.

An external account-level sentinel that wants parity with this WebUI path should reuse the same request shape as far as the standard OpenAI SDK allows: direct account credentials, Responses API, stream=true, no store: false for API-key accounts, no upstream max_output_tokens field, and success parsing based on the streaming events. A local stream delta collection limit is acceptable as a sentinel safety bound, but it should not change the upstream request body. The sentinel may replace the user text hi with a marker prompt, but it should not introduce extra request fields or Codex/compact headers merely for convenience. If a marker-only sentinel intentionally diverges from the management test shape, the divergence must be documented in probe output so a WebUI success and sentinel failure are not misread as operator error.

Account Sentinel Marker Contract

The UniDesk account-level sentinel uses marker-only health semantics. A probe is healthy only when the upstream response satisfies the configured marker match. Every other result is unhealthy and must enter the same exponential freeze state machine, regardless of whether the immediate response is HTTP 200, 400, 403, 429, 500, 502, 503, 504, a streaming error event, malformed output, empty output, timeout, or any other transport/API failure. HTTP status, upstream error code, body hash, body preview, headers, and SDK exception class are diagnostics only; they must not become additional allow/deny criteria that bypass marker mismatch. Sentinel actions are only schedulable=false on freeze and schedulable=true on marker-matching recovery; they must not clear Sub2API temporary-unschedulable or rate-limit state as part of marker recovery.

The sentinel must not maintain separate classifiers for "private content", "maintenance", "quota", "ads", or provider-specific body phrases as health gates. The only recovery condition is a later recovery probe that matches the marker. Freeze TTL expiry only schedules the next recovery probe; it does not restore an account by itself. Repeated non-marker results use a short exponential freeze backoff because failed marker probes produce little or no useful output token usage; repeated marker-matching results use the configured success cadence backoff. This contract applies equally to OpenAI Responses gpt-5.5 direct account probes and manual codex-pool sentinel-probe --account ... --confirm measurements.

profiles.entries[].trustUpstream is the durable account-level trust marker for sentinel success cadence, and the absence of the field means untrusted. Trusted and untrusted accounts use separate YAML cadence maximums after marker-matching probes; the values belong only in config/platform-infra/sub2api-codex-pool.yaml. This field must not change Sub2API scheduler priority, capacity, load factor, membership, built-in temporary-unschedulable settings, or the marker-only health contract. Its purpose is to keep intermittently unreliable 200-success providers under more frequent direct probes without adding provider-specific content classifiers.

pool.defaultSentinelProtect is the default protection policy for sentinel freeze decisions, and profiles.entries[].sentinelProtect may override it for a specific account. For protected accounts, the marker-only health contract still applies, but the sentinel must exhaust the configured consecutive marker confirmation attempts before treating the account as failed and entering the freeze state machine. The retry count, initial delay, maximum delay, and backoff multiplier are YAML values; long-term reference prose must not duplicate the current numbers. This policy exists only to absorb occasional marker/probe or gateway-failure confirmation jitter. It must not change Sub2API scheduler priority, capacity, load factor, membership, built-in temporary-unschedulable settings, or the recovery condition.

When codex-pool sync --confirm creates a YAML-managed account or changes direct-probe-relevant account inputs such as the profile mapping, upstream base URL, API key fingerprint, upstream User-Agent, Responses WebSocket mode, trustUpstream, pool/profile sentinelProtect, sync records a pending sentinel probe from the pre-mutation runtime state, updates the account, and schedules the account probe immediately. It does not restore existing accounts to schedulable=true; restoration belongs to the marker-only sentinel after it has synced Sub2API runtime state and observed a marker-matching probe. New or changed accounts are not default-frozen; only an actual non-marker probe result or an existing active quarantine may remove an account from the scheduler. This avoids zero-available windows during sync while still ensuring that later marker failures enter the normal freeze/restore state machine. Unchanged accounts must not have their existing success or failure backoff reset by unrelated YAML syncs.

If the YAML failure freeze maximum is lowered, codex-pool sync --confirm may migrate only currently active sentinel quarantines whose stored interval or next recovery time exceeds the current maximum. The migration keeps the account frozen, marks the next recovery probe due immediately, and lets the next marker result decide restore versus the new shorter failure backoff. It must not clear quarantine or restore schedulability merely because an older TTL has expired.

If the YAML success cadence maximum is lowered or an account changes trust class, codex-pool sync --confirm may clamp existing successful account state so the next probe is due under the current YAML policy instead of waiting for an older, longer success window to expire. This clamp only affects sentinel state and probe timing; it does not by itself restore a quarantined account or bypass the next marker result.

Operational observation for this sentinel should use the read-only codex-pool sentinel-report table or its --raw form. It is the canonical low-noise view for per-account probe count, trust class, Sub2API runtime schedulability, protect threshold and latest protect confirmation result, marker result, HTTP/error diagnostics, freeze TTL, success cadence, success cadence maximum, next probe time, and recent CronJob runs; raw ConfigMap dumps and ad hoc log scraping are fallback diagnostics, not the primary state surface.

The active Codex-pool request path follows the YAML-selected active target:

A client sends an OpenAI-compatible request to the configured consumer base URL with the unified API key.
The target-level public edge forwards traffic to that target's sub2api-frpc when config/platform-infra/sub2api.yaml enables publicExposure.
sub2api-frpc forwards to sub2api.platform-infra.svc.cluster.local:8080 inside the active target namespace.
Sub2API validates the unified key and resolves its group_id.
Accounts listed in profiles.entries are bound to the same group via group_ids, so Sub2API dispatches through that group using its own account selection semantics.

For the current D601 externally backed active target, client traffic reaches PK01 Caddy, PK01 forwards to the YAML-declared FRP remote port, D601 sub2api-frpc connects directly to PK01 frps, and FRP forwards to sub2api.platform-infra.svc.cluster.local:8080 on D601. This path does not pass through the master server or the pikanode reverse proxy. api.pikapython.com must resolve to the YAML-declared PK01 public address before Caddy can obtain or renew the public certificate; when DNS is missing, PK01 local FRP probes and public-IP remote-port probes may prove the edge path, but they are not a substitute for final https://api.pikapython.com validation.

When target-level egressProxy.enabled=true, the D601 target renders an in-cluster HTTP(S) proxy client from the master VPN subscription source declared in YAML. The CLI injects the resulting proxy URL and NO_PROXY into Sub2API and, when requested by YAML, the Codex account sentinel. platform-infra sub2api validate --target D601 --full must prove the proxy Deployment/Service is ready and that an app pod can complete the YAML-declared health probe through the proxy. This target-level injection does not by itself bind manually created Sub2API accounts to that proxy; account tests and account-specific upstream transports still need a YAML-declared manualAccounts.protected[].proxyBinding when the account must avoid direct egress. Subscription contents and generated proxy configs are Secret material and must not be printed.

Adding, removing, exposing, validating, and configuring local Codex consumers are daily operations covered by $unidesk-sub2api. The development rule is that ordinary pool membership changes stay YAML-only and do not add code or CI/CD. Code changes are only appropriate when UniDesk needs to render or validate a Sub2API capability that already exists upstream, such as account-level WebSocket mode or per-account upstream User-Agent. If Sub2API itself does not support a desired behavior, do not magic-patch it through UniDesk scripts, Kubernetes hotfixes, local forks, or hidden compatibility paths; either leave the behavior unsupported or pursue it upstream as an explicit Sub2API feature.

codex-pool sync --confirm and codex-pool validate are runtime operations that may need more than one SSH short-connection window because they log in to Sub2API, reconcile accounts, inspect recent logs, and run gateway smoke requests. The formal entry remains the UniDesk CLI, which must use a submit-and-short-poll control shape or an equivalent remote job wrapper instead of one long trans G14:k3s sh call. If these commands fail with UNIDESK_SSH_RUNTIME_TIMEOUT while the remote operation may still be running, treat it as a control-plane visibility gap first: improve or use the CLI's job/poll path, then rerun sync or validate. Do not replace it with raw kubectl, manual Sub2API admin API patches, repeated blind full loops, or Sub2API source modifications.

After codex-pool configure-local --confirm, the default ~/.codex/config.toml / auth.json pair must remain the unified Sub2API consumer and must not be reused as an upstream account profile. Keep every upstream source profile in suffixed files such as config.toml.<profile> / auth.json.<profile> and register it through YAML profiles.entries.

Public FRP Boundary

When publicExposure.enabled is true, the same FRP TCP bridge exposes both OpenAI-compatible API paths and the built-in Sub2API management frontend. The management UI is reachable at the configured publicExposure.publicBaseUrl and its /login route; do not allocate a second public port unless a separate YAML-controlled exposure decision exists.

The public management UI is an operations endpoint. Keep Sub2API itself in platform-infra, keep the Kubernetes Service as ClusterIP, and treat FRP as the only public bridge unless a later decision explicitly changes the exposure model.

The public bridge has two separate failure classes. Sub2API upstream/account failures are visible in Sub2API logs and currently belong to sentinel quarantine plus normal Sub2API routing among schedulable accounts. Edge failures between Caddy and the FRP remote port are not visible to Sub2API; symptoms include Caddy connect: connection refused, EOF, connection reset, TLS/certificate failures, DNS NXDOMAIN, or short 502 bursts while frps closes and reopens the configured remote port. Those failures must be diagnosed from DNS, Caddy, and frps/frpc evidence and mitigated through YAML-controlled Caddy edge retry, DNS correction, or FRP stability fixes, not by disabling accounts or changing pool membership.

PK01 /etc/caddy/Caddyfile is a shared edge artifact for multiple YAML owners, including platform-infra services and HWLAB node public exposure. Every platform-infra writer must use the shared managed-block helper in scripts/src/pk01-caddy.ts or the platform public-service wrapper around it. The helper preserves existing UniDesk managed blocks, updates only the caller's marker block, validates the merged Caddyfile before install, and reloads Caddy only after validation succeeds.

Do not render and install a whole PK01 Caddyfile from a single service YAML. Sub2API, LangBot, n8n, HWLAB and future public services must coexist by distinct # BEGIN unidesk managed <owner> blocks. A public exposure closeout should verify the service's own public URL and, when the operation touched PK01 Caddy, confirm that unrelated managed blocks are still present or that the apply output reports they were preserved.

Availability And Probes

Kubernetes readiness is not the same as pool availability:

The Sub2API app, PostgreSQL, and Redis manifests include container-level health probes. These only prove the pods and local dependencies are healthy enough for Kubernetes scheduling.
The FRP client deployment is a connector deployment and does not itself prove that edge traffic reaches Sub2API.
No scheduled CronJob, ServiceMonitor, or PodMonitor currently proves the full unified Codex API path.
platform-infra sub2api validate and platform-infra sub2api codex-pool validate are on-demand checks. Operational usage is documented in $unidesk-sub2api; they are acceptable for deployment closeout, but they are not continuous monitoring. codex-pool validate must test both GET /v1/models and a small POST /v1/responses request, and the Responses smoke should report request id, selected/final account evidence, upstream failover count, and whether the validation succeeded only after failover. It should also summarize recent /responses and /responses/compact gateway failures separately so ordinary long streaming failures are not hidden behind compact-only evidence.
codex-pool validate must not create mock upstreams or temporary failover-probe accounts as its default proof of Sub2API behavior. When a suspected failover path is in question, validate should surface the relevant source-path expectation and real runtime evidence: request ids, selected/final account ids, openai.upstream_failover_switching, openai.forward_failed, openai.account_select_failed, and final status. If runtime evidence contradicts the source-path expectation, fix Sub2API or the UniDesk integration path rather than converting the mismatch into a mock-only success.
Public exposure closeout must include the edge layer when the user-facing URL is involved. A Sub2API-side compact success summary does not rule out DNS, Caddy, TLS, or FRP failures that happened before Sub2API received the request; inspect the edge evidence or use a CLI report that summarizes it before declaring the public URL stable.
Because codex-pool validate includes account alignment, recent-log inspection, and gateway smoke, timeout of the CLI transport is not valid negative evidence about Sub2API scheduling by itself. Closeout evidence must come from the final structured validation result or from an explicitly reported remote job failure with stdout/stderr tail, not from a single low-level trans timeout.

When an automatic availability probe is added, it should be YAML-controlled and cover these layers without printing secrets:

G14 in-cluster GET /v1/models through sub2api.platform-infra.svc.cluster.local:8080 with the unified key.
master-local GET /v1/models through the configured FRP endpoint when public exposure is enabled.
A tiny POST /v1/responses call through the same consumer URL for true OpenAI-compatible request validation.
Optional per-upstream account probes if Sub2API exposes a safe account selection or admin-health mechanism; otherwise document that group-level success does not prove every upstream account is healthy.

For D601 public exposure, the equivalent probe set must use the target URL from config/platform-infra/sub2api.yaml, include the PK01 Caddy/FRP edge, and require api.pikapython.com DNS to resolve to the YAML-declared address before treating HTTPS as validated.

Until continuous probing exists, closeout comments must state that validation was on-demand and include the exact CLI/API entrypoints used.

k3s Network Policy Requirements

G14 k3s runs kube-router as its network policy controller. When any NetworkPolicy CRD exists in a namespace, kube-router replaces its default allow-all behavior with explicit iptables/ipset rules that only permit traffic matching declared policies. If a namespace has NetworkPolicy resources but the generated iptables rules miss or incorrectly evaluate a traffic path, pods in that namespace will experience silent connection timeouts (REJECT with icmp-port-unreachable) even though kubectl get networkpolicy shows the policy and DNS/service resolution works.

The platform-infra namespace must have a NetworkPolicy named allow-all (or equivalent) that explicitly permits all ingress and egress within the namespace. Without it, kube-router's default-deny iptables chains block cross-pod traffic including Sub2API → PostgreSQL and Sub2API → Redis connections, causing Sub2API init containers and background services to hang with context deadline exceeded or no response errors.

Diagnostic symptoms:

Sub2API pod stuck Init:0/2 with wait-postgres logging sub2api-postgres:5432 - no response perpetually
pg_isready succeeds inside the postgres pod itself but TCP from any other pod times out
kubectl exec from a different pod or nc -zv to the postgres ClusterIP/pod-IP returns Operation timed out
iptables -L KUBE-ROUTER-INPUT -n | grep <namespace> shows per-pod FW chains; the chain ends with REJECT ... mark match ! 0x10000/0x10000

If kube-router iptables rules become stale after a NetworkPolicy create/update cycle (e.g., ipset references old pod IPs or mark-bit logic fails to match), the fastest recovery is: iptables -I FORWARD 1 -s 10.42.0.0/16 -d 10.42.0.0/16 -j ACCEPT as a temporary bypass, then recreate the NetworkPolicy or restart kube-router/k3s to force a full iptables sync. After recovery, remove the temporary rule: iptables -D FORWARD -s 10.42.0.0/16 -d 10.42.0.0/16 -j ACCEPT.

The manifest for the required allow-all policy is:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-all
  namespace: platform-infra
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - {}
  egress:
  - {}

This policy must be included in the sub2api plan / apply manifest rendering so that it is created as part of the normal deployment flow, not maintained as a manual one-off.

platform-infra sub2api status must report whether NetworkPolicy/allow-all exists and still has podSelector: {}, policyTypes: [Ingress, Egress], ingress: [{}], and egress: [{}]. For active bundled targets, platform-infra sub2api validate must also run temporary in-namespace probe pods that connect to sub2api-postgres:5432 and sub2api-redis:6379; local pg_isready inside the PostgreSQL pod alone is insufficient because it does not exercise kube-router cross-pod policy evaluation. For external-DB standby targets, validate --target checks the predeployment shape: no local PostgreSQL, app replicas zero, ClusterIP services, allow-all NetworkPolicy, local Redis declared as ephemeral cache with readiness required only when Redis replicas are above zero, and no standby-disabled public FRP, egress proxy, or sentinel CronJob remains. For external-DB active targets, validate --target checks that the app uses the external database endpoint, local Redis is ephemeral, no local PostgreSQL StatefulSet exists, and any YAML-declared egress proxy and public exposure resources are present and probed through their configured paths.

57 KiB Raw Blame History