feat: activate D601 sub2api external edge

2026-06-12 07:56:46 +00:00
parent f5a6610fe9
commit c23d9b5ea0
8 changed files with 2375 additions and 164 deletions
@@ -58,6 +58,16 @@ PK01 currently hosts existing Docker workloads:

 `pikanode` mounts `/home/ubuntu/pikanode` read-write into the container. Static/generated download artifacts under `html/download/` and repository data under `files/` may be user-visible or needed by the service. They are not generic GC candidates.

+## Sub2API Caddy Edge
+
+PK01 may act as the public Caddy edge for a YAML-declared D601 Sub2API target. The durable source of truth is `config/platform-infra/sub2api.yaml`; do not hand-edit PK01 Caddy or FRP state as a separate routing truth.
+
+In this mode, public ports `80` and `443` belong to Caddy. The existing `pikanode` container must be bound to a loopback HTTP port and used only as the apex PikaPython/PikaNode upstream. The `api.pikapython.com` site must reverse proxy directly to the YAML-declared FRP remote port, so API traffic follows `client -> PK01 Caddy -> PK01 frps remote port -> D601 frpc -> D601 Sub2API`. It must not pass through pikanode or a master-server reverse proxy.
+
+Caddy binary installation is also YAML-controlled. If `publicExposure.pk01.caddyDownloadProxyUrl` is set, PK01 Caddy downloads must use that proxy URL; the PK01 loopback provider egress proxy is the preferred source. A slow or failing Caddy download should first be treated as missing proxy use, not as a reason to keep retrying a naked GitHub release download.
+
+The public certificate depends on DNS. The `api.pikapython.com` record must resolve to the YAML-declared PK01 public address before Caddy can complete ACME issuance. If the DNS record is absent or stale, local probes such as PK01 `127.0.0.1:<frp-remote-port>` and public-IP remote-port checks can prove the FRP data path, but final `https://api.pikapython.com` validation remains blocked until DNS is corrected.
+
 ## Host PostgreSQL

 PK01 host-native PostgreSQL is declared by `config/platform-db/postgres-pk01.yaml` and managed through `bun scripts/cli.ts platform-db postgres plan|status|apply`; daily operation commands live in `$unidesk-ops` at `.agents/skills/unidesk-ops/SKILL.md`. It is a host systemd service, not a Docker container or k3s workload. The YAML is the source of truth for PostgreSQL version, TLS mode, listening addresses, `pg_hba` source CIDRs, generated Secret source files, exported `DATABASE_URL`, and backup timer settings.
@@ -1,6 +1,6 @@
 # Platform Infra

-`platform-infra` is the k3s namespace for UniDesk-operated shared platform services. G14 is the active default runtime for this namespace; D601 may host explicitly declared standby platform targets when the service needs node-local preparation or cutover capacity. It is separate from HWLAB runtime lanes, AgentRun lanes, D601 user services, and legacy `devops-infra` control-plane helpers. New shared infra should land here first; old `devops-infra` resources migrate gradually only when a concrete owner and validation path exist.
+`platform-infra` is the k3s namespace for UniDesk-operated shared platform services. G14 is the default active runtime for this namespace; D601 may host explicitly declared standby or externally backed active targets when the service needs node-local preparation, cutover capacity, or a direct public edge. It is separate from HWLAB runtime lanes, AgentRun lanes, D601 user services, and legacy `devops-infra` control-plane helpers. New shared infra should land here first; old `devops-infra` resources migrate gradually only when a concrete owner and validation path exists.

 ## Source Of Truth

@@ -12,15 +12,15 @@
 ## Sub2API Deployment Boundary

 - Sub2API is a platform service operated by UniDesk in namespace `platform-infra`. It is not a HWLAB lane workload, AgentRun workload, D601 user service, or master server daemon.
- The canonical deployment entrypoint is `bun scripts/cli.ts platform-infra sub2api plan|apply|status|validate|codex-pool`. Runtime targets are selected with `--target`; `G14` is the active default target and `D601` is a standby target controlled by the same YAML. Daily operation procedures live in `$unidesk-sub2api` at `.agents/skills/unidesk-sub2api/SKILL.md`. This reference keeps only development boundaries and project-specific source-of-truth rules.
+- The canonical deployment entrypoint is `bun scripts/cli.ts platform-infra sub2api plan|apply|status|validate|codex-pool`. Runtime targets are selected with `--target`; `G14` is the default active target and `D601` is controlled by the same YAML as either standby predeploy or externally backed active runtime. Daily operation procedures live in `$unidesk-sub2api` at `.agents/skills/unidesk-sub2api/SKILL.md`. This reference keeps only development boundaries and project-specific source-of-truth rules.
 - Raw `kubectl` through `trans <target>:k3s` is only for bounded diagnosis and evidence, not a formal mutate path.
 - The image version is controlled by `config/platform-infra/sub2api.yaml`. Image update procedures are daily operations owned by `$unidesk-sub2api`; the development boundary is that image choices remain YAML-controlled.
 - Sub2API should stay ClusterIP-only by default. Do not add Ingress, NodePort, LoadBalancer, or broad FRP exposure unless a YAML-controlled public exposure decision exists.
 - Sub2API currently has no resource limits by design. Do not add CPU or memory limits unless a later explicit decision changes that policy and stores the new policy in YAML.
 - Master server is a consumer/control host, not the runtime location. Do not deploy Sub2API, PostgreSQL, Redis, or heavy validation loops on master server.
- D601 Sub2API is a predeployment target, not a second active singleton. While the platform database handoff is pending, it must render without a local PostgreSQL StatefulSet, keep the Sub2API app and local Redis cache scaled to zero, and use only ephemeral Redis storage when Redis is later activated. After the external platform DB endpoint, Secret, and runtime images are ready, activation must be expressed by YAML and applied through the same `platform-infra sub2api --target D601` CLI path.
+- D601 Sub2API is selected by YAML, not by ad hoc runtime patches. In standby mode it must render without a local PostgreSQL StatefulSet, keep the Sub2API app and local Redis cache scaled to zero, and use only ephemeral Redis storage when Redis is later activated. In externally backed active mode it connects directly to the YAML-declared external PostgreSQL endpoint with `sslmode=require`, keeps durable app state outside the D601 k3s node, and uses local Redis only as ephemeral cache. Activation must be applied through the same `platform-infra sub2api --target D601` CLI path.
 - External platform PostgreSQL endpoints for Sub2API are produced by the platform DB YAML and its `platform-db postgres` CLI. Cross-node Sub2API consumers connect directly to that endpoint; the master server is not a PostgreSQL data-plane relay. DNS aliases are optional when the exported `DATABASE_URL` uses a reachable IP with `sslmode=require`; current PK01-specific rules live in `docs/reference/pk01.md`.
- Sub2API account sentinel and public FRP exposure remain singleton concerns. Do not create a second sentinel or public management surface for D601 unless a later YAML-controlled decision explicitly moves or splits that responsibility.
+- Sub2API account sentinel and public exposure are target-scoped YAML decisions. Do not create a second sentinel, FRP client, public management surface, or edge proxy by hand; enable or move those resources only through the target YAML and the `platform-infra sub2api` / `codex-pool --target` CLI paths.

 ## Codex Pool Routing

@@ -45,7 +45,7 @@
 - Codex account-state, quota prompts, model-routing failures, encrypted-content affinity failures, gateway wrappers, and timeout-like upstream errors must be handled by the generic temporary-unschedulable/failover path plus the external marker sentinel. Do not change membership, priority, capacity, load factor, WebSocket mode, `pool_mode`, or a specific provider's status merely to work around those errors. If a matching upstream failure still logs `openai.forward_failed` without `openai.upstream_failover_switching`, the missing fix is in Sub2API's HTTP `/responses` failover classification/error propagation, not in account pinning.
 - `profiles.entries[].openaiResponsesWebSocketsV2Mode` is the account-level Responses WebSocket v2 switch for OpenAI-compatible upstreams that require WebSocket transport. Allowed values are `off`, `ctx_pool`, and `passthrough`; omit the field unless that upstream needs it.
 - `profiles.entries[].upstreamUserAgent` is an optional account-level upstream request User-Agent override. Use it only for upstreams that require a Codex CLI compatible User-Agent; keep the value YAML-controlled and newline-free.
- `publicExposure` controls the optional FRP bridge from master server to the G14 ClusterIP service.
+- `publicExposure` in `config/platform-infra/sub2api-codex-pool.yaml` controls the default Codex-pool public bridge from master server to the G14 ClusterIP service. Target-level `publicExposure` in `config/platform-infra/sub2api.yaml` controls non-master exposure such as a D601-to-PK01 edge.
 - `publicExposure.masterCaddy.responseHeaderTimeoutSeconds` controls the master Caddy `response_header_timeout` for the public Sub2API site. It must be long enough for Codex `/responses/compact` requests; otherwise Caddy can return a client-visible 504 before Sub2API finishes the upstream compact request, and that edge timeout is not an account-level upstream failure that Sub2API can use for temporary-unschedulable failover. The numeric value belongs only in `config/platform-infra/sub2api-codex-pool.yaml`; after changing it, use `codex-pool expose --confirm` to reload Caddy and verify the rendered `response_header_timeout`. Requests that were already in flight before the reload may still finish with the previous timeout, so post-change evidence should check only requests that started after the reload.
 - `publicExposure.masterCaddy.edgeRetry` controls the master Caddy reverse-proxy retry window for the public Sub2API site. This belongs at the edge because FRP remotePort listener loss, `connection refused`, EOF, or connection reset can happen before a request reaches Sub2API, so Sub2API account failover and sentinel logic cannot observe or recover that request. Keep retry scope narrow, especially for non-idempotent POST traffic: connection-attempt failures may be retried by the reverse proxy, while round-trip retry after an upstream connection was established should be limited by YAML `retryMatch` to paths that are safe to repeat, such as compact. Retry durations and intervals belong only in YAML; after changing them, run `codex-pool expose --confirm` and verify the rendered Caddyfile contains the expected `lb_try_duration`, `lb_try_interval`, and `lb_retry_match`.
 - `localCodex` controls how the master server's current `~/.codex` consumer files are backed up and rewritten. Keep `supportsWebSockets` and `responsesWebSocketsV2` in the same state, and enable them only when at least one YAML-managed account has a current direct Codex WSv2 smoke that passes. If no upstream profile can sustain Responses WSv2, the honest long-term state is `false/false` so Codex uses HTTP Responses directly instead of repeatedly reconnecting before `response.completed`. `localCodex.responsesSmokeModel` is the YAML-declared model used by `codex-pool validate` for the lightweight `POST /v1/responses` smoke.
@@ -92,7 +92,7 @@ If the YAML success cadence maximum is lowered or an account changes trust class

 Operational observation for this sentinel should use the read-only `codex-pool sentinel-report` table or its `--raw` form. It is the canonical low-noise view for per-account probe count, trust class, marker result, HTTP/error diagnostics, freeze TTL, success cadence, success cadence maximum, next probe time, and recent CronJob runs; raw ConfigMap dumps and ad hoc log scraping are fallback diagnostics, not the primary state surface.

-The request path is:
+The default G14 Codex-pool request path is:

 1. A client sends an OpenAI-compatible request to the configured consumer base URL, normally `https://sub2api.74-48-78-17.nip.io/v1/...`, with the unified API key.
 2. master `frps` forwards the TCP connection to `platform-infra/sub2api-frpc` when `publicExposure.enabled` is true.
@@ -100,6 +100,10 @@ The request path is:
 4. Sub2API validates the unified key and resolves its `group_id`.
 5. Accounts listed in `profiles.entries` are bound to the same group via `group_ids`, so Sub2API dispatches through that group using its own account selection semantics.

+The D601 externally backed request path is different when target-level `publicExposure.enabled=true` in `config/platform-infra/sub2api.yaml`: client traffic reaches PK01 Caddy, PK01 forwards to the YAML-declared FRP remote port, D601 `sub2api-frpc` connects directly to PK01 `frps`, and FRP forwards to `sub2api.platform-infra.svc.cluster.local:8080` on D601. This path does not pass through the master server or the pikanode reverse proxy. `api.pikapython.com` must resolve to the YAML-declared PK01 public address before Caddy can obtain or renew the public certificate; when DNS is missing, PK01 local FRP probes and public-IP remote-port probes may prove the edge path, but they are not a substitute for final `https://api.pikapython.com` validation.
+
+When target-level `egressProxy.enabled=true`, the D601 target renders an in-cluster HTTP(S) proxy client from the master VPN subscription source declared in YAML. The CLI injects the resulting proxy URL and `NO_PROXY` into Sub2API and, when requested by YAML, the Codex account sentinel. `platform-infra sub2api validate --target D601 --full` must prove the proxy Deployment/Service is ready and that an app pod can complete the YAML-declared health probe through the proxy. Subscription contents and generated proxy configs are Secret material and must not be printed.
+
 Adding, removing, exposing, validating, and configuring local Codex consumers are daily operations covered by `$unidesk-sub2api`. The development rule is that ordinary pool membership changes stay YAML-only and do not add code or CI/CD. Code changes are only appropriate when UniDesk needs to render or validate a Sub2API capability that already exists upstream, such as account-level WebSocket mode or per-account upstream User-Agent. If Sub2API itself does not support a desired behavior, do not magic-patch it through UniDesk scripts, Kubernetes hotfixes, local forks, or hidden compatibility paths; either leave the behavior unsupported or pursue it upstream as an explicit Sub2API feature.

 `codex-pool sync --confirm` and `codex-pool validate` are runtime operations that may need more than one SSH short-connection window because they log in to Sub2API, reconcile accounts, inspect recent logs, and run gateway smoke requests. The formal entry remains the UniDesk CLI, which must use a submit-and-short-poll control shape or an equivalent remote job wrapper instead of one long `trans G14:k3s script` call. If these commands fail with `UNIDESK_SSH_RUNTIME_TIMEOUT` while the remote operation may still be running, treat it as a control-plane visibility gap first: improve or use the CLI's job/poll path, then rerun `sync` or `validate`. Do not replace it with raw `kubectl`, manual Sub2API admin API patches, repeated blind full loops, or Sub2API source modifications.
@@ -112,18 +116,18 @@ When `publicExposure.enabled` is true, the same FRP TCP bridge exposes both Open

 The public management UI is an operations endpoint. Keep Sub2API itself in `platform-infra`, keep the Kubernetes Service as ClusterIP, and treat FRP as the only public bridge unless a later decision explicitly changes the exposure model.

-The public bridge has two separate failure classes. Sub2API upstream/account failures are visible in Sub2API logs and currently belong to sentinel quarantine plus normal Sub2API routing among schedulable accounts. Edge failures between master Caddy and the FRP remotePort are not visible to Sub2API; symptoms include Caddy `connect: connection refused`, EOF, connection reset, or short 502 bursts while frps closes and reopens the configured remotePort. Those failures must be diagnosed from Caddy and frps/frpc evidence and mitigated through YAML-controlled Caddy edge retry or FRP stability fixes, not by disabling accounts or changing pool membership.
+The public bridge has two separate failure classes. Sub2API upstream/account failures are visible in Sub2API logs and currently belong to sentinel quarantine plus normal Sub2API routing among schedulable accounts. Edge failures between Caddy and the FRP remote port are not visible to Sub2API; symptoms include Caddy `connect: connection refused`, EOF, connection reset, TLS/certificate failures, DNS NXDOMAIN, or short 502 bursts while frps closes and reopens the configured remote port. Those failures must be diagnosed from DNS, Caddy, and frps/frpc evidence and mitigated through YAML-controlled Caddy edge retry, DNS correction, or FRP stability fixes, not by disabling accounts or changing pool membership.

 ## Availability And Probes

 Kubernetes readiness is not the same as pool availability:

 - The Sub2API app, PostgreSQL, and Redis manifests include container-level health probes. These only prove the pods and local dependencies are healthy enough for Kubernetes scheduling.
- The FRP client deployment is currently a simple connector deployment and does not itself prove that master-local traffic reaches Sub2API.
+- The FRP client deployment is a connector deployment and does not itself prove that edge traffic reaches Sub2API.
 - No scheduled `CronJob`, `ServiceMonitor`, or `PodMonitor` currently proves the full unified Codex API path.
 - `platform-infra sub2api validate` and `platform-infra sub2api codex-pool validate` are on-demand checks. Operational usage is documented in `$unidesk-sub2api`; they are acceptable for deployment closeout, but they are not continuous monitoring. `codex-pool validate` must test both `GET /v1/models` and a small `POST /v1/responses` request, and the Responses smoke should report request id, selected/final account evidence, upstream failover count, and whether the validation succeeded only after failover. It should also summarize recent `/responses` and `/responses/compact` gateway failures separately so ordinary long streaming failures are not hidden behind compact-only evidence.
 - `codex-pool validate` must not create mock upstreams or temporary failover-probe accounts as its default proof of Sub2API behavior. When a suspected failover path is in question, validate should surface the relevant source-path expectation and real runtime evidence: request ids, selected/final account ids, `openai.upstream_failover_switching`, `openai.forward_failed`, `openai.account_select_failed`, and final status. If runtime evidence contradicts the source-path expectation, fix Sub2API or the UniDesk integration path rather than converting the mismatch into a mock-only success.
- Public exposure closeout must include the edge layer when the user-facing URL is involved. A Sub2API-side compact success summary does not rule out Caddy/FRP 502s that happened before Sub2API received the request; inspect the edge Caddy/frps/frpc evidence or use a CLI report that summarizes it before declaring public compact stable.
+- Public exposure closeout must include the edge layer when the user-facing URL is involved. A Sub2API-side compact success summary does not rule out DNS, Caddy, TLS, or FRP failures that happened before Sub2API received the request; inspect the edge evidence or use a CLI report that summarizes it before declaring the public URL stable.
 - Because `codex-pool validate` includes account alignment, recent-log inspection, and gateway smoke, timeout of the CLI transport is not valid negative evidence about Sub2API scheduling by itself. Closeout evidence must come from the final structured validation result or from an explicitly reported remote job failure with stdout/stderr tail, not from a single low-level `trans` timeout.

 When an automatic availability probe is added, it should be YAML-controlled and cover these layers without printing secrets:
@@ -133,6 +137,8 @@ When an automatic availability probe is added, it should be YAML-controlled and
 3. A tiny `POST /v1/responses` call through the same consumer URL for true OpenAI-compatible request validation.
 4. Optional per-upstream account probes if Sub2API exposes a safe account selection or admin-health mechanism; otherwise document that group-level success does not prove every upstream account is healthy.

+For D601 public exposure, the equivalent probe set must use the target URL from `config/platform-infra/sub2api.yaml`, include the PK01 Caddy/FRP edge, and require `api.pikapython.com` DNS to resolve to the YAML-declared address before treating HTTPS as validated.
+
 Until continuous probing exists, closeout comments must state that validation was on-demand and include the exact CLI/API entrypoints used.

 ## k3s Network Policy Requirements
@@ -169,4 +175,4 @@ spec:

 This policy must be included in the `sub2api plan` / `apply` manifest rendering so that it is created as part of the normal deployment flow, not maintained as a manual one-off.

-`platform-infra sub2api status` must report whether `NetworkPolicy/allow-all` exists and still has `podSelector: {}`, `policyTypes: [Ingress, Egress]`, `ingress: [{}]`, and `egress: [{}]`. For active bundled targets, `platform-infra sub2api validate` must also run temporary in-namespace probe pods that connect to `sub2api-postgres:5432` and `sub2api-redis:6379`; local `pg_isready` inside the PostgreSQL pod alone is insufficient because it does not exercise kube-router cross-pod policy evaluation. For external-DB pending standby targets, `validate --target` checks the predeployment shape instead: no local PostgreSQL, app replicas zero, ClusterIP services, allow-all NetworkPolicy, and local Redis declared as ephemeral cache with readiness required only when Redis replicas are above zero.
+`platform-infra sub2api status` must report whether `NetworkPolicy/allow-all` exists and still has `podSelector: {}`, `policyTypes: [Ingress, Egress]`, `ingress: [{}]`, and `egress: [{}]`. For active bundled targets, `platform-infra sub2api validate` must also run temporary in-namespace probe pods that connect to `sub2api-postgres:5432` and `sub2api-redis:6379`; local `pg_isready` inside the PostgreSQL pod alone is insufficient because it does not exercise kube-router cross-pod policy evaluation. For external-DB standby targets, `validate --target` checks the predeployment shape: no local PostgreSQL, app replicas zero, ClusterIP services, allow-all NetworkPolicy, and local Redis declared as ephemeral cache with readiness required only when Redis replicas are above zero. For external-DB active targets, `validate --target` checks that the app uses the external database endpoint, local Redis is ephemeral, no local PostgreSQL StatefulSet exists, and any YAML-declared egress proxy and public exposure resources are present and probed through their configured paths.