docs: capture post-task platform state

2026-06-09 04:23:19 +00:00
parent b0fef5b44d
commit d41f9a0de1
4 changed files with 65 additions and 2 deletions
@@ -139,7 +139,7 @@ trans G14 script -- '/usr/local/sbin/g14-platform-db-backup'
 - `postgresql` systemd service active。
 - `ss -ltnp` 只显示 `127.0.0.1:5432` 和 `10.42.0.1:5432` 监听。
 - `/usr/local/sbin/g14-platform-db-health` 能列出预期 database。
- `hwlab-v03` 中 `g14-platform-postgres` Service/Endpoints 可见。
+- `hwlab-v03` 中 `g14-platform-postgres` Service 可见，且 Endpoints 或 EndpointSlice 至少一条 bridge 路径可见。
 - `hwlab-cloud-api` `/health/live` 返回 `status=ok`、`ready=true`、`db.connectionResult=connected`、`runtime.connection.queryResult=durable_readiness_ready`。
 - `hwlab nodes control-plane status --node G14 --lane v03` 显示 Argo `Synced/Healthy`，runtime workload 摘要不包含旧自有 Postgres。

@@ -79,7 +79,7 @@ The `devops-infra` git mirror/relay remains manual and CLI-controlled, not CronJ

 After a `v0.2` PipelineRun completes, treat runtime rollout and remote GitOps persistence as two separate checks. `hwlab g14 control-plane status --lane v02` is the runtime check: it must show the expected source commit, PipelineRun completed, Argo `Synced/Healthy`, public 19666/19667 probes passing, and Cloud Web asset probes such as `/app.js` readable. `hwlab g14 git-mirror status` is the persistence check: `cache.summary.pendingFlush` must be false and `cache.summary.githubInSync` true before declaring GitOps fully flushed back to GitHub. The PR monitor performs this flush automatically for its own merged PRs and records the result in the PR comment. Manual operators should run `bun scripts/cli.ts hwlab g14 git-mirror flush --confirm` and poll the returned job with `bun scripts/cli.ts job status <jobId> --tail-bytes 12000` only when they used lower-level manual trigger/status paths or when the monitor reports a flush failure; do not replace this with raw `kubectl`, native `git push`, or a long SSH wait.

-If `gitops-promote` fails because the mirror write hook rejects a rendered GitOps path as outside the allowed lane outputs, treat it as `devops-infra` mirror control-plane drift until proven otherwise. The recovery path is `hwlab g14 git-mirror apply --confirm` to reinstall the current hook/ConfigMap, `hwlab g14 git-mirror sync --confirm --wait` to realign source and GitOps refs, then a targeted `control-plane cleanup-runs --pipeline-run <failed-run> --confirm` before retriggering the same lane. Do not patch the hook inside the pod, delete PipelineRuns with raw kubectl, or bypass `git-mirror flush`; closeout still requires the target PipelineRun status, Argo health, public probes, and `git-mirror status` with `pendingFlush=false`.
+If `gitops-promote` fails because the git mirror control plane drifted, refs are inconsistent, or publish/flush did not complete, recover through the controlled mirror path: `hwlab g14 git-mirror apply --confirm` to reinstall the current hook/ConfigMap, `hwlab g14 git-mirror sync --confirm --wait` to realign source and GitOps refs, then a targeted `control-plane cleanup-runs --pipeline-run <failed-run> --confirm` before retriggering the same lane. The old branch/path allowlist gate has been removed; do not restore it, patch the hook inside the pod, delete PipelineRuns with raw kubectl, or bypass `git-mirror flush`. Closeout still requires the target PipelineRun status, Argo health, public probes, and `git-mirror status` with `pendingFlush=false`.

 When closing an issue against a specific completed `v0.2` PipelineRun, use targeted status instead of the latest-head status if `origin/v0.2` has already advanced through a parallel task:

@@ -0,0 +1,57 @@
+# G14 Platform Infra
+
+`platform-infra` is the G14 k3s namespace for UniDesk-operated shared platform services. It is separate from HWLAB runtime lanes, AgentRun lanes, D601 user services, and legacy `devops-infra` control-plane helpers. New shared infra should land here first; old `devops-infra` resources migrate gradually only when a concrete owner and validation path exist.
+
+## Source Of Truth
+
+- UniDesk-owned platform configuration must be YAML-first. `config/platform-infra/*.yaml` is the durable source for images, versions, endpoints, FRP exposure, account profile selection, and local consumer configuration.
+- Runtime Secrets and local `~/.codex/config.toml*` / `auth.json*` files are inputs or generated local state, not committed truth. CLI output may show Secret paths, byte counts, fingerprints, and short previews only; it must not print complete API keys.
+- Code that reads platform YAML must validate object shape, field types, required fields, Kubernetes names, image strings, and ports before mutating G14 k3s or local consumer files.
+- Do not hide image versions, namespace names, endpoint URLs, FRP ports, or profile lists in Python/TOML/JSON helper constants when they are UniDesk-owned choices. External tools may still require their own TOML/JSON/env file formats at the edge.
+
+## Sub2API Deployment Boundary
+
+- Sub2API is a G14 platform service operated by UniDesk in namespace `platform-infra`. It is not a HWLAB lane workload, AgentRun workload, D601 service, or master server daemon.
+- The canonical deployment entrypoint is `bun scripts/cli.ts platform-infra sub2api plan|apply|status|validate|codex-pool`; raw `kubectl` through `trans G14:k3s` is only for bounded diagnosis and evidence.
+- The image version is controlled by `config/platform-infra/sub2api.yaml`. Updating the image must be a YAML change plus `platform-infra sub2api apply --confirm` and follow-up runtime validation.
+- Sub2API should stay ClusterIP-only by default. Do not add Ingress, NodePort, LoadBalancer, or broad FRP exposure unless a YAML-controlled public exposure decision exists.
+- Sub2API currently has no resource limits by design. Do not add CPU or memory limits unless a later explicit decision changes that policy and stores the new policy in YAML.
+- Master server is a consumer/control host, not the runtime location. Do not deploy Sub2API, PostgreSQL, Redis, or heavy validation loops on master server.
+
+## Codex Pool Routing
+
+`config/platform-infra/sub2api-codex-pool.yaml` controls the Codex-facing OpenAI-compatible pool:
+
+- `pool.groupName` names the Sub2API group that represents the pool.
+- `pool.apiKeySecretName` and `pool.apiKeySecretKey` name the k3s Secret that stores the single consumer API key.
+- `profiles.entries` selects local Codex profile files from `~/.codex/` and maps them to Sub2API account names.
+- `publicExposure` controls the optional FRP bridge from master server to the G14 ClusterIP service.
+- `localCodex` controls how the master server's current `~/.codex` consumer files are backed up and rewritten.
+
+The request path is:
+
+1. A client sends an OpenAI-compatible request to the configured consumer base URL, normally master-local `http://127.0.0.1:<frp-port>/v1/...`, with the unified API key.
+2. master `frps` forwards the TCP connection to `platform-infra/sub2api-frpc` when `publicExposure.enabled` is true.
+3. `sub2api-frpc` forwards to `sub2api.platform-infra.svc.cluster.local:8080`.
+4. Sub2API validates the unified key and resolves its `group_id`.
+5. Accounts listed in `profiles.entries` are bound to the same group via `group_ids`, so Sub2API dispatches through that group using its own account selection semantics.
+
+After `codex-pool configure-local --confirm`, the default upstream profile must not recursively import the just-created Sub2API consumer endpoint as an upstream account. Keep the default source profile pointed at `config.toml.<backupSuffix>` and `auth.json.<backupSuffix>`; fallback to the current default files is only for first bootstrap before backups exist.
+
+## Availability And Probes
+
+Kubernetes readiness is not the same as pool availability:
+
+- The Sub2API app, PostgreSQL, and Redis manifests include container-level health probes. These only prove the pods and local dependencies are healthy enough for Kubernetes scheduling.
+- The FRP client deployment is currently a simple connector deployment and does not itself prove that master-local traffic reaches Sub2API.
+- No scheduled `CronJob`, `ServiceMonitor`, or `PodMonitor` currently proves the full unified Codex API path.
+- `platform-infra sub2api validate` and `platform-infra sub2api codex-pool validate` are on-demand checks. They are acceptable for deployment closeout, but they are not continuous monitoring.
+
+When an automatic availability probe is added, it should be YAML-controlled and cover these layers without printing secrets:
+
+1. G14 in-cluster `GET /v1/models` through `sub2api.platform-infra.svc.cluster.local:8080` with the unified key.
+2. master-local `GET /v1/models` through the configured FRP endpoint when public exposure is enabled.
+3. A tiny `POST /v1/responses` call through the same consumer URL for true OpenAI-compatible request validation.
+4. Optional per-upstream account probes if Sub2API exposes a safe account selection or admin-health mechanism; otherwise document that group-level success does not prove every upstream account is healthy.
+
+Until continuous probing exists, closeout comments must state that validation was on-demand and include the exact CLI/API entrypoints used.