diff --git a/AGENTS.md b/AGENTS.md index f9907b1b..153c89b7 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -7,6 +7,11 @@ UniDesk 是一个以主 server 为统一入口的分布式工作平台;本文 - P0: UniDesk 自有配置一律优先使用 YAML(`.yaml`/`.yml`),包括 `config/` 下的运行面、平台基础设施、节点/lane、部署参数和可调版本配置;除非外部工具硬性要求 JSON/TOML/ENV 等格式,禁止新增 JSON 作为 UniDesk 自有配置真相。 - P0: 需要代码读取的 YAML 配置必须显式校验 schema、字段类型和必填项;禁止静默 fallback、宽松猜测或把配置藏进脚本常量,后续版本、镜像、namespace、endpoint 等可调项必须从 YAML 配置进入受控 CLI。 +## P0 最高优先级:G14 platform-infra 规则 + +- P0: `platform-infra` 是 G14 k3s 上 UniDesk 运维的平台基础设施 namespace;Sub2API、Codex pool、FRP 暴露、统一消费 API key 和后续平台基础设施迁移的长期边界、路由与探针口径统一见 `docs/reference/platform-infra.md`。 +- P0: `devops-infra` 仅作为既有控制面基础设施逐步迁移来源,不再作为新增平台服务的默认 namespace;新增/迁移必须优先落到 `platform-infra`,并通过 `config/platform-infra/*.yaml` 与 `bun scripts/cli.ts platform-infra ...` 受控。 + ## P0 最高优先级:CaseRun 无服务与单步调试规则 - P0: CaseRun、case registry 产物整理、trace 语义化、harness 诊断、短连接 CLI 和本地/目标 host 上可直接运行的 runner 调试,默认是无服务工作流;只要不需要变更 cloud-api、web、gateway、GitOps、k3s runtime 或其他常驻服务,就必须直接无服务运行和验证,禁止为了运行 CaseRun 触发 CI/CD、rollout 或服务发布。 @@ -262,6 +267,7 @@ UniDesk 是一个以主 server 为统一入口的分布式工作平台;本文 - `docs/reference/code-queue-supervision.md`:Code Queue 居中调度、并发队列拆分、运行中监控、基础设施缺陷分流和验收收口规则。 - `docs/reference/hwlab.md`:HWLAB 指挥侧固定 workspace、G14 主运行面、D601 legacy/硬件桥接边界、最小 device-agent/gateway 桥接模型和受控发布边界。 - `docs/reference/g14.md`:G14 provider 节点、k3s 控制桥、legacy DEV/PROD 退役边界、当前 HWLAB runtime lane、device-agent 手动实验边界、Code Queue/CI 候选目标和节点本地 VPN proxy bootstrap 边界。 +- `docs/reference/platform-infra.md`:G14 `platform-infra` namespace、YAML-first shared service 配置、Sub2API/Codex pool、FRP 暴露和 on-demand availability probe 边界。 - `docs/reference/g14-observability-infra.md`:G14 原生 k3s 上 Prometheus Operator、`devops-infra` 监控基础设施、跨 namespace scrape 声明和安全边界。 - `docs/reference/gc.md`:UniDesk 主 server 和 provider 磁盘 GC、G14/HWLAB registry retention、safe-stop 线和长期防膨胀收益规则。 - `docs/reference/observability.md`:服务日志、任务活性、通用性能指标 API 和性能面板的可观测性规则。 diff --git a/docs/reference/g14-platform-db.md b/docs/reference/g14-platform-db.md index f5c0873e..b42d5634 100644 --- a/docs/reference/g14-platform-db.md +++ b/docs/reference/g14-platform-db.md @@ -139,7 +139,7 @@ trans G14 script -- '/usr/local/sbin/g14-platform-db-backup' - `postgresql` systemd service active。 - `ss -ltnp` 只显示 `127.0.0.1:5432` 和 `10.42.0.1:5432` 监听。 - `/usr/local/sbin/g14-platform-db-health` 能列出预期 database。 -- `hwlab-v03` 中 `g14-platform-postgres` Service/Endpoints 可见。 +- `hwlab-v03` 中 `g14-platform-postgres` Service 可见,且 Endpoints 或 EndpointSlice 至少一条 bridge 路径可见。 - `hwlab-cloud-api` `/health/live` 返回 `status=ok`、`ready=true`、`db.connectionResult=connected`、`runtime.connection.queryResult=durable_readiness_ready`。 - `hwlab nodes control-plane status --node G14 --lane v03` 显示 Argo `Synced/Healthy`,runtime workload 摘要不包含旧自有 Postgres。 diff --git a/docs/reference/g14.md b/docs/reference/g14.md index 00c64ec9..3f017807 100644 --- a/docs/reference/g14.md +++ b/docs/reference/g14.md @@ -79,7 +79,7 @@ The `devops-infra` git mirror/relay remains manual and CLI-controlled, not CronJ After a `v0.2` PipelineRun completes, treat runtime rollout and remote GitOps persistence as two separate checks. `hwlab g14 control-plane status --lane v02` is the runtime check: it must show the expected source commit, PipelineRun completed, Argo `Synced/Healthy`, public 19666/19667 probes passing, and Cloud Web asset probes such as `/app.js` readable. `hwlab g14 git-mirror status` is the persistence check: `cache.summary.pendingFlush` must be false and `cache.summary.githubInSync` true before declaring GitOps fully flushed back to GitHub. The PR monitor performs this flush automatically for its own merged PRs and records the result in the PR comment. Manual operators should run `bun scripts/cli.ts hwlab g14 git-mirror flush --confirm` and poll the returned job with `bun scripts/cli.ts job status --tail-bytes 12000` only when they used lower-level manual trigger/status paths or when the monitor reports a flush failure; do not replace this with raw `kubectl`, native `git push`, or a long SSH wait. -If `gitops-promote` fails because the mirror write hook rejects a rendered GitOps path as outside the allowed lane outputs, treat it as `devops-infra` mirror control-plane drift until proven otherwise. The recovery path is `hwlab g14 git-mirror apply --confirm` to reinstall the current hook/ConfigMap, `hwlab g14 git-mirror sync --confirm --wait` to realign source and GitOps refs, then a targeted `control-plane cleanup-runs --pipeline-run --confirm` before retriggering the same lane. Do not patch the hook inside the pod, delete PipelineRuns with raw kubectl, or bypass `git-mirror flush`; closeout still requires the target PipelineRun status, Argo health, public probes, and `git-mirror status` with `pendingFlush=false`. +If `gitops-promote` fails because the git mirror control plane drifted, refs are inconsistent, or publish/flush did not complete, recover through the controlled mirror path: `hwlab g14 git-mirror apply --confirm` to reinstall the current hook/ConfigMap, `hwlab g14 git-mirror sync --confirm --wait` to realign source and GitOps refs, then a targeted `control-plane cleanup-runs --pipeline-run --confirm` before retriggering the same lane. The old branch/path allowlist gate has been removed; do not restore it, patch the hook inside the pod, delete PipelineRuns with raw kubectl, or bypass `git-mirror flush`. Closeout still requires the target PipelineRun status, Argo health, public probes, and `git-mirror status` with `pendingFlush=false`. When closing an issue against a specific completed `v0.2` PipelineRun, use targeted status instead of the latest-head status if `origin/v0.2` has already advanced through a parallel task: diff --git a/docs/reference/platform-infra.md b/docs/reference/platform-infra.md new file mode 100644 index 00000000..34221aa0 --- /dev/null +++ b/docs/reference/platform-infra.md @@ -0,0 +1,57 @@ +# G14 Platform Infra + +`platform-infra` is the G14 k3s namespace for UniDesk-operated shared platform services. It is separate from HWLAB runtime lanes, AgentRun lanes, D601 user services, and legacy `devops-infra` control-plane helpers. New shared infra should land here first; old `devops-infra` resources migrate gradually only when a concrete owner and validation path exist. + +## Source Of Truth + +- UniDesk-owned platform configuration must be YAML-first. `config/platform-infra/*.yaml` is the durable source for images, versions, endpoints, FRP exposure, account profile selection, and local consumer configuration. +- Runtime Secrets and local `~/.codex/config.toml*` / `auth.json*` files are inputs or generated local state, not committed truth. CLI output may show Secret paths, byte counts, fingerprints, and short previews only; it must not print complete API keys. +- Code that reads platform YAML must validate object shape, field types, required fields, Kubernetes names, image strings, and ports before mutating G14 k3s or local consumer files. +- Do not hide image versions, namespace names, endpoint URLs, FRP ports, or profile lists in Python/TOML/JSON helper constants when they are UniDesk-owned choices. External tools may still require their own TOML/JSON/env file formats at the edge. + +## Sub2API Deployment Boundary + +- Sub2API is a G14 platform service operated by UniDesk in namespace `platform-infra`. It is not a HWLAB lane workload, AgentRun workload, D601 service, or master server daemon. +- The canonical deployment entrypoint is `bun scripts/cli.ts platform-infra sub2api plan|apply|status|validate|codex-pool`; raw `kubectl` through `trans G14:k3s` is only for bounded diagnosis and evidence. +- The image version is controlled by `config/platform-infra/sub2api.yaml`. Updating the image must be a YAML change plus `platform-infra sub2api apply --confirm` and follow-up runtime validation. +- Sub2API should stay ClusterIP-only by default. Do not add Ingress, NodePort, LoadBalancer, or broad FRP exposure unless a YAML-controlled public exposure decision exists. +- Sub2API currently has no resource limits by design. Do not add CPU or memory limits unless a later explicit decision changes that policy and stores the new policy in YAML. +- Master server is a consumer/control host, not the runtime location. Do not deploy Sub2API, PostgreSQL, Redis, or heavy validation loops on master server. + +## Codex Pool Routing + +`config/platform-infra/sub2api-codex-pool.yaml` controls the Codex-facing OpenAI-compatible pool: + +- `pool.groupName` names the Sub2API group that represents the pool. +- `pool.apiKeySecretName` and `pool.apiKeySecretKey` name the k3s Secret that stores the single consumer API key. +- `profiles.entries` selects local Codex profile files from `~/.codex/` and maps them to Sub2API account names. +- `publicExposure` controls the optional FRP bridge from master server to the G14 ClusterIP service. +- `localCodex` controls how the master server's current `~/.codex` consumer files are backed up and rewritten. + +The request path is: + +1. A client sends an OpenAI-compatible request to the configured consumer base URL, normally master-local `http://127.0.0.1:/v1/...`, with the unified API key. +2. master `frps` forwards the TCP connection to `platform-infra/sub2api-frpc` when `publicExposure.enabled` is true. +3. `sub2api-frpc` forwards to `sub2api.platform-infra.svc.cluster.local:8080`. +4. Sub2API validates the unified key and resolves its `group_id`. +5. Accounts listed in `profiles.entries` are bound to the same group via `group_ids`, so Sub2API dispatches through that group using its own account selection semantics. + +After `codex-pool configure-local --confirm`, the default upstream profile must not recursively import the just-created Sub2API consumer endpoint as an upstream account. Keep the default source profile pointed at `config.toml.` and `auth.json.`; fallback to the current default files is only for first bootstrap before backups exist. + +## Availability And Probes + +Kubernetes readiness is not the same as pool availability: + +- The Sub2API app, PostgreSQL, and Redis manifests include container-level health probes. These only prove the pods and local dependencies are healthy enough for Kubernetes scheduling. +- The FRP client deployment is currently a simple connector deployment and does not itself prove that master-local traffic reaches Sub2API. +- No scheduled `CronJob`, `ServiceMonitor`, or `PodMonitor` currently proves the full unified Codex API path. +- `platform-infra sub2api validate` and `platform-infra sub2api codex-pool validate` are on-demand checks. They are acceptable for deployment closeout, but they are not continuous monitoring. + +When an automatic availability probe is added, it should be YAML-controlled and cover these layers without printing secrets: + +1. G14 in-cluster `GET /v1/models` through `sub2api.platform-infra.svc.cluster.local:8080` with the unified key. +2. master-local `GET /v1/models` through the configured FRP endpoint when public exposure is enabled. +3. A tiny `POST /v1/responses` call through the same consumer URL for true OpenAI-compatible request validation. +4. Optional per-upstream account probes if Sub2API exposes a safe account selection or admin-health mechanism; otherwise document that group-level success does not prove every upstream account is healthy. + +Until continuous probing exists, closeout comments must state that validation was on-demand and include the exact CLI/API entrypoints used.