docs: add yaml-first distributed ops spec

2026-06-13 03:59:07 +00:00
parent c5c1ff7f58
commit c29ca99a50
2 changed files with 130 additions and 0 deletions
@@ -11,6 +11,7 @@ UniDesk 是一个以主 server 为统一入口的分布式工作平台；本文

 - P0: UniDesk 自有配置一律优先使用 YAML（`.yaml`/`.yml`），包括 `config/` 下的运行面、平台基础设施、节点/lane、部署参数和可调版本配置；除非外部工具硬性要求 JSON/TOML/ENV 等格式，禁止新增 JSON 作为 UniDesk 自有配置真相。
 - P0: 需要代码读取的 YAML 配置必须显式校验格式、字段类型和必填项；配置校验只保证“能被正确读取和渲染”，不得把业务策略、调度策略或数值选择写成代码硬编码、schema 硬范围、合同测试或隐藏默认值。后续版本、镜像、namespace、endpoint、容量、冷却时间、退避窗口等可调项必须从 YAML 配置进入受控 CLI，具体数值以 YAML 为准。
+- P0: YAML-first 异构分布式运维架构、现有 YAML 归属优先、禁止硬编码 node/service、公共 ops 层抽取和薄 domain CLI 规则见 `docs/reference/yaml-first-ops.md`。

 ## P0 最高优先级：G14 platform-infra 规则

@@ -278,6 +279,7 @@ UniDesk 是一个以主 server 为统一入口的分布式工作平台；本文
 - `docs/reference/hwlab.md`：HWLAB 指挥侧固定 workspace、G14 主运行面、D601 legacy/硬件桥接边界、最小 device-agent/gateway 桥接模型和受控发布边界。
 - `docs/reference/g14.md`：G14 provider 节点、k3s 控制桥、legacy DEV/PROD 退役边界、当前 HWLAB runtime lane、device-agent 手动实验边界、Code Queue/CI 候选目标和节点本地 VPN proxy bootstrap 边界。
 - `docs/reference/pk01.md`：PK01 腾讯云 provider-gateway、pikanode/MET Docker workload、SSH 透传、磁盘 GC 和 pikanode temp 长效 retention 边界。
+- `docs/reference/yaml-first-ops.md`：YAML-first 异构分布式运维架构、现有 YAML 归属优先、公共 ops 层抽取、禁止硬编码 node/service 和薄 domain CLI 规则。
 - `docs/reference/platform-infra.md`：G14 `platform-infra` namespace、YAML-first shared service 配置、Sub2API/Codex pool、FRP 暴露和 on-demand availability probe 开发边界；Sub2API 日常操作统一见 `$unidesk-sub2api`（`.agents/skills/unidesk-sub2api/SKILL.md`）。
 - `docs/reference/master-server-ops.md`：主 server 本机 Codex profile wrapper、ACX/GOCX/Moon Bridge 路由边界、默认模型、真实调用验收和 MiniMax session recovery 规则。
 - `docs/reference/g14-observability-infra.md`：G14 原生 k3s 上 Prometheus Operator、`devops-infra` 监控基础设施、跨 namespace scrape 声明和安全边界。
@@ -0,0 +1,128 @@
+# YAML-First Heterogeneous Distributed Ops
+
+This document defines the UniDesk architecture for YAML-first heterogeneous distributed operations. It is the long-term reference for turning node, lane, service, Secret, exposure, database, rollout and probe decisions into declared configuration plus reusable CLI execution. Concrete values belong in YAML under `config/`; this document defines ownership and architecture only.
+
+## Scope
+
+YAML-first ops applies to UniDesk-owned distributed runtime management across heterogeneous targets: host services, k3s namespaces, public exposure bridges, external databases, app runtime Secrets, CI/CD control-plane bootstrap, workflow services and managed service probes.
+
+It is not a new global orchestrator. Existing domain ownership stays intact:
+
+- Platform shared services keep their truth in the existing platform infra YAML family.
+- Platform database state keeps its truth in platform database YAML.
+- Runtime lane services keep their truth in their existing node/lane YAML.
+- Agent execution infrastructure keeps its truth in its own infrastructure YAML.
+
+Add a new top-level YAML registry only after multiple existing domains share the same lifecycle, owner and command model, and after the common blocks have already proven reusable. The default path is to extend the owning domain YAML and shared ops helpers, not to create another parallel control plane.
+
+## Source Of Truth
+
+UniDesk-owned distributed ops choices must enter through YAML:
+
+- target route and execution plane
+- namespace, workload, service, Secret and ConfigMap identifiers
+- image references, versions and pull policy
+- public URL, DNS expectation, FRP/Caddy edge settings and probe endpoints
+- database host, role/database declarations, Secret exports and connection mode
+- Secret source references, key mappings, transforms and rollout triggers
+- readiness, validation and smoke probe shape
+- retention, cadence, timeout and policy values when they are UniDesk-owned choices
+
+Code may validate that YAML is present, typed, syntactically valid and renderable. Code must not become the hidden source for node names, service names, namespaces, ports, image tags, Secret names, URLs, account lists, capacities, cooldowns or retry windows. These values must be read from YAML or from explicit external tool/runtime APIs.
+
+External formats such as JSON, TOML, env files, Kubernetes YAML, Caddyfile, systemd units or app-specific config files may still be generated or consumed at the edge when the external tool requires them. They are inputs or rendered artifacts, not UniDesk desired-state truth.
+
+## Architecture Layers
+
+YAML-first ops uses five layers.
+
+1. Domain YAML
+
+The owning `config/**/*.yaml` file declares the desired runtime state and all tunable values. A domain YAML may contain reusable blocks such as `publicExposure`, `externalDatabase`, `runtimeSecrets`, `rollout`, `probes`, `staging`, `retention` or `controlPlane`, but the exact block is owned by the domain until it is promoted into a shared helper.
+
+2. Domain Parser
+
+Each domain has a parser that resolves a selected target and validates only shape, field type, required fields and renderability. It may validate generic syntax such as Kubernetes resource names, route token format, URL shape, image reference shape, relative source references and key names. It must not hard-code current policy values or silently fill business defaults that should live in YAML.
+
+3. Common Ops Library
+
+Shared behavior belongs in reusable modules under `scripts/src/`, not in service-specific command files. The existing reusable seeds are the platform infra public-service helpers and the platform infra ops library. New common helpers should be extracted when the same operation appears in more than one domain, especially for:
+
+- route execution and bounded capture
+- YAML parsing primitives
+- redacted output, fingerprints and compact evidence
+- Secret source loading and source path redaction
+- Kubernetes Secret apply from local source material
+- rollout restart/status from YAML-declared workload refs
+- public exposure rendering through FRP/Caddy
+- manifest staging, dry-run and server-side apply wrappers
+- probe execution and response summarization
+- async job submission and short polling for long operations
+
+4. Thin Domain CLI
+
+The domain CLI resolves the target from YAML, calls shared helpers and prints structured JSON. It should not contain large inline shell bodies, duplicated secret-sync scripts, hard-coded service names or app-specific operational workflows. A domain CLI may keep a stable command namespace for compatibility and discoverability, but the implementation should delegate to common helpers.
+
+5. Runtime Executor
+
+Runtime mutation goes through UniDesk CLI and `trans` route execution. Direct `kubectl`, raw SSH, hand-written Caddy edits, direct GitHub API calls or ad hoc shell scripts may be diagnostic or emergency recovery tools only. Repeated operational writes must be promoted into a controlled CLI command that reads YAML and reports redacted structured output.
+
+## Common Block Rules
+
+Reusable blocks must describe operations in data, not in service-specific code branches.
+
+### Target Blocks
+
+A target block should declare the route, execution plane, namespace and any workload refs required by the operation. Code must not infer these from a node id, lane id or service id by concatenating strings unless that concatenation rule itself is explicitly declared and stable for the domain.
+
+### Secret Blocks
+
+A runtime Secret block should declare source reference, source key, target Secret, target key, optional transform and rollout trigger. Secret values must stay in git-ignored owner-only source files or external Secret stores. CLI output may show sourceRef, target object names, key names, presence, byte counts, fingerprints, mutation and next commands; it must not print secret values, full tokens, decoded base64, passwords or complete connection strings.
+
+App-specific transforms are allowed only as isolated named transform functions. The transform name is data in YAML; the implementation belongs in a shared transform registry or a small domain adapter, not in a one-off reset command.
+
+### Exposure Blocks
+
+Public exposure must be declared as an edge topology, including DNS expectation, public base URL, bridge settings, edge host route and target service. The existing FRP/Caddy path is a reusable public-service primitive. New public exposure code should extend that primitive instead of adding per-service Caddy or FRP scripts.
+
+### Database Blocks
+
+External database consumers must reference the YAML-owned platform database source and exported Secret shape. A consumer should not deploy a new database, copy connection strings by hand, or derive credentials from live runtime objects unless the owning database YAML declares that export.
+
+### Probe Blocks
+
+Probes are validation data, not hidden policy. YAML should declare what endpoint or runtime object proves the operation for that service. CLI code may execute the probe, bound output and classify failure, but should not hard-code current URLs, credentials, namespaces or service paths.
+
+## Refactoring Rule
+
+When adding YAML-first ops to an existing domain, follow this order:
+
+1. Inventory the existing YAML, CLI commands and helper modules.
+2. Choose the owning domain YAML; do not start with a new global registry.
+3. Add or refine a reusable block in that YAML with all concrete values declared there.
+4. Extend the domain parser with shape/type/renderability validation only.
+5. Extract common execution into shared helper modules before adding domain-specific code.
+6. Keep the domain CLI as a thin adapter over the common helper.
+7. Validate with the narrowest syntax check and command-shape or original-entry runtime check required by the change.
+
+Large domain command files must be split by responsibility before receiving more operational logic. Typical split boundaries are target resolution, manifest rendering, Secret sync, public exposure, database bridge, rollout, probes, cleanup and status summarization.
+
+## Anti-Patterns
+
+Avoid these patterns:
+
+- creating a per-service reset script when a YAML-declared Secret sync plus rollout block is enough
+- adding a second control plane for a service that already has an owning YAML and CLI namespace
+- hard-coding node ids, service ids, namespaces, ports, URLs, Secret names or workload names in code
+- deriving live state by string conventions when YAML can declare the object directly
+- keeping repeated `kubectl apply`, Caddy edits, FRP edits or rollout restarts as runbook shell snippets
+- printing secret values, complete env files, full `DATABASE_URL` values or reusable API keys
+- writing long-term docs that duplicate current YAML values as prose
+- using contract tests or hidden guards to freeze policy values that should remain YAML-controlled
+- preserving legacy command branches after the latest YAML-first path supersedes them
+
+## Documentation Boundary
+
+Long-term references should point to this architecture for common YAML-first ops rules, then document only domain-specific ownership and entrypoints. They should not repeat common Secret, exposure, target, redaction or no-hardcoding rules unless a domain adds a stricter constraint.
+
+When a recurring operation becomes stable, update the owning reference document and the relevant skill with the domain entrypoint and decision boundary. Do not document one-off manual recovery as the standard path; manual repair remains recovery evidence until the YAML and CLI path exists.