diff --git a/AGENTS.md b/AGENTS.md index bb347763..0d752245 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -276,6 +276,7 @@ UniDesk 是一个以主 server 为统一入口的分布式工作平台;本文 - `docs/reference/code-queue-supervision.md`:AgentRun Queue 与旧 Code Queue 指挥监督策略、并发窗口、轮询节奏、终态读取、阻塞拆分、PR handoff 和验收收口规则。 - `docs/reference/hwlab.md`:HWLAB 指挥侧固定 workspace、G14 主运行面、D601 legacy/硬件桥接边界、最小 device-agent/gateway 桥接模型和受控发布边界。 - `docs/reference/g14.md`:G14 provider 节点、k3s 控制桥、legacy DEV/PROD 退役边界、当前 HWLAB runtime lane、device-agent 手动实验边界、Code Queue/CI 候选目标和节点本地 VPN proxy bootstrap 边界。 +- `docs/reference/pk01.md`:PK01 腾讯云 provider-gateway、pikanode/MET Docker workload、SSH 透传、磁盘 GC 和 pikanode temp 长效 retention 边界。 - `docs/reference/platform-infra.md`:G14 `platform-infra` namespace、YAML-first shared service 配置、Sub2API/Codex pool、FRP 暴露和 on-demand availability probe 开发边界;Sub2API 日常操作统一见 `$unidesk-sub2api`(`.agents/skills/unidesk-sub2api/SKILL.md`)。 - `docs/reference/master-server-ops.md`:主 server 本机 Codex profile wrapper、ACX/GOCX/Moon Bridge 路由边界、默认模型、真实调用验收和 MiniMax session recovery 规则。 - `docs/reference/g14-observability-infra.md`:G14 原生 k3s 上 Prometheus Operator、`devops-infra` 监控基础设施、跨 namespace scrape 声明和安全边界。 diff --git a/docs/reference/gc.md b/docs/reference/gc.md index 39a46661..3765c5f1 100644 --- a/docs/reference/gc.md +++ b/docs/reference/gc.md @@ -70,6 +70,12 @@ UniDesk 的磁盘治理入口是 `bun scripts/cli.ts gc ...`。该入口用于 受限 core dump 只匹配 `/root/unidesk/core.` 普通文件。执行前必须重新校验路径 allowlist、Git 未跟踪、非 symlink、无 `fuser` 活跃引用。估算收益必须按实际分配块数计算,并可另行披露 `apparentSizeBytes`;不能把 sparse core dump 的表观大小当成可回收磁盘空间。 +## Remote PK01 Policy + +PK01 是腾讯云 Docker provider,不是 G14 k3s/registry 节点;长期运维边界见 `docs/reference/pk01.md`。`gc remote PK01 ...` 可用于通用低风险候选(allowlisted `/tmp`、Docker json-file 日志、BuildKit cache、apt cache、受限 core dump 和 journald 计划),但 pikanode 的主要增长源由 PK01 节点本地 retention 机制管理,而不是 G14 registry/PVC retention。 + +PK01 pikanode temp retention 只允许清理 `/home/ubuntu/pikanode/html/temp` 下超过保留窗口的直接子目录,并必须保护 `html/download/`、`html/upload/`、`files/`、证书、Git state、直接日志文件和近期 temp workspace。该策略已固化为 PK01 节点本地 systemd timer 与 logrotate;人工排障时优先查看 `systemctl status unidesk-pk01-pikanode-temp-gc.timer` 和 `/var/log/unidesk-pk01/pikanode-temp-gc.log`。如果 PK01 高水位仍无法通过 temp retention 和通用低风险 GC 降下来,必须停止并进入 pikanode 下载产物留存、Docker image retention 或容量扩容决策,不能把 `download/`、`files/` 或 Docker overlay 当作普通临时目录删除。 + ## HWLAB Registry Retention G14 HWLAB registry 清理必须显式使用 `--include-hwlab-registry`,默认 `gc remote G14 plan` 不进入 registry。策略必须保守,不能只留 latest,也不能只删除 tag link 后误判已经释放空间。 diff --git a/docs/reference/pk01.md b/docs/reference/pk01.md new file mode 100644 index 00000000..9d728bbc --- /dev/null +++ b/docs/reference/pk01.md @@ -0,0 +1,131 @@ +# PK01 Provider Operations Reference + +PK01 is a Tencent Cloud compute provider attached to UniDesk through `provider-gateway` with Provider ID `PK01`. This reference is the long-term operating boundary for PK01 host access, provider-gateway bootstrap state, pikanode retention, and disk GC. General provider-gateway rules remain authoritative in `docs/reference/provider-gateway.md`; general GC safety rules remain authoritative in `docs/reference/gc.md`. + +## Operating Entry Points + +Use UniDesk SSH passthrough for PK01 host operations: + +```bash +trans PK01 argv hostname +trans PK01 script <<'SCRIPT' +df -h / +docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}' +SCRIPT +``` + +Before closing an operation, verify both the provider channel and host workload state: + +```bash +bun scripts/cli.ts debug health +trans PK01 argv bash -lc 'docker inspect --format "name={{.Name}} restart={{.HostConfig.RestartPolicy.Name}} pid={{.HostConfig.PidMode}} state={{.State.Status}} image={{.Config.Image}}" unidesk-provider-gateway-pk01' +trans PK01 argv bash -lc 'docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}"' +``` + +PK01 has no k3s control plane. `trans PK01:k3s ...` is not an operating truth. If a future PK01 k3s lane is introduced, it must get a separate runtime-lane reference and must not reuse the current pikanode host-data policy as a Kubernetes retention policy. + +## Provider Gateway Bootstrap State + +PK01 currently uses a direct Docker provider-gateway deployment rather than a full UniDesk source checkout. The node-local runtime bundle is: + +| Item | Path / value | Boundary | +|---|---|---| +| Provider ID | `PK01` | Must stay unique in the UniDesk node registry. | +| Container | `unidesk-provider-gateway-pk01` | Must be `restart=always`, `pid=host`, and `running`. | +| Runtime bundle | `/home/ubuntu/unidesk-provider-pk01` | Minimal workspace mounted read-only into the gateway container. | +| Env file | `/home/ubuntu/.unidesk/state/provider-pk01/provider.env` | Contains provider token and must not be printed, copied into docs, or committed. | +| Host SSH key | `/home/ubuntu/.unidesk/host-ssh-pk01/id_ed25519` | Mounted read-only at `/run/host-ssh`; public key is authorized for `ubuntu`. | +| Logs | `/home/ubuntu/.unidesk/logs/provider-pk01` | Node-local runtime logs, not a Git source of truth. | +| Egress proxy | `127.0.0.1:18789` | Loopback only; never expose as a public endpoint. | + +Long-term provider-gateway upgrades should converge to the standard `provider.upgrade mode=schedule` flow described in `docs/reference/provider-gateway.md`. If PK01 is still on the direct Docker bootstrap path, do not rebuild the gateway synchronously through the gateway's own `trans PK01` session. Use a detached node-local job or first move PK01 to the standard attach/upgrade bundle. + +The minimal PK01 provider-gateway health contract is: + +- `debug health` shows `providerId=PK01` as online. +- labels include `providerGatewayVersion`, `providerGatewayRuntimeGuardOk=true`, `providerGatewaySshDataTransport=tcp-pool`, and a nonzero ready SSH data pool. +- `trans PK01 argv hostname` reaches the Tencent Cloud host and returns the host name. + +## Host Workloads + +PK01 currently hosts existing Docker workloads: + +| Container | Role | Protection boundary | +|---|---|---| +| `pikanode` | Public PikaPython/PikaNode service rooted at `/home/ubuntu/pikanode` | Do not delete source, `files/`, `html/download/`, `html/upload/`, certificates, or Git state without a service-owner retention decision. | +| `met_server` | Existing MET service | Treat as protected runtime unless a separate owner-approved retention plan exists. | +| `unidesk-provider-gateway-pk01` | UniDesk maintenance bridge | Must remain running; do not stop it as part of generic disk GC. | + +`pikanode` mounts `/home/ubuntu/pikanode` read-write into the container. Static/generated download artifacts under `html/download/` and repository data under `files/` may be user-visible or needed by the service. They are not generic GC candidates. + +## Disk GC Policy + +PK01 follows the same safe-stop principle as G14: first produce a bounded attribution, then clean only classified candidates, and stop when remaining pressure is in protected runtime data. + +Default sequence for a high-water incident: + +1. Run generic remote GC plan and, if useful, confirmed run: + ```bash + bun scripts/cli.ts gc remote PK01 plan --target-use-percent 60 --limit 100 --full + bun scripts/cli.ts gc remote PK01 run --confirm --target-use-percent 60 --limit 100 --full + ``` +2. Inspect PK01-specific host data with short passthrough commands; avoid full-root `du` in one `trans` call because `trans` has a 60 second hard timeout. +3. For pikanode growth, clean only `html/temp` direct child directories that are older than the configured node-local retention window. Preserve direct files such as `stdout.log`, `update.log`, `accesstoken.json`, `pullrequest.json`, and any recent temp workspaces. +4. Re-check `df -h /`, provider health, Docker container state, and a pikanode local HTTPS probe. +5. If the target still cannot be reached without touching `html/download/`, `files/`, Docker images, or other protected runtime data, stop and make a retention/capacity decision instead of widening deletion scope. + +PK01 pikanode temp directories are safe to remove only under this narrow definition: + +- path is a direct child directory of `/home/ubuntu/pikanode/html/temp`; +- path is not a symlink; +- parent is exactly `/home/ubuntu/pikanode/html/temp`; +- mtime is older than the configured retention window; +- deletion uses `rm -rf --one-file-system` and never follows paths outside that root. + +Never use `rm -rf /home/ubuntu/pikanode/html/temp/*` as an unbounded shell expansion. It risks deleting current generation workspaces and direct state/log files. + +## Long-Term Retention Mechanisms + +PK01 has node-local retention controls installed so that pikanode temp output and logs do not grow without bound: + +| Mechanism | Node-local path | Purpose | +|---|---|---| +| pikanode temp timer | `/etc/systemd/system/unidesk-pk01-pikanode-temp-gc.timer` | Runs pikanode temp retention on a daily timer. | +| pikanode temp service | `/etc/systemd/system/unidesk-pk01-pikanode-temp-gc.service` | Executes `/usr/local/sbin/unidesk-pk01-pikanode-temp-gc` as a one-shot cleanup. | +| pikanode temp script | `/usr/local/sbin/unidesk-pk01-pikanode-temp-gc` | Deletes only old direct temp directories under the protected root. | +| retention log | `/var/log/unidesk-pk01/pikanode-temp-gc.log` | Bounded operational evidence for the timer. | +| pikanode logrotate | `/etc/logrotate.d/unidesk-pk01-pikanode` | Rotates pikanode temp/runtime logs and the retention log. | +| journald cap | `/etc/systemd/journald.conf.d/99-unidesk-pk01.conf` | Caps systemd journal growth on PK01. | + +Operational checks: + +```bash +trans PK01 argv bash -lc 'systemctl status unidesk-pk01-pikanode-temp-gc.timer --no-pager' +trans PK01 argv bash -lc 'sudo systemctl start unidesk-pk01-pikanode-temp-gc.service && tail -n 40 /var/log/unidesk-pk01/pikanode-temp-gc.log' +trans PK01 argv bash -lc 'sudo logrotate -d /etc/logrotate.d/unidesk-pk01-pikanode' +``` + +The timer and logrotate configuration are node-local operational state. If a future UniDesk CLI subcommand manages PK01 retention centrally, it must first render a dry-run plan, show the same protected paths, and then install/update these node-local files through a confirmed operation. + +## Space Attribution Baseline + +PK01 space attribution should use short, bounded commands. Recommended probes: + +```bash +trans PK01 argv bash -lc 'df -h / && df -i /' +trans PK01 argv bash -lc 'sudo timeout 20 du -xhd1 /var /home/ubuntu/pikanode /home/ubuntu/.vscode-server /var/lib/docker /var/log 2>/dev/null | sort -h | tail -80' +trans PK01 argv bash -lc 'docker system df -v | sed -n "1,220p"' +trans PK01 argv bash -lc 'sudo find /home/ubuntu/pikanode/html/temp -xdev -mindepth 1 -maxdepth 1 -printf "%TY-%Tm-%Td %TH:%TM %p\n" | sort | tail -40' +``` + +Interpretation guide: + +| Path | Meaning | Default action | +|---|---|---| +| `/home/ubuntu/pikanode/html/temp` | Generated pikanode build workspaces | Managed by PK01 temp retention. | +| `/home/ubuntu/pikanode/html/download` | Generated ZIP downloads | Protected unless a separate download retention policy is approved. | +| `/home/ubuntu/pikanode/files` | pikanode repository/service data | Protected. | +| `/home/ubuntu/.vscode-server` | VS Code remote server, extensions, and cache | Do not delete installed servers/extensions by default; cached VSIX cleanup needs an explicit policy. | +| `/var/lib/docker` | Docker overlay/image/container state for PK01 workloads | Do not prune generically; inspect running containers first. | +| `/var/log/journal` | systemd journal | Managed by journald cap; use sudo when vacuuming manually. | +