diff --git a/.agents/skills/unidesk-gc/SKILL.md b/.agents/skills/unidesk-gc/SKILL.md new file mode 100644 index 00000000..845920ea --- /dev/null +++ b/.agents/skills/unidesk-gc/SKILL.md @@ -0,0 +1,146 @@ +--- +name: unidesk-gc +description: UniDesk disk GC and host pressure relief workflow. Use when Codex needs to diagnose or reduce UniDesk host/root filesystem usage, run `bun scripts/cli.ts gc ...`, handle `/tmp/unidesk-cli-output` growth, clean merged UniDesk worktrees, prune controlled BuildKit/tool caches, tune journald caps, investigate Web observe/Chrome growth, or decide safe-stop boundaries for local host or `gc remote` operations. Trigger on gc, disk cleanup, disk full, root filesystem high water, host disk pressure, worktree cleanup, BuildKit cache cleanup, Web observe artifact growth, Chrome memory pressure, or UniDesk GC retention tasks. +--- + +# UniDesk GC + +Use this skill for UniDesk disk pressure work. Prefer the controlled UniDesk CLI and stop at protected boundaries instead of expanding into ad hoc `rm -rf`, Docker prune, database cleanup, raw Kubernetes deletion, or runtime state deletion. + +Long-term policy lives in `docs/reference/gc.md`. Read that reference before remote GC, k3s/PVC attribution, JD01 Web observe/Chrome growth, G14 registry retention, CI workspace retention, or any safe-stop decision. + +## Local Host Workflow + +Start with read-only attribution: + +```bash +df -h / +df -BG / +bun scripts/cli.ts gc plan --target-use-percent 69 --limit 50 +``` + +If the default plan has a shortfall, use explicit opt-in candidates: + +```bash +bun scripts/cli.ts gc plan --target-use-percent 69 --limit 2000 \ + --include-tool-caches \ + --include-stale-tmp \ + --include-vscode-stale-servers \ + --include-vscode-stale-extensions \ + --include-vscode-cached-vsix \ + --include-baidu-staging \ + --include-state-artifacts \ + --include-state-stale-scratch \ + --include-codex-sessions \ + --include-merged-worktrees \ + --include-vpn-diagnostic-logs +``` + +Run the same candidate surface only after reviewing the plan: + +```bash +bun scripts/cli.ts gc run --confirm --target-use-percent 69 --limit 2000 \ + --include-tool-caches \ + --include-stale-tmp \ + --include-vscode-stale-servers \ + --include-vscode-stale-extensions \ + --include-vscode-cached-vsix \ + --include-baidu-staging \ + --include-state-artifacts \ + --include-state-stale-scratch \ + --include-codex-sessions \ + --include-merged-worktrees \ + --include-vpn-diagnostic-logs +``` + +When worktree candidates are protected by merge/cherry timeout, rerun only the worktree surface with higher temporary budgets: + +```bash +bun scripts/cli.ts gc plan --target-use-percent 69 --limit 2000 \ + --include-merged-worktrees \ + --worktree-scan-budget-ms 120000 \ + --worktree-cherry-check-timeout-ms 10000 \ + --no-file-logs --no-docker-logs --no-journal --no-build-cache --no-tmp --no-db-summary +``` + +Use the matching `run --confirm` only for candidates still shown by that plan. Dirty, recent, active, unmerged, and timeout-protected worktrees must remain protected. + +## Cache And Logs + +Check Docker image cleanup separately: + +```bash +bun scripts/cli.ts server cleanup plan --min-age-hours 24 --limit 80 +``` + +If it returns zero stale image candidates, do not use `docker image prune` or `docker system prune`; protected images may be current or rollback/runtime truth. + +Default BuildKit cleanup can estimate reclaim but actually release `0B` if all cache is recent. Use `--build-cache-all` only as an explicit pressure-relief step: + +```bash +bun scripts/cli.ts gc plan --target-use-percent 69 --build-cache-all --limit 50 \ + --no-file-logs --no-docker-logs --no-journal --no-tmp --no-db-summary +bun scripts/cli.ts gc run --confirm --target-use-percent 69 --build-cache-all --limit 50 \ + --no-file-logs --no-docker-logs --no-journal --no-tmp --no-db-summary +``` + +Journald can be capped through the same CLI: + +```bash +bun scripts/cli.ts gc plan --target-use-percent 69 --journal-target-size 128M --limit 50 \ + --no-file-logs --no-docker-logs --no-build-cache --no-tmp --no-db-summary +``` + +Use the matching `run --confirm` if the plan is acceptable. + +## Temporary Dumps + +`/tmp/unidesk-cli-output` is a CLI dump directory for oversized JSON/stdout. It can grow close to GiB scale during GC diagnosis because each truncated plan/run writes another dump. After extracting needed evidence and confirming no active writers, it is acceptable to remove the dump directory: + +```bash +fuser -v /tmp/unidesk-cli-output 2>&1 || true +rm -rf -- /tmp/unidesk-cli-output +``` + +Prefer turning repeated dump cleanup into a controlled CLI retention policy instead of making manual removal the normal interface. + +For other `/tmp` directories, check size, mtime, and active fds first. Avoid deleting same-day source/workspace scratch that may belong to parallel tasks unless its owner and recreatability are clear. + +## Remote Hosts + +Use `bun scripts/cli.ts gc remote ...` for provider hosts. Remote long work must be asynchronous and queried with `status --job-id`; do not keep a long SSH session open. + +Read `docs/reference/gc.md` before these remote cases: + +- G14 registry retention, CI workspace retention, k3s/PVC attribution, and safe-stop decisions. +- PK01 pikanode temp retention and Docker-provider safe boundaries. +- JD01 k3s/PVC attribution, Web observe artifact retention, Chrome/observer memory growth, and YAML-first source-of-truth checks. + +For JD01, Chrome memory growth should first be treated as an observer lifecycle problem: sentinel/quick-verify terminal paths must stop their observer, and runner TTL/maxSamples/artifact caps must come from YAML. Do not solve it by raw killing Chrome or deleting web-observe directories; use controlled observe stop and GC plan candidates. + +## Protected Boundaries + +Never use these as generic disk relief: + +- `docker system prune`, `docker image prune`, Docker volume removal, or Compose volume deletion. +- PostgreSQL PGDATA or database trace cleanup without the dedicated `gc db-trace` flow, backup, and maintenance window. +- `/var/lib/containerd`, `/var/lib/rancher/k3s`, `/var/lib/kubelet`, PVC paths, registry blobs, runtime snapshots, or k3s/container runtime state. +- Codex auth/config/profile state. Codex session cleanup must use `--include-codex-sessions`; large active Codex SQLite log files require `fuser` checks and a dedicated retention decision. +- Active Web observe runs, live observer runners, live Chrome process trees, or web-observe state roots without manifest/heartbeat/pid/open-fd based stale classification. +- Dirty, active, unmerged, recent, or timeout-protected worktrees. +- `backend-core` rebuild/restart/replacement while solving disk pressure unless the user explicitly asks. + +If `summary.target.safeStop=true` remains after all low-risk candidates, stop and report the remaining protected pressure sources and decision options. Do not bypass the CLI to hit a percentage target. + +## Verification + +Close with concise evidence: + +```bash +df -h / +df -BG / +docker system df +du -sh /root/unidesk/.worktree /root/unidesk/.state /tmp /var/log 2>/dev/null || true +``` + +Summarize the starting and final `df` percentage, major successful cleanup classes, protected failures, and remaining high-risk pressure sources. diff --git a/.agents/skills/unidesk-gc/agents/openai.yaml b/.agents/skills/unidesk-gc/agents/openai.yaml new file mode 100644 index 00000000..7c4e3348 --- /dev/null +++ b/.agents/skills/unidesk-gc/agents/openai.yaml @@ -0,0 +1,4 @@ +interface: + display_name: "UniDesk GC" + short_description: "Controlled disk GC for UniDesk hosts" + default_prompt: "Use $unidesk-gc to plan and run controlled disk cleanup for a UniDesk host." diff --git a/docs/reference/gc.md b/docs/reference/gc.md index ee9707a2..b1beac1c 100644 --- a/docs/reference/gc.md +++ b/docs/reference/gc.md @@ -84,6 +84,20 @@ PK01 是腾讯云 Docker provider,不是 G14 k3s/registry 节点;长期运 PK01 pikanode temp retention 只允许清理 `/home/ubuntu/pikanode/html/temp` 下超过保留窗口的直接子目录,并必须保护 `html/download/`、`html/upload/`、`files/`、证书、Git state、直接日志文件和近期 temp workspace。该策略已固化为 PK01 节点本地 systemd timer 与 logrotate;人工排障时优先查看 `systemctl status unidesk-pk01-pikanode-temp-gc.timer` 和 `/var/log/unidesk-pk01/pikanode-temp-gc.log`。如果 PK01 高水位仍无法通过 temp retention 和通用低风险 GC 降下来,必须停止并进入 pikanode 下载产物留存、Docker image retention 或容量扩容决策,不能把 `download/`、`files/` 或 Docker overlay 当作普通临时目录删除。 +## Remote JD01 Policy + +JD01 是 YAML-first k3s provider,承载 AgentRun、HWLAB v0.3、Web probe sentinel 和相关 PVC/state artifact。`gc remote JD01 ...` 的目标节点、lane、namespace 集合、state root、保护路径、扫描预算、artifact cap、retention 窗口、observer stop 策略和输出 limit 必须来自 `config/unidesk-cli.yaml#gc.remote.targets.JD01` 或等价 source of truth;CLI 参数只作为一次性覆盖,不得把 JD01 namespace、路径或阈值写成脚本隐藏默认。 + +JD01 远端 plan 必须适配短连接:`snapshot` 和轻量 `plan` 返回有界 JSON;涉及 k3s/PVC 实占、state root 深扫、history/trend 或大 protected path size 的长任务必须创建异步 job,并通过 `gc remote JD01 status --job-id ` 渐进查询。protected path size 采集必须是 budgeted/progressive:每个 protected item 披露 `sizeState`、耗时、超时或失败原因;`du` 超时后不得回退为无界递归扫描,也不得让保护对象尺寸统计阻塞 plan 返回。 + +JD01 PVC 归因必须按 YAML 配置的 namespace 集合读取 k8s API,不得复用 G14 专属 namespace 硬编码。报告至少包含 namespace、PVC、PV、host path、requested size、estimated actual bytes、active mount pods、owner/session/PipelineRun/runId、phase 和 reclaim policy。默认只做 plan 和归因;删除 PVC/PV、local-path host path、k3s storage、containerd snapshot/blob 或 workload 对象必须通过对应高层 retention 子命令和 GitOps/运行面 owner 判定,不能由 remote GC 扩大成 raw `kubectl delete` 或 host path 删除。 + +JD01 Web observe artifact 是一等 GC 对象。state root 必须来自 YAML;候选按 run 聚合并读取 `manifest.json`、`heartbeat.json`、`pid`、report sha 和 top files。年龄判定以 manifest/heartbeat 的 started/completed/updated 字段、pid 存活和打开 fd 检查为准,不以目录 mtime 为唯一依据,因为手动 GC 或目录遍历可能刷新 mtime。active run、pid alive、open fd、未生成必要 report 的 run 均为 protected。safe 候选只覆盖超过 YAML retention 且可重建的 raw samples、browser-process、network/trace、screenshot 等大 artifact;长期保留 report summary、report json/md、最终截图或诊断摘要由 YAML cap/retention 策略控制。 + +JD01 Chrome 内存治理应优先管理 observer runner 生命周期,而不是孤立清理 Chrome 进程。Web probe sentinel 和 quick-verify 启动 observer 后,所有终态路径(成功、blocked、失败、timeout、异常)都必须执行 YAML 控制的 `web-probe observe stop`/force stop 流程,并验证对应 runner/Chrome process tree 退出;observe runner 自身也必须从 scenario/YAML 获得最大运行时长或 max samples 兜底,即使调用方退出也会停止采样并关闭 browser。browser freeze policy 只能作为异常保护,不替代正常任务生命周期结束后的 stop。 + +JD01 plan 和 status 应同时披露内存压力摘要:active observer 数、Chrome process 数、Chrome RSS、stale observer 数、state root artifact bytes、last cleanup、last stop failure 和 drill-down 命令。GC run 只能执行 plan 中明确标记 safe 的低风险动作,例如 apt/journal/tmp allowlist、dead web-observe raw artifact retention,以及通过受控 observe stop 处理 stale observer;对 PVC、k3s runtime、containerd、Docker volume、Secret 和 auth/config 状态一律保持 protected 或转交专用 retention 入口。 + ## HWLAB Registry Retention G14 HWLAB registry 清理必须显式使用 `--include-hwlab-registry`,默认 `gc remote G14 plan` 不进入 registry。策略必须保守,不能只留 latest,也不能只删除 tag link 后误判已经释放空间。 @@ -240,6 +254,8 @@ trans G14 sh -- 'du -xh -d 2 /root/hwlab-v02/.worktree 2>/dev/null | sort -h | t rsyslog 文件日志不属于当前 `gc remote` 默认可变更对象。若 `/var/log/syslog*`、`/var/log/kern.log*` 或同类文件成为 50% 目标的最后缺口,应先新增受控 logrotate/压缩/截断 CLI,并在输出中披露保留 tail、压缩对象、释放估算和失败恢复;禁止直接 `truncate` 或删除日志文件作为长期流程。`/root/hwlab-v02/.worktree` 只能在明确 owner、branch、dirty 状态和可重建性后清理,不能按目录大小直接删除。 +JD01 空间和 Chrome 压力审计同样默认只读。需要深挖时,优先通过 `gc remote JD01 snapshot|plan|status` 暴露有界摘要;必要的远端探测只用于补 CLI 证据,不作为长期手工流程。报告字段至少包括 root 水位、inode、水位目标缺口、YAML 配置引用、k8s PVC namespace 归因、web-observe run artifact topN、observer/Chrome process tree 摘要、protected sizeState 和 safe/blocked 分类。Web observe run 的 stale 判定必须说明 manifest/heartbeat/pid/open-fd 依据;不能把目录 mtime 当作唯一证据。 + ## Validation Checklist G14 GC 后必须验证: @@ -276,4 +292,6 @@ bun scripts/cli.ts hwlab g14 control-plane cleanup-released-pvs --lane all --lim | Core dump limits | 限制 dump 大小或按 allowlist 删除实际分配块 | 防止 crash dump 污染观测;sparse dump 不应被高估 | | Containerd image audit | 定期只读报告 runtime image cache 构成 | 为维护窗口 prune 提供证据,不默认删除 | | Worktree TTL audit | 报告 `.worktree` owner、branch、dirty 和 node_modules/cache 占用 | 为安全清理并行任务 scratch 提供证据 | +| Web observe artifact caps | 从 YAML 控制 samples/browser-process/network/screenshot raw artifact cap、summary+tail 保留和 dead run retention | 防止 Web sentinel 长期巡检把 JSONL 与截图产物线性堆满磁盘 | +| Observer lifecycle cap | quick-verify/sentinel 所有终态 stop observer,runner 按 YAML TTL/maxSamples 自停 | 防止 detached observer 与 Chrome process tree 在线性巡检中泄露内存 | | Capacity trigger | 达到高水位时输出 safe-stop 决策表 | 避免为了百分比目标破坏运行面 |