236 lines
9.9 KiB
Markdown
236 lines
9.9 KiB
Markdown
---
|
||
name: unidesk-ops
|
||
description: UniDesk 手动运维 CLI — `server`、`gc` 和 PK01 `platform-db postgres` 子命令,覆盖主 server 启停、健康检查、swap、日志、Docker 镜像清理、磁盘 GC、服务重建和 PK01 host PostgreSQL 运维。用户提到 server start、server status、server swap、server rebuild、gc、磁盘清理、platform-db、PK01 PostgreSQL、运维时使用。
|
||
---
|
||
|
||
# UniDesk 手动运维 CLI
|
||
|
||
主 server 运维入口,通过 `bun scripts/cli.ts server ...` 和 `bun scripts/cli.ts gc ...` 操作。
|
||
|
||
**固定入口前缀**: `cd /root/unidesk && bun scripts/cli.ts ...`
|
||
|
||
---
|
||
|
||
## 启停
|
||
|
||
```bash
|
||
bun scripts/cli.ts server start
|
||
bun scripts/cli.ts server stop
|
||
```
|
||
|
||
异步 job 模式,返回 `job.id`、日志路径。`start` 执行 Docker 构建+启动,`stop` 停止 Compose project 全部服务。
|
||
|
||
---
|
||
|
||
## 健康检查
|
||
|
||
```bash
|
||
bun scripts/cli.ts server status
|
||
```
|
||
|
||
返回公开端口、受限宿主端口、内部端口、swap 摘要、Compose 容器状态、各服务健康检查和访问 URL。
|
||
|
||
低内存时 `swap.warning` 非空,先执行 `server swap ensure`。
|
||
|
||
---
|
||
|
||
## Swap 管理
|
||
|
||
```bash
|
||
bun scripts/cli.ts server swap status
|
||
bun scripts/cli.ts server swap ensure [--path /swapfile] [--size 2GiB] [--dry-run]
|
||
```
|
||
|
||
`ensure` 在无 active swap 时创建 swapfile(`chmod 600`、`mkswap`、`swapon`、写 `/etc/fstab`)。已有 swap 时 no-op。fstab 写入失败返回 `degraded`。
|
||
|
||
---
|
||
|
||
## 日志
|
||
|
||
```bash
|
||
bun scripts/cli.ts server logs
|
||
```
|
||
|
||
返回文件日志和 Docker 容器日志尾部,默认限制输出大小。
|
||
|
||
---
|
||
|
||
## Docker 镜像清理
|
||
|
||
```bash
|
||
bun scripts/cli.ts server cleanup plan [--min-age-hours 24] [--limit N]
|
||
bun scripts/cli.ts server cleanup run --confirm [--min-age-hours 24] [--limit N]
|
||
```
|
||
|
||
`plan` 只生成 dry-run 计划;`run --confirm` 只删除同一 classifier 选出的 stale Docker images。保守白名单:保留 running/stopped 容器镜像、deploy.json/CI.json commit-pinned artifact、Compose stable image。禁止 `docker system prune`、`docker image prune`、`docker volume rm`、`docker compose down -v` 和数据库清理。高风险候选必须额外显式 `--include-high-risk` 才会执行。
|
||
|
||
---
|
||
|
||
## 磁盘 GC
|
||
|
||
```bash
|
||
bun scripts/cli.ts gc plan
|
||
bun scripts/cli.ts gc run --confirm
|
||
bun scripts/cli.ts gc db-trace
|
||
bun scripts/cli.ts gc policy
|
||
bun scripts/cli.ts gc remote <providerId> [--target-use-percent N] [--dry-run|--confirm]
|
||
```
|
||
|
||
主 server 和 provider 磁盘高水位缓解。`plan` 只读输出候选、风险、估算收益和保护对象。`run` 必须 `--confirm`。`remote` 通过 SSH 透传执行远端 GC。
|
||
|
||
常用显式候选和目标口径:
|
||
|
||
```bash
|
||
bun scripts/cli.ts gc plan --target-use-percent 69 \
|
||
--include-tool-caches \
|
||
--include-stale-tmp \
|
||
--include-vscode-stale-servers \
|
||
--include-vscode-stale-extensions \
|
||
--include-vscode-cached-vsix \
|
||
--include-baidu-staging \
|
||
--include-vpn-diagnostic-logs
|
||
```
|
||
|
||
`--target-use-percent` 按 `df` 显示口径估算 shortfall。工具缓存、`/tmp` 非 allowlist 直接子项、VS Code 历史 server/extension 版本、VS Code CachedExtensionVSIXs 下载缓存、Baidu staging 旧 PGDATA tarball、UniDesk `.state` 历史诊断/部署产物、VPN 诊断 ring pcap 均默认不启用;必须显式 include 后才进入候选,且执行时仍受路径断言保护。stale `/tmp` 扫描按 `--limit` 有界枚举候选,避免为了估算全量临时目录而长时间无输出。`.state` retention 只通过 `--include-state-artifacts --state-artifact-keep-days N` 选择 `.state/e2e`、`.state/validation`、`.state/jobs`、`.state/codex-queue/output-archive` 下超过保留期的普通文件,以及 `.state/deploy/exports`、`.state/deploy/resolve` 下超过保留期的直接子目录;默认保留期 14 天。VS Code cached VSIX 只选择 `/root/.vscode-server/data/CachedExtensionVSIXs` 下超过 `--vscode-cached-vsix-keep-days` 的顶层普通缓存文件,执行前检查 active fd;不删除已安装 extensions、server 或 user data。VPN 诊断日志只选择 `/root/vpn-server/logs/hy2-udp-ring-*.pcap` 和 `hy2-monitor-ring-*.pcap` 中超过 `--vpn-diagnostic-log-keep-hours` 的普通文件,执行前检查 active fd;不删除 evidence JSONL。默认 GC 不触碰 `.state/recovery`、`.state/codex-queue/codex-home`、`.state/deploy/work`、`.state/baidu-netdisk`、PGDATA、Docker volumes/images、Codex sessions/auth state、active worktree、runtime image/snapshot state、Baidu staging 根目录、VPN 日志根目录或 VS Code user data。
|
||
|
||
`gc policy install` 的每日 timer 会自动执行 24 小时 VPN 诊断 pcap retention、14 天 UniDesk `.state` artifact retention 和 7 天 VS Code CachedExtensionVSIXs retention,用于限制长期诊断/部署产物、tcpdump ring 文件与 VS Code 下载缓存增长;手动 `gc plan/run` 仍必须显式 `--include-vpn-diagnostic-logs` / `--include-state-artifacts` / `--include-vscode-cached-vsix` 才会列出或删除这些对象。
|
||
|
||
---
|
||
|
||
## 服务重建
|
||
|
||
```bash
|
||
bun scripts/cli.ts server rebuild <service>
|
||
```
|
||
|
||
service 可选:`backend-core` | `frontend` | `dev-frontend-proxy` | `provider-gateway` | `todo-note` | `code-queue-mgr` | `project-manager` | `baidu-netdisk` | `oa-event-flow`
|
||
|
||
异步 job:构建镜像 → `.state/locks/server-compose.lock` 串行保护 → `--no-deps --force-recreate` 替换容器 → 等待 `healthy/running`。
|
||
|
||
启动后必须轮询 job,不要把提交 job 当成已经完成:
|
||
|
||
```bash
|
||
bun scripts/cli.ts server rebuild backend-core
|
||
bun scripts/cli.ts job status <jobId> --tail-bytes 12000
|
||
```
|
||
|
||
backend-core 重建完成后再做运行面验证:
|
||
|
||
```bash
|
||
bun scripts/cli.ts server status
|
||
docker exec unidesk-backend-core sh -lc 'backend-core --fetch-json http://127.0.0.1:8080/health --require-ok'
|
||
```
|
||
|
||
**禁止事项**:
|
||
- backend-core 常规迭代不得在 master server 编译;只有已提交修复需要上线主 server Compose runtime 时,才用 `server rebuild backend-core` 受控异步 job
|
||
- D601 Code Queue 执行面不由 `server rebuild` 管理
|
||
- 不重建/删除 database 命名卷
|
||
|
||
---
|
||
|
||
## PK01 Host PostgreSQL
|
||
|
||
PK01 host-native PostgreSQL 是平台外置状态库样板,声明文件是 `config/platform-db/postgres-pk01.yaml`,受控入口是:
|
||
|
||
```bash
|
||
bun scripts/cli.ts platform-db postgres plan --config config/platform-db/postgres-pk01.yaml
|
||
bun scripts/cli.ts platform-db postgres status --config config/platform-db/postgres-pk01.yaml
|
||
bun scripts/cli.ts platform-db postgres apply --config config/platform-db/postgres-pk01.yaml --confirm
|
||
bun scripts/cli.ts platform-db postgres apply --config config/platform-db/postgres-pk01.yaml --confirm --wait
|
||
```
|
||
|
||
- `plan` / `status` 只读;`apply --confirm` 默认创建本地异步 job;`apply --confirm --wait` 会启动 PK01 侧 root-owned job 并短轮询。
|
||
- 输出只显示 Secret key 名、presence、fingerprint、连接 host、SSL 状态和状态摘要;禁止打印密码或完整 `DATABASE_URL`。
|
||
- 跨节点消费者必须直连 YAML 的 `postgres.network.connectionHost`,当前是 PK01 公网 endpoint;不要让 D601/G14/Sub2API/HWLAB/AgentRun 通过 master server 中转 PostgreSQL。
|
||
- 当前 TLS 口径是 PostgreSQL native TLS + `sslmode=require`。`publicDns` 只是可选 alias;只要 `connectionHost` 是可达 IP,DNS 未解析不作为切库 blocker。
|
||
- 远端 PostgreSQL 配置或 `pg_hba` 来源 CIDR 变化后,先跑 `apply --confirm --wait`,再跑 `status`;若消费者公网出口 IP 变化,必须先更新 YAML `allowSources` 和对应 `pg_hba`。
|
||
|
||
日常复验建议:
|
||
|
||
```bash
|
||
bun scripts/cli.ts platform-db postgres status --config config/platform-db/postgres-pk01.yaml
|
||
trans PK01 script <<'SCRIPT'
|
||
systemctl is-active postgresql
|
||
systemctl is-enabled postgresql
|
||
systemctl is-active unidesk-pk01-sub2api-pgdump.timer
|
||
SCRIPT
|
||
```
|
||
|
||
长期边界见 `docs/reference/pk01.md`;Sub2API 消费侧边界见 `docs/reference/platform-infra.md`。
|
||
|
||
---
|
||
|
||
## Moon Bridge 管理
|
||
|
||
Moon Bridge 是 Codex ↔ 上游 provider 的桥接服务,通过 profile 级 wrapper 管理:
|
||
|
||
```bash
|
||
# DeepSeek profile
|
||
dscx bridge-start
|
||
dscx bridge-status
|
||
dscx bridge-smoke dscx-bridge-ok
|
||
dscx bridge-stop
|
||
|
||
# MiniMax profile
|
||
mxcx bridge-start
|
||
mxcx bridge-status
|
||
mxcx bridge-smoke mxcx-bridge-ok
|
||
mxcx bridge-stop
|
||
```
|
||
|
||
- `dscx` → `127.0.0.1:38440`(Codex custom provider `deepseek`,DeepSeek V4 Pro)
|
||
- `mxcx` → `127.0.0.1:38441`(Codex custom provider `minimax`,MiniMax-M3)
|
||
- 启动用 `setsid` + profile-local PID file,进程不随 CLI 退出
|
||
- 日志在 `<CODEX_HOME>/logs/moonbridge/`
|
||
|
||
---
|
||
|
||
## Codex Profile Smoke
|
||
|
||
```bash
|
||
# DeepSeek
|
||
dscx doctor
|
||
dscx bridge-smoke dscx-bridge-ok
|
||
dscx exec --skip-git-repo-check 'Reply exactly: dscx-codex-ok'
|
||
|
||
# MiniMax
|
||
mxcx doctor
|
||
mxcx bridge-smoke mxcx-bridge-ok
|
||
mxcx exec --skip-git-repo-check 'Reply exactly: mxcx-codex-ok'
|
||
```
|
||
|
||
`bridge-smoke` 验证 Moon Bridge → provider 链路。`exec` 验证完整 Codex CLI → bridge → provider 全链路。
|
||
|
||
---
|
||
|
||
## MiniMax Session Recovery
|
||
|
||
MiniMax 会话因无效 tool-call arguments 导致 `resume` 反复失败时的恢复流程:
|
||
|
||
```bash
|
||
# 1. 清理无效 tool arguments
|
||
mxcx session-clean <session-id-or-jsonl>
|
||
|
||
# 2. 确认幂等(应返回 changed=false)
|
||
mxcx session-clean <session-id-or-jsonl>
|
||
|
||
# 3. 注入 guard 防止复发
|
||
mxcx session-guard <session-id-or-jsonl>
|
||
|
||
# 4. 非交互 smoke 验证恢复
|
||
mxcx exec resume <session-id> 'Reply exactly: recovered-ok'
|
||
|
||
# 5. apply-patch smoke(如涉及远端编辑)
|
||
# 验证使用 trans <route> apply-patch,非 download/upload/sed
|
||
```
|
||
|
||
`mxcx resume <session-id>` 自动执行 `session-clean` + `session-guard` 后再调用 Codex。修复最小化:只修无效 `function_call.arguments`,不压缩/截断/重排 transcript。
|
||
|
||
---
|
||
|
||
## 参考文档
|
||
|
||
- **主 server 架构与行为规范**: `docs/reference/master-server-ops.md`(Execution Boundary、Codex Provider Profile 架构、Moon Bridge 内部规则、MiniMax session-clean 行为约束、apply-patch 策略)
|
||
- **磁盘 GC 长期规则**: `docs/reference/gc.md`
|
||
- **部署边界**: `docs/reference/deployment.md`
|