docs: converge trans shell examples
This commit is contained in:
@@ -118,7 +118,7 @@ dry-run 输出会暴露 registry probe URL、required labels、目标 image、
|
||||
`status` 和 `health` 通过:
|
||||
|
||||
```bash
|
||||
trans D601 argv bash -lc '<readonly script>'
|
||||
trans D601 sh -- '<readonly shell command>'
|
||||
```
|
||||
|
||||
只读检查 D601 状态。检查项包括:
|
||||
|
||||
@@ -62,14 +62,14 @@ If a manual repair is needed to unblock the platform, the durable fix must be co
|
||||
|
||||
Distributed runtime work should prefer structured CLI passthrough over ad-hoc nested shell strings. The standard escalation order is:
|
||||
|
||||
1. Use a purpose-built UniDesk route plus operation or helper such as `trans D601:k3s kubectl ...`, `trans D601:k3s script`, `trans D601:k3s:<namespace>:<workload> logs`, `trans D601:k3s:<namespace>:<workload> script`, `trans D601:k3s:<namespace>:<workload>[:<container>] apply-patch --cwd /workspace`, `trans <providerId>:/absolute/workspace apply-patch`, `trans <providerId> py`, `trans <providerId> find`, `trans <providerId> glob` or `trans <providerId> skills`. Use legacy `apply-patch-v1` only when the old remote helper is explicitly required.
|
||||
1. Use a purpose-built UniDesk route plus operation or helper such as `trans D601:k3s kubectl ...`, `trans D601:k3s sh`, `trans D601:k3s:<namespace>:<workload> logs`, `trans D601:k3s:<namespace>:<workload> sh`, `trans D601:k3s:<namespace>:<workload>[:<container>] apply-patch --cwd /workspace`, `trans <providerId>:/absolute/workspace apply-patch`, `trans <providerId> py`, `trans <providerId> find`, `trans <providerId> glob` or `trans <providerId> skills`. Use legacy `apply-patch-v1` only when the old remote helper is explicitly required.
|
||||
2. If no helper exists, use `trans <providerId> argv <command> [args...]` so the CLI quotes each argv token once.
|
||||
3. If shell features such as pipes, redirects, loops or variable expansion are required, use a single quoted heredoc with `trans <providerId> script` or `trans D601:k3s:<namespace>:<workload> script` so the script body travels over stdin instead of through shell command-string arguments.
|
||||
3. If shell features such as pipes, redirects, loops or variable expansion are required, use a single quoted heredoc with explicit `trans <providerId> sh|bash` or `trans D601:k3s:<namespace>:<workload> sh|bash` so the script body travels over stdin instead of through shell command-string arguments.
|
||||
4. Treat free-form ssh-like command strings as an interactive compatibility path, not as the default automation surface.
|
||||
|
||||
For D601 Kubernetes work, route syntax is preferred over positional shell recipes, but the route must stay a pure locator. `D601:k3s` means the native k3s control plane, and `D601:k3s:<namespace>:<workload>[:container]` means a namespaced workload or pod/container. `:` is the distributed route separator; `/` is only an in-container filesystem cwd, so container selection must use `:<container>` or `--container <container>`, not `pod/<pod>/<container>`. Operations come after the route: `kubectl` runs on the control plane, `logs` reads bounded workload logs, `script` streams a local heredoc/stdin script into the host or target pod, and `apply-patch --cwd /workspace` is the default remote text patch operation for pod workspaces. The route-operation split keeps distributed location and execution behavior independently extensible, fixes `KUBECONFIG=/etc/rancher/k3s/k3s.yaml`, refuses long-follow logs, and assembles common `kubectl exec` / `kubectl logs` / stdin script / pod patch target arguments without adding a provider-gateway protocol change. This prevents the common failure mode where a command crosses local shell, UniDesk SSH broker, remote shell command strings, `kubectl exec`, and container shell quoting layers before reaching the process that should run it.
|
||||
For D601 Kubernetes work, route syntax is preferred over positional shell recipes, but the route must stay a pure locator. `D601:k3s` means the native k3s control plane, and `D601:k3s:<namespace>:<workload>[:container]` means a namespaced workload or pod/container. `:` is the distributed route separator; `/` is only an in-container filesystem cwd, so container selection must use `:<container>` or `--container <container>`, not `pod/<pod>/<container>`. Operations come after the route: `kubectl` runs on the control plane, `logs` reads bounded workload logs, `sh`/`bash` stream a local heredoc/stdin script into the host or target pod with an explicit shell dialect, and `apply-patch --cwd /workspace` is the default remote text patch operation for pod workspaces. The route-operation split keeps distributed location and execution behavior independently extensible, fixes `KUBECONFIG=/etc/rancher/k3s/k3s.yaml`, refuses long-follow logs, and assembles common `kubectl exec` / `kubectl logs` / stdin shell / pod patch target arguments without adding a provider-gateway protocol change. This prevents the common failure mode where a command crosses local shell, UniDesk SSH broker, remote shell command strings, `kubectl exec`, and container shell quoting layers before reaching the process that should run it.
|
||||
|
||||
Longer scripts should move across stdin (`trans py`, `trans script` or k3s `script` operation), and remote text patches should default to `apply-patch` with a host or pod workspace route. Legacy `apply-patch-v1` remains available as the explicit fallback and uses the injected `sh` helper path instead of assuming target containers have `python3`, `node` or repository-local tools. Avoid heredocs nested inside remote command strings, `python - <<EOF` inside SSH strings, or JSON/Markdown bodies passed through shell arguments. These patterns often bind stdin to the wrong process, strip quotes, or leave a half-open provider SSH session that looks like a platform outage.
|
||||
Longer scripts should move across stdin (`trans py`, explicit `trans sh|bash`, or k3s `sh|bash` operation), and remote text patches should default to `apply-patch` with a host or pod workspace route. Legacy `apply-patch-v1` remains available as the explicit fallback and uses the injected `sh` helper path instead of assuming target containers have `python3`, `node` or repository-local tools. Avoid heredocs nested inside remote command strings, `python - <<EOF` inside SSH strings, or JSON/Markdown bodies passed through shell arguments. These patterns often bind stdin to the wrong process, strip quotes, or leave a half-open provider SSH session that looks like a platform outage.
|
||||
|
||||
When structured passthrough is missing for a recurring workflow, fix the CLI first and then document the durable helper. Do not preserve a growing collection of one-off shell recipes as the long-term runbook.
|
||||
|
||||
|
||||
@@ -15,8 +15,8 @@ G14 platform DB 是 G14 host OS 上的原生 PostgreSQL,不是 k3s workload,
|
||||
G14 平台库固定由 systemd 管理:
|
||||
|
||||
```bash
|
||||
trans G14 script -- 'systemctl status postgresql'
|
||||
trans G14 script -- '/usr/local/sbin/g14-platform-db-health'
|
||||
trans G14 sh -- 'systemctl status postgresql'
|
||||
trans G14 sh -- '/usr/local/sbin/g14-platform-db-health'
|
||||
```
|
||||
|
||||
PostgreSQL 只监听 G14 host loopback 与 k3s pod 可达的 node gateway 地址:
|
||||
@@ -76,8 +76,8 @@ HWLAB v0.3 的 source truth 在 `G14:/root/hwlab-v03`、branch `v0.3`。`deploy/
|
||||
标准验证:
|
||||
|
||||
```bash
|
||||
trans G14:/root/hwlab-v03 script -- 'npm run gitops:ts:check'
|
||||
trans G14:/root/hwlab-v03 script -- 'npm run gitops:render -- --lane v03 --out /tmp/hwlab-v03-render-check'
|
||||
trans G14:/root/hwlab-v03 sh -- 'npm run gitops:ts:check'
|
||||
trans G14:/root/hwlab-v03 sh -- 'npm run gitops:render -- --lane v03 --out /tmp/hwlab-v03-render-check'
|
||||
bun scripts/cli.ts hwlab nodes control-plane trigger-current --node G14 --lane v03 --confirm
|
||||
bun scripts/cli.ts hwlab nodes control-plane status --node G14 --lane v03 --pipeline-run <pipeline-run>
|
||||
```
|
||||
@@ -132,8 +132,8 @@ bun scripts/cli.ts hwlab nodes secret cleanup-obsolete --node G14 --lane v03 --n
|
||||
备份脚本固定在 G14 host:
|
||||
|
||||
```bash
|
||||
trans G14 script -- 'systemctl status g14-platform-db-backup.timer'
|
||||
trans G14 script -- '/usr/local/sbin/g14-platform-db-backup'
|
||||
trans G14 sh -- 'systemctl status g14-platform-db-backup.timer'
|
||||
trans G14 sh -- '/usr/local/sbin/g14-platform-db-backup'
|
||||
```
|
||||
|
||||
备份目录:
|
||||
|
||||
+18
-18
@@ -25,7 +25,7 @@ The old `/root/hwlab` workspace on branch `G14` is no longer a default source tr
|
||||
The standard entry forms are:
|
||||
|
||||
```bash
|
||||
trans G14:/root/hwlab script -- 'git fetch origin G14 && git pull --ff-only origin G14 && git status --short --branch && git remote -v'
|
||||
trans G14:/root/hwlab sh -- 'git fetch origin G14 && git pull --ff-only origin G14 && git status --short --branch && git remote -v'
|
||||
trans G14:/root/hwlab apply-patch < patch.diff
|
||||
trans G14:k3s kubectl get pods -n hwlab-v02
|
||||
```
|
||||
@@ -55,7 +55,7 @@ HWLAB `v0.2` is the supported G14 runtime lane for the v0.2 branch. It must not
|
||||
The fixed `v0.2` source branch is `v0.2`, forked from the current `G14` branch after the G14 long-term reference docs record this decision. The fixed G14 development workspace for that branch is:
|
||||
|
||||
```bash
|
||||
trans G14:/root/hwlab-v02 script -- 'git status --short --branch && git remote -v'
|
||||
trans G14:/root/hwlab-v02 sh -- 'git status --short --branch && git remote -v'
|
||||
```
|
||||
|
||||
`/root/hwlab-v02` is the long-lived `v0.2` development workspace, not a scratch clone or CI/CD source selector. It must track `origin/v0.2` with `origin git@github.com:pikasTech/HWLAB.git`; local dirty state, stale `HEAD`, and untracked `.worktree/` only affect human development. Do not reuse retired `/root/hwlab` or `/root/hwlab/.worktree/*` as the `v0.2` fixed workspace.
|
||||
@@ -122,13 +122,13 @@ The generic P2/P3/P4 flow is owned by `$dad-dev`; this section fixes the G14/v0.
|
||||
Direct-lightweight precheck:
|
||||
|
||||
```bash
|
||||
trans G14:/root/hwlab-v02 script -- 'git fetch origin v0.2 && git pull --ff-only origin v0.2 && git status --short --branch'
|
||||
trans G14:/root/hwlab-v02 sh -- 'git fetch origin v0.2 && git pull --ff-only origin v0.2 && git status --short --branch'
|
||||
```
|
||||
|
||||
Service workflow setup:
|
||||
|
||||
```bash
|
||||
trans G14:/root/hwlab-v02 script -- 'git worktree add .worktree/<task> -b fix/issue<N>-<short-name> origin/v0.2'
|
||||
trans G14:/root/hwlab-v02 sh -- 'git worktree add .worktree/<task> -b fix/issue<N>-<short-name> origin/v0.2'
|
||||
```
|
||||
|
||||
The fixed repo at `/root/hwlab-v02` is not a scratch area for service/runtime work, but it is the direct-lightweight source workspace. When a direct-lightweight task sees parallel dirty state in the fixed repo, inspect and include or separate it according to the current user instruction and project Git rules; never discard it silently. Worktree branches for service workflow should follow the `fix/issue<N>-<short-name>` naming so PR titles and merge commits stay scannable. GitHub PR writes, merge, rollout trigger and final original-entry validation follow `$dad-dev` plus the UniDesk CLI control rules in `AGENTS.md`.
|
||||
@@ -137,17 +137,17 @@ The fixed repo at `/root/hwlab-v02` is not a scratch area for service/runtime wo
|
||||
|
||||
Direct-lightweight commits are allowed and do not need recovery. A direct commit on `v0.2` only needs recovery when it changed service/runtime/GitOps/CI/CD/public behavior that should have used the PR/rollout workflow. The recovery is bounded and audit-friendly, but it is also a `git push --force-with-lease` against the protected branch, so it is only acceptable when the unapproved direct commit is the only new content on `v0.2` since the last merged PR:
|
||||
|
||||
1. Confirm no parallel worktree was in flight and the commit is the only delta. `trans G14:/root/hwlab-v02 script -- 'git log origin/v0.2..HEAD'` and `git log HEAD..origin/v0.2` must show the direct commit as a single fast-forward candidate.
|
||||
1. Confirm no parallel worktree was in flight and the commit is the only delta. `trans G14:/root/hwlab-v02 sh -- 'git log origin/v0.2..HEAD'` and `git log HEAD..origin/v0.2` must show the direct commit as a single fast-forward candidate.
|
||||
2. Capture the commit identity and patch for the recovery record:
|
||||
```bash
|
||||
trans G14:/root/hwlab-v02 script -- 'git show <direct-commit-sha> > /tmp/v0.2-recovery.patch'
|
||||
trans G14:/root/hwlab-v02 sh -- 'git show <direct-commit-sha> > /tmp/v0.2-recovery.patch'
|
||||
```
|
||||
3. Roll the fixed repo back to the previous merged PR head. Use `git reset --hard <previous-pr-sha>`; this preserves any autostash (e.g. from a parallel `git checkout` snapshot in another worktree) on the stash list and does not touch the other worktree's working tree.
|
||||
4. In the pre-existing worktree (e.g. `.worktree/<task>` on `fix/issue<N>-<short-name>`) bring the branch up to the previous PR head with `trans G14:/root/hwlab-v02/.worktree/<task> script -- 'git reset --hard <previous-pr-sha>'`, then `git cherry-pick <direct-commit-sha>` to replay the direct commit on the feature branch. If the worktree branch was already a clean clone of `origin/v0.2` at the previous PR head, the reset is a no-op.
|
||||
4. In the pre-existing worktree (e.g. `.worktree/<task>` on `fix/issue<N>-<short-name>`) bring the branch up to the previous PR head with `trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'git reset --hard <previous-pr-sha>'`, then `git cherry-pick <direct-commit-sha>` to replay the direct commit on the feature branch. If the worktree branch was already a clean clone of `origin/v0.2` at the previous PR head, the reset is a no-op.
|
||||
5. Push the feature branch and force-push `v0.2` back to the rolled-back head with `--force-with-lease` (refuses to clobber a concurrent push):
|
||||
```bash
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'git push -u origin fix/issue<N>-<short-name>'
|
||||
trans G14:/root/hwlab-v02 script -- 'git push --force-with-lease origin v0.2'
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'git push -u origin fix/issue<N>-<short-name>'
|
||||
trans G14:/root/hwlab-v02 sh -- 'git push --force-with-lease origin v0.2'
|
||||
```
|
||||
6. Open the PR through UniDesk CLI, squash-merge, then `git pull --ff-only origin v0.2` to bring the fixed repo back in sync. The previous PR's merge commit will not be in the new PR's history; the new PR's diff equals the original direct commit's diff, so the PR trail still contains the exact same bytes.
|
||||
7. `bun scripts/cli.ts hwlab g14 control-plane status --lane v02` will read the new merge commit; the previously-staged PipelineRun for the direct commit was created on the v0.2 head and `trigger-current` will delete + recreate it for the post-merge head, so no manual PipelineRun cleanup is required.
|
||||
@@ -160,7 +160,7 @@ Cloud Web layout, status-panel, collapsed-control, and modal issues on `v0.2` ne
|
||||
|
||||
Use these surfaces together:
|
||||
|
||||
- `trans G14:/root/hwlab-v02/.worktree/<task>/web/hwlab-cloud-web script -- 'bun run check'` for approved static source/layout checks and dist freshness.
|
||||
- `trans G14:/root/hwlab-v02/.worktree/<task>/web/hwlab-cloud-web sh -- 'bun run check'` for approved static source/layout checks and dist freshness.
|
||||
- `bun scripts/cli.ts hwlab g14 control-plane status --lane v02` for runtime, Argo, public endpoint, and GitOps alignment. If `origin/v0.2` moved through a parallel PR, use `--pipeline-run` or `--source-commit` and treat same-branch supersession as context rather than failure.
|
||||
- Public API probes for both `/health/live` and `/v1/live-builds`. `/health/live` proves live service health/revision, but Cloud Web build time, image tag/digest, source metadata, and actual runtime commit/revision should be read from `/v1/live-builds`.
|
||||
- A bounded browser/DOM probe against `http://74.48.78.17:19666/` that asserts the deployed page state relevant to the issue.
|
||||
@@ -183,9 +183,9 @@ When a `v0.2` Cloud Web fix removes a button from `index.html` or a field from t
|
||||
|
||||
```bash
|
||||
# 1. Web assets rebuild and the orphan is gone from the dist
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'cd web/hwlab-cloud-web && bun run build'
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> script -- "grep -c '<removed-field>' web/hwlab-cloud-web/dist/app.js" # must be 0
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> script -- "grep -c 'id=\"<removed-id>\"' web/hwlab-cloud-web/index.html" # must be 0
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'cd web/hwlab-cloud-web && bun run build'
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> sh -- "grep -c '<removed-field>' web/hwlab-cloud-web/dist/app.js" # must be 0
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> sh -- "grep -c 'id=\"<removed-id>\"' web/hwlab-cloud-web/index.html" # must be 0
|
||||
|
||||
# 2. Live 19666/19667 confirms the deployed bundle is the new build
|
||||
curl -fsS http://74.48.78.17:19666/ | grep -c '<removed-id>' # must be 0
|
||||
@@ -196,7 +196,7 @@ bun scripts/cli.ts hwlab g14 control-plane status --lane v02
|
||||
While the PR is open, the author can also run a one-liner to surface any orphan `el.<field>.addEventListener` whose field is not declared in the `el` literal of `app.ts`:
|
||||
|
||||
```bash
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'awk "/^const el = /,/^};/" web/hwlab-cloud-web/app.ts | tr -d "," | awk "{print \$1}" | grep -E "^[a-zA-Z]" | sort -u > /tmp/el-fields.txt; grep -nEo "el\\.([A-Za-z_$][A-Za-z0-9_$]*)\\.addEventListener" web/hwlab-cloud-web/*.ts | while read m; do field=$(echo "$m" | sed -E "s/.*el\\.([A-Za-z_$][A-Za-z0-9_$]*)\\.addEventListener.*/\\1/"); if ! grep -q "^$field$" /tmp/el-fields.txt; then echo "ORPHAN: el.$field.addEventListener"; fi; done'
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'awk "/^const el = /,/^};/" web/hwlab-cloud-web/app.ts | tr -d "," | awk "{print \$1}" | grep -E "^[a-zA-Z]" | sort -u > /tmp/el-fields.txt; grep -nEo "el\\.([A-Za-z_$][A-Za-z0-9_$]*)\\.addEventListener" web/hwlab-cloud-web/*.ts | while read m; do field=$(echo "$m" | sed -E "s/.*el\\.([A-Za-z_$][A-Za-z0-9_$]*)\\.addEventListener.*/\\1/"); if ! grep -q "^$field$" /tmp/el-fields.txt; then echo "ORPHAN: el.$field.addEventListener"; fi; done'
|
||||
```
|
||||
|
||||
Document the explicit `grep` / curl evidence in the issue closeout comment. Tightening the `el` literal with proper TypeScript types is tracked separately and must not be done as part of a runtime fix PR.
|
||||
@@ -276,10 +276,10 @@ A live `workspace.evidence` / `debug.evidence` / `download evidence` selector th
|
||||
Device-pod fixes still follow `$dad-dev` and the service/runtime side of the `## v0.2 Source Workflow` route above. The device-pod-specific closeout is the three-layer runtime matrix below; keep these checks because they prove the cloud-api -> executor -> D601 host chain, while generic PR/CI/CD and worktree mechanics stay in `$dad-dev`.
|
||||
|
||||
```bash
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'cd tools && bun test device-pod-cli.test.ts'
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'cd cmd/hwlab-device-pod && bun test main.test.ts'
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'cd internal/cloud && bun test access-control.test.ts'
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'node --check skills/device-pod-cli/assets/device-host-cli.mjs'
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'cd tools && bun test device-pod-cli.test.ts'
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'cd cmd/hwlab-device-pod && bun test main.test.ts'
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'cd internal/cloud && bun test access-control.test.ts'
|
||||
trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'node --check skills/device-pod-cli/assets/device-host-cli.mjs'
|
||||
```
|
||||
|
||||
Treat `access-control.test.ts` workbench failures as pre-existing on the v0.2 base unless the new test list explicitly covers them. After PR merge and `trigger-current --lane v02 --confirm`, the live `http://74.48.78.17:19667/` CLI 验收 must hit all three layers:
|
||||
|
||||
+11
-11
@@ -214,11 +214,11 @@ Registry 报告必须区分 `uniqueBlobBytes` 和 `sharedBlobBytes`。多个 rep
|
||||
G14 空间审计默认只读。需要报告时优先采集以下摘要,避免全量 dump 大 JSON:
|
||||
|
||||
```bash
|
||||
trans G14 script -- 'df -h / | tail -1'
|
||||
trans G14 script -- 'du -xh -d 1 / /var /var/lib /root 2>/dev/null | sort -h | tail -40'
|
||||
trans G14 script -- 'du -xh -d 2 /var/lib/rancher/k3s /var/lib/containerd /var/log 2>/dev/null | sort -h | tail -80'
|
||||
trans G14 script -- 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl get pv,pvc,pod -A -o wide'
|
||||
trans G14 script -- 'find /var/lib/hwlab/registry/docker/registry/v2/repositories -path "*/_manifests/tags/*/current/link" -type f | wc -l'
|
||||
trans G14 sh -- 'df -h / | tail -1'
|
||||
trans G14 sh -- 'du -xh -d 1 / /var /var/lib /root 2>/dev/null | sort -h | tail -40'
|
||||
trans G14 sh -- 'du -xh -d 2 /var/lib/rancher/k3s /var/lib/containerd /var/log 2>/dev/null | sort -h | tail -80'
|
||||
trans G14:k3s kubectl get pv,pvc,pod -A -o wide
|
||||
trans G14 sh -- 'find /var/lib/hwlab/registry/docker/registry/v2/repositories -path "*/_manifests/tags/*/current/link" -type f | wc -l'
|
||||
```
|
||||
|
||||
需要深挖 registry 时,报告字段至少包括 repo、tag count、manifest revision count、latest tags、protected digest closure、unique blob bytes 和 shared blob bytes。需要深挖 k3s runtime 时,报告字段至少包括 namespace/PVC、PV host path、owner workload、PVC 实占、k3s containerd snapshots/blobs 总量。不要把 `/var/lib/kubelet/pods` 与 `/var/lib/rancher/k3s/storage` 简单相加,因为 kubelet pod 目录可能包含 PVC bind mount 或 runtime 元数据,存在重复计数风险。
|
||||
@@ -226,8 +226,8 @@ trans G14 script -- 'find /var/lib/hwlab/registry/docker/registry/v2/repositorie
|
||||
需要深挖日志和 worktree 时,默认只读报告,不直接清理:
|
||||
|
||||
```bash
|
||||
trans G14 script -- 'du -xh -d 1 /var/log 2>/dev/null | sort -h | tail -40'
|
||||
trans G14 script -- 'du -xh -d 2 /root/hwlab-v02/.worktree 2>/dev/null | sort -h | tail -60'
|
||||
trans G14 sh -- 'du -xh -d 1 /var/log 2>/dev/null | sort -h | tail -40'
|
||||
trans G14 sh -- 'du -xh -d 2 /root/hwlab-v02/.worktree 2>/dev/null | sort -h | tail -60'
|
||||
```
|
||||
|
||||
rsyslog 文件日志不属于当前 `gc remote` 默认可变更对象。若 `/var/log/syslog*`、`/var/log/kern.log*` 或同类文件成为 50% 目标的最后缺口,应先新增受控 logrotate/压缩/截断 CLI,并在输出中披露保留 tail、压缩对象、释放估算和失败恢复;禁止直接 `truncate` 或删除日志文件作为长期流程。`/root/hwlab-v02/.worktree` 只能在明确 owner、branch、dirty 状态和可重建性后清理,不能按目录大小直接删除。
|
||||
@@ -237,10 +237,10 @@ rsyslog 文件日志不属于当前 `gc remote` 默认可变更对象。若 `/va
|
||||
G14 GC 后必须验证:
|
||||
|
||||
```bash
|
||||
trans G14 script -- 'df -h / | tail -1'
|
||||
trans G14 script -- 'curl -fsS http://127.0.0.1:5000/v2/ >/dev/null && echo ok'
|
||||
trans G14 script -- 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n hwlab-ci get deploy hwlab-registry'
|
||||
trans G14 script -- 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n hwlab-ci get cronjob hwlab-g14-branch-poller -o custom-columns=NAME:.metadata.name,SUSPEND:.spec.suspend --no-headers && ! KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n hwlab-ci get cronjob hwlab-v02-branch-poller >/dev/null 2>&1'
|
||||
trans G14 sh -- 'df -h / | tail -1'
|
||||
trans G14 sh -- 'curl -fsS http://127.0.0.1:5000/v2/ >/dev/null && echo ok'
|
||||
trans G14:k3s kubectl -n hwlab-ci get deploy hwlab-registry
|
||||
trans G14:k3s sh -- 'kubectl -n hwlab-ci get cronjob hwlab-g14-branch-poller -o custom-columns=NAME:.metadata.name,SUSPEND:.spec.suspend --no-headers && ! kubectl -n hwlab-ci get cronjob hwlab-v02-branch-poller >/dev/null 2>&1'
|
||||
```
|
||||
|
||||
DEV workload 验证应检查非零副本 workload 是否 ready;`0/0` 的显式停用 deployment 不应误报为事故。registry tag 数只作为辅证,不能替代 workload ref 保护和 registry API 健康。
|
||||
|
||||
@@ -86,7 +86,7 @@ Sanitizer rules: recursively scans `ResponsesRequest.input`, repairs tool-call `
|
||||
|
||||
## MiniMax Apply-Patch Operations
|
||||
|
||||
MiniMax-backed sessions must use the same UniDesk remote text patch contract as other agents: route first, operation second, and `apply-patch` v2 by default. The stable write shape is `trans <provider>:/absolute/workspace apply-patch < patch.diff`; read-only inspection may use `trans <provider>:/absolute/workspace script -- 'nl -ba file'` or equivalent bounded commands.
|
||||
MiniMax-backed sessions must use the same UniDesk remote text patch contract as other agents: route first, operation second, and `apply-patch` v2 by default. The stable write shape is `trans <provider>:/absolute/workspace apply-patch < patch.diff`; read-only inspection may use `trans <provider>:/absolute/workspace nl -ba file` or equivalent bounded commands.
|
||||
|
||||
- If `apply-patch` reports `failed to find expected lines`, first read the exact current target block, then retry with a smaller `Update File` hunk, an `@@ <unique anchor>` hint, or multiple small hunks. This is normal stale-context recovery, not a reason to switch tools.
|
||||
- Do not recover text patch failures by using `download` / `upload`, remote Python/Perl/sed heredocs, `cat >` / `tee` whole-file rewrites, or `apply-patch-v1`, unless `apply-patch` itself is unavailable or the target is non-text / bulk mechanical generated content.
|
||||
|
||||
@@ -42,7 +42,7 @@ UniDesk 用户服务是挂载到 UniDesk 核心服务上的、面向用户使用
|
||||
|
||||
业务仓库由业务系统自己维护,包括源码、Dockerfile、docker-compose、配置模板和业务测试。UniDesk 只引用业务仓库 URL、commit id、Dockerfile/docker-compose 路径和运行容器名;不得把业务全量代码复制到 `src/components/microservices/` 形成双维护。`src/components/microservices/` 只能放通用示例或 UniDesk 自有示例,不作为业务仓库镜像。
|
||||
|
||||
Code Queue runner 也是分布式开发执行面。runner 镜像必须内置 `tran`,让 runner 在执行任务时能通过公网 frontend 控制面访问 D601、G14、host workspace、k3s 控制面和目标 pod。runner 内应优先使用 `tran <provider> argv ...`、`tran <provider>:k3s kubectl ...`、`tran <provider>:k3s:<namespace>:<workload> argv ...` 这类结构化命令;需要 stdin 的 `script`、`apply-patch`、`py` 操作同样通过 frontend `/ws/ssh` 流式通道执行,不应退回 `/api/dispatch` task polling。这个边界避免把 provider token、backend-core 内网 DNS 或长命令多层引号作为 runner 可用性的前提,也避免大 stdout 被 task JSON compact 截断。
|
||||
Code Queue runner 也是分布式开发执行面。runner 镜像必须内置 `tran`,让 runner 在执行任务时能通过公网 frontend 控制面访问 D601、G14、host workspace、k3s 控制面和目标 pod。runner 内应优先使用 `tran <provider> argv ...`、`tran <provider>:k3s kubectl ...`、`tran <provider>:k3s:<namespace>:<workload> argv ...` 这类结构化命令;需要 stdin 的 `sh`/`bash`、`apply-patch`、`py` 操作同样通过 frontend `/ws/ssh` 流式通道执行,不应退回 `/api/dispatch` task polling。这个边界避免把 provider token、backend-core 内网 DNS 或长命令多层引号作为 runner 可用性的前提,也避免大 stdout 被 task JSON compact 截断。
|
||||
|
||||
## Main Server User Services
|
||||
|
||||
@@ -227,7 +227,7 @@ D601 上必须显式使用原生 k3s kubeconfig:`KUBECONFIG=/etc/rancher/k3s/k
|
||||
- Skill 注入边界:DEV Code Queue scheduler/read/write Pod 必须把宿主 `/home/ubuntu/.agents/skills` 只读挂载到容器 `/root/.agents/skills`,并设置 `UNIDESK_SKILLS_PATH=/root/.agents/skills`,让执行任务能读取 `cli-spec` 等技能;只允许挂载 skill 目录本身,不得把宿主 `~/.agents`、`~/.codex`、token、auth JSON 或其他隐私配置整体暴露给任务容器。`/health` 和 `/api/dev-ready` 必须暴露非敏感 `skills` 状态:路径、exists、available、readonly、skillCount、`cliSpecAvailable` 和修复建议;CLI `codex dev-ready` 可读取该摘要。当前交付只要求 DEV manifest 和旧 direct Compose 诊断路径具备只读 skill 注入;PROD Code Queue 发布前必须单独审查隔离级别,不能把 DEV 桥接模式直接推广为生产默认。
|
||||
- Develop-ready 镜像:Code Queue 镜像必须在启动前预装 UniDesk/Pipeline 调试所需工具,至少包含 `codex`、`bun`、`node`、`npm`/`npx`、`git`、`rg`、`curl`、`python3`/`pip3`、`docker`、`docker compose`、`docker-compose`、`jq`、`ssh`、`rsync`、`make`、`gcc`/`g++`、`iptables`、`tar`、`gzip` 和 `unzip`;不得依赖 Codex 任务运行时再 `apt-get install` 这些基础环境。
|
||||
- 远程开发容器与任务执行 Provider:Code Queue 必须能通过 live API 拉起 D601 等计算节点上的开发容器,入口为 `POST /api/dev-containers/<providerId>/start`,默认 Provider 为 `D601`。该流程由 Code Queue 调用 UniDesk SSH 维护桥在目标节点创建 `unidesk-codex-dev-<providerId>`;人工入口写 `trans <providerId>`,内部服务调用仍复用同一 route parser 和 broker。在 Code Queue 所在节点与开发容器之间建立 `ssh -w` TUN 点对点链路;服务所在节点负责对开发容器的 TUN 源地址做 NAT/MASQUERADE,开发容器默认路由和 DNS 改走该 TUN,从而让 `ping google.com`、DNS、HTTP(S) 等出网都经主 server 全局代理,而不是依赖 D601 本地网络。提交 Code Queue 任务时必须支持选择执行 Provider:`D601` 在 D601 原生 k3s 的 active Code Queue scheduler/runner Pod 中本机执行,默认工作目录为 `/workspace`,并且 `/workspace` 必须映射 D601 WSL host 的 `/home/ubuntu`;同一个 hostPath 还必须挂载到容器内 `/home/ubuntu`,让 WSL home 里的绝对 symlink(例如 `/workspace/cq-deploy -> /home/ubuntu/unidesk-code-queue-deploy`)在任务中可解析,不能只看到 symlink 名而无法进入目标目录。`/root/unidesk` 与 `/app` 必须单独映射 `/home/ubuntu/cq-deploy` 作为服务部署仓库;其他 Provider 在对应 `unidesk-codex-dev-<providerId>` 容器中执行,默认工作目录为 `/home/ubuntu`,可按任务覆盖 `cwd`。远程任务启动前必须自动复用或拉起该 Provider 的开发容器、同步 Codex 配置和允许的运行时 provider 环境变量,并通过同一 master TUN/NAT 链路出网;目标 host 存在 `/mnt` 时,开发容器必须挂载 host `/mnt:/mnt`,确保 D601 这类 WSL 节点的 Windows 盘符路径如 `/mnt/f/Work/ConStart` 在任务容器内可见,避免 agent 因缺少真实工作区而搜索到无关项目。TUN 建立必须幂等处理 stale 状态:启动前清理旧 `tun<id>`、默认路由、旧 tunnel SSH 进程和旧 OUTPUT 跳转,缺失旧设备不能导致失败,冷启动运行时准备要有有界但足够的 timeout。TUN 建立后必须创建 `UD-CQ-EGRESS-<provider>` OUTPUT 链,规则只允许 loopback、既有连接、`tun<id>` 出口以及到 master server 的 SSH tunnel 控制连接,随后 reject 其他 IPv4/IPv6 出站包;这条网络层封口是开发/执行容器的权威外网边界,不能用 `HTTP_PROXY`/`NO_PROXY` 环境变量替代,容器镜像也必须使用已解析出的唯一 `unidesk-code-queue:<provider>` 或显式 `image`,缺失时直接失败,禁止 provider-gateway image、`latest` 或其他隐式镜像 fallback。验收必须保留三类日志:容器建隧道后 `ping google.com` 成功、强制指定原 Docker 网卡直连外网被 `sealed_direct_ping=blocked_expected` 拦截、服务所在节点上对应 `UNIDESK-CODEX-DEV-<providerId>` NAT 链或 `tun<id>` 计数在 ping 前后增长;涉及 WSL 工作区任务时还必须在开发容器内验证目标 `/mnt/...` 路径可读。`GET /api/dev-containers/<providerId>/status` 必须展示默认路由、`route_8_8_8_8`、`egressFirewallChain` 和 OUTPUT 链跳转。开发容器代理密钥只生成到 `.state/code-queue/dev-proxy/` 与目标节点用户目录,不得提交到仓库。
|
||||
- 远程维护桥调用:Code Queue 已迁移到 D601 后,Code Queue 后端 Pod 内没有主 server 的 `unidesk-backend-core` 容器,不能再把 `trans ...` 实现为本地 `docker exec unidesk-backend-core`。Code Queue runner 发起的 provider 维护命令必须通过主 server frontend authenticated `/ws/ssh` 流式代理进入 backend-core SSH bridge,再由目标 provider-gateway 执行 Host SSH/WSL SSH;stdout/stderr 直接流回 runner,不能经过 `/api/dispatch` task polling 或 JSON compact。需要传递脚本、`py` 或 `apply-patch` 时也使用同一条 stdin 流式通道,避免恢复到本地 Docker broker、手工 base64 分块上传、交互 shell fallback 或多层引号。
|
||||
- 远程维护桥调用:Code Queue 已迁移到 D601 后,Code Queue 后端 Pod 内没有主 server 的 `unidesk-backend-core` 容器,不能再把 `trans ...` 实现为本地 `docker exec unidesk-backend-core`。Code Queue runner 发起的 provider 维护命令必须通过主 server frontend authenticated `/ws/ssh` 流式代理进入 backend-core SSH bridge,再由目标 provider-gateway 执行 Host SSH/WSL SSH;stdout/stderr 直接流回 runner,不能经过 `/api/dispatch` task polling 或 JSON compact。需要传递 `sh`/`bash` stdin shell body、`py` 或 `apply-patch` 时也使用同一条 stdin 流式通道,避免恢复到本地 Docker broker、手工 base64 分块上传、交互 shell fallback 或多层引号。
|
||||
- 远程 Provider 准备不得阻塞控制面:Code Queue 在请求处理、队列调度、远程开发容器准备、Host SSH/WSL SSH 透传、Codex/OpenCode 启动和日志导出路径中,禁止使用会长时间占用 Bun event loop 的同步子进程调用,例如针对远程 Provider 的 `spawnSync`、`execSync` 或 `execFileSync`。远程命令必须通过异步子进程执行,带显式 timeout、超时 kill、stdout/stderr 上限和任务 output 进度记录;远程准备失败只能让对应任务进入失败或 retry,不能让 `POST /api/tasks`、SSE `/api/events`、`/health`、overview 或 frontend/core 用户服务代理等控制面请求等待远程 SSH 结束。凡是改动 D601/远程 Provider 准备、`api/dev-containers/*`、任务入队启动或 `runCodeQueueSsh` 等路径,验收必须在一个远程 SSH/status/start 探针运行期间并发验证容器直连 `/health` 和 `/api/tasks/overview` 仍能在 1s 内返回,证明远程超时不会复发为全站刷新卡死。
|
||||
- OpenCode 远程执行:`minimax-m3` 与 `minimax-m2.7` 两路并行配置走 OpenCode JSON event port 时,本地和远程命令都必须显式执行 `opencode run ...`;远程 Docker exec 不得退化成 `exec run ...`,否则会在目标容器内变成 `bash: exec: run: not found`。OpenCode JSON stream 的终态判定以“当前进程退出码 + 当前 attempt 的最终 assistant response”为准:`exit=0` 且当前 attempt 产生非空最终回复时,即使上游没有发 `step_finish` 事件,也应视为正常 terminal;非零退出、无当前最终回复或传输关闭才进入 retry。每个 attempt 的 `finalResponse` 必须只来自当前 OpenCode/Codex turn,禁止在当前 turn 未产出最终回复时回退复用 task 上一次 `finalResponse`,否则会把旧任务内容误判为本轮完成。
|
||||
- Codex 控制:服务内部启动 `codex app-server --listen stdio://`,用 JSON-RPC 调用 `thread/start`、`turn/start`、`turn/steer` 和 `turn/interrupt`,并监听 `turn/completed`、assistant delta、reasoning delta、command output delta、file diff delta 等通知生成前端可轮询的 transcript。
|
||||
|
||||
+11
-11
@@ -8,18 +8,18 @@ Use UniDesk SSH passthrough for PK01 host operations:
|
||||
|
||||
```bash
|
||||
trans PK01 argv hostname
|
||||
trans PK01 script <<'SCRIPT'
|
||||
trans PK01 sh <<'SH'
|
||||
df -h /
|
||||
docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}'
|
||||
SCRIPT
|
||||
SH
|
||||
```
|
||||
|
||||
Before closing an operation, verify both the provider channel and host workload state:
|
||||
|
||||
```bash
|
||||
bun scripts/cli.ts debug health
|
||||
trans PK01 argv bash -lc 'docker inspect --format "name={{.Name}} restart={{.HostConfig.RestartPolicy.Name}} pid={{.HostConfig.PidMode}} state={{.State.Status}} image={{.Config.Image}}" unidesk-provider-gateway-pk01'
|
||||
trans PK01 argv bash -lc 'docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}"'
|
||||
trans PK01 sh -- 'docker inspect --format "name={{.Name}} restart={{.HostConfig.RestartPolicy.Name}} pid={{.HostConfig.PidMode}} state={{.State.Status}} image={{.Config.Image}}" unidesk-provider-gateway-pk01'
|
||||
trans PK01 sh -- 'docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}"'
|
||||
```
|
||||
|
||||
PK01 has no k3s control plane. `trans PK01:k3s ...` is not an operating truth. If a future PK01 k3s lane is introduced, it must get a separate runtime-lane reference and must not reuse the current pikanode host-data policy as a Kubernetes retention policy.
|
||||
@@ -130,9 +130,9 @@ PK01 has node-local retention controls installed so that pikanode temp output an
|
||||
Operational checks:
|
||||
|
||||
```bash
|
||||
trans PK01 argv bash -lc 'systemctl status unidesk-pk01-pikanode-temp-gc.timer --no-pager'
|
||||
trans PK01 argv bash -lc 'sudo systemctl start unidesk-pk01-pikanode-temp-gc.service && tail -n 40 /var/log/unidesk-pk01/pikanode-temp-gc.log'
|
||||
trans PK01 argv bash -lc 'sudo logrotate -d /etc/logrotate.d/unidesk-pk01-pikanode'
|
||||
trans PK01 sh -- 'systemctl status unidesk-pk01-pikanode-temp-gc.timer --no-pager'
|
||||
trans PK01 sh -- 'sudo systemctl start unidesk-pk01-pikanode-temp-gc.service && tail -n 40 /var/log/unidesk-pk01/pikanode-temp-gc.log'
|
||||
trans PK01 sh -- 'sudo logrotate -d /etc/logrotate.d/unidesk-pk01-pikanode'
|
||||
```
|
||||
|
||||
The timer and logrotate configuration are node-local operational state. If a future UniDesk CLI subcommand manages PK01 retention centrally, it must first render a dry-run plan, show the same protected paths, and then install/update these node-local files through a confirmed operation.
|
||||
@@ -142,10 +142,10 @@ The timer and logrotate configuration are node-local operational state. If a fut
|
||||
PK01 space attribution should use short, bounded commands. Recommended probes:
|
||||
|
||||
```bash
|
||||
trans PK01 argv bash -lc 'df -h / && df -i /'
|
||||
trans PK01 argv bash -lc 'sudo timeout 20 du -xhd1 /var /home/ubuntu/pikanode /home/ubuntu/.vscode-server /var/lib/docker /var/log 2>/dev/null | sort -h | tail -80'
|
||||
trans PK01 argv bash -lc 'docker system df -v | sed -n "1,220p"'
|
||||
trans PK01 argv bash -lc 'sudo find /home/ubuntu/pikanode/html/temp -xdev -mindepth 1 -maxdepth 1 -printf "%TY-%Tm-%Td %TH:%TM %p\n" | sort | tail -40'
|
||||
trans PK01 sh -- 'df -h / && df -i /'
|
||||
trans PK01 sh -- 'sudo timeout 20 du -xhd1 /var /home/ubuntu/pikanode /home/ubuntu/.vscode-server /var/lib/docker /var/log 2>/dev/null | sort -h | tail -80'
|
||||
trans PK01 sh -- 'docker system df -v | sed -n "1,220p"'
|
||||
trans PK01 sh -- 'sudo find /home/ubuntu/pikanode/html/temp -xdev -mindepth 1 -maxdepth 1 -printf "%TY-%Tm-%Td %TH:%TM %p\n" | sort | tail -40'
|
||||
```
|
||||
|
||||
Interpretation guide:
|
||||
|
||||
@@ -167,7 +167,7 @@ When target-level `egressProxy.enabled=true`, the D601 target renders an in-clus
|
||||
|
||||
Adding, removing, exposing, validating, and configuring local Codex consumers are daily operations covered by `$unidesk-sub2api`. The development rule is that ordinary pool membership changes stay YAML-only and do not add code or CI/CD. Code changes are only appropriate when UniDesk needs to render or validate a Sub2API capability that already exists upstream, such as account-level WebSocket mode or per-account upstream User-Agent. If Sub2API itself does not support a desired behavior, do not magic-patch it through UniDesk scripts, Kubernetes hotfixes, local forks, or hidden compatibility paths; either leave the behavior unsupported or pursue it upstream as an explicit Sub2API feature.
|
||||
|
||||
`codex-pool sync --confirm` and `codex-pool validate` are runtime operations that may need more than one SSH short-connection window because they log in to Sub2API, reconcile accounts, inspect recent logs, and run gateway smoke requests. The formal entry remains the UniDesk CLI, which must use a submit-and-short-poll control shape or an equivalent remote job wrapper instead of one long `trans G14:k3s script` call. If these commands fail with `UNIDESK_SSH_RUNTIME_TIMEOUT` while the remote operation may still be running, treat it as a control-plane visibility gap first: improve or use the CLI's job/poll path, then rerun `sync` or `validate`. Do not replace it with raw `kubectl`, manual Sub2API admin API patches, repeated blind full loops, or Sub2API source modifications.
|
||||
`codex-pool sync --confirm` and `codex-pool validate` are runtime operations that may need more than one SSH short-connection window because they log in to Sub2API, reconcile accounts, inspect recent logs, and run gateway smoke requests. The formal entry remains the UniDesk CLI, which must use a submit-and-short-poll control shape or an equivalent remote job wrapper instead of one long `trans G14:k3s sh` call. If these commands fail with `UNIDESK_SSH_RUNTIME_TIMEOUT` while the remote operation may still be running, treat it as a control-plane visibility gap first: improve or use the CLI's job/poll path, then rerun `sync` or `validate`. Do not replace it with raw `kubectl`, manual Sub2API admin API patches, repeated blind full loops, or Sub2API source modifications.
|
||||
|
||||
After `codex-pool configure-local --confirm`, the default `~/.codex/config.toml` / `auth.json` pair must remain the unified Sub2API consumer and must not be reused as an upstream account profile. Keep every upstream source profile in suffixed files such as `config.toml.<profile>` / `auth.json.<profile>` and register it through YAML `profiles.entries`.
|
||||
|
||||
|
||||
@@ -46,7 +46,7 @@ Provider WebSocket 是注册、heartbeat、dispatch、`provider.upgrade` 和短
|
||||
|
||||
TCP pool 的长期方向是把 `trans`/`tran` 变成真正并发的短连接工具,而不是给单条 Provider WebSocket 继续叠队列。backend-core 只用 provider WebSocket 下发 open/dispatch/exit 等控制帧,stdin/stdout/stderr 数据帧必须走已预热的 TCP channel;每个 SSH 会话、脚本、文件传输或 Windows 透传命令独占一条 channel,结束后释放回池。池耗尽、channel 丢失、data port 不可达和 provider 版本过旧都必须是结构化快速失败;禁止把请求排进应用层队列后长时间不返回。
|
||||
|
||||
当前默认池大小是 10 条,设计上优先覆盖高频短 SSH、并发小文件和单个大文件不阻塞其他请求的场景。已验证的目标状态是:D601 这类 WSL provider 上 10 路并发 `trans ... argv bash -lc 'sleep 2'` 不再出现 `provider ssh tcp data pool has no idle channel`、stderr 为空、每一路 stdout 都包含命令开始和结束输出,结束后 labels 回到 `ready=desired`、`claimed=0`。当前仍存在端到端固定开销,10 路并发短命令的墙钟可能明显高于远端命令自身耗时;这属于后续连接建立、broker 调度、WSL SSH spawn 或 provider 启动路径的性能优化范围,不能用队列、门禁或隐藏重试掩盖。
|
||||
当前默认池大小是 10 条,设计上优先覆盖高频短 SSH、并发小文件和单个大文件不阻塞其他请求的场景。已验证的目标状态是:D601 这类 WSL provider 上 10 路并发 `trans ... sh -- 'sleep 2'` 不再出现 `provider ssh tcp data pool has no idle channel`、stderr 为空、每一路 stdout 都包含命令开始和结束输出,结束后 labels 回到 `ready=desired`、`claimed=0`。当前仍存在端到端固定开销,10 路并发短命令的墙钟可能明显高于远端命令自身耗时;这属于后续连接建立、broker 调度、WSL SSH spawn 或 provider 启动路径的性能优化范围,不能用队列、门禁或隐藏重试掩盖。
|
||||
|
||||
开发中最容易踩的坑是把“依赖层在线”误判成“数据面可用”。`host.ssh` 只证明 provider 能执行维护 SSH;`host.ssh.tcp-pool`、`providerGatewaySshDataPoolReady`、`providerGatewaySshDataPoolClaimed` 和 `providerGatewaySshDataPoolLastError` 才能证明 TCP 数据池状态。另一个坑是输出尾部丢失:backend-core broker 在收到 `ssh.data` 后必须把 stdout/stderr 写入并 flush,再处理 `ssh.exit`,否则短命令可能 rc=0 但最后一段 stdout 没到调用端。第三个坑是 session 释放:`ssh.exit`、错误和超时路径都必须释放 claimed channel,避免下一批并发请求看到假性的池耗尽。第四个坑是 core/provider 池状态漂移:如果 provider 通过控制 WebSocket 返回 `host_ssh_error` 且提示 `requested ssh tcp data channel is not ready`,说明 core 侧 claim 到的 channel 已经不被 provider 认可,backend-core 必须 drop 该 `providerId + dataChannelId`,不能把它 release 回 idle pool 后继续重复 claim。
|
||||
|
||||
@@ -162,7 +162,7 @@ backend-core 可以通过真实 WebSocket 调度向在线 provider 下发 `provi
|
||||
|
||||
`bun scripts/cli.ts provider triage <PROVIDER_ID>` 是 provider 运行状态的只读多信号裁决入口。输出必须包含 `decision`、`retryable`、`healthyScopes`、`failedScopes`、`degradedScopes`、`blockingDisposition`、`rationale`、`signals` 和 `recommendedCrossChecks`。`decision` 的长期语义是:`global-offline` 表示 provider heartbeat、Host SSH、k3s 或 scheduler 等多个独立关键面同时失败且没有健康交叉证据;`service-degraded` 表示 registry、service proxy 或单个用户服务局部退化但仍存在 provider 级健康信号;`retryable-transient` 表示单次 runner-local、SSH、proxy 或 API timeout 证据不足,应重试或补交叉验证;`healthy` 表示未观察到失败或退化信号。
|
||||
|
||||
`recommendedCrossChecks` 必须保留 argv 形态的 Host SSH 自检:`trans <PROVIDER_ID> argv true`。这条命令用于证明非交互维护桥仍可用;如果自由 ssh-like 形态出现 timeout、`kex_exchange_identification` 或 `Connection closed by remote host`,应先按 CLI 输出的 `UNIDESK_SSH_HINT` 改用 `trans D601 argv bash -lc '<command>'` 复测,再结合 `provider triage` 判断是否真是 provider 级故障。
|
||||
`recommendedCrossChecks` 必须保留 argv 形态的 Host SSH 自检:`trans <PROVIDER_ID> argv true`。这条命令用于证明非交互维护桥仍可用;如果自由 ssh-like 形态出现 timeout、`kex_exchange_identification` 或 `Connection closed by remote host`,应先按 CLI 输出的 `UNIDESK_SSH_HINT` 改用 `trans D601 sh -- '<command>'` 或 `trans D601 bash -- '<bash command>'` 复测,再结合 `provider triage` 判断是否真是 provider 级故障。
|
||||
|
||||
D601 这类长期 WSL provider 不得因为单一路径失败被直接写成全局离线。典型局部退化包括 artifact registry 的 `unidesk-artifact-registry.service` inactive,但 registry container 仍 running、listener 仍绑定 loopback、`http://127.0.0.1:5000/v2/` 返回 200;这种状态应在 registry scope 内显示 degraded,并在 provider triage 中落到 `decision=service-degraded`,只提示修复 systemd drift,不阻断所有 D601 上的 Code Queue、k3sctl-adapter 或业务 API 判断。
|
||||
|
||||
@@ -186,6 +186,6 @@ WSL provider 需要调用 Windows-only 工具链时,优先在 WSL 用户的 `~
|
||||
|
||||
维护桥通过真实 WebSocket dispatch 暴露为 `host.ssh` 命令。默认 payload 使用 `mode: "probe"`,远端只执行一个短命令并返回 `UNIDESK_SSH_TEST user=... host=... bridge=host.ssh cwd=...`;需要人工诊断时可以显式使用 `mode: "exec"` 与 `command` 字段执行有界命令。所有 `host.ssh` 执行都必须有超时,stdout/stderr 在 task result 中截断展示;自动升级和普通任务仍必须使用 Docker socket 与 `provider.upgrade`,不得把 WSL SSH 维护桥当成调度通道。
|
||||
|
||||
面向人的终端入口是 `trans <PROVIDER_ID> [ssh-like args...]`。无后续参数时打开远端登录 shell,有后续参数时执行远端命令并返回远端 exit code;该入口的 client 侧仍连接 backend-core 内网 `/ws/ssh` broker,core 只用 provider WebSocket 下发 open/dispatch 控制消息,终端 stdin/stdout/stderr 数据面必须走 provider 主动连接 main server 的 `host.ssh.tcp-pool` TCP warm pool,不新增计算节点入站要求,也不保留旧 WebSocket 数据 fallback。传统 ssh 传输参数由 provider-gateway 环境变量统一控制,CLI 只负责把 Provider ID 后的远端命令和终端 stdin/stdout/stderr 透传过去。非交互远端命令优先使用 argv 入口:`trans D601 argv true`,或需要 shell 特性时使用 `trans D601 argv bash -lc '<command>'`。WSL 节点需要同时看清 Linux/WSL 与 Windows 两套 skill 时,使用 `trans <PROVIDER_ID> skills`,该命令只通过已建立的维护桥读取 `SKILL.md` 元数据,不要求 provider-gateway 新增业务 API。
|
||||
面向人的终端入口是 `trans <PROVIDER_ID> [ssh-like args...]`。无后续参数时打开远端登录 shell,有后续参数时执行远端命令并返回远端 exit code;该入口的 client 侧仍连接 backend-core 内网 `/ws/ssh` broker,core 只用 provider WebSocket 下发 open/dispatch 控制消息,终端 stdin/stdout/stderr 数据面必须走 provider 主动连接 main server 的 `host.ssh.tcp-pool` TCP warm pool,不新增计算节点入站要求,也不保留旧 WebSocket 数据 fallback。传统 ssh 传输参数由 provider-gateway 环境变量统一控制,CLI 只负责把 Provider ID 后的远端命令和终端 stdin/stdout/stderr 透传过去。非交互单进程远端命令优先使用 argv 入口:`trans D601 argv true`;需要 shell 特性时在 operation 位置显式写 `sh` 或 `bash`,例如 `trans D601 sh -- '<command>'` 或 `trans D601 bash -- '<bash command>'`。WSL 节点需要同时看清 Linux/WSL 与 Windows 两套 skill 时,使用 `trans <PROVIDER_ID> skills`,该命令只通过已建立的维护桥读取 `SKILL.md` 元数据,不要求 provider-gateway 新增业务 API。
|
||||
|
||||
验证 WSL SSH 桥时,先在目标 WSL 中启动 sshd 并确保维护公钥写入目标用户的 `authorized_keys`,再确认目标 provider 注册 labels 中 `unideskCapabilities` 包含 `host.ssh`。运行 `bun scripts/cli.ts debug dispatch <PROVIDER_ID> host.ssh --wait-ms 15000` 后,结果应在 `debug task latest` 或前端任务历史中显示 `status: succeeded`、`probeLine` 含 `UNIDESK_SSH_TEST`、`exitCode: 0`,并且目标节点 labels 中 `hostSshKeyPresent` 为 true;随后运行 `trans <PROVIDER_ID> argv true` 验证非交互 argv 维护命令,再运行 `trans <PROVIDER_ID> hostname` 验证近似原生 ssh 的远端命令体验。在计算节点本机自测时,使用 remote CLI 透传同一组命令:`bun scripts/cli.ts --main-server-ip 74.48.78.17 debug health`、`bun scripts/cli.ts --main-server-ip 74.48.78.17 debug dispatch <PROVIDER_ID> host.ssh --wait-ms 15000`、`bun scripts/cli.ts --main-server-ip 74.48.78.17 ssh <PROVIDER_ID> argv true` 和 `bun scripts/cli.ts --main-server-ip 74.48.78.17 ssh <PROVIDER_ID> hostname`;默认 remote CLI 走公网 frontend 登录态,不需要主 server SSH key。健康检查必须能看到该 Provider 在线、`hostSshConfigured=true`、`hostSshKeyPresent=true`、`hostSshTarget` 正确、`unideskCapabilities` 包含 `host.ssh`,probe 必须返回 `UNIDESK_SSH_TEST`,`ssh <PROVIDER_ID> argv true` 与 `ssh <PROVIDER_ID> hostname` 必须 exit code 为 0。如果 D518 这类 WSL 节点没有公网 SSH 入口,也必须通过这个 provider-gateway 自连维护桥完成验证,而不是要求主 server 直接连节点公网 22 端口;旧版 provider 未声明 `host.ssh` 时必须先升级 provider-gateway,否则 core 会拒绝 SSH 透传。
|
||||
|
||||
@@ -63,7 +63,7 @@
|
||||
**每次开始正式工作前(进入排程、回读 Todo Note、回写决策、安排时间块),秘书必须自己拉取一次最新北京时间作为基准时间。** 不能依赖对话首部日期、上下文里的"现在"、历史时间戳或上一次会话的"今天"推断;用户口述时间(如"现在是 14:57")作为交叉验证,但不是基准。取时方式(2026-06-01 验证):
|
||||
|
||||
- **主 server 本地调用**(秘书本机场景):`TZ='Asia/Shanghai' date '+%Y-%m-%d %H:%M:%S %Z'`,15:39 CST = 真实北京时间,无需走 ssh route。优点:低摩擦、单命令、无 dep。
|
||||
- **跨节点/远端校时**(需要确认远端 host 时钟时):`trans <route> script -- 'date "+%Y-%m-%d %H:%M:%S %Z"'`,route 写明定位(如 `G14` / `D601`)。
|
||||
- **跨节点/远端校时**(需要确认远端 host 时钟时):`trans <route> sh -- 'date "+%Y-%m-%d %H:%M:%S %Z"'`,route 写明定位(如 `G14` / `D601`)。
|
||||
|
||||
把结果在排程第一句明确告诉用户("我现在看到的时间是 X,按这个重排"),让用户能立刻发现时间错位。这样做的原因:用户可能在两次对话之间休息或离开,秘书用缓存时间排程会与用户实际位置偏移,反馈格式"完成/未完成 + 卡点"对不上,日程序列累积错位。
|
||||
|
||||
|
||||
@@ -32,13 +32,13 @@ trans <PROVIDER_ID>:win skills --limit 20
|
||||
|
||||
## Windows Long-Lived Process Detach
|
||||
|
||||
`trans <PROVIDER_ID>:win cmd ...` 和 `trans <PROVIDER_ID> script -- powershell.exe ...` 适合短命令、只读探测和有界 skill 调用,不适合直接启动 Windows 长驻进程。Windows `cmd start`、`cmd /c ... &`、PowerShell `Start-Process -PassThru` 或带 stdout/stderr 重定向的子进程,仍可能被 provider-gateway/SSH broker 按子进程树或继承句柄等待;结果是远端进程其实已启动,但 `trans` 会持续占用 provider session,后续 D601/G14 高频调用被 provider session lock 串行排队。
|
||||
`trans <PROVIDER_ID>:win cmd ...` 和 `trans <PROVIDER_ID>:win ps ...` 适合短命令、只读探测和有界 skill 调用,不适合直接启动 Windows 长驻进程。Windows `cmd start`、`cmd /c ... &`、PowerShell `Start-Process -PassThru` 或带 stdout/stderr 重定向的子进程,仍可能被 provider-gateway/SSH broker 按子进程树或继承句柄等待;结果是远端进程其实已启动,但 `trans` 会持续占用 provider session,后续 D601/G14 高频调用被 provider session lock 串行排队。
|
||||
|
||||
长驻 Windows 进程必须使用明确脱离当前 `trans` 会话的启动模型:
|
||||
|
||||
- 优先把启动参数写入节点私有 profile 或 `.cmd`/`.ps1` 文件,再通过 Windows Task Scheduler、Windows Service、NSSM、PM2 Windows service 或同等 supervisor 启动。
|
||||
- 如果只是临时实验,使用 `schtasks /Create ... /TR "<cmd file>" /SC ONCE ... /F` 后再 `schtasks /Run ...`,并把 stdout/stderr 写入固定日志文件;验证用独立短命令读取 `tasklist`、`Get-CimInstance Win32_Process`、日志尾部和服务健康。
|
||||
- 不要在 `D601:win cmd` 内用 `start` 或 `Start-Process` 直接启动 `hwlab-gateway`、串口 monitor、Codex app-server、Keil job watcher 等长驻进程;如果误用了并导致 `trans` 卡住,应先停止对应 Windows PID 或本地被卡住的 trans/tran broker 进程,再改为 detached supervisor。
|
||||
- 不要在 `D601:win cmd` / `D601:win ps` 内用 `start` 或 `Start-Process` 直接启动 `hwlab-gateway`、串口 monitor、Codex app-server、Keil job watcher 等长驻进程;如果误用了并导致 `trans` 卡住,应先停止对应 Windows PID 或本地被卡住的 trans/tran broker 进程,再改为 detached supervisor。
|
||||
- 启动命令必须避免从 WSL UNC cwd 进入 Windows cmd;长驻进程的 `cwd` 应是 Windows 盘符路径,例如 `F:\Work\HWLAB`,日志也写到同一节点私有 `.state` 或 skill state 目录。
|
||||
- 长驻进程的验收标准不是“启动命令返回”,而是独立短命令能看到 PID、日志首行、health endpoint 或 cloud-side session/resource/capability 已注册。
|
||||
|
||||
|
||||
Reference in New Issue
Block a user