docs: converge trans shell examples

This commit is contained in:
Codex
2026-06-15 05:25:59 +00:00
parent e504b7b3b4
commit ceb3fb4627
15 changed files with 65 additions and 65 deletions
+1 -1
View File
@@ -118,7 +118,7 @@ dry-run 输出会暴露 registry probe URL、required labels、目标 image、
`status``health` 通过:
```bash
trans D601 argv bash -lc '<readonly script>'
trans D601 sh -- '<readonly shell command>'
```
只读检查 D601 状态。检查项包括:
+4 -4
View File
@@ -62,14 +62,14 @@ If a manual repair is needed to unblock the platform, the durable fix must be co
Distributed runtime work should prefer structured CLI passthrough over ad-hoc nested shell strings. The standard escalation order is:
1. Use a purpose-built UniDesk route plus operation or helper such as `trans D601:k3s kubectl ...`, `trans D601:k3s script`, `trans D601:k3s:<namespace>:<workload> logs`, `trans D601:k3s:<namespace>:<workload> script`, `trans D601:k3s:<namespace>:<workload>[:<container>] apply-patch --cwd /workspace`, `trans <providerId>:/absolute/workspace apply-patch`, `trans <providerId> py`, `trans <providerId> find`, `trans <providerId> glob` or `trans <providerId> skills`. Use legacy `apply-patch-v1` only when the old remote helper is explicitly required.
1. Use a purpose-built UniDesk route plus operation or helper such as `trans D601:k3s kubectl ...`, `trans D601:k3s sh`, `trans D601:k3s:<namespace>:<workload> logs`, `trans D601:k3s:<namespace>:<workload> sh`, `trans D601:k3s:<namespace>:<workload>[:<container>] apply-patch --cwd /workspace`, `trans <providerId>:/absolute/workspace apply-patch`, `trans <providerId> py`, `trans <providerId> find`, `trans <providerId> glob` or `trans <providerId> skills`. Use legacy `apply-patch-v1` only when the old remote helper is explicitly required.
2. If no helper exists, use `trans <providerId> argv <command> [args...]` so the CLI quotes each argv token once.
3. If shell features such as pipes, redirects, loops or variable expansion are required, use a single quoted heredoc with `trans <providerId> script` or `trans D601:k3s:<namespace>:<workload> script` so the script body travels over stdin instead of through shell command-string arguments.
3. If shell features such as pipes, redirects, loops or variable expansion are required, use a single quoted heredoc with explicit `trans <providerId> sh|bash` or `trans D601:k3s:<namespace>:<workload> sh|bash` so the script body travels over stdin instead of through shell command-string arguments.
4. Treat free-form ssh-like command strings as an interactive compatibility path, not as the default automation surface.
For D601 Kubernetes work, route syntax is preferred over positional shell recipes, but the route must stay a pure locator. `D601:k3s` means the native k3s control plane, and `D601:k3s:<namespace>:<workload>[:container]` means a namespaced workload or pod/container. `:` is the distributed route separator; `/` is only an in-container filesystem cwd, so container selection must use `:<container>` or `--container <container>`, not `pod/<pod>/<container>`. Operations come after the route: `kubectl` runs on the control plane, `logs` reads bounded workload logs, `script` streams a local heredoc/stdin script into the host or target pod, and `apply-patch --cwd /workspace` is the default remote text patch operation for pod workspaces. The route-operation split keeps distributed location and execution behavior independently extensible, fixes `KUBECONFIG=/etc/rancher/k3s/k3s.yaml`, refuses long-follow logs, and assembles common `kubectl exec` / `kubectl logs` / stdin script / pod patch target arguments without adding a provider-gateway protocol change. This prevents the common failure mode where a command crosses local shell, UniDesk SSH broker, remote shell command strings, `kubectl exec`, and container shell quoting layers before reaching the process that should run it.
For D601 Kubernetes work, route syntax is preferred over positional shell recipes, but the route must stay a pure locator. `D601:k3s` means the native k3s control plane, and `D601:k3s:<namespace>:<workload>[:container]` means a namespaced workload or pod/container. `:` is the distributed route separator; `/` is only an in-container filesystem cwd, so container selection must use `:<container>` or `--container <container>`, not `pod/<pod>/<container>`. Operations come after the route: `kubectl` runs on the control plane, `logs` reads bounded workload logs, `sh`/`bash` stream a local heredoc/stdin script into the host or target pod with an explicit shell dialect, and `apply-patch --cwd /workspace` is the default remote text patch operation for pod workspaces. The route-operation split keeps distributed location and execution behavior independently extensible, fixes `KUBECONFIG=/etc/rancher/k3s/k3s.yaml`, refuses long-follow logs, and assembles common `kubectl exec` / `kubectl logs` / stdin shell / pod patch target arguments without adding a provider-gateway protocol change. This prevents the common failure mode where a command crosses local shell, UniDesk SSH broker, remote shell command strings, `kubectl exec`, and container shell quoting layers before reaching the process that should run it.
Longer scripts should move across stdin (`trans py`, `trans script` or k3s `script` operation), and remote text patches should default to `apply-patch` with a host or pod workspace route. Legacy `apply-patch-v1` remains available as the explicit fallback and uses the injected `sh` helper path instead of assuming target containers have `python3`, `node` or repository-local tools. Avoid heredocs nested inside remote command strings, `python - <<EOF` inside SSH strings, or JSON/Markdown bodies passed through shell arguments. These patterns often bind stdin to the wrong process, strip quotes, or leave a half-open provider SSH session that looks like a platform outage.
Longer scripts should move across stdin (`trans py`, explicit `trans sh|bash`, or k3s `sh|bash` operation), and remote text patches should default to `apply-patch` with a host or pod workspace route. Legacy `apply-patch-v1` remains available as the explicit fallback and uses the injected `sh` helper path instead of assuming target containers have `python3`, `node` or repository-local tools. Avoid heredocs nested inside remote command strings, `python - <<EOF` inside SSH strings, or JSON/Markdown bodies passed through shell arguments. These patterns often bind stdin to the wrong process, strip quotes, or leave a half-open provider SSH session that looks like a platform outage.
When structured passthrough is missing for a recurring workflow, fix the CLI first and then document the durable helper. Do not preserve a growing collection of one-off shell recipes as the long-term runbook.
+6 -6
View File
@@ -15,8 +15,8 @@ G14 platform DB 是 G14 host OS 上的原生 PostgreSQL,不是 k3s workload
G14 平台库固定由 systemd 管理:
```bash
trans G14 script -- 'systemctl status postgresql'
trans G14 script -- '/usr/local/sbin/g14-platform-db-health'
trans G14 sh -- 'systemctl status postgresql'
trans G14 sh -- '/usr/local/sbin/g14-platform-db-health'
```
PostgreSQL 只监听 G14 host loopback 与 k3s pod 可达的 node gateway 地址:
@@ -76,8 +76,8 @@ HWLAB v0.3 的 source truth 在 `G14:/root/hwlab-v03`、branch `v0.3`。`deploy/
标准验证:
```bash
trans G14:/root/hwlab-v03 script -- 'npm run gitops:ts:check'
trans G14:/root/hwlab-v03 script -- 'npm run gitops:render -- --lane v03 --out /tmp/hwlab-v03-render-check'
trans G14:/root/hwlab-v03 sh -- 'npm run gitops:ts:check'
trans G14:/root/hwlab-v03 sh -- 'npm run gitops:render -- --lane v03 --out /tmp/hwlab-v03-render-check'
bun scripts/cli.ts hwlab nodes control-plane trigger-current --node G14 --lane v03 --confirm
bun scripts/cli.ts hwlab nodes control-plane status --node G14 --lane v03 --pipeline-run <pipeline-run>
```
@@ -132,8 +132,8 @@ bun scripts/cli.ts hwlab nodes secret cleanup-obsolete --node G14 --lane v03 --n
备份脚本固定在 G14 host
```bash
trans G14 script -- 'systemctl status g14-platform-db-backup.timer'
trans G14 script -- '/usr/local/sbin/g14-platform-db-backup'
trans G14 sh -- 'systemctl status g14-platform-db-backup.timer'
trans G14 sh -- '/usr/local/sbin/g14-platform-db-backup'
```
备份目录:
+18 -18
View File
@@ -25,7 +25,7 @@ The old `/root/hwlab` workspace on branch `G14` is no longer a default source tr
The standard entry forms are:
```bash
trans G14:/root/hwlab script -- 'git fetch origin G14 && git pull --ff-only origin G14 && git status --short --branch && git remote -v'
trans G14:/root/hwlab sh -- 'git fetch origin G14 && git pull --ff-only origin G14 && git status --short --branch && git remote -v'
trans G14:/root/hwlab apply-patch < patch.diff
trans G14:k3s kubectl get pods -n hwlab-v02
```
@@ -55,7 +55,7 @@ HWLAB `v0.2` is the supported G14 runtime lane for the v0.2 branch. It must not
The fixed `v0.2` source branch is `v0.2`, forked from the current `G14` branch after the G14 long-term reference docs record this decision. The fixed G14 development workspace for that branch is:
```bash
trans G14:/root/hwlab-v02 script -- 'git status --short --branch && git remote -v'
trans G14:/root/hwlab-v02 sh -- 'git status --short --branch && git remote -v'
```
`/root/hwlab-v02` is the long-lived `v0.2` development workspace, not a scratch clone or CI/CD source selector. It must track `origin/v0.2` with `origin git@github.com:pikasTech/HWLAB.git`; local dirty state, stale `HEAD`, and untracked `.worktree/` only affect human development. Do not reuse retired `/root/hwlab` or `/root/hwlab/.worktree/*` as the `v0.2` fixed workspace.
@@ -122,13 +122,13 @@ The generic P2/P3/P4 flow is owned by `$dad-dev`; this section fixes the G14/v0.
Direct-lightweight precheck:
```bash
trans G14:/root/hwlab-v02 script -- 'git fetch origin v0.2 && git pull --ff-only origin v0.2 && git status --short --branch'
trans G14:/root/hwlab-v02 sh -- 'git fetch origin v0.2 && git pull --ff-only origin v0.2 && git status --short --branch'
```
Service workflow setup:
```bash
trans G14:/root/hwlab-v02 script -- 'git worktree add .worktree/<task> -b fix/issue<N>-<short-name> origin/v0.2'
trans G14:/root/hwlab-v02 sh -- 'git worktree add .worktree/<task> -b fix/issue<N>-<short-name> origin/v0.2'
```
The fixed repo at `/root/hwlab-v02` is not a scratch area for service/runtime work, but it is the direct-lightweight source workspace. When a direct-lightweight task sees parallel dirty state in the fixed repo, inspect and include or separate it according to the current user instruction and project Git rules; never discard it silently. Worktree branches for service workflow should follow the `fix/issue<N>-<short-name>` naming so PR titles and merge commits stay scannable. GitHub PR writes, merge, rollout trigger and final original-entry validation follow `$dad-dev` plus the UniDesk CLI control rules in `AGENTS.md`.
@@ -137,17 +137,17 @@ The fixed repo at `/root/hwlab-v02` is not a scratch area for service/runtime wo
Direct-lightweight commits are allowed and do not need recovery. A direct commit on `v0.2` only needs recovery when it changed service/runtime/GitOps/CI/CD/public behavior that should have used the PR/rollout workflow. The recovery is bounded and audit-friendly, but it is also a `git push --force-with-lease` against the protected branch, so it is only acceptable when the unapproved direct commit is the only new content on `v0.2` since the last merged PR:
1. Confirm no parallel worktree was in flight and the commit is the only delta. `trans G14:/root/hwlab-v02 script -- 'git log origin/v0.2..HEAD'` and `git log HEAD..origin/v0.2` must show the direct commit as a single fast-forward candidate.
1. Confirm no parallel worktree was in flight and the commit is the only delta. `trans G14:/root/hwlab-v02 sh -- 'git log origin/v0.2..HEAD'` and `git log HEAD..origin/v0.2` must show the direct commit as a single fast-forward candidate.
2. Capture the commit identity and patch for the recovery record:
```bash
trans G14:/root/hwlab-v02 script -- 'git show <direct-commit-sha> > /tmp/v0.2-recovery.patch'
trans G14:/root/hwlab-v02 sh -- 'git show <direct-commit-sha> > /tmp/v0.2-recovery.patch'
```
3. Roll the fixed repo back to the previous merged PR head. Use `git reset --hard <previous-pr-sha>`; this preserves any autostash (e.g. from a parallel `git checkout` snapshot in another worktree) on the stash list and does not touch the other worktree's working tree.
4. In the pre-existing worktree (e.g. `.worktree/<task>` on `fix/issue<N>-<short-name>`) bring the branch up to the previous PR head with `trans G14:/root/hwlab-v02/.worktree/<task> script -- 'git reset --hard <previous-pr-sha>'`, then `git cherry-pick <direct-commit-sha>` to replay the direct commit on the feature branch. If the worktree branch was already a clean clone of `origin/v0.2` at the previous PR head, the reset is a no-op.
4. In the pre-existing worktree (e.g. `.worktree/<task>` on `fix/issue<N>-<short-name>`) bring the branch up to the previous PR head with `trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'git reset --hard <previous-pr-sha>'`, then `git cherry-pick <direct-commit-sha>` to replay the direct commit on the feature branch. If the worktree branch was already a clean clone of `origin/v0.2` at the previous PR head, the reset is a no-op.
5. Push the feature branch and force-push `v0.2` back to the rolled-back head with `--force-with-lease` (refuses to clobber a concurrent push):
```bash
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'git push -u origin fix/issue<N>-<short-name>'
trans G14:/root/hwlab-v02 script -- 'git push --force-with-lease origin v0.2'
trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'git push -u origin fix/issue<N>-<short-name>'
trans G14:/root/hwlab-v02 sh -- 'git push --force-with-lease origin v0.2'
```
6. Open the PR through UniDesk CLI, squash-merge, then `git pull --ff-only origin v0.2` to bring the fixed repo back in sync. The previous PR's merge commit will not be in the new PR's history; the new PR's diff equals the original direct commit's diff, so the PR trail still contains the exact same bytes.
7. `bun scripts/cli.ts hwlab g14 control-plane status --lane v02` will read the new merge commit; the previously-staged PipelineRun for the direct commit was created on the v0.2 head and `trigger-current` will delete + recreate it for the post-merge head, so no manual PipelineRun cleanup is required.
@@ -160,7 +160,7 @@ Cloud Web layout, status-panel, collapsed-control, and modal issues on `v0.2` ne
Use these surfaces together:
- `trans G14:/root/hwlab-v02/.worktree/<task>/web/hwlab-cloud-web script -- 'bun run check'` for approved static source/layout checks and dist freshness.
- `trans G14:/root/hwlab-v02/.worktree/<task>/web/hwlab-cloud-web sh -- 'bun run check'` for approved static source/layout checks and dist freshness.
- `bun scripts/cli.ts hwlab g14 control-plane status --lane v02` for runtime, Argo, public endpoint, and GitOps alignment. If `origin/v0.2` moved through a parallel PR, use `--pipeline-run` or `--source-commit` and treat same-branch supersession as context rather than failure.
- Public API probes for both `/health/live` and `/v1/live-builds`. `/health/live` proves live service health/revision, but Cloud Web build time, image tag/digest, source metadata, and actual runtime commit/revision should be read from `/v1/live-builds`.
- A bounded browser/DOM probe against `http://74.48.78.17:19666/` that asserts the deployed page state relevant to the issue.
@@ -183,9 +183,9 @@ When a `v0.2` Cloud Web fix removes a button from `index.html` or a field from t
```bash
# 1. Web assets rebuild and the orphan is gone from the dist
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'cd web/hwlab-cloud-web && bun run build'
trans G14:/root/hwlab-v02/.worktree/<task> script -- "grep -c '<removed-field>' web/hwlab-cloud-web/dist/app.js" # must be 0
trans G14:/root/hwlab-v02/.worktree/<task> script -- "grep -c 'id=\"<removed-id>\"' web/hwlab-cloud-web/index.html" # must be 0
trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'cd web/hwlab-cloud-web && bun run build'
trans G14:/root/hwlab-v02/.worktree/<task> sh -- "grep -c '<removed-field>' web/hwlab-cloud-web/dist/app.js" # must be 0
trans G14:/root/hwlab-v02/.worktree/<task> sh -- "grep -c 'id=\"<removed-id>\"' web/hwlab-cloud-web/index.html" # must be 0
# 2. Live 19666/19667 confirms the deployed bundle is the new build
curl -fsS http://74.48.78.17:19666/ | grep -c '<removed-id>' # must be 0
@@ -196,7 +196,7 @@ bun scripts/cli.ts hwlab g14 control-plane status --lane v02
While the PR is open, the author can also run a one-liner to surface any orphan `el.<field>.addEventListener` whose field is not declared in the `el` literal of `app.ts`:
```bash
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'awk "/^const el = /,/^};/" web/hwlab-cloud-web/app.ts | tr -d "," | awk "{print \$1}" | grep -E "^[a-zA-Z]" | sort -u > /tmp/el-fields.txt; grep -nEo "el\\.([A-Za-z_$][A-Za-z0-9_$]*)\\.addEventListener" web/hwlab-cloud-web/*.ts | while read m; do field=$(echo "$m" | sed -E "s/.*el\\.([A-Za-z_$][A-Za-z0-9_$]*)\\.addEventListener.*/\\1/"); if ! grep -q "^$field$" /tmp/el-fields.txt; then echo "ORPHAN: el.$field.addEventListener"; fi; done'
trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'awk "/^const el = /,/^};/" web/hwlab-cloud-web/app.ts | tr -d "," | awk "{print \$1}" | grep -E "^[a-zA-Z]" | sort -u > /tmp/el-fields.txt; grep -nEo "el\\.([A-Za-z_$][A-Za-z0-9_$]*)\\.addEventListener" web/hwlab-cloud-web/*.ts | while read m; do field=$(echo "$m" | sed -E "s/.*el\\.([A-Za-z_$][A-Za-z0-9_$]*)\\.addEventListener.*/\\1/"); if ! grep -q "^$field$" /tmp/el-fields.txt; then echo "ORPHAN: el.$field.addEventListener"; fi; done'
```
Document the explicit `grep` / curl evidence in the issue closeout comment. Tightening the `el` literal with proper TypeScript types is tracked separately and must not be done as part of a runtime fix PR.
@@ -276,10 +276,10 @@ A live `workspace.evidence` / `debug.evidence` / `download evidence` selector th
Device-pod fixes still follow `$dad-dev` and the service/runtime side of the `## v0.2 Source Workflow` route above. The device-pod-specific closeout is the three-layer runtime matrix below; keep these checks because they prove the cloud-api -> executor -> D601 host chain, while generic PR/CI/CD and worktree mechanics stay in `$dad-dev`.
```bash
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'cd tools && bun test device-pod-cli.test.ts'
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'cd cmd/hwlab-device-pod && bun test main.test.ts'
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'cd internal/cloud && bun test access-control.test.ts'
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'node --check skills/device-pod-cli/assets/device-host-cli.mjs'
trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'cd tools && bun test device-pod-cli.test.ts'
trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'cd cmd/hwlab-device-pod && bun test main.test.ts'
trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'cd internal/cloud && bun test access-control.test.ts'
trans G14:/root/hwlab-v02/.worktree/<task> sh -- 'node --check skills/device-pod-cli/assets/device-host-cli.mjs'
```
Treat `access-control.test.ts` workbench failures as pre-existing on the v0.2 base unless the new test list explicitly covers them. After PR merge and `trigger-current --lane v02 --confirm`, the live `http://74.48.78.17:19667/` CLI 验收 must hit all three layers:
+11 -11
View File
@@ -214,11 +214,11 @@ Registry 报告必须区分 `uniqueBlobBytes` 和 `sharedBlobBytes`。多个 rep
G14 空间审计默认只读。需要报告时优先采集以下摘要,避免全量 dump 大 JSON:
```bash
trans G14 script -- 'df -h / | tail -1'
trans G14 script -- 'du -xh -d 1 / /var /var/lib /root 2>/dev/null | sort -h | tail -40'
trans G14 script -- 'du -xh -d 2 /var/lib/rancher/k3s /var/lib/containerd /var/log 2>/dev/null | sort -h | tail -80'
trans G14 script -- 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl get pv,pvc,pod -A -o wide'
trans G14 script -- 'find /var/lib/hwlab/registry/docker/registry/v2/repositories -path "*/_manifests/tags/*/current/link" -type f | wc -l'
trans G14 sh -- 'df -h / | tail -1'
trans G14 sh -- 'du -xh -d 1 / /var /var/lib /root 2>/dev/null | sort -h | tail -40'
trans G14 sh -- 'du -xh -d 2 /var/lib/rancher/k3s /var/lib/containerd /var/log 2>/dev/null | sort -h | tail -80'
trans G14:k3s kubectl get pv,pvc,pod -A -o wide
trans G14 sh -- 'find /var/lib/hwlab/registry/docker/registry/v2/repositories -path "*/_manifests/tags/*/current/link" -type f | wc -l'
```
需要深挖 registry 时,报告字段至少包括 repo、tag count、manifest revision count、latest tags、protected digest closure、unique blob bytes 和 shared blob bytes。需要深挖 k3s runtime 时,报告字段至少包括 namespace/PVC、PV host path、owner workload、PVC 实占、k3s containerd snapshots/blobs 总量。不要把 `/var/lib/kubelet/pods``/var/lib/rancher/k3s/storage` 简单相加,因为 kubelet pod 目录可能包含 PVC bind mount 或 runtime 元数据,存在重复计数风险。
@@ -226,8 +226,8 @@ trans G14 script -- 'find /var/lib/hwlab/registry/docker/registry/v2/repositorie
需要深挖日志和 worktree 时,默认只读报告,不直接清理:
```bash
trans G14 script -- 'du -xh -d 1 /var/log 2>/dev/null | sort -h | tail -40'
trans G14 script -- 'du -xh -d 2 /root/hwlab-v02/.worktree 2>/dev/null | sort -h | tail -60'
trans G14 sh -- 'du -xh -d 1 /var/log 2>/dev/null | sort -h | tail -40'
trans G14 sh -- 'du -xh -d 2 /root/hwlab-v02/.worktree 2>/dev/null | sort -h | tail -60'
```
rsyslog 文件日志不属于当前 `gc remote` 默认可变更对象。若 `/var/log/syslog*``/var/log/kern.log*` 或同类文件成为 50% 目标的最后缺口,应先新增受控 logrotate/压缩/截断 CLI,并在输出中披露保留 tail、压缩对象、释放估算和失败恢复;禁止直接 `truncate` 或删除日志文件作为长期流程。`/root/hwlab-v02/.worktree` 只能在明确 owner、branch、dirty 状态和可重建性后清理,不能按目录大小直接删除。
@@ -237,10 +237,10 @@ rsyslog 文件日志不属于当前 `gc remote` 默认可变更对象。若 `/va
G14 GC 后必须验证:
```bash
trans G14 script -- 'df -h / | tail -1'
trans G14 script -- 'curl -fsS http://127.0.0.1:5000/v2/ >/dev/null && echo ok'
trans G14 script -- 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n hwlab-ci get deploy hwlab-registry'
trans G14 script -- 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n hwlab-ci get cronjob hwlab-g14-branch-poller -o custom-columns=NAME:.metadata.name,SUSPEND:.spec.suspend --no-headers && ! KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n hwlab-ci get cronjob hwlab-v02-branch-poller >/dev/null 2>&1'
trans G14 sh -- 'df -h / | tail -1'
trans G14 sh -- 'curl -fsS http://127.0.0.1:5000/v2/ >/dev/null && echo ok'
trans G14:k3s kubectl -n hwlab-ci get deploy hwlab-registry
trans G14:k3s sh -- 'kubectl -n hwlab-ci get cronjob hwlab-g14-branch-poller -o custom-columns=NAME:.metadata.name,SUSPEND:.spec.suspend --no-headers && ! kubectl -n hwlab-ci get cronjob hwlab-v02-branch-poller >/dev/null 2>&1'
```
DEV workload 验证应检查非零副本 workload 是否 ready`0/0` 的显式停用 deployment 不应误报为事故。registry tag 数只作为辅证,不能替代 workload ref 保护和 registry API 健康。
+1 -1
View File
@@ -86,7 +86,7 @@ Sanitizer rules: recursively scans `ResponsesRequest.input`, repairs tool-call `
## MiniMax Apply-Patch Operations
MiniMax-backed sessions must use the same UniDesk remote text patch contract as other agents: route first, operation second, and `apply-patch` v2 by default. The stable write shape is `trans <provider>:/absolute/workspace apply-patch < patch.diff`; read-only inspection may use `trans <provider>:/absolute/workspace script -- 'nl -ba file'` or equivalent bounded commands.
MiniMax-backed sessions must use the same UniDesk remote text patch contract as other agents: route first, operation second, and `apply-patch` v2 by default. The stable write shape is `trans <provider>:/absolute/workspace apply-patch < patch.diff`; read-only inspection may use `trans <provider>:/absolute/workspace nl -ba file` or equivalent bounded commands.
- If `apply-patch` reports `failed to find expected lines`, first read the exact current target block, then retry with a smaller `Update File` hunk, an `@@ <unique anchor>` hint, or multiple small hunks. This is normal stale-context recovery, not a reason to switch tools.
- Do not recover text patch failures by using `download` / `upload`, remote Python/Perl/sed heredocs, `cat >` / `tee` whole-file rewrites, or `apply-patch-v1`, unless `apply-patch` itself is unavailable or the target is non-text / bulk mechanical generated content.
+2 -2
View File
@@ -42,7 +42,7 @@ UniDesk 用户服务是挂载到 UniDesk 核心服务上的、面向用户使用
业务仓库由业务系统自己维护,包括源码、Dockerfile、docker-compose、配置模板和业务测试。UniDesk 只引用业务仓库 URL、commit id、Dockerfile/docker-compose 路径和运行容器名;不得把业务全量代码复制到 `src/components/microservices/` 形成双维护。`src/components/microservices/` 只能放通用示例或 UniDesk 自有示例,不作为业务仓库镜像。
Code Queue runner 也是分布式开发执行面。runner 镜像必须内置 `tran`,让 runner 在执行任务时能通过公网 frontend 控制面访问 D601、G14、host workspace、k3s 控制面和目标 pod。runner 内应优先使用 `tran <provider> argv ...``tran <provider>:k3s kubectl ...``tran <provider>:k3s:<namespace>:<workload> argv ...` 这类结构化命令;需要 stdin 的 `script``apply-patch``py` 操作同样通过 frontend `/ws/ssh` 流式通道执行,不应退回 `/api/dispatch` task polling。这个边界避免把 provider token、backend-core 内网 DNS 或长命令多层引号作为 runner 可用性的前提,也避免大 stdout 被 task JSON compact 截断。
Code Queue runner 也是分布式开发执行面。runner 镜像必须内置 `tran`,让 runner 在执行任务时能通过公网 frontend 控制面访问 D601、G14、host workspace、k3s 控制面和目标 pod。runner 内应优先使用 `tran <provider> argv ...``tran <provider>:k3s kubectl ...``tran <provider>:k3s:<namespace>:<workload> argv ...` 这类结构化命令;需要 stdin 的 `sh`/`bash``apply-patch``py` 操作同样通过 frontend `/ws/ssh` 流式通道执行,不应退回 `/api/dispatch` task polling。这个边界避免把 provider token、backend-core 内网 DNS 或长命令多层引号作为 runner 可用性的前提,也避免大 stdout 被 task JSON compact 截断。
## Main Server User Services
@@ -227,7 +227,7 @@ D601 上必须显式使用原生 k3s kubeconfig`KUBECONFIG=/etc/rancher/k3s/k
- Skill 注入边界:DEV Code Queue scheduler/read/write Pod 必须把宿主 `/home/ubuntu/.agents/skills` 只读挂载到容器 `/root/.agents/skills`,并设置 `UNIDESK_SKILLS_PATH=/root/.agents/skills`,让执行任务能读取 `cli-spec` 等技能;只允许挂载 skill 目录本身,不得把宿主 `~/.agents``~/.codex`、token、auth JSON 或其他隐私配置整体暴露给任务容器。`/health``/api/dev-ready` 必须暴露非敏感 `skills` 状态:路径、exists、available、readonly、skillCount、`cliSpecAvailable` 和修复建议;CLI `codex dev-ready` 可读取该摘要。当前交付只要求 DEV manifest 和旧 direct Compose 诊断路径具备只读 skill 注入;PROD Code Queue 发布前必须单独审查隔离级别,不能把 DEV 桥接模式直接推广为生产默认。
- Develop-ready 镜像:Code Queue 镜像必须在启动前预装 UniDesk/Pipeline 调试所需工具,至少包含 `codex``bun``node``npm`/`npx``git``rg``curl``python3`/`pip3``docker``docker compose``docker-compose``jq``ssh``rsync``make``gcc`/`g++``iptables``tar``gzip``unzip`;不得依赖 Codex 任务运行时再 `apt-get install` 这些基础环境。
- 远程开发容器与任务执行 ProviderCode Queue 必须能通过 live API 拉起 D601 等计算节点上的开发容器,入口为 `POST /api/dev-containers/<providerId>/start`,默认 Provider 为 `D601`。该流程由 Code Queue 调用 UniDesk SSH 维护桥在目标节点创建 `unidesk-codex-dev-<providerId>`;人工入口写 `trans <providerId>`,内部服务调用仍复用同一 route parser 和 broker。在 Code Queue 所在节点与开发容器之间建立 `ssh -w` TUN 点对点链路;服务所在节点负责对开发容器的 TUN 源地址做 NAT/MASQUERADE,开发容器默认路由和 DNS 改走该 TUN,从而让 `ping google.com`、DNS、HTTP(S) 等出网都经主 server 全局代理,而不是依赖 D601 本地网络。提交 Code Queue 任务时必须支持选择执行 Provider:`D601` 在 D601 原生 k3s 的 active Code Queue scheduler/runner Pod 中本机执行,默认工作目录为 `/workspace`,并且 `/workspace` 必须映射 D601 WSL host 的 `/home/ubuntu`;同一个 hostPath 还必须挂载到容器内 `/home/ubuntu`,让 WSL home 里的绝对 symlink(例如 `/workspace/cq-deploy -> /home/ubuntu/unidesk-code-queue-deploy`)在任务中可解析,不能只看到 symlink 名而无法进入目标目录。`/root/unidesk``/app` 必须单独映射 `/home/ubuntu/cq-deploy` 作为服务部署仓库;其他 Provider 在对应 `unidesk-codex-dev-<providerId>` 容器中执行,默认工作目录为 `/home/ubuntu`,可按任务覆盖 `cwd`。远程任务启动前必须自动复用或拉起该 Provider 的开发容器、同步 Codex 配置和允许的运行时 provider 环境变量,并通过同一 master TUN/NAT 链路出网;目标 host 存在 `/mnt` 时,开发容器必须挂载 host `/mnt:/mnt`,确保 D601 这类 WSL 节点的 Windows 盘符路径如 `/mnt/f/Work/ConStart` 在任务容器内可见,避免 agent 因缺少真实工作区而搜索到无关项目。TUN 建立必须幂等处理 stale 状态:启动前清理旧 `tun<id>`、默认路由、旧 tunnel SSH 进程和旧 OUTPUT 跳转,缺失旧设备不能导致失败,冷启动运行时准备要有有界但足够的 timeout。TUN 建立后必须创建 `UD-CQ-EGRESS-<provider>` OUTPUT 链,规则只允许 loopback、既有连接、`tun<id>` 出口以及到 master server 的 SSH tunnel 控制连接,随后 reject 其他 IPv4/IPv6 出站包;这条网络层封口是开发/执行容器的权威外网边界,不能用 `HTTP_PROXY`/`NO_PROXY` 环境变量替代,容器镜像也必须使用已解析出的唯一 `unidesk-code-queue:<provider>` 或显式 `image`,缺失时直接失败,禁止 provider-gateway image、`latest` 或其他隐式镜像 fallback。验收必须保留三类日志:容器建隧道后 `ping google.com` 成功、强制指定原 Docker 网卡直连外网被 `sealed_direct_ping=blocked_expected` 拦截、服务所在节点上对应 `UNIDESK-CODEX-DEV-<providerId>` NAT 链或 `tun<id>` 计数在 ping 前后增长;涉及 WSL 工作区任务时还必须在开发容器内验证目标 `/mnt/...` 路径可读。`GET /api/dev-containers/<providerId>/status` 必须展示默认路由、`route_8_8_8_8``egressFirewallChain` 和 OUTPUT 链跳转。开发容器代理密钥只生成到 `.state/code-queue/dev-proxy/` 与目标节点用户目录,不得提交到仓库。
- 远程维护桥调用:Code Queue 已迁移到 D601 后,Code Queue 后端 Pod 内没有主 server 的 `unidesk-backend-core` 容器,不能再把 `trans ...` 实现为本地 `docker exec unidesk-backend-core`。Code Queue runner 发起的 provider 维护命令必须通过主 server frontend authenticated `/ws/ssh` 流式代理进入 backend-core SSH bridge,再由目标 provider-gateway 执行 Host SSH/WSL SSHstdout/stderr 直接流回 runner,不能经过 `/api/dispatch` task polling 或 JSON compact。需要传递脚本`py``apply-patch` 时也使用同一条 stdin 流式通道,避免恢复到本地 Docker broker、手工 base64 分块上传、交互 shell fallback 或多层引号。
- 远程维护桥调用:Code Queue 已迁移到 D601 后,Code Queue 后端 Pod 内没有主 server 的 `unidesk-backend-core` 容器,不能再把 `trans ...` 实现为本地 `docker exec unidesk-backend-core`。Code Queue runner 发起的 provider 维护命令必须通过主 server frontend authenticated `/ws/ssh` 流式代理进入 backend-core SSH bridge,再由目标 provider-gateway 执行 Host SSH/WSL SSHstdout/stderr 直接流回 runner,不能经过 `/api/dispatch` task polling 或 JSON compact。需要传递 `sh`/`bash` stdin shell body`py``apply-patch` 时也使用同一条 stdin 流式通道,避免恢复到本地 Docker broker、手工 base64 分块上传、交互 shell fallback 或多层引号。
- 远程 Provider 准备不得阻塞控制面:Code Queue 在请求处理、队列调度、远程开发容器准备、Host SSH/WSL SSH 透传、Codex/OpenCode 启动和日志导出路径中,禁止使用会长时间占用 Bun event loop 的同步子进程调用,例如针对远程 Provider 的 `spawnSync``execSync``execFileSync`。远程命令必须通过异步子进程执行,带显式 timeout、超时 kill、stdout/stderr 上限和任务 output 进度记录;远程准备失败只能让对应任务进入失败或 retry,不能让 `POST /api/tasks`、SSE `/api/events``/health`、overview 或 frontend/core 用户服务代理等控制面请求等待远程 SSH 结束。凡是改动 D601/远程 Provider 准备、`api/dev-containers/*`、任务入队启动或 `runCodeQueueSsh` 等路径,验收必须在一个远程 SSH/status/start 探针运行期间并发验证容器直连 `/health``/api/tasks/overview` 仍能在 1s 内返回,证明远程超时不会复发为全站刷新卡死。
- OpenCode 远程执行:`minimax-m3``minimax-m2.7` 两路并行配置走 OpenCode JSON event port 时,本地和远程命令都必须显式执行 `opencode run ...`;远程 Docker exec 不得退化成 `exec run ...`,否则会在目标容器内变成 `bash: exec: run: not found`。OpenCode JSON stream 的终态判定以“当前进程退出码 + 当前 attempt 的最终 assistant response”为准:`exit=0` 且当前 attempt 产生非空最终回复时,即使上游没有发 `step_finish` 事件,也应视为正常 terminal;非零退出、无当前最终回复或传输关闭才进入 retry。每个 attempt 的 `finalResponse` 必须只来自当前 OpenCode/Codex turn,禁止在当前 turn 未产出最终回复时回退复用 task 上一次 `finalResponse`,否则会把旧任务内容误判为本轮完成。
- Codex 控制:服务内部启动 `codex app-server --listen stdio://`,用 JSON-RPC 调用 `thread/start``turn/start``turn/steer``turn/interrupt`,并监听 `turn/completed`、assistant delta、reasoning delta、command output delta、file diff delta 等通知生成前端可轮询的 transcript。
+11 -11
View File
@@ -8,18 +8,18 @@ Use UniDesk SSH passthrough for PK01 host operations:
```bash
trans PK01 argv hostname
trans PK01 script <<'SCRIPT'
trans PK01 sh <<'SH'
df -h /
docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}'
SCRIPT
SH
```
Before closing an operation, verify both the provider channel and host workload state:
```bash
bun scripts/cli.ts debug health
trans PK01 argv bash -lc 'docker inspect --format "name={{.Name}} restart={{.HostConfig.RestartPolicy.Name}} pid={{.HostConfig.PidMode}} state={{.State.Status}} image={{.Config.Image}}" unidesk-provider-gateway-pk01'
trans PK01 argv bash -lc 'docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}"'
trans PK01 sh -- 'docker inspect --format "name={{.Name}} restart={{.HostConfig.RestartPolicy.Name}} pid={{.HostConfig.PidMode}} state={{.State.Status}} image={{.Config.Image}}" unidesk-provider-gateway-pk01'
trans PK01 sh -- 'docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}"'
```
PK01 has no k3s control plane. `trans PK01:k3s ...` is not an operating truth. If a future PK01 k3s lane is introduced, it must get a separate runtime-lane reference and must not reuse the current pikanode host-data policy as a Kubernetes retention policy.
@@ -130,9 +130,9 @@ PK01 has node-local retention controls installed so that pikanode temp output an
Operational checks:
```bash
trans PK01 argv bash -lc 'systemctl status unidesk-pk01-pikanode-temp-gc.timer --no-pager'
trans PK01 argv bash -lc 'sudo systemctl start unidesk-pk01-pikanode-temp-gc.service && tail -n 40 /var/log/unidesk-pk01/pikanode-temp-gc.log'
trans PK01 argv bash -lc 'sudo logrotate -d /etc/logrotate.d/unidesk-pk01-pikanode'
trans PK01 sh -- 'systemctl status unidesk-pk01-pikanode-temp-gc.timer --no-pager'
trans PK01 sh -- 'sudo systemctl start unidesk-pk01-pikanode-temp-gc.service && tail -n 40 /var/log/unidesk-pk01/pikanode-temp-gc.log'
trans PK01 sh -- 'sudo logrotate -d /etc/logrotate.d/unidesk-pk01-pikanode'
```
The timer and logrotate configuration are node-local operational state. If a future UniDesk CLI subcommand manages PK01 retention centrally, it must first render a dry-run plan, show the same protected paths, and then install/update these node-local files through a confirmed operation.
@@ -142,10 +142,10 @@ The timer and logrotate configuration are node-local operational state. If a fut
PK01 space attribution should use short, bounded commands. Recommended probes:
```bash
trans PK01 argv bash -lc 'df -h / && df -i /'
trans PK01 argv bash -lc 'sudo timeout 20 du -xhd1 /var /home/ubuntu/pikanode /home/ubuntu/.vscode-server /var/lib/docker /var/log 2>/dev/null | sort -h | tail -80'
trans PK01 argv bash -lc 'docker system df -v | sed -n "1,220p"'
trans PK01 argv bash -lc 'sudo find /home/ubuntu/pikanode/html/temp -xdev -mindepth 1 -maxdepth 1 -printf "%TY-%Tm-%Td %TH:%TM %p\n" | sort | tail -40'
trans PK01 sh -- 'df -h / && df -i /'
trans PK01 sh -- 'sudo timeout 20 du -xhd1 /var /home/ubuntu/pikanode /home/ubuntu/.vscode-server /var/lib/docker /var/log 2>/dev/null | sort -h | tail -80'
trans PK01 sh -- 'docker system df -v | sed -n "1,220p"'
trans PK01 sh -- 'sudo find /home/ubuntu/pikanode/html/temp -xdev -mindepth 1 -maxdepth 1 -printf "%TY-%Tm-%Td %TH:%TM %p\n" | sort | tail -40'
```
Interpretation guide:
+1 -1
View File
@@ -167,7 +167,7 @@ When target-level `egressProxy.enabled=true`, the D601 target renders an in-clus
Adding, removing, exposing, validating, and configuring local Codex consumers are daily operations covered by `$unidesk-sub2api`. The development rule is that ordinary pool membership changes stay YAML-only and do not add code or CI/CD. Code changes are only appropriate when UniDesk needs to render or validate a Sub2API capability that already exists upstream, such as account-level WebSocket mode or per-account upstream User-Agent. If Sub2API itself does not support a desired behavior, do not magic-patch it through UniDesk scripts, Kubernetes hotfixes, local forks, or hidden compatibility paths; either leave the behavior unsupported or pursue it upstream as an explicit Sub2API feature.
`codex-pool sync --confirm` and `codex-pool validate` are runtime operations that may need more than one SSH short-connection window because they log in to Sub2API, reconcile accounts, inspect recent logs, and run gateway smoke requests. The formal entry remains the UniDesk CLI, which must use a submit-and-short-poll control shape or an equivalent remote job wrapper instead of one long `trans G14:k3s script` call. If these commands fail with `UNIDESK_SSH_RUNTIME_TIMEOUT` while the remote operation may still be running, treat it as a control-plane visibility gap first: improve or use the CLI's job/poll path, then rerun `sync` or `validate`. Do not replace it with raw `kubectl`, manual Sub2API admin API patches, repeated blind full loops, or Sub2API source modifications.
`codex-pool sync --confirm` and `codex-pool validate` are runtime operations that may need more than one SSH short-connection window because they log in to Sub2API, reconcile accounts, inspect recent logs, and run gateway smoke requests. The formal entry remains the UniDesk CLI, which must use a submit-and-short-poll control shape or an equivalent remote job wrapper instead of one long `trans G14:k3s sh` call. If these commands fail with `UNIDESK_SSH_RUNTIME_TIMEOUT` while the remote operation may still be running, treat it as a control-plane visibility gap first: improve or use the CLI's job/poll path, then rerun `sync` or `validate`. Do not replace it with raw `kubectl`, manual Sub2API admin API patches, repeated blind full loops, or Sub2API source modifications.
After `codex-pool configure-local --confirm`, the default `~/.codex/config.toml` / `auth.json` pair must remain the unified Sub2API consumer and must not be reused as an upstream account profile. Keep every upstream source profile in suffixed files such as `config.toml.<profile>` / `auth.json.<profile>` and register it through YAML `profiles.entries`.
+3 -3
View File
@@ -46,7 +46,7 @@ Provider WebSocket 是注册、heartbeat、dispatch、`provider.upgrade` 和短
TCP pool 的长期方向是把 `trans`/`tran` 变成真正并发的短连接工具,而不是给单条 Provider WebSocket 继续叠队列。backend-core 只用 provider WebSocket 下发 open/dispatch/exit 等控制帧,stdin/stdout/stderr 数据帧必须走已预热的 TCP channel;每个 SSH 会话、脚本、文件传输或 Windows 透传命令独占一条 channel,结束后释放回池。池耗尽、channel 丢失、data port 不可达和 provider 版本过旧都必须是结构化快速失败;禁止把请求排进应用层队列后长时间不返回。
当前默认池大小是 10 条,设计上优先覆盖高频短 SSH、并发小文件和单个大文件不阻塞其他请求的场景。已验证的目标状态是:D601 这类 WSL provider 上 10 路并发 `trans ... argv bash -lc 'sleep 2'` 不再出现 `provider ssh tcp data pool has no idle channel`、stderr 为空、每一路 stdout 都包含命令开始和结束输出,结束后 labels 回到 `ready=desired``claimed=0`。当前仍存在端到端固定开销,10 路并发短命令的墙钟可能明显高于远端命令自身耗时;这属于后续连接建立、broker 调度、WSL SSH spawn 或 provider 启动路径的性能优化范围,不能用队列、门禁或隐藏重试掩盖。
当前默认池大小是 10 条,设计上优先覆盖高频短 SSH、并发小文件和单个大文件不阻塞其他请求的场景。已验证的目标状态是:D601 这类 WSL provider 上 10 路并发 `trans ... sh -- 'sleep 2'` 不再出现 `provider ssh tcp data pool has no idle channel`、stderr 为空、每一路 stdout 都包含命令开始和结束输出,结束后 labels 回到 `ready=desired``claimed=0`。当前仍存在端到端固定开销,10 路并发短命令的墙钟可能明显高于远端命令自身耗时;这属于后续连接建立、broker 调度、WSL SSH spawn 或 provider 启动路径的性能优化范围,不能用队列、门禁或隐藏重试掩盖。
开发中最容易踩的坑是把“依赖层在线”误判成“数据面可用”。`host.ssh` 只证明 provider 能执行维护 SSH`host.ssh.tcp-pool``providerGatewaySshDataPoolReady``providerGatewaySshDataPoolClaimed``providerGatewaySshDataPoolLastError` 才能证明 TCP 数据池状态。另一个坑是输出尾部丢失:backend-core broker 在收到 `ssh.data` 后必须把 stdout/stderr 写入并 flush,再处理 `ssh.exit`,否则短命令可能 rc=0 但最后一段 stdout 没到调用端。第三个坑是 session 释放:`ssh.exit`、错误和超时路径都必须释放 claimed channel,避免下一批并发请求看到假性的池耗尽。第四个坑是 core/provider 池状态漂移:如果 provider 通过控制 WebSocket 返回 `host_ssh_error` 且提示 `requested ssh tcp data channel is not ready`,说明 core 侧 claim 到的 channel 已经不被 provider 认可,backend-core 必须 drop 该 `providerId + dataChannelId`,不能把它 release 回 idle pool 后继续重复 claim。
@@ -162,7 +162,7 @@ backend-core 可以通过真实 WebSocket 调度向在线 provider 下发 `provi
`bun scripts/cli.ts provider triage <PROVIDER_ID>` 是 provider 运行状态的只读多信号裁决入口。输出必须包含 `decision``retryable``healthyScopes``failedScopes``degradedScopes``blockingDisposition``rationale``signals``recommendedCrossChecks``decision` 的长期语义是:`global-offline` 表示 provider heartbeat、Host SSH、k3s 或 scheduler 等多个独立关键面同时失败且没有健康交叉证据;`service-degraded` 表示 registry、service proxy 或单个用户服务局部退化但仍存在 provider 级健康信号;`retryable-transient` 表示单次 runner-local、SSH、proxy 或 API timeout 证据不足,应重试或补交叉验证;`healthy` 表示未观察到失败或退化信号。
`recommendedCrossChecks` 必须保留 argv 形态的 Host SSH 自检:`trans <PROVIDER_ID> argv true`。这条命令用于证明非交互维护桥仍可用;如果自由 ssh-like 形态出现 timeout、`kex_exchange_identification``Connection closed by remote host`,应先按 CLI 输出的 `UNIDESK_SSH_HINT` 改用 `trans D601 argv bash -lc '<command>'` 复测,再结合 `provider triage` 判断是否真是 provider 级故障。
`recommendedCrossChecks` 必须保留 argv 形态的 Host SSH 自检:`trans <PROVIDER_ID> argv true`。这条命令用于证明非交互维护桥仍可用;如果自由 ssh-like 形态出现 timeout、`kex_exchange_identification``Connection closed by remote host`,应先按 CLI 输出的 `UNIDESK_SSH_HINT` 改用 `trans D601 sh -- '<command>'``trans D601 bash -- '<bash command>'` 复测,再结合 `provider triage` 判断是否真是 provider 级故障。
D601 这类长期 WSL provider 不得因为单一路径失败被直接写成全局离线。典型局部退化包括 artifact registry 的 `unidesk-artifact-registry.service` inactive,但 registry container 仍 running、listener 仍绑定 loopback、`http://127.0.0.1:5000/v2/` 返回 200;这种状态应在 registry scope 内显示 degraded,并在 provider triage 中落到 `decision=service-degraded`,只提示修复 systemd drift,不阻断所有 D601 上的 Code Queue、k3sctl-adapter 或业务 API 判断。
@@ -186,6 +186,6 @@ WSL provider 需要调用 Windows-only 工具链时,优先在 WSL 用户的 `~
维护桥通过真实 WebSocket dispatch 暴露为 `host.ssh` 命令。默认 payload 使用 `mode: "probe"`,远端只执行一个短命令并返回 `UNIDESK_SSH_TEST user=... host=... bridge=host.ssh cwd=...`;需要人工诊断时可以显式使用 `mode: "exec"``command` 字段执行有界命令。所有 `host.ssh` 执行都必须有超时,stdout/stderr 在 task result 中截断展示;自动升级和普通任务仍必须使用 Docker socket 与 `provider.upgrade`,不得把 WSL SSH 维护桥当成调度通道。
面向人的终端入口是 `trans <PROVIDER_ID> [ssh-like args...]`。无后续参数时打开远端登录 shell,有后续参数时执行远端命令并返回远端 exit code;该入口的 client 侧仍连接 backend-core 内网 `/ws/ssh` brokercore 只用 provider WebSocket 下发 open/dispatch 控制消息,终端 stdin/stdout/stderr 数据面必须走 provider 主动连接 main server 的 `host.ssh.tcp-pool` TCP warm pool,不新增计算节点入站要求,也不保留旧 WebSocket 数据 fallback。传统 ssh 传输参数由 provider-gateway 环境变量统一控制,CLI 只负责把 Provider ID 后的远端命令和终端 stdin/stdout/stderr 透传过去。非交互远端命令优先使用 argv 入口:`trans D601 argv true`,或需要 shell 特性时使用 `trans D601 argv bash -lc '<command>'`。WSL 节点需要同时看清 Linux/WSL 与 Windows 两套 skill 时,使用 `trans <PROVIDER_ID> skills`,该命令只通过已建立的维护桥读取 `SKILL.md` 元数据,不要求 provider-gateway 新增业务 API。
面向人的终端入口是 `trans <PROVIDER_ID> [ssh-like args...]`。无后续参数时打开远端登录 shell,有后续参数时执行远端命令并返回远端 exit code;该入口的 client 侧仍连接 backend-core 内网 `/ws/ssh` brokercore 只用 provider WebSocket 下发 open/dispatch 控制消息,终端 stdin/stdout/stderr 数据面必须走 provider 主动连接 main server 的 `host.ssh.tcp-pool` TCP warm pool,不新增计算节点入站要求,也不保留旧 WebSocket 数据 fallback。传统 ssh 传输参数由 provider-gateway 环境变量统一控制,CLI 只负责把 Provider ID 后的远端命令和终端 stdin/stdout/stderr 透传过去。非交互单进程远端命令优先使用 argv 入口:`trans D601 argv true`需要 shell 特性时在 operation 位置显式写 `sh``bash`,例如 `trans D601 sh -- '<command>'``trans D601 bash -- '<bash command>'`。WSL 节点需要同时看清 Linux/WSL 与 Windows 两套 skill 时,使用 `trans <PROVIDER_ID> skills`,该命令只通过已建立的维护桥读取 `SKILL.md` 元数据,不要求 provider-gateway 新增业务 API。
验证 WSL SSH 桥时,先在目标 WSL 中启动 sshd 并确保维护公钥写入目标用户的 `authorized_keys`,再确认目标 provider 注册 labels 中 `unideskCapabilities` 包含 `host.ssh`。运行 `bun scripts/cli.ts debug dispatch <PROVIDER_ID> host.ssh --wait-ms 15000` 后,结果应在 `debug task latest` 或前端任务历史中显示 `status: succeeded``probeLine``UNIDESK_SSH_TEST``exitCode: 0`,并且目标节点 labels 中 `hostSshKeyPresent` 为 true;随后运行 `trans <PROVIDER_ID> argv true` 验证非交互 argv 维护命令,再运行 `trans <PROVIDER_ID> hostname` 验证近似原生 ssh 的远端命令体验。在计算节点本机自测时,使用 remote CLI 透传同一组命令:`bun scripts/cli.ts --main-server-ip 74.48.78.17 debug health``bun scripts/cli.ts --main-server-ip 74.48.78.17 debug dispatch <PROVIDER_ID> host.ssh --wait-ms 15000``bun scripts/cli.ts --main-server-ip 74.48.78.17 ssh <PROVIDER_ID> argv true``bun scripts/cli.ts --main-server-ip 74.48.78.17 ssh <PROVIDER_ID> hostname`;默认 remote CLI 走公网 frontend 登录态,不需要主 server SSH key。健康检查必须能看到该 Provider 在线、`hostSshConfigured=true``hostSshKeyPresent=true``hostSshTarget` 正确、`unideskCapabilities` 包含 `host.ssh`probe 必须返回 `UNIDESK_SSH_TEST``ssh <PROVIDER_ID> argv true``ssh <PROVIDER_ID> hostname` 必须 exit code 为 0。如果 D518 这类 WSL 节点没有公网 SSH 入口,也必须通过这个 provider-gateway 自连维护桥完成验证,而不是要求主 server 直接连节点公网 22 端口;旧版 provider 未声明 `host.ssh` 时必须先升级 provider-gateway,否则 core 会拒绝 SSH 透传。
+1 -1
View File
@@ -63,7 +63,7 @@
**每次开始正式工作前(进入排程、回读 Todo Note、回写决策、安排时间块),秘书必须自己拉取一次最新北京时间作为基准时间。** 不能依赖对话首部日期、上下文里的"现在"、历史时间戳或上一次会话的"今天"推断;用户口述时间(如"现在是 14:57")作为交叉验证,但不是基准。取时方式(2026-06-01 验证):
- **主 server 本地调用**(秘书本机场景):`TZ='Asia/Shanghai' date '+%Y-%m-%d %H:%M:%S %Z'`15:39 CST = 真实北京时间,无需走 ssh route。优点:低摩擦、单命令、无 dep。
- **跨节点/远端校时**(需要确认远端 host 时钟时):`trans <route> script -- 'date "+%Y-%m-%d %H:%M:%S %Z"'`route 写明定位(如 `G14` / `D601`)。
- **跨节点/远端校时**(需要确认远端 host 时钟时):`trans <route> sh -- 'date "+%Y-%m-%d %H:%M:%S %Z"'`route 写明定位(如 `G14` / `D601`)。
把结果在排程第一句明确告诉用户("我现在看到的时间是 X,按这个重排"),让用户能立刻发现时间错位。这样做的原因:用户可能在两次对话之间休息或离开,秘书用缓存时间排程会与用户实际位置偏移,反馈格式"完成/未完成 + 卡点"对不上,日程序列累积错位。
+2 -2
View File
@@ -32,13 +32,13 @@ trans <PROVIDER_ID>:win skills --limit 20
## Windows Long-Lived Process Detach
`trans <PROVIDER_ID>:win cmd ...``trans <PROVIDER_ID> script -- powershell.exe ...` 适合短命令、只读探测和有界 skill 调用,不适合直接启动 Windows 长驻进程。Windows `cmd start``cmd /c ... &`、PowerShell `Start-Process -PassThru` 或带 stdout/stderr 重定向的子进程,仍可能被 provider-gateway/SSH broker 按子进程树或继承句柄等待;结果是远端进程其实已启动,但 `trans` 会持续占用 provider session,后续 D601/G14 高频调用被 provider session lock 串行排队。
`trans <PROVIDER_ID>:win cmd ...``trans <PROVIDER_ID>:win ps ...` 适合短命令、只读探测和有界 skill 调用,不适合直接启动 Windows 长驻进程。Windows `cmd start``cmd /c ... &`、PowerShell `Start-Process -PassThru` 或带 stdout/stderr 重定向的子进程,仍可能被 provider-gateway/SSH broker 按子进程树或继承句柄等待;结果是远端进程其实已启动,但 `trans` 会持续占用 provider session,后续 D601/G14 高频调用被 provider session lock 串行排队。
长驻 Windows 进程必须使用明确脱离当前 `trans` 会话的启动模型:
- 优先把启动参数写入节点私有 profile 或 `.cmd`/`.ps1` 文件,再通过 Windows Task Scheduler、Windows Service、NSSM、PM2 Windows service 或同等 supervisor 启动。
- 如果只是临时实验,使用 `schtasks /Create ... /TR "<cmd file>" /SC ONCE ... /F` 后再 `schtasks /Run ...`,并把 stdout/stderr 写入固定日志文件;验证用独立短命令读取 `tasklist``Get-CimInstance Win32_Process`、日志尾部和服务健康。
- 不要在 `D601:win cmd` 内用 `start``Start-Process` 直接启动 `hwlab-gateway`、串口 monitor、Codex app-server、Keil job watcher 等长驻进程;如果误用了并导致 `trans` 卡住,应先停止对应 Windows PID 或本地被卡住的 trans/tran broker 进程,再改为 detached supervisor。
- 不要在 `D601:win cmd` / `D601:win ps` 内用 `start``Start-Process` 直接启动 `hwlab-gateway`、串口 monitor、Codex app-server、Keil job watcher 等长驻进程;如果误用了并导致 `trans` 卡住,应先停止对应 Windows PID 或本地被卡住的 trans/tran broker 进程,再改为 detached supervisor。
- 启动命令必须避免从 WSL UNC cwd 进入 Windows cmd;长驻进程的 `cwd` 应是 Windows 盘符路径,例如 `F:\Work\HWLAB`,日志也写到同一节点私有 `.state` 或 skill state 目录。
- 长驻进程的验收标准不是“启动命令返回”,而是独立短命令能看到 PID、日志首行、health endpoint 或 cloud-side session/resource/capability 已注册。