fix: cap tran runtime and remove local lock
This commit is contained in:
@@ -114,11 +114,11 @@ GitHub issue/PR 写操作必须优先使用 `bun scripts/cli.ts gh issue|pr ...
|
||||
|
||||
`bun scripts/cli.ts ssh --help` 和 `bun scripts/cli.ts ssh <providerId> --help` 是本地 JSON 帮助命令,必须快速返回;不能把 `--help` 解析成 Provider ID,不能打开交互 shell,也不能等待 provider 会话。
|
||||
|
||||
主 server 固定提供 `tran` 缩写,等价于 `bun /root/unidesk/scripts/cli.ts ssh`。这里必须同时保留两层入口:交互式 bash 用 `~/.bashrc` 里的 `alias tran='bun /root/unidesk/scripts/cli.ts ssh'`;Codex `exec`、脚本和其他非交互 shell 不会自动展开 alias,所以还必须有 `/root/.local/bin/tran` 可执行 wrapper,内容固定为:
|
||||
主 server 固定提供 `tran` 缩写,等价于受控的 UniDesk SSH 透传入口。这里必须同时保留两层入口:交互式 bash 用 `~/.bashrc` 里的 `alias tran='/root/.local/bin/tran'`;Codex `exec`、脚本和其他非交互 shell 不会自动展开 alias,所以还必须有 `/root/.local/bin/tran` 可执行 wrapper,内容固定为委托 repo 内版本化脚本:
|
||||
|
||||
```sh
|
||||
#!/bin/sh
|
||||
exec bun /root/unidesk/scripts/cli.ts ssh "$@"
|
||||
exec /root/unidesk/scripts/tran "$@"
|
||||
```
|
||||
|
||||
主 server 上的人工/Codex 分布式敏捷操作必须直接写 `tran ...`,不要在 Codex 工具调用里退回完整 `bun scripts/cli.ts ssh ...` 前缀。例如 `tran D601:/home/ubuntu/workspace/hwlab-dev git status --short --branch`、`tran D601:k3s kubectl get pods -n hwlab-dev` 或 `tran D601:k3s:hwlab-dev:hwlab-cloud-web/tmp pwd`。CLI 命令参考和需要跨机器复制的脚本为了说明稳定入口,可以保留完整 `bun scripts/cli.ts ssh ...` 形式;`tran` 是主 server 本机操作纪律,不作为远端 provider 或 CI/CD 的前置依赖。
|
||||
@@ -127,15 +127,17 @@ exec bun /root/unidesk/scripts/cli.ts ssh "$@"
|
||||
|
||||
本地 shell 运算符不是 `tran` 可以拦截的内容。`tran G14:/root/hwlab sed -n '1,20p' AGENTS.md && sed -n '1,20p' docs/reference/g14.md` 会先由 master server 的本地 shell 拆成两个命令,只有第一个 `sed` 进入 G14,第二个 `sed` 会在 master server 当前目录执行。需要把两个命令都放到目标节点时,必须写成 `tran G14:/root/hwlab shell 'sed -n "1,20p" AGENTS.md && sed -n "1,20p" docs/reference/g14.md'`,或者用 `tran G14:/root/hwlab script <<'SCRIPT'` 把多行脚本送到远端。
|
||||
|
||||
`tran` wrapper 会在打开 provider SSH session 前,对同一个 provider/plane 的非交互调用做本机文件锁串行化。该锁只覆盖 `tran <route> <operation> ...` 这类短命令,不覆盖 `tran <route>` 交互 shell,目的是避免 Codex 并发读文件或并发小命令同时冲击同一个 provider 的 session allocator,导致所有调用在 `provider session` 打开阶段超时。锁目录默认是 `/tmp/unidesk-tran-locks`,可用 `UNIDESK_TRAN_LOCK_DIR` 调整;等待超过 `UNIDESK_TRAN_LOCK_NOTICE_SECONDS` 会在 stderr 提示正在排队,超过 `UNIDESK_TRAN_LOCK_WARNING_SECONDS` 会提示高频分布式调用正在排队,超过 `UNIDESK_TRAN_LOCK_TIMEOUT_SECONDS` 会失败。只有排查锁本身或验证底层并发能力时才允许临时设置 `UNIDESK_TRAN_SESSION_LOCK=0`,普通分布式开发不得绕过该锁。
|
||||
`tran` 不做本地 provider/plane 串行锁;本地目录锁不是 G14 原生 k3s/Tekton/GitOps 的业务协调机制,stale lock 会阻塞所有后续短查询。以后不要在 `tran` wrapper 里恢复本地锁。业务并发、发布互斥和 rollout 协调必须交给 k8s/Tekton/Argo/Lease 等原生运行面机制;若 provider session allocator 需要限流,应在服务端实现带 TTL 的队列或 lease,而不是在客户端加目录锁。
|
||||
|
||||
非交互 `tran`/`ssh` 有最外层运行时硬超时,默认和最大值都是 60 秒;`UNIDESK_TRAN_RUNTIME_TIMEOUT_SECONDS`、`UNIDESK_TRAN_RUNTIME_TIMEOUT_MS` 或 `UNIDESK_SSH_RUNTIME_TIMEOUT_MS` 只能把超时调小,不能调大超过 60 秒。到点后 wrapper、backend-core broker 或 frontend websocket 路径会主动断开并在 stderr 输出 `UNIDESK_TRAN_TIMEOUT_HINT` 或 `UNIDESK_SSH_RUNTIME_TIMEOUT`,提示改用短查询加轮询。长时间 CI/CD、Tekton/Argo 观察、trace/result、日志 tail、构建下载和硬件任务都必须按 submit-and-poll/短查询语义拆成多次 `tran` 调用;不得让单个 `tran` 挂着等待最终完成。
|
||||
|
||||
`bun scripts/cli.ts ssh D518` 应表现为登录 D518 WSL 的 shell;`bun scripts/cli.ts ssh D518 hostname` 应像 `ssh D518 hostname` 一样只输出远端命令结果并返回远端 exit code。Provider ID 前的目标选择由 UniDesk 节点清单决定,`-p`、`-i`、`-l`、`-o` 等传统 ssh 传输参数由 provider-gateway 部署配置统一管理,CLI 会兼容性消费这些参数但不会覆盖节点侧维护桥配置。指挥官、CI 预检和其他非交互流程不要依赖 ssh-like 自由拼接;单进程标准写法是 `bun scripts/cli.ts ssh D601 argv true`,多行 shell 逻辑标准写法是 quoted heredoc 单步调用 `bun scripts/cli.ts ssh D601 script <<'SCRIPT'`。
|
||||
|
||||
core 只允许声明了 `host.ssh` capability 的 provider 使用 `ssh` 透传或 `host.ssh` dispatch;旧 provider 不支持该能力时必须快速失败并输出错误,不能把未知命令误判成 `echo` 成功。
|
||||
|
||||
本地 broker 默认等待 provider SSH 会话打开 60000ms,以便在目标节点同时有较多 microservice.http 任务时仍能建立维护会话;需要诊断慢连接时可用 `UNIDESK_SSH_OPEN_TIMEOUT_MS=<ms>` 临时调大,但最小有效值固定为 15000ms,避免把真实离线误判为长时间阻塞。
|
||||
本地 broker 默认等待 provider SSH 会话打开 60000ms,以便在目标节点同时有较多 microservice.http 任务时仍能建立维护会话;需要诊断慢连接时可用 `UNIDESK_SSH_OPEN_TIMEOUT_MS=<ms>` 临时调大,但最小有效值固定为 15000ms,避免把真实离线误判为长时间阻塞。注意 open timeout 只控制“会话打开”阶段,不能绕过 60 秒最外层运行时硬超时。
|
||||
|
||||
ssh-like 远端命令如果出现 `kex_exchange_identification`、`Connection closed by remote host`、provider session timeout 或 exit code 255,CLI 会在原始 stderr 后追加一行 `UNIDESK_SSH_HINT { ... }`。该 JSON 不回显原始远端命令,只包含 `code=ssh-like-command-friction`、`trigger`、`try` 和 `triage`;`try` 固定指向 stdin script 形态,避免把一次 ssh-like 解析/握手摩擦误读成 D601 SSH 整体不可用。`ssh`/`tran` 只有在运行耗时超过默认 10000ms 时才会在 stderr 追加一行 `UNIDESK_SSH_TIMING { ... }`,且 `level=warning`;正常短调用不输出 timing 噪声。慢成功命令也必须保留该 warning,因为它是 provider session、远端命令成本、helper bootstrap 和 `tran`/`apply-patch` 性能回归的重要监控信号。warning 包含 `elapsedMs`、`elapsedSeconds`、`transport`、`invocationKind` 和 `exitCode`,提示优先排查 provider/session 延迟、远端命令自身耗时、helper bootstrap 或工具层回归。阈值可用 `UNIDESK_SSH_SLOW_WARNING_MS=<ms>` 临时调节,提示同样不回显原始远端命令。
|
||||
ssh-like 远端命令如果出现 `kex_exchange_identification`、`Connection closed by remote host`、provider session timeout 或 exit code 255,CLI 会在原始 stderr 后追加一行 `UNIDESK_SSH_HINT { ... }`。该 JSON 不回显原始远端命令,只包含 `code=ssh-like-command-friction`、`trigger`、`try` 和 `triage`;`try` 固定指向 stdin script 形态,避免把一次 ssh-like 解析/握手摩擦误读成 D601 SSH 整体不可用。`ssh`/`tran` 运行时硬超时会输出 `UNIDESK_SSH_RUNTIME_TIMEOUT { ... }` 或 wrapper 层 `UNIDESK_TRAN_TIMEOUT_HINT { ... }`;这不是远端业务失败,而是调用方需要改成短查询/轮询。`ssh`/`tran` 只有在运行耗时超过默认 10000ms 时才会在 stderr 追加一行 `UNIDESK_SSH_TIMING { ... }`,且 `level=warning`;正常短调用不输出 timing 噪声。慢成功命令也必须保留该 warning,因为它是 provider session、远端命令成本、helper bootstrap 和 `tran`/`apply-patch` 性能回归的重要监控信号。warning 包含 `elapsedMs`、`elapsedSeconds`、`transport`、`invocationKind` 和 `exitCode`,提示优先排查 provider/session 延迟、远端命令自身耗时、helper bootstrap 或工具层回归。阈值可用 `UNIDESK_SSH_SLOW_WARNING_MS=<ms>` 临时调节,提示同样不回显原始远端命令。
|
||||
|
||||
`ssh <providerId>` 只在当前 operation 需要 helper 时才注入 `/tmp/unidesk-ssh-tools`,普通 `argv`、`script`、`kubectl`、`logs` 等路径不得传输无关工具源码。`apply-patch` 只注入 `apply_patch`;`glob` 只注入 `glob`;`skills`/`skill discover` 只注入 `skill-discover`。`apply_patch` 接受标准 `*** Begin Patch` / `*** End Patch` patch 格式,便于通过 SSH 透传编辑远端仓库文件;远端存在 `perl` 时必须走快速精确匹配路径,避免大文件 hunk 被 sh 模式匹配拖成几十秒,缺少 `perl` 时才退回 sh-only 实现。`glob` 和 `skill-discover` 需要远端 `python3`。注入工具只写 `/tmp/unidesk-ssh-tools`,不修改目标仓库。
|
||||
|
||||
|
||||
+2
-1
@@ -185,8 +185,9 @@ export function sshHelp(): unknown {
|
||||
"Do not put operation names in any colon route segment, including nested k3s namespace/workload/container segments.",
|
||||
"Do not use post-provider shorthand such as `ssh G14 k3s ...`; write `ssh G14:k3s ...` so location and operation stay separated.",
|
||||
"If an ssh-like remote command fails with timeout/kex/exit-255 friction, stderr includes one low-noise UNIDESK_SSH_HINT JSON line with the argv retry command.",
|
||||
"Non-interactive ssh/tran operations have a hard top-level runtime timeout capped at 60s. Timeout writes UNIDESK_SSH_RUNTIME_TIMEOUT or UNIDESK_TRAN_TIMEOUT_HINT and disconnects the broker; long CI/CD, trace, logs, build, or hardware work must use submit-and-poll / short query loops instead of keeping tran open.",
|
||||
"Only slow ssh/tran runtime writes UNIDESK_SSH_TIMING JSON to stderr; operations over 10s are marked level=warning even when they succeed, because slow successful calls are a distributed performance monitoring signal. Check provider latency, remote command cost, helper bootstrap, or tran/apply-patch optimization before repeating high-frequency work. Routine short calls do not emit timing noise.",
|
||||
"The local tran wrapper serializes non-interactive calls per provider/plane before opening provider SSH sessions, so parallel Codex file reads do not stampede the provider session allocator; set UNIDESK_TRAN_SESSION_LOCK=0 only for explicit diagnostics.",
|
||||
"The local tran wrapper must not add provider/plane directory locks; rely on k8s/Tekton/Argo/Lease or server-side TTL queues for coordination.",
|
||||
"Use -- before a remote command that intentionally starts with a dash.",
|
||||
],
|
||||
};
|
||||
|
||||
+33
-2
@@ -6,7 +6,18 @@ import { type UniDeskConfig } from "./config";
|
||||
import { type DebugDispatchCommand, isDebugDispatchCommand } from "./debug";
|
||||
import { summarizeMicroserviceHealthResponse, summarizeMicroserviceObservation, summarizeMicroserviceProxyResponse } from "./microservices";
|
||||
import { parseNetworkPerfOptions, runNetworkPerf } from "./network-perf";
|
||||
import { formatSshFailureHint, formatSshRuntimeTimingHint, parseSshInvocation, sshFailureHint, sshRoutePayloadCwd, sshRuntimeTimingHint, wrapSshRemoteCommand } from "./ssh";
|
||||
import {
|
||||
formatSshFailureHint,
|
||||
formatSshRuntimeTimeoutHint,
|
||||
formatSshRuntimeTimingHint,
|
||||
parseSshInvocation,
|
||||
sshFailureHint,
|
||||
sshRoutePayloadCwd,
|
||||
sshRuntimeTimeoutHint,
|
||||
sshRuntimeTimeoutMs,
|
||||
sshRuntimeTimingHint,
|
||||
wrapSshRemoteCommand,
|
||||
} from "./ssh";
|
||||
import { codexJudgeQueryAsync, codexOutputQueryAsync, codexPrPreflightQueryAsync, codexQueuesQueryAsync, codexTaskQueryAsync, codexTasksQueryAsync, codexUnreadTriageAsync } from "./code-queue";
|
||||
import { runDecisionCenterCommandAsync } from "./decision-center";
|
||||
import {
|
||||
@@ -898,6 +909,7 @@ async function runRemoteSshWebSocket(
|
||||
rows: Number(process.stdout.rows) > 0 ? Number(process.stdout.rows) : 30,
|
||||
};
|
||||
const openTimeoutMs = Math.max(15000, Number(process.env.UNIDESK_SSH_OPEN_TIMEOUT_MS || 60000));
|
||||
const runtimeTimeoutMs = sshRuntimeTimeoutMs();
|
||||
const payload = {
|
||||
providerId: invocation.providerId,
|
||||
command: wrapSshRemoteCommand(parsed.remoteCommand, parsed.requiredHelpers),
|
||||
@@ -905,6 +917,7 @@ async function runRemoteSshWebSocket(
|
||||
tty: parsed.remoteCommand === null,
|
||||
stdinEotOnEnd: parsed.remoteCommand !== null,
|
||||
openTimeoutMs,
|
||||
runtimeTimeoutMs,
|
||||
cols: size.cols,
|
||||
rows: size.rows,
|
||||
};
|
||||
@@ -945,6 +958,7 @@ async function runRemoteSshWebSocket(
|
||||
|
||||
return await new Promise<number>((resolve) => {
|
||||
const rawMode = parsed.remoteCommand === null && process.stdin.isTTY && typeof process.stdin.setRawMode === "function";
|
||||
let timedOut = false;
|
||||
const openTimer = setTimeout(() => {
|
||||
if (sessionReady || settled) return;
|
||||
process.stderr.write("unidesk remote frontend ssh bridge timed out waiting for provider session\n");
|
||||
@@ -955,9 +969,26 @@ async function runRemoteSshWebSocket(
|
||||
// Ignore close failures while resolving the timeout path.
|
||||
}
|
||||
}, openTimeoutMs);
|
||||
const runtimeTimer = setTimeout(() => {
|
||||
if (settled) return;
|
||||
timedOut = true;
|
||||
exitCode = 124;
|
||||
process.stderr.write(formatSshRuntimeTimeoutHint(sshRuntimeTimeoutHint({
|
||||
invocation,
|
||||
transport: "frontend-websocket",
|
||||
timeoutMs: runtimeTimeoutMs,
|
||||
})));
|
||||
try {
|
||||
ws.close();
|
||||
} catch {
|
||||
// Ignore close failures while resolving the timeout path.
|
||||
}
|
||||
finish(124);
|
||||
}, runtimeTimeoutMs);
|
||||
|
||||
const restore = (): void => {
|
||||
clearTimeout(openTimer);
|
||||
clearTimeout(runtimeTimer);
|
||||
process.stdin.off("data", onStdinData);
|
||||
process.stdin.off("end", onStdinEnd);
|
||||
if (rawMode) process.stdin.setRawMode(false);
|
||||
@@ -966,7 +997,7 @@ async function runRemoteSshWebSocket(
|
||||
if (settled) return;
|
||||
settled = true;
|
||||
restore();
|
||||
const hint = sshFailureHint(invocation.providerId, parsed, code, "");
|
||||
const hint = timedOut ? null : sshFailureHint(invocation.providerId, parsed, code, "");
|
||||
if (hint !== null) process.stderr.write(formatSshFailureHint(hint));
|
||||
const timingHint = formatSshRuntimeTimingHint(sshRuntimeTimingHint({
|
||||
invocation,
|
||||
|
||||
+96
-1
@@ -57,12 +57,28 @@ export interface SshRuntimeTimingHint {
|
||||
note: string;
|
||||
}
|
||||
|
||||
export interface SshRuntimeTimeoutHint {
|
||||
code: "ssh-runtime-timeout";
|
||||
level: "warning";
|
||||
providerId: string;
|
||||
route: string;
|
||||
transport: "backend-core-broker" | "frontend-websocket";
|
||||
invocationKind: SshInvocationKind;
|
||||
timeoutMs: number;
|
||||
timeoutSeconds: number;
|
||||
message: string;
|
||||
action: string;
|
||||
note: string;
|
||||
}
|
||||
|
||||
const argvQuotedSshSubcommands = new Set(["git", "rg", "grep", "sed", "nl", "stat", "du", "ls", "cat", "head", "tail", "wc", "pwd"]);
|
||||
const nativeK3sKubeconfig = "/etc/rancher/k3s/k3s.yaml";
|
||||
const windowsBridgeCwd = "/mnt/c/Windows";
|
||||
const windowsPowerShellExePath = "/mnt/c/Windows/System32/WindowsPowerShell/v1.0/powershell.exe";
|
||||
const windowsCmdExeNativePath = "C:\\Windows\\System32\\cmd.exe";
|
||||
const defaultSshSlowWarningMs = 10_000;
|
||||
const defaultSshRuntimeTimeoutMs = 60_000;
|
||||
const maxSshRuntimeTimeoutMs = 60_000;
|
||||
const k3sResourceKindAliases = new Set(["pod", "po", "pods", "deployment", "deploy", "deployments", "statefulset", "sts", "daemonset", "ds", "job", "jobs"]);
|
||||
const legacyK3sOperationRouteSegments = new Set([
|
||||
"guard",
|
||||
@@ -1842,6 +1858,38 @@ export function formatSshRuntimeTimingHint(hint: SshRuntimeTimingHint): string {
|
||||
return `UNIDESK_SSH_TIMING ${JSON.stringify(hint)}\n`;
|
||||
}
|
||||
|
||||
export function sshRuntimeTimeoutMs(env: NodeJS.ProcessEnv = process.env): number {
|
||||
const raw = env.UNIDESK_SSH_RUNTIME_TIMEOUT_MS ?? env.UNIDESK_TRAN_RUNTIME_TIMEOUT_MS;
|
||||
const parsed = raw === undefined ? NaN : Number(raw);
|
||||
if (!Number.isFinite(parsed) || parsed <= 0) return defaultSshRuntimeTimeoutMs;
|
||||
return Math.min(maxSshRuntimeTimeoutMs, Math.max(1000, Math.trunc(parsed)));
|
||||
}
|
||||
|
||||
export function sshRuntimeTimeoutHint(options: {
|
||||
invocation: ParsedSshInvocation;
|
||||
transport: SshRuntimeTimeoutHint["transport"];
|
||||
timeoutMs: number;
|
||||
}): SshRuntimeTimeoutHint {
|
||||
const timeoutSeconds = Number((options.timeoutMs / 1000).toFixed(3));
|
||||
return {
|
||||
code: "ssh-runtime-timeout",
|
||||
level: "warning",
|
||||
providerId: safeProviderId(options.invocation.providerId),
|
||||
route: options.invocation.route.raw,
|
||||
transport: options.transport,
|
||||
invocationKind: options.invocation.parsed.invocationKind,
|
||||
timeoutMs: options.timeoutMs,
|
||||
timeoutSeconds,
|
||||
message: `ssh/tran operation exceeded the ${timeoutSeconds}s top-level runtime limit and was disconnected.`,
|
||||
action: "Use short query plus poll semantics; do not keep tran open waiting for long CI/CD, trace, logs, or build progress.",
|
||||
note: "Timeout hint is written to stderr and intentionally does not echo the original remote command.",
|
||||
};
|
||||
}
|
||||
|
||||
export function formatSshRuntimeTimeoutHint(hint: SshRuntimeTimeoutHint): string {
|
||||
return `UNIDESK_SSH_RUNTIME_TIMEOUT ${JSON.stringify(hint)}\n`;
|
||||
}
|
||||
|
||||
function brokerSource(): string {
|
||||
return String.raw`
|
||||
const open = JSON.parse(process.argv[2] || process.argv[1] || "{}");
|
||||
@@ -1861,6 +1909,13 @@ const openTimer = setTimeout(() => {
|
||||
try { ws.close(); } catch {}
|
||||
process.exit(255);
|
||||
}, Number(open.openTimeoutMs || 15000));
|
||||
const runtimeTimeoutMs = Number(open.runtimeTimeoutMs || 60000);
|
||||
const runtimeTimer = setTimeout(() => {
|
||||
process.stderr.write("unidesk ssh bridge runtime timeout; use short query plus poll semantics instead of keeping tran open\n");
|
||||
exitCode = 124;
|
||||
try { ws.close(); } catch {}
|
||||
setTimeout(() => process.exit(124), 250).unref?.();
|
||||
}, runtimeTimeoutMs);
|
||||
|
||||
function send(value) {
|
||||
const text = JSON.stringify(value);
|
||||
@@ -1933,6 +1988,7 @@ ws.addEventListener("message", (event) => {
|
||||
}
|
||||
if (message.type === "ssh.error") {
|
||||
clearTimeout(openTimer);
|
||||
clearTimeout(runtimeTimer);
|
||||
process.stderr.write(String(message.message || "ssh bridge error") + "\n");
|
||||
exitCode = 255;
|
||||
ws.close();
|
||||
@@ -1940,12 +1996,15 @@ ws.addEventListener("message", (event) => {
|
||||
}
|
||||
if (message.type === "ssh.exit") {
|
||||
clearTimeout(openTimer);
|
||||
clearTimeout(runtimeTimer);
|
||||
exitCode = Number.isInteger(message.exitCode) ? message.exitCode : 255;
|
||||
ws.close();
|
||||
}
|
||||
});
|
||||
|
||||
ws.addEventListener("close", () => {
|
||||
clearTimeout(openTimer);
|
||||
clearTimeout(runtimeTimer);
|
||||
process.exit(exitCode);
|
||||
});
|
||||
|
||||
@@ -1979,6 +2038,7 @@ export async function runSsh(config: UniDeskConfig, providerId: string, args: st
|
||||
const startedAtMs = Date.now();
|
||||
const size = terminalSize();
|
||||
const openTimeoutMs = Math.max(15000, Number(process.env.UNIDESK_SSH_OPEN_TIMEOUT_MS || 60000));
|
||||
const runtimeTimeoutMs = sshRuntimeTimeoutMs();
|
||||
const payload = {
|
||||
providerId: invocation.providerId,
|
||||
command: wrapSshRemoteCommand(parsed.remoteCommand, parsed.requiredHelpers),
|
||||
@@ -1986,6 +2046,7 @@ export async function runSsh(config: UniDeskConfig, providerId: string, args: st
|
||||
tty: parsed.remoteCommand === null,
|
||||
stdinEotOnEnd: parsed.remoteCommand !== null,
|
||||
openTimeoutMs,
|
||||
runtimeTimeoutMs,
|
||||
cols: size.cols,
|
||||
rows: size.rows,
|
||||
};
|
||||
@@ -2039,7 +2100,41 @@ export async function runSsh(config: UniDeskConfig, providerId: string, args: st
|
||||
|
||||
return await new Promise<number>((resolve) => {
|
||||
let settled = false;
|
||||
let timedOut = false;
|
||||
let killTimer: NodeJS.Timeout | null = null;
|
||||
const runtimeTimer = setTimeout(() => {
|
||||
if (settled) return;
|
||||
timedOut = true;
|
||||
const hint = sshRuntimeTimeoutHint({
|
||||
invocation,
|
||||
transport: "backend-core-broker",
|
||||
timeoutMs: runtimeTimeoutMs,
|
||||
});
|
||||
const formatted = formatSshRuntimeTimeoutHint(hint);
|
||||
appendStderrTail(formatted);
|
||||
process.stderr.write(formatted);
|
||||
try {
|
||||
child.stdin.destroy();
|
||||
} catch {
|
||||
// Ignore stdin teardown failures on the timeout path.
|
||||
}
|
||||
try {
|
||||
child.kill("SIGTERM");
|
||||
} catch {
|
||||
// Ignore kill failures and fall through to finish.
|
||||
}
|
||||
killTimer = setTimeout(() => {
|
||||
try {
|
||||
child.kill("SIGKILL");
|
||||
} catch {
|
||||
// Process may have already exited.
|
||||
}
|
||||
}, 2000);
|
||||
finish(124);
|
||||
}, runtimeTimeoutMs);
|
||||
const restore = (): void => {
|
||||
clearTimeout(runtimeTimer);
|
||||
if (killTimer && !timedOut) clearTimeout(killTimer);
|
||||
process.stdin.unpipe(child.stdin);
|
||||
if (rawMode) process.stdin.setRawMode(false);
|
||||
};
|
||||
@@ -2047,7 +2142,7 @@ export async function runSsh(config: UniDeskConfig, providerId: string, args: st
|
||||
if (settled) return;
|
||||
settled = true;
|
||||
restore();
|
||||
const hint = sshFailureHint(invocation.providerId, parsed, exitCode, stderrTail);
|
||||
const hint = timedOut ? null : sshFailureHint(invocation.providerId, parsed, exitCode, stderrTail);
|
||||
if (hint !== null) process.stderr.write(formatSshFailureHint(hint));
|
||||
const timingHint = formatSshRuntimeTimingHint(sshRuntimeTimingHint({
|
||||
invocation,
|
||||
|
||||
@@ -5,7 +5,18 @@ import path from "node:path";
|
||||
import { sshHelp } from "./src/help";
|
||||
import { providerTriageRecommendedCrossChecks } from "./src/provider-triage";
|
||||
import { extractRemoteCliOptions, remoteSshFrontendPlanForTest } from "./src/remote";
|
||||
import { formatSshFailureHint, formatSshRuntimeTimingHint, parseSshArgs, parseSshInvocation, remoteApplyPatchSource, sshFailureHint, sshRuntimeTimingHint } from "./src/ssh";
|
||||
import {
|
||||
formatSshFailureHint,
|
||||
formatSshRuntimeTimeoutHint,
|
||||
formatSshRuntimeTimingHint,
|
||||
parseSshArgs,
|
||||
parseSshInvocation,
|
||||
remoteApplyPatchSource,
|
||||
sshFailureHint,
|
||||
sshRuntimeTimeoutHint,
|
||||
sshRuntimeTimeoutMs,
|
||||
sshRuntimeTimingHint,
|
||||
} from "./src/ssh";
|
||||
|
||||
type JsonRecord = Record<string, unknown>;
|
||||
|
||||
@@ -60,57 +71,6 @@ function applyPatchFixture(args: string[], patch: string, files: Record<string,
|
||||
}
|
||||
}
|
||||
|
||||
function tranConcurrentLockFixture(): { status: number | null; stdout: string; stderr: string } {
|
||||
const root = mkdtempSync(path.join(os.tmpdir(), "unidesk-tran-lock-contract-"));
|
||||
try {
|
||||
const fakeRepo = path.join(root, "repo");
|
||||
const fakeScripts = path.join(fakeRepo, "scripts");
|
||||
const fakeBin = path.join(root, "bin");
|
||||
mkdirSync(fakeScripts, { recursive: true });
|
||||
mkdirSync(fakeBin, { recursive: true });
|
||||
writeFileSync(path.join(fakeScripts, "cli.ts"), "// fake cli entry for tran wrapper contract\n", "utf8");
|
||||
const fakeBun = path.join(fakeBin, "bun");
|
||||
writeFileSync(fakeBun, [
|
||||
"#!/bin/sh",
|
||||
"if mkdir \"$FAKE_BUN_RUN_LOCK\" 2>/dev/null; then",
|
||||
" sleep 1",
|
||||
" rmdir \"$FAKE_BUN_RUN_LOCK\"",
|
||||
" exit 0",
|
||||
"fi",
|
||||
"echo fake bun observed overlapping tran execution >&2",
|
||||
"exit 42",
|
||||
"",
|
||||
].join("\n"), "utf8");
|
||||
chmodSync(fakeBun, 0o755);
|
||||
const tranPath = path.resolve("scripts/tran");
|
||||
return spawnSync("sh", ["-c", [
|
||||
`"${tranPath}" D601:/tmp pwd >/tmp/unidesk-tran-lock-one.out 2>/tmp/unidesk-tran-lock-one.err &`,
|
||||
"p1=$!",
|
||||
`"${tranPath}" D601:/tmp pwd >/tmp/unidesk-tran-lock-two.out 2>/tmp/unidesk-tran-lock-two.err &`,
|
||||
"p2=$!",
|
||||
"wait $p1; s1=$?",
|
||||
"wait $p2; s2=$?",
|
||||
"cat /tmp/unidesk-tran-lock-one.err /tmp/unidesk-tran-lock-two.err >&2",
|
||||
"printf '%s %s\\n' \"$s1\" \"$s2\"",
|
||||
].join("\n")], {
|
||||
cwd: path.resolve("."),
|
||||
env: {
|
||||
...process.env,
|
||||
PATH: `${fakeBin}${path.delimiter}${process.env.PATH ?? ""}`,
|
||||
UNIDESK_TRAN_REPO_ROOT: fakeRepo,
|
||||
UNIDESK_TRAN_LOCK_DIR: path.join(root, "locks"),
|
||||
UNIDESK_TRAN_LOCK_NOTICE_SECONDS: "0",
|
||||
UNIDESK_TRAN_LOCK_TIMEOUT_SECONDS: "10",
|
||||
FAKE_BUN_RUN_LOCK: path.join(root, "fake-bun-running"),
|
||||
},
|
||||
encoding: "utf8",
|
||||
timeout: 10_000,
|
||||
});
|
||||
} finally {
|
||||
rmSync(root, { recursive: true, force: true });
|
||||
}
|
||||
}
|
||||
|
||||
export function runSshArgvGuidanceContract(): JsonRecord {
|
||||
const argv = parseSshArgs(["argv", "true"]);
|
||||
assertCondition(argv.invocationKind === "argv", "argv subcommand must be classified as argv", argv);
|
||||
@@ -405,6 +365,17 @@ export function runSshArgvGuidanceContract(): JsonRecord {
|
||||
|
||||
const timeoutHint = sshFailureHint("D601", sshLike, 255, "unidesk ssh bridge timed out waiting for provider session");
|
||||
assertCondition(timeoutHint?.trigger === "timeout-or-kex", "provider session timeout must map to timeout-or-kex", timeoutHint);
|
||||
assertCondition(sshRuntimeTimeoutMs({ UNIDESK_SSH_RUNTIME_TIMEOUT_MS: "120000" } as NodeJS.ProcessEnv) === 60_000, "ssh runtime timeout must cap at 60s", {});
|
||||
assertCondition(sshRuntimeTimeoutMs({ UNIDESK_TRAN_RUNTIME_TIMEOUT_MS: "2500" } as NodeJS.ProcessEnv) === 2500, "ssh runtime timeout must accept smaller explicit limits", {});
|
||||
const runtimeTimeout = sshRuntimeTimeoutHint({
|
||||
invocation: parseSshInvocation("G14:k3s", ["script"]),
|
||||
transport: "backend-core-broker",
|
||||
timeoutMs: 60_000,
|
||||
});
|
||||
const formattedRuntimeTimeout = formatSshRuntimeTimeoutHint(runtimeTimeout);
|
||||
assertCondition(formattedRuntimeTimeout.startsWith("UNIDESK_SSH_RUNTIME_TIMEOUT "), "runtime timeout hint must have structured prefix", formattedRuntimeTimeout);
|
||||
assertCondition(formattedRuntimeTimeout.includes("short query plus poll semantics"), "runtime timeout hint must point to short polling", formattedRuntimeTimeout);
|
||||
assertCondition(!formattedRuntimeTimeout.includes("kubectl"), "runtime timeout hint must not echo remote command text", formattedRuntimeTimeout);
|
||||
|
||||
const helpText = JSON.stringify(sshHelp());
|
||||
assertCondition(helpText.includes("ssh <providerId> script [--shell sh|bash] [script-args...] <<'SCRIPT'"), "ssh help must recommend stdin script passthrough for shell scripts", helpText);
|
||||
@@ -423,8 +394,9 @@ export function runSshArgvGuidanceContract(): JsonRecord {
|
||||
assertCondition(helpText.includes("apply-patch [--allow-loose]") && helpText.includes("low-context update hunks"), "ssh help must document apply-patch loose-context guard", helpText);
|
||||
assertCondition(helpText.includes("ssh D601:k3s:hwlab-dev:hwlab-cloud-api script <<'SCRIPT'"), "ssh help must document k3s script operation", helpText);
|
||||
assertCondition(helpText.includes("UNIDESK_SSH_HINT"), "ssh help must document structured failure hint", helpText);
|
||||
assertCondition(helpText.includes("UNIDESK_SSH_RUNTIME_TIMEOUT") && helpText.includes("UNIDESK_TRAN_TIMEOUT_HINT") && helpText.includes("60s") && helpText.includes("submit-and-poll"), "ssh help must document top-level runtime timeout and short polling discipline", helpText);
|
||||
assertCondition(helpText.includes("UNIDESK_SSH_TIMING") && helpText.includes("10s") && helpText.includes("slow successful calls are a distributed performance monitoring signal") && helpText.includes("Routine short calls do not emit timing noise"), "ssh help must document slow-only runtime timing hints", helpText);
|
||||
assertCondition(helpText.includes("UNIDESK_TRAN_SESSION_LOCK=0") && helpText.includes("provider session allocator"), "ssh help must document tran provider session serialization", helpText);
|
||||
assertCondition(helpText.includes("must not add provider/plane directory locks") && helpText.includes("k8s/Tekton/Argo/Lease"), "ssh help must document tran's no-local-lock boundary", helpText);
|
||||
|
||||
const crossChecks = providerTriageRecommendedCrossChecks("D601");
|
||||
assertCondition(crossChecks.includes("bun scripts/cli.ts ssh D601 argv true"), "provider triage cross-checks must keep argv true", crossChecks);
|
||||
@@ -455,12 +427,7 @@ export function runSshArgvGuidanceContract(): JsonRecord {
|
||||
const tranScript = readFileSync(new URL("./tran", import.meta.url), "utf8");
|
||||
assertCondition(tranScript.includes("CODE_QUEUE_DEV_CONTAINER_MASTER_HOST") && tranScript.includes("--main-server-ip"), "tran wrapper must auto-select frontend transport inside Code Queue runner pods", tranScript);
|
||||
assertCondition(tranScript.includes("UNIDESK_TRAN_LOCAL"), "tran wrapper must keep an explicit local override for diagnostics", tranScript);
|
||||
assertCondition(tranScript.includes("tran_lock_scope") && tranScript.includes("UNIDESK_TRAN_LOCK_DIR"), "tran wrapper must serialize concurrent provider session opens with a local sh lock", tranScript);
|
||||
assertCondition(tranScript.includes("*:win|*:win/*) plane=win"), "tran wrapper must lock win route calls separately from host/k3s calls", tranScript);
|
||||
const tranLock = tranConcurrentLockFixture();
|
||||
assertCondition(tranLock.status === 0, "tran lock fixture shell should complete", tranLock);
|
||||
assertCondition(tranLock.stdout.trim() === "0 0", "parallel tran invocations for one provider must serialize instead of overlapping fake bun", tranLock);
|
||||
assertCondition(!tranLock.stderr.includes("overlapping tran execution"), "tran provider lock must prevent overlapping provider session allocation", tranLock);
|
||||
assertCondition(!tranScript.includes("tran_lock_scope") && !tranScript.includes("UNIDESK_TRAN_LOCK_DIR") && !tranScript.includes("mkdir \"$lock_path\""), "tran wrapper must not add local provider/plane directory locks", tranScript);
|
||||
|
||||
const remoteSource = readFileSync(new URL("./src/remote.ts", import.meta.url), "utf8");
|
||||
assertCondition(remoteSource.includes("UNIDESK_REMOTE_HTTP_CLIENT") && remoteSource.includes("isCodeQueueRunnerEnv(env) ? \"curl\" : \"fetch\""), "remote frontend transport must default to curl HTTP in Code Queue runner environments", remoteSource);
|
||||
@@ -506,7 +473,7 @@ export function runSshArgvGuidanceContract(): JsonRecord {
|
||||
"host apply-patch bootstraps only the apply_patch helper and uses a Perl fast path for large files",
|
||||
"remote frontend ssh uses authenticated /ws/ssh streaming instead of host.ssh dispatch task polling",
|
||||
"Code Queue runner image installs the tran wrapper and runner tran auto-selects remote frontend transport",
|
||||
"tran serializes concurrent non-interactive calls per provider/plane before opening provider SSH sessions",
|
||||
"tran does not add local provider/plane directory locks and leaves coordination to k8s/Tekton/Argo/Lease",
|
||||
"Code Queue runner remote frontend HTTP uses curl by default for non-ssh API calls to avoid Bun response-body native crashes",
|
||||
],
|
||||
};
|
||||
|
||||
+35
-55
@@ -7,67 +7,47 @@ if [ ! -f "$repo/scripts/cli.ts" ]; then
|
||||
repo=$(CDPATH= cd -- "$self_dir/.." && pwd)
|
||||
fi
|
||||
|
||||
tran_timeout_seconds() {
|
||||
raw=${UNIDESK_TRAN_RUNTIME_TIMEOUT_SECONDS:-}
|
||||
if [ -z "$raw" ] && [ -n "${UNIDESK_TRAN_RUNTIME_TIMEOUT_MS:-}" ]; then
|
||||
case "$UNIDESK_TRAN_RUNTIME_TIMEOUT_MS" in
|
||||
''|*[!0-9]*) raw=60 ;;
|
||||
*) raw=$(((${UNIDESK_TRAN_RUNTIME_TIMEOUT_MS} + 999) / 1000)) ;;
|
||||
esac
|
||||
fi
|
||||
if [ -z "$raw" ] && [ -n "${UNIDESK_SSH_RUNTIME_TIMEOUT_MS:-}" ]; then
|
||||
case "$UNIDESK_SSH_RUNTIME_TIMEOUT_MS" in
|
||||
''|*[!0-9]*) raw=60 ;;
|
||||
*) raw=$(((${UNIDESK_SSH_RUNTIME_TIMEOUT_MS} + 999) / 1000)) ;;
|
||||
esac
|
||||
fi
|
||||
raw=${raw:-60}
|
||||
case "${raw:-60}" in
|
||||
''|*[!0-9]*) raw=60 ;;
|
||||
esac
|
||||
[ "$raw" -gt 0 ] || raw=60
|
||||
[ "$raw" -le 60 ] || raw=60
|
||||
printf '%s\n' "$raw"
|
||||
}
|
||||
|
||||
if [ "${UNIDESK_TRAN_TIMEOUT_GUARD:-0}" != "1" ] && command -v timeout >/dev/null 2>&1; then
|
||||
timeout_seconds=$(tran_timeout_seconds)
|
||||
set +e
|
||||
UNIDESK_TRAN_TIMEOUT_GUARD=1 timeout -s TERM -k 2s "${timeout_seconds}s" "$0" "$@"
|
||||
rc=$?
|
||||
set -e
|
||||
if [ "$rc" = 124 ] || [ "$rc" = 137 ] || [ "$rc" = 143 ]; then
|
||||
printf 'UNIDESK_TRAN_TIMEOUT_HINT {"code":"tran-top-level-timeout","level":"warning","timeoutSeconds":%s,"message":"tran exceeded the top-level runtime limit and was disconnected.","action":"Use short query plus poll semantics; do not keep tran open waiting for long CI/CD, trace, logs, or build progress."}\n' "$timeout_seconds" >&2
|
||||
fi
|
||||
exit "$rc"
|
||||
fi
|
||||
|
||||
host=${UNIDESK_MAIN_SERVER_IP:-${UNIDESK_MAIN_SERVER_HOST:-${CODE_QUEUE_DEV_CONTAINER_MASTER_HOST:-}}}
|
||||
runner_env=0
|
||||
if [ -n "${CODE_QUEUE_SERVICE_ROLE:-}" ] || [ -n "${CODE_QUEUE_INSTANCE_ID:-}" ] || [ -n "${KUBERNETES_SERVICE_HOST:-}" ]; then
|
||||
runner_env=1
|
||||
fi
|
||||
|
||||
tran_lock_scope() {
|
||||
[ "$#" -ge 2 ] || return 1
|
||||
case "${UNIDESK_TRAN_SESSION_LOCK:-1}" in
|
||||
0|false|FALSE|no|NO|off|OFF) return 1 ;;
|
||||
esac
|
||||
route=$1
|
||||
case "$route" in
|
||||
""|-*) return 1 ;;
|
||||
esac
|
||||
provider=${route%%:*}
|
||||
[ -n "$provider" ] || return 1
|
||||
plane=host
|
||||
case "$route" in
|
||||
*:win|*:win/*) plane=win ;;
|
||||
*:k3s*) plane=k3s ;;
|
||||
esac
|
||||
printf '%s\n' "$provider-$plane"
|
||||
}
|
||||
|
||||
tran_acquire_lock() {
|
||||
scope=$1
|
||||
lock_root=${UNIDESK_TRAN_LOCK_DIR:-/tmp/unidesk-tran-locks}
|
||||
lock_name=$(printf '%s' "$scope" | tr -c 'A-Za-z0-9_.-' '_')
|
||||
lock_path=$lock_root/$lock_name.lock
|
||||
notice_seconds=${UNIDESK_TRAN_LOCK_NOTICE_SECONDS:-3}
|
||||
warning_seconds=${UNIDESK_TRAN_LOCK_WARNING_SECONDS:-10}
|
||||
timeout_seconds=${UNIDESK_TRAN_LOCK_TIMEOUT_SECONDS:-120}
|
||||
mkdir -p "$lock_root"
|
||||
started=$(date +%s)
|
||||
noticed=0
|
||||
warned=0
|
||||
while ! mkdir "$lock_path" 2>/dev/null; do
|
||||
now=$(date +%s)
|
||||
waited=$((now - started))
|
||||
if [ "$noticed" = 0 ] && [ "$waited" -ge "$notice_seconds" ]; then
|
||||
printf 'tran provider session lock waiting scope=%s waited=%ss; serializing concurrent opens to avoid provider session allocation timeouts\n' "$scope" "$waited" >&2
|
||||
noticed=1
|
||||
fi
|
||||
if [ "$warned" = 0 ] && [ "$waited" -ge "$warning_seconds" ]; then
|
||||
printf 'tran provider session lock warning scope=%s waited=%ss; high-frequency distributed calls are queued behind another tran, consider batching reads or checking stuck sessions if this repeats\n' "$scope" "$waited" >&2
|
||||
warned=1
|
||||
fi
|
||||
if [ "$waited" -ge "$timeout_seconds" ]; then
|
||||
printf 'tran provider session lock timeout scope=%s waited=%ss lock=%s\n' "$scope" "$waited" "$lock_path" >&2
|
||||
exit 255
|
||||
fi
|
||||
sleep 1
|
||||
done
|
||||
trap 'rmdir "$lock_path" 2>/dev/null || true' EXIT
|
||||
}
|
||||
|
||||
if scope=$(tran_lock_scope "$@"); then
|
||||
tran_acquire_lock "$scope"
|
||||
fi
|
||||
|
||||
if [ "$runner_env" = 1 ] && [ -n "$host" ] && [ "${UNIDESK_TRAN_LOCAL:-}" != "1" ]; then
|
||||
bun "$repo/scripts/cli.ts" --main-server-ip "$host" ssh "$@"
|
||||
exit $?
|
||||
|
||||
Reference in New Issue
Block a user