Add provider triage health contract

This commit is contained in:
Codex
2026-05-20 11:59:44 +00:00
parent 34e0dca884
commit 9259fce80f
7 changed files with 539 additions and 10 deletions
+1 -1
View File
@@ -33,7 +33,7 @@ UniDesk 是一个以主 server 为统一入口的分布式工作平台;本文
- `bun scripts/cli.ts server swap status|ensure [--path /swapfile] [--size 2GiB] [--dry-run]`:以 JSON 查看或幂等创建主 server swapfile`ensure` 输出 before/after、动作、持久化状态和 degraded/failed 详情,规则见 `docs/reference/deployment.md`
- `bun scripts/cli.ts server logs [--tail-bytes N]`:分页返回文件日志与 Docker 日志尾部并带截断元数据,日志规则见 `docs/reference/observability.md`
- `bun scripts/cli.ts server rebuild <backend-core|frontend|dev-frontend-proxy|provider-gateway|todo-note|code-queue-mgr|project-manager|baidu-netdisk|oa-event-flow>`:以 build-first、Compose lock、no-deps force-recreate 和 post-up validation 的异步 job 重建主 server Compose 内单个服务;对 database、File Browser、Code Queue 执行面、k3sctl-adapter 或未知对象返回结构化 `unsupported-server-rebuild`,规则见 `docs/reference/deployment.md``docs/reference/cicd-standardization.md`
- `bun scripts/cli.ts provider attach <providerId> [--master-server URL] [--up] [--force]`:在新增计算节点上生成两项配置的 provider-gateway 挂载包;默认只需要主 server URL(默认 `http://74.48.78.17/`)和唯一 Provider ID,生成的 Compose 固定 Docker socket、`pid: "host"``restart: always`、只读 `/workspace`、SSH 维护私钥挂载和 loopback egress proxy 端口,规则见 `docs/reference/provider-gateway.md`
- `bun scripts/cli.ts provider attach <providerId> [--master-server URL] [--up] [--force]` / `bun scripts/cli.ts provider triage <providerId> [--observed-error text] [--observed-scope scope] [--microservice id ...]`:前者在新增计算节点上生成两项配置的 provider-gateway 挂载包;后者是只读多信号健康裁决入口,用来把单路径 `provider is not online`、SSH 超时、registry 失败或 proxy 失败归类为 runner-local、service-degraded、provider-degraded 或 global-blocker,规则见 `docs/reference/provider-gateway.md``docs/reference/code-queue-supervision.md`
- `bun scripts/cli.ts ssh <providerId> [ssh-like args...]`:通过 provider-gateway 的 Host SSH / WSL SSH 维护桥打开近似原生 ssh 的交互会话或远端命令,并在远端 PATH 注入 `apply_patch``glob``skill-discover``apply-patch``py``skills`、结构化 `find``glob``argv` 子命令用于避免远端补丁、Python stdin、skill 发现与常用只读命令的嵌套转义问题,使用规则见 `docs/reference/cli.md``docs/reference/provider-gateway.md`
- `bun scripts/cli.ts microservice list/status/health/diagnostics/tunnel-self-test/proxy`:管理和验证挂载在主 server、计算节点 Docker 或 k3s 控制面上的用户服务,`proxy` 支持受控 JSON bodyOA Event Flow/Todo Note/Baidu Netdisk/Code Queue Manager on main-server、k3s Control/Code Queue 执行面/MDTODO/Decision Center/FindJob/Pipeline/MET Nonlinear on D601 的规则见 `docs/reference/microservices.md`
- `bun scripts/cli.ts microservice health/diagnostics/proxy code-agent-sandbox`:验证独立 Code Agent Sandbox 的 health、只读 diagnostics、trace 和 adapter/mode/credential boundary 契约,规则见 `docs/reference/code-agent-sandbox.md`
+1 -1
View File
@@ -17,7 +17,7 @@ CLI 可以从 `master` 快速演进,但必须兼容 `deploy.json` 固定的 CI
- `server swap status|ensure [--path /swapfile] [--size 2GiB] [--dry-run]` 是主 server swap 管理入口。`status` 仅读 `/proc/meminfo``/proc/swaps``/etc/fstab` 并返回 JSON`ensure` 在已有任何 active swap 时只报告 no-op,在无 active swap 时创建固定 swapfile、`chmod 600``mkswap``swapon` 并尽量写入 `/etc/fstab`。输出必须包含 `before``after`、total memory、active swap、持久化状态、关键动作和错误详情;若 swap 已启用但 fstab 写入失败,状态为 `degraded`,调用者需按返回的 detail 修复持久化。
- `server logs` 返回 `logs/` 文件日志和 Docker 容器日志的尾部,默认限制输出大小,避免日志爆炸。实现必须只读取文件末尾字节,不得为了 tail 先把巨大日志完整读入 CLI 内存。
- `server rebuild <backend-core|frontend|dev-frontend-proxy|provider-gateway|todo-note|code-queue-mgr|project-manager|baidu-netdisk|oa-event-flow>` 创建异步 job,先构建目标服务镜像,随后在 `.state/locks/server-compose.lock` 串行保护下用 `--no-deps --force-recreate` 替换目标 service 并等待容器 `healthy/running`;该命令用于替代手工删除容器的兜底流程,其中 `dev-frontend-proxy` 只更新主 server dev 入口薄代理,`todo-note``code-queue-mgr``project-manager``baidu-netdisk``oa-event-flow` 只重建主 server 承载的对应后端,不会重建或删除 database 命名卷。D601 Code Queue 执行面不由 `server rebuild` 管理,Rust backend-core 迭代不得用 `server rebuild backend-core` 在 master server 编译,规则见 `docs/reference/dev-environment.md`
- `provider attach <providerId> [--master-server URL] [--up] [--force]` 在新计算节点生成两项配置的 provider-gateway 挂载包:`.state/provider-<ID>.env` 默认只包含 `UNIDESK_MASTER_SERVER``PROVIDER_ID``provider-<ID>.yml` 固定 Docker socket、`pid: "host"``restart: always`、只读 `/workspace` 和 SSH 维护私钥挂载;`--up` 会立即执行生成的 `docker compose up -d --build`
- `provider attach <providerId> [--master-server URL] [--up] [--force]` 在新计算节点生成两项配置的 provider-gateway 挂载包:`.state/provider-<ID>.env` 默认只包含 `UNIDESK_MASTER_SERVER``PROVIDER_ID``provider-<ID>.yml` 固定 Docker socket、`pid: "host"``restart: always`、只读 `/workspace` 和 SSH 维护私钥挂载;`--up` 会立即执行生成的 `docker compose up -d --build``provider triage <providerId> [--observed-error text] [--observed-scope scope] [--microservice id ...]` 是只读多信号健康裁决入口,会把单路径 `provider is not online`、SSH 超时、registry 失败和 service proxy 失败归类成 `runner-local-observation-gap``service-degraded``provider-degraded``global-blocker`,且默认提供 `debug health``debug dispatch <providerId> host.ssh --wait-ms 15000``ssh <providerId> argv true``artifact-registry health --provider-id <providerId>``microservice health k3sctl-adapter``microservice health code-queue``codex tasks --view supervisor --limit 20` 作为推荐交叉验证命令。
- `ssh <providerId> [ssh-like args...]` 通过 backend-core 内网 WebSocket broker 和 provider-gateway 的 Host SSH / WSL SSH 维护桥连接目标节点;无后续参数时进入远端登录 shell,有后续参数时按 ssh 远端命令体验执行并返回远端 exit code。
- `ssh <providerId> apply-patch [tool args...] < patch.diff` 直接调用远端注入的 `apply_patch` 工具,并把本地 stdin 中的标准 `*** Begin Patch` / `*** End Patch` patch 流透传给目标节点。
- `ssh <providerId> py [script-args...] < script.py` 把本地 stdin 落到远端临时 `.py` 文件后再以 `python3 -u` 执行并自动清理,避免再手写 `'python3 -'`、heredoc 或多层引号;`script-args` 会按 argv 安全透传给远端脚本。
+2
View File
@@ -95,6 +95,8 @@ PR 支持本身是 Code Queue 能力的一部分。当前 UniDesk CLI 支持 `gh
只有多个独立观察面同时失败,或同一关键路径在明确时间窗口内持续失败,才能把问题判为全局阻塞。否则应记录为 transient 或 runner-local observation gap,优先重试、steer 任务纠偏或拆出基础设施 follow-up;不得让业务 worker 把单次局部失败作为最终 blocker。CLI 和 runtime 后续应把错误输出结构化为 `scope=runner-local|provider-gateway|ssh|registry|k3s|scheduler|service-proxy``observedAt``retryable` 和建议的交叉验证命令。
在 UniDesk CLI 中,`bun scripts/cli.ts provider triage <providerId>` 是只读多信号裁决入口,适合作为 worker 和指挥官的统一健康判断前置。它必须至少保留这些合同:`provider is not online` 这类单路径失败只应落到 `runner-local-observation-gap`,不得直接输出 `global-blocker`;只有 provider-gateway/SSH/k3s/scheduler 等多个独立关键路径同时失败,才允许输出 `global-blocker`registry 或单个 service proxy 失败但 heartbeat、SSH 或节点视图仍健康时,应输出 `service-degraded``provider-degraded``recommendedCrossChecks` 必须包含 `debug health``debug dispatch <providerId> host.ssh --wait-ms 15000``ssh <providerId> argv true``artifact-registry health --provider-id <providerId>``microservice health k3sctl-adapter``microservice health code-queue``codex tasks --view supervisor --limit 20`
对于 trace 或 heartbeat 新鲜的长任务,通常应保持运行。每几分钟轮询一次优于反复 interrupt/retry。
外部 token provider、模型 API 或上游服务的限流和短时不可用是正常预期,不应自动升级为 Code Queue 基础设施缺陷。典型表现包括 `429 Too Many Requests`、provider transient error、上游 timeout 或模型服务短时失败。只要 Code Queue 的状态机仍在自动指数退避,task heartbeat 或 scheduler heartbeat 新鲜,且任务仍能从 `retry_wait` 回到 `running`,指挥官应等待外部 provider 自行恢复,不创建额外修复 issue、不重派重复任务、不把该现象写成 blocker。只有当退避机制失效、任务丢失、heartbeat 过期、状态机卡死,或重试耗尽进入不可恢复终态时,才按 Code Queue 基础设施问题介入。
+11 -7
View File
@@ -15,7 +15,7 @@ export function rootHelp(): unknown {
{ command: "server swap status|ensure [--path /swapfile] [--size 2GiB] [--dry-run]", description: "Inspect or idempotently create host swap for low-memory main-server operation." },
{ command: "server logs [--tail-bytes N]", description: "Return bounded tails from file logs and docker logs." },
{ command: "server rebuild <backend-core|frontend|dev-frontend-proxy|provider-gateway|todo-note|code-queue-mgr|project-manager|baidu-netdisk|oa-event-flow>", description: "Maintenance-only local Compose rebuild for reviewed main-server services; frontend standard release must use CI artifact plus deploy apply dev/prod artifact consumers." },
{ command: "provider attach <providerId> [--master-server URL] [--up] [--force]", description: "Generate the minimal external provider-gateway env/compose bundle; only master server URL and provider id are required." },
{ command: "provider attach <providerId> [--master-server URL] [--up] [--force] | provider triage <providerId> [--observed-error text] [--observed-scope scope] [--microservice id ...]", description: "Generate the minimal external provider-gateway env/compose bundle or run the read-only provider health triage contract." },
{ command: "ssh <providerId> [ssh-like args...]", description: "Open a Host SSH / WSL SSH maintenance session through the provider-gateway bridge with built-in remote helper tools in PATH." },
{ command: "ssh <providerId> apply-patch [tool args...] < patch.diff", description: "Invoke the injected remote apply_patch helper directly over SSH passthrough and stream the patch from local stdin." },
{ command: "ssh <providerId> py [script-args...] < script.py", description: "Run remote Python from local stdin through SSH passthrough without nested shell quoting; extra args become script argv." },
@@ -45,7 +45,7 @@ export function rootHelp(): unknown {
{ command: "artifact-registry plan|render|status|health|install|deploy-backend-core|deploy-service", description: "Manage the D601 host-managed CNCF Distribution registry and run pull-only artifact CD for supported services, including D601 direct, k3s-managed, and code-queue dev-only consumers." },
{ command: "gh auth|issue|pr", description: "Run safe GitHub issue and PR list/view/create/comment operations through REST with body-file support, token diagnostics, escape scanning, and merge blocked." },
{ command: "code-agent-sandbox", description: "Independent Code Agent Sandbox service skeleton for adapter, mode, and credential-boundary diagnostics." },
{ command: "schedule list|get|runs|run|delete", description: "Manage backend-core scheduled tasks and run history; schedule run <id> supports --wait-ms N." },
{ command: "schedule list|get|runs|run|retry-run|delete", description: "Manage backend-core scheduled tasks and run history; schedule run <id> supports --wait-ms N and retry-run reuses the failed run's schedule." },
{ command: "schedule upsert-pgdata-backup [--time HH:MM] [--remote-base /SERVER_DATA/UNIDESK_PG_DATA]", description: "Create or update the daily PGDATA physical backup task that uploads monthly rotated archives to Baidu Netdisk." },
{ command: "codex deploy <commitId> [--provider-id D601] [--timeout-ms N]", description: "Disabled legacy Code Queue deploy path; use the dev-only artifact consumer instead." },
{ command: "codex submit [prompt] [--prompt-file path|--prompt-stdin] [--queue queueId] [--provider-id id] [--cwd path] [--model model] [--execution-mode mode] [--max-attempts N] [--reference-task-id id] [--dry-run]", description: "Submit a Code Queue task through backend-core -> code-queue proxy; --dry-run shows the structured request without enqueueing." },
@@ -174,22 +174,26 @@ function decisionHelp(): unknown {
function providerHelp(): unknown {
return {
command: "provider attach",
command: "provider attach|triage",
output: "json",
usage: "bun scripts/cli.ts provider attach <providerId> [--master-server URL] [--up] [--force]",
description: "Generate the minimal provider-gateway attach env/compose bundle for a new compute node.",
usage: [
"bun scripts/cli.ts provider attach <providerId> [--master-server URL] [--up] [--force]",
"bun scripts/cli.ts provider triage <providerId> [--observed-error text] [--observed-scope scope] [--microservice id ...]",
],
description: "Generate the minimal provider-gateway attach env/compose bundle or run the read-only provider health triage contract.",
};
}
function scheduleHelp(): unknown {
return {
command: "schedule list|get|runs|run|delete|upsert-pgdata-backup",
command: "schedule list|get|runs|run|retry-run|delete|upsert-pgdata-backup",
output: "json",
usage: [
"bun scripts/cli.ts schedule list",
"bun scripts/cli.ts schedule get <id>",
"bun scripts/cli.ts schedule runs [id] [--limit N]",
"bun scripts/cli.ts schedule runs [scheduleId] [--limit N]",
"bun scripts/cli.ts schedule run <id> [--wait-ms N]",
"bun scripts/cli.ts schedule retry-run <failedRunId>",
"bun scripts/cli.ts schedule delete <id>",
"bun scripts/cli.ts schedule upsert-pgdata-backup [--time HH:MM] [--remote-base path]",
],
+9 -1
View File
@@ -2,6 +2,7 @@ import { existsSync, mkdirSync, readFileSync, writeFileSync } from "node:fs";
import { dirname, join } from "node:path";
import { type UniDeskConfig, repoRoot, rootPath } from "./config";
import { runCommand } from "./command";
import { runProviderTriage } from "./provider-triage";
interface ProviderAttachOptions {
providerId: string;
@@ -177,7 +178,14 @@ function inspectAttachedContainer(options: ProviderAttachOptions): ProviderAttac
export async function runProviderCommand(config: UniDeskConfig, args: string[]): Promise<unknown> {
const [sub] = args;
if (sub !== "attach") {
throw new Error("provider requires subcommand: attach");
if (sub === "triage") {
const providerId = args[1];
if (providerId === undefined || providerId.length === 0) {
throw new Error("provider triage requires provider id, for example: bun scripts/cli.ts provider triage D601");
}
return runProviderTriage(config, providerId, args.slice(2));
}
throw new Error("provider requires subcommand: attach|triage");
}
const options = parseAttachOptions(config, args.slice(1));
mkdirSync(options.logDir, { recursive: true });
+59
View File
@@ -0,0 +1,59 @@
import { describe, expect, test } from "bun:test";
import { buildProviderTriageResult, type ProviderTriageSignal } from "./provider-triage";
function signal(
id: string,
scope: ProviderTriageSignal["scope"],
status: ProviderTriageSignal["status"],
): ProviderTriageSignal {
return {
id,
scope,
status,
independentPath: true,
observedAt: "2026-05-20T00:00:00.000Z",
summary: `${id}:${scope}:${status}`,
};
}
describe("provider triage contract", () => {
test("single path provider offline stays non-global blocker", () => {
const result = buildProviderTriageResult("D601", [
signal("runner-local", "runner-local", "failed"),
signal("backend-core-node", "provider-gateway", "ok"),
signal("host-ssh-probe", "ssh", "ok"),
], "2026-05-20T00:00:00.000Z");
expect(result.blockingDisposition).toBe("runner-local-observation-gap");
expect(result.retryable).toBe(true);
expect(result.scope).toBe("runner-local");
expect(result.contract.singlePathProviderOfflineIsGlobalBlocker).toBe(false);
});
test("multiple independent critical failures can global block", () => {
const result = buildProviderTriageResult("D601", [
signal("backend-core-node", "provider-gateway", "failed"),
signal("host-ssh-probe", "ssh", "failed"),
signal("code-queue-health", "scheduler", "failed"),
], "2026-05-20T00:00:00.000Z");
expect(result.blockingDisposition).toBe("global-blocker");
expect(result.retryable).toBe(false);
expect(result.failedIndependentScopes).toContain("provider-gateway");
expect(result.failedIndependentScopes).toContain("ssh");
});
test("registry failure with healthy heartbeat and ssh is degraded, not global blocker", () => {
const result = buildProviderTriageResult("D601", [
signal("backend-core-node", "provider-gateway", "ok"),
signal("host-ssh-probe", "ssh", "ok"),
signal("artifact-registry-health", "registry", "failed"),
signal("k3sctl-adapter-health", "k3s", "ok"),
], "2026-05-20T00:00:00.000Z");
expect(result.blockingDisposition).toBe("service-degraded");
expect(result.retryable).toBe(true);
expect(result.healthyIndependentScopes).toContain("provider-gateway");
expect(result.healthyIndependentScopes).toContain("ssh");
});
});
+456
View File
@@ -0,0 +1,456 @@
import { type UniDeskConfig } from "./config";
import { coreInternalFetch } from "./microservices";
import { debugDispatch, debugHealth } from "./debug";
import { runArtifactRegistryCommand } from "./artifact-registry";
import { runCodeQueueCommand } from "./code-queue";
export type ProviderSignalScope =
| "runner-local"
| "provider-gateway"
| "ssh"
| "registry"
| "k3s"
| "scheduler"
| "service-proxy"
| "microservice"
| "unknown";
export type ProviderSignalStatus = "ok" | "degraded" | "failed" | "unknown";
export type ProviderBlockingDisposition =
| "transient"
| "runner-local-observation-gap"
| "provider-degraded"
| "service-degraded"
| "global-blocker";
export interface ProviderTriageSignal {
id: string;
scope: ProviderSignalScope;
status: ProviderSignalStatus;
independentPath: boolean;
observedAt: string;
summary: string;
evidence?: unknown;
}
export interface ProviderTriageClassification {
scope: ProviderSignalScope;
observedAt: string;
retryable: boolean;
recommendedCrossChecks: string[];
blockingDisposition: ProviderBlockingDisposition;
rationale: string[];
failedIndependentScopes: ProviderSignalScope[];
healthyIndependentScopes: ProviderSignalScope[];
}
export interface ProviderTriageResult extends ProviderTriageClassification {
ok: boolean;
providerId: string;
signals: ProviderTriageSignal[];
contract: {
singlePathProviderOfflineIsGlobalBlocker: false;
globalBlockerRequiresIndependentCriticalFailures: true;
};
}
type JsonRecord = Record<string, unknown>;
const criticalScopes = new Set<ProviderSignalScope>(["provider-gateway", "ssh", "scheduler", "k3s"]);
const commandPrefix = "bun scripts/cli.ts";
function asRecord(value: unknown): JsonRecord | null {
return typeof value === "object" && value !== null && !Array.isArray(value) ? value as JsonRecord : null;
}
function asArray(value: unknown): unknown[] {
return Array.isArray(value) ? value : [];
}
function text(value: unknown): string {
return typeof value === "string" ? value : "";
}
function bool(value: unknown): boolean {
return value === true;
}
function lower(value: unknown): string {
return String(value ?? "").toLowerCase();
}
function isoNow(): string {
return new Date().toISOString();
}
function signal(
id: string,
scope: ProviderSignalScope,
status: ProviderSignalStatus,
summary: string,
evidence?: unknown,
independentPath = true,
): ProviderTriageSignal {
return { id, scope, status, independentPath, observedAt: isoNow(), summary, evidence };
}
function isOkEnvelope(value: unknown): boolean {
const record = asRecord(value);
if (record === null) return false;
return record.ok === true;
}
function bodyOf(value: unknown): JsonRecord | null {
return asRecord(asRecord(value)?.body);
}
function findByProvider(items: unknown, providerId: string): JsonRecord | null {
return asArray(items)
.map(asRecord)
.find((item): item is JsonRecord => item !== null && item.providerId === providerId) ?? null;
}
function providerGatewaySignal(debug: unknown, providerId: string): ProviderTriageSignal {
const nodes = asArray(bodyOf(asRecord(debug)?.nodesInternal)?.nodes);
const node = findByProvider(nodes, providerId);
if (node === null) {
return signal("backend-core-node", "provider-gateway", "unknown", `backend-core node view has no provider ${providerId}`, {
nodesInternal: asRecord(debug)?.nodesInternal,
});
}
const labels = asRecord(node.labels) ?? {};
const capabilities = asArray(labels.unideskCapabilities).map((item) => String(item));
const online = node.status === "online";
const hasHeartbeat = typeof node.lastHeartbeat === "string" && node.lastHeartbeat.length > 0;
const status: ProviderSignalStatus = online && hasHeartbeat ? "ok" : online ? "degraded" : "failed";
return signal("backend-core-node", "provider-gateway", status, `backend-core node status=${node.status ?? "unknown"} lastHeartbeat=${node.lastHeartbeat ?? "null"}`, {
providerId: node.providerId,
name: node.name,
status: node.status,
connectedAt: node.connectedAt,
lastHeartbeat: node.lastHeartbeat,
providerGatewayVersion: labels.providerGatewayVersion ?? null,
hostSshConfigured: labels.hostSshConfigured ?? null,
hostSshKeyPresent: labels.hostSshKeyPresent ?? null,
capabilities,
});
}
function systemStatusSignal(debug: unknown, providerId: string): ProviderTriageSignal {
const items = asArray(bodyOf(asRecord(debug)?.systemStatusInternal)?.systemStatuses);
const item = findByProvider(items, providerId);
if (item === null) return signal("backend-core-system-status", "provider-gateway", "unknown", `no system status sample for ${providerId}`);
const current = asRecord(item.current);
const currentOk = current === null ? null : current.ok;
const status: ProviderSignalStatus = current === null ? "unknown" : currentOk === false ? "degraded" : "ok";
return signal("backend-core-system-status", "provider-gateway", status, `system status current.ok=${String(currentOk)} updatedAt=${item.updatedAt ?? "null"}`, {
providerId: item.providerId,
nodeStatus: item.nodeStatus,
updatedAt: item.updatedAt,
current: current === null ? null : {
ok: current.ok,
collectedAt: current.collectedAt,
cpu: current.cpu,
memory: current.memory,
disk: current.disk,
},
historyCount: item.historyCount ?? null,
});
}
function sshSignal(result: unknown, providerId: string): ProviderTriageSignal {
const record = asRecord(result);
const waitTask = asRecord(asRecord(asRecord(record?.wait)?.task)?.result);
const dispatchBody = bodyOf(record?.dispatch);
const dispatchOk = isOkEnvelope(record?.dispatch) && dispatchBody?.taskId !== undefined;
const wait = asRecord(record?.wait);
const task = asRecord(wait?.task);
const taskStatus = text(task?.status);
const exitCode = waitTask === null ? null : waitTask.exitCode;
if (taskStatus === "succeeded" && (exitCode === 0 || exitCode === null)) {
return signal("host-ssh-probe", "ssh", "ok", "host.ssh short probe succeeded", {
taskId: dispatchBody?.taskId ?? null,
taskStatus,
exitCode,
probeLine: waitTask?.probeLine ?? null,
stdoutPreview: text(waitTask?.stdout).slice(0, 500),
});
}
if (dispatchOk && wait?.ok === false) {
return signal("host-ssh-probe", "ssh", "unknown", "host.ssh dispatch accepted but wait did not reach terminal state", {
providerId,
taskId: dispatchBody?.taskId ?? null,
wait,
});
}
return signal("host-ssh-probe", "ssh", "failed", "host.ssh short probe failed", {
providerId,
result,
});
}
function registrySignal(result: unknown): ProviderTriageSignal {
const record = asRecord(result);
if (record === null) return signal("artifact-registry-health", "registry", "unknown", "artifact registry health returned non-object", result);
const status: ProviderSignalStatus = record.ok === true && record.healthy !== false ? "ok" : record.ok === false ? "failed" : "degraded";
return signal("artifact-registry-health", "registry", status, `artifact registry health ok=${String(record.ok)} healthy=${String(record.healthy)}`, {
ok: record.ok,
installed: record.installed ?? null,
healthy: record.healthy ?? null,
checks: record.checks ?? null,
observed: record.observed ?? null,
command: record.command ?? null,
});
}
function microserviceHealthSignal(serviceId: string, scope: ProviderSignalScope, response: unknown): ProviderTriageSignal {
const body = bodyOf(response);
const record = asRecord(response);
const status: ProviderSignalStatus = record?.ok === true && body?.ok !== false ? "ok" : record?.ok === false ? "failed" : "degraded";
const upstreamStatus = record?.status ?? null;
return signal(`${serviceId}-health`, scope, status, `${serviceId} health upstream ok=${String(record?.ok)} status=${String(upstreamStatus)} body.ok=${String(body?.ok)}`, {
upstream: { ok: record?.ok ?? null, status: upstreamStatus },
body,
fallback: {
exitCode: record?.exitCode ?? null,
stderrTail: record?.stderrTail ?? null,
stdoutTail: record?.stdoutTail ?? null,
},
});
}
function codeQueueSchedulerSignal(response: unknown): ProviderTriageSignal {
const record = asRecord(response);
if (record === null) return signal("code-queue-health", "scheduler", "unknown", "Code Queue health returned non-object", response);
const devReady = asRecord(record.devReady);
const status: ProviderSignalStatus = record.upstream !== undefined && devReady?.ok !== false ? "ok" : devReady?.ok === false ? "degraded" : "unknown";
return signal("code-queue-health", "scheduler", status, `code-queue dev-ready ok=${String(devReady?.ok)} missingTools=${JSON.stringify(devReady?.missingTools ?? [])}`, {
upstream: record.upstream ?? null,
devReady,
commands: record.commands ?? null,
});
}
function codeQueueTasksSignal(response: unknown): ProviderTriageSignal {
const body = asRecord(asRecord(response)?.supervisor);
const diagnostics = asRecord(body?.executionDiagnostics);
if (diagnostics === null) return signal("code-queue-task-heartbeat", "scheduler", "unknown", "Code Queue task heartbeat diagnostics unavailable", response);
const effectiveLiveness = text(diagnostics.effectiveLiveness);
const status: ProviderSignalStatus = effectiveLiveness === "healthy" || effectiveLiveness === "live" ? "ok" : effectiveLiveness === "at-risk" ? "degraded" : "unknown";
return signal("code-queue-task-heartbeat", "scheduler", status, `Code Queue executionDiagnostics effectiveLiveness=${effectiveLiveness || "unknown"}`, {
executionDiagnostics: diagnostics,
commands: asRecord(body?.commands) ?? null,
});
}
function classifyErrorMessage(message: string): ProviderSignalScope {
const normalized = message.toLowerCase();
if (/provider is not online|provider .*offline|provider .*not online/u.test(normalized)) return "runner-local";
if (/ssh|host\.ssh/u.test(normalized)) return "ssh";
if (/registry|artifact/u.test(normalized)) return "registry";
if (/k3s|kubectl|kubernetes/u.test(normalized)) return "k3s";
if (/scheduler|code queue|codex/u.test(normalized)) return "scheduler";
if (/proxy|tunnel|microservice\.http/u.test(normalized)) return "service-proxy";
if (/microservice|service health/u.test(normalized)) return "microservice";
return "unknown";
}
function observedErrorSignal(message: string, scope: ProviderSignalScope): ProviderTriageSignal {
return signal("observed-error", scope, "failed", message, { message }, scope !== "runner-local");
}
export function providerTriageRecommendedCrossChecks(providerId: string): string[] {
return [
`${commandPrefix} provider triage ${providerId}`,
`${commandPrefix} debug health`,
`${commandPrefix} debug dispatch ${providerId} host.ssh --wait-ms 15000`,
`${commandPrefix} ssh ${providerId} argv true`,
`${commandPrefix} artifact-registry health --provider-id ${providerId}`,
`${commandPrefix} microservice health k3sctl-adapter`,
`${commandPrefix} microservice health code-queue`,
`${commandPrefix} codex tasks --view supervisor --limit 20`,
];
}
function uniqueScopes(signals: ProviderTriageSignal[], statuses: ProviderSignalStatus[]): ProviderSignalScope[] {
return Array.from(new Set(signals
.filter((item) => item.independentPath)
.filter((item) => statuses.includes(item.status))
.map((item) => item.scope)))
.sort();
}
function primaryScope(signals: ProviderTriageSignal[]): ProviderSignalScope {
const failed = uniqueScopes(signals, ["failed"]);
if (failed.length === 1) return failed[0] ?? "unknown";
if (failed.length > 1) return failed.some((scope) => criticalScopes.has(scope)) ? failed.find((scope) => criticalScopes.has(scope)) ?? "unknown" : failed[0] ?? "unknown";
const degraded = uniqueScopes(signals, ["degraded"]);
if (degraded.length === 1) return degraded[0] ?? "unknown";
if (degraded.length > 1) return degraded[0] ?? "unknown";
return "unknown";
}
export function classifyProviderTriage(providerId: string, signals: ProviderTriageSignal[], observedAt = isoNow()): ProviderTriageClassification {
const failedScopes = uniqueScopes(signals, ["failed"]);
const degradedScopes = uniqueScopes(signals, ["degraded"]);
const healthyScopes = uniqueScopes(signals, ["ok"]);
const independentFailedScopes = failedScopes.filter((scope) => scope !== "runner-local");
const failedCriticalScopes = independentFailedScopes.filter((scope) => criticalScopes.has(scope));
const runnerLocalObservedFailure = signals.some((signal) => signal.scope === "runner-local" && signal.status === "failed");
const serviceOnlyFailure = independentFailedScopes.length > 0 && independentFailedScopes.every((scope) => scope === "registry" || scope === "service-proxy" || scope === "microservice" || scope === "k3s");
const hasIndependentHealthy = healthyScopes.length > 0;
const rationale: string[] = [];
let blockingDisposition: ProviderBlockingDisposition;
if (runnerLocalObservedFailure && independentFailedScopes.length === 0) {
blockingDisposition = "runner-local-observation-gap";
rationale.push("single runner-local provider offline observation is not sufficient evidence for global D601 outage");
} else if (failedCriticalScopes.length >= 2 && healthyScopes.length === 0) {
blockingDisposition = "global-blocker";
rationale.push("multiple independent critical provider paths failed and no independent healthy path was observed");
} else if (serviceOnlyFailure && hasIndependentHealthy) {
blockingDisposition = "service-degraded";
rationale.push("service-scoped path failed while at least one provider-level path remains healthy");
} else if (failedCriticalScopes.length > 0 || degradedScopes.some((scope) => criticalScopes.has(scope))) {
blockingDisposition = hasIndependentHealthy ? "provider-degraded" : "transient";
rationale.push(hasIndependentHealthy
? "provider-critical path is degraded but cross-checks still show independent healthy evidence"
: "critical path issue lacks enough independent failed evidence for global blocker");
} else if (failedScopes.length > 0 || degradedScopes.length > 0) {
blockingDisposition = "service-degraded";
rationale.push("only non-provider-global service paths are failed or degraded");
} else {
blockingDisposition = "transient";
rationale.push("no failed independent path was observed");
}
if (runnerLocalObservedFailure) rationale.push("runner-local observation failed but is not counted as an independent global blocker by contract");
if (hasIndependentHealthy) rationale.push(`healthy independent scopes: ${healthyScopes.join(", ")}`);
if (failedScopes.length > 0) rationale.push(`failed independent scopes: ${failedScopes.join(", ")}`);
return {
scope: runnerLocalObservedFailure && failedScopes.length === 0 && degradedScopes.length === 0 ? "runner-local" : primaryScope(signals),
observedAt,
retryable: blockingDisposition !== "global-blocker",
recommendedCrossChecks: providerTriageRecommendedCrossChecks(providerId),
blockingDisposition,
rationale,
failedIndependentScopes: independentFailedScopes,
healthyIndependentScopes: healthyScopes,
};
}
export function buildProviderTriageResult(providerId: string, signals: ProviderTriageSignal[], observedAt = isoNow()): ProviderTriageResult {
const classification = classifyProviderTriage(providerId, signals, observedAt);
return {
ok: classification.blockingDisposition !== "global-blocker",
providerId,
...classification,
signals,
contract: {
singlePathProviderOfflineIsGlobalBlocker: false,
globalBlockerRequiresIndependentCriticalFailures: true,
},
};
}
function parseServiceList(args: string[]): string[] {
const services: string[] = [];
for (let index = 0; index < args.length; index += 1) {
const arg = args[index] ?? "";
if (arg === "--microservice" || arg === "--service") {
const value = args[index + 1];
if (value === undefined || value.length === 0) throw new Error(`${arg} requires a service id`);
services.push(value);
index += 1;
}
if (arg === "--microservices") {
const value = args[index + 1];
if (value === undefined || value.length === 0) throw new Error(`${arg} requires a comma-separated service list`);
services.push(...value.split(",").map((item) => item.trim()).filter(Boolean));
index += 1;
}
}
return Array.from(new Set(services));
}
function optionValue(args: string[], name: string): string | undefined {
const index = args.indexOf(name);
if (index === -1) return undefined;
const raw = args[index + 1];
if (raw === undefined || raw.length === 0) throw new Error(`${name} requires a non-empty value`);
return raw;
}
function assertKnownOptions(args: string[]): void {
const valueOptions = new Set(["--observed-error", "--observed-scope", "--microservice", "--service", "--microservices"]);
for (let index = 0; index < args.length; index += 1) {
const arg = args[index] ?? "";
if (!arg.startsWith("--")) continue;
if (!valueOptions.has(arg)) throw new Error(`unsupported provider triage option: ${arg}`);
const value = args[index + 1];
if (value === undefined || value.startsWith("--")) throw new Error(`${arg} requires a value`);
index += 1;
}
}
export async function runProviderTriage(config: UniDeskConfig, providerId: string, args: string[] = []): Promise<ProviderTriageResult> {
if (!/^[A-Za-z0-9_.-]{1,64}$/u.test(providerId)) throw new Error("provider triage requires a safe provider id such as D601");
assertKnownOptions(args);
const observedAt = isoNow();
const signals: ProviderTriageSignal[] = [];
const observedError = optionValue(args, "--observed-error");
const observedScope = optionValue(args, "--observed-scope") as ProviderSignalScope | undefined;
if (observedError !== undefined) signals.push(observedErrorSignal(observedError, observedScope ?? classifyErrorMessage(observedError)));
const debug = await debugHealth(config);
signals.push(providerGatewaySignal(debug, providerId));
signals.push(systemStatusSignal(debug, providerId));
try {
signals.push(sshSignal(await debugDispatch(config, providerId, "host.ssh", { source: "provider-triage", mode: "probe", timeoutMs: 8000 }, 15_000), providerId));
} catch (error) {
signals.push(signal("host-ssh-probe", "ssh", "failed", error instanceof Error ? error.message : String(error), { error: String(error) }));
}
try {
signals.push(registrySignal(await runArtifactRegistryCommand(["health", "--provider-id", providerId])));
} catch (error) {
signals.push(signal("artifact-registry-health", "registry", "failed", error instanceof Error ? error.message : String(error), { error: String(error) }));
}
try {
signals.push(microserviceHealthSignal("k3sctl-adapter", "k3s", coreInternalFetch("/api/microservices/k3sctl-adapter/health")));
} catch (error) {
signals.push(signal("k3sctl-adapter-health", "k3s", "failed", error instanceof Error ? error.message : String(error), { error: String(error) }));
}
try {
signals.push(microserviceHealthSignal("code-queue", "scheduler", coreInternalFetch("/api/microservices/code-queue/health")));
} catch (error) {
signals.push(signal("code-queue-microservice-health", "scheduler", "failed", error instanceof Error ? error.message : String(error), { error: String(error) }));
}
try {
signals.push(codeQueueSchedulerSignal(await runCodeQueueCommand(config, ["dev-ready"])));
} catch (error) {
signals.push(signal("code-queue-health", "scheduler", "unknown", error instanceof Error ? error.message : String(error), { error: String(error) }));
}
try {
signals.push(codeQueueTasksSignal(await runCodeQueueCommand(config, ["tasks", "--view", "supervisor", "--limit", "20"])));
} catch (error) {
signals.push(signal("code-queue-task-heartbeat", "scheduler", "unknown", error instanceof Error ? error.message : String(error), { error: String(error) }));
}
for (const serviceId of parseServiceList(args)) {
try {
signals.push(microserviceHealthSignal(serviceId, "microservice", coreInternalFetch(`/api/microservices/${encodeURIComponent(serviceId)}/health`)));
} catch (error) {
signals.push(signal(`${serviceId}-health`, "microservice", "failed", error instanceof Error ? error.message : String(error), { error: String(error) }));
}
}
return buildProviderTriageResult(providerId, signals, observedAt);
}