Files

T

Codex b6637d2b71 docs: record hwlab opencode otel triage

2026-06-30 08:20:03 +00:00

12 KiB

Raw Blame History

name, description

name	description
unidesk-otel	UniDesk OpenTelemetry/Tempo 链路追踪运维技能。用户提到 OTel、OpenTelemetry、Tempo、trace backend、platform-infra observability、链路追踪、按 traceId 查 span、provider-stream-disconnected、Code Agent/AgentRun/HWLAB 跨服务追踪、或要求“用 otel 查/改进 otel”时使用。

UniDesk OTel

Skill(cli-spec)

UniDesk 的 OTel 运行面在 platform-infra namespace：OTel Collector 负责接收 OTLP traces，Tempo 负责查询。操作入口统一走 UniDesk YAML-first CLI，不直接 kubectl port-forward、手写 Tempo API 或裸 curl。

固定入口: cd /root/unidesk && bun scripts/cli.ts platform-infra observability ...

基本状态

bun scripts/cli.ts platform-infra observability status --target D601
bun scripts/cli.ts platform-infra observability validate --target D601

status 检查 platform-infra namespace、otel-collector、tempo Deployment/Service/Pod 和 readiness probe。
validate 生成一条测试 trace，经 Collector 写入 Tempo，再通过受控 service proxy 查询，证明采集和查询闭环可用。
--full 只在需要展开远端 stdout/stderr 或完整 status payload 时使用；默认输出必须保持低噪声。

查询 Trace

bun scripts/cli.ts platform-infra observability trace \
  --target D601 \
  --trace-id <otelTraceId>

--trace-id 是 32 位 hex OpenTelemetry trace id，不是业务 traceId。默认输出只返回有界摘要：

spanCount、serviceCount、services
businessTraceIds
errorSpanCount
spanNameCounts
errorSpans
去重后的关键 spans
下一步 drill-down 命令

默认输出不得展开完整 Tempo JSON；需要原始响应时才用 --raw。

当只有错误文案而没有 OTel trace id 时，先用 search 从 Tempo 最近 trace 中反查候选，再进入 trace：

bun scripts/cli.ts platform-infra observability search \
  --target D601 \
  --grep 'no rollout found' \
  --lookback-minutes 360 \
  --candidate-limit 80 \
  --limit 20

search 会通过受控 service proxy 调 Tempo /api/search 取候选 trace，并逐条拉 trace 做本地 grep 摘要；默认只输出匹配 trace、服务、业务 traceId、错误 span 和下一步命令。扩大时间窗或候选数必须显式传 --lookback-minutes / --candidate-limit，避免大 trace 输出淹没上下文。

噪声压制

按错误文案、span 名、failureKind 或关键属性定位时，优先用 --grep：

bun scripts/cli.ts platform-infra observability trace \
  --target D601 \
  --trace-id <otelTraceId> \
  --grep provider-stream-disconnected \
  --limit 20

--grep <text> 在 span 名、status message 和关键 attributes 的摘要 JSON 中过滤。
--limit <N> 控制返回 span 数，避免大 trace 淹没上下文。
--full 展开完整 span 摘要，但仍不输出 Tempo raw body。
--raw 仅用于排查 Tempo 响应结构、CLI 解析器或后端返回本身。

业务 Trace 映射

HWLAB/Code Agent 的业务 traceId 通常形如 trc_...。当已知 OTel trace id 时，直接用 trace --trace-id 查询；当只知道业务 traceId 时，优先从 HWLAB trace/result、Code Agent result 或已记录的 issue 证据中取得对应 OTel trace id。不要为了找 OTel trace id 去打印 Secret、Authorization header、完整 DSN 或运行面 raw transcript。

OTel trace 内常见业务关联属性：

traceId: HWLAB 业务 traceId
otel.trace_id: OTel trace id
runId / commandId: AgentRun run/command
sessionId / turnId / threadId: HWLAB/AgentRun 会话与 turn 关联
failureKind / willRetry / terminalStatus: 错误与终态判断

HWLAB OpenCode /global/event 排障

OpenCode 对话长时间显示 Thinking、用户发消息后无 assistant 文本、或怀疑 iframe/live state 没吃到事件时，先分别证明 provider、Cloud Web 代理和浏览器事件流三层状态。provider span 可能已经 200，但 UI 仍可能因 /global/event 目录、ticket 或 live state 不一致而不收敛。

优先用明确 service 的 TraceQL 查询 provider，避免普通 grep 漏掉新 span 名或属性：

bun scripts/cli.ts platform-infra observability search \
  --target <node> \
  --query '{ resource.service.name = "opencode-provider-proxy" }' \
  --lookback-minutes 30 \
  --candidate-limit 100 \
  --limit 20

provider 侧至少看 /v1/chat/completions 的 HTTP status、duration、opencode.provider.sse.content_chunks、opencode.provider.sse.content_chars、opencode.provider.sse.reasoning_only_choices_dropped、output data lines、done lines 和 JSON error 数。只有这些正常，才能把焦点转到 Cloud Web 代理或 UI。

Cloud Web 侧查 /global/event 长连接的 start span；长连接可能在调查窗口内不关闭，所以必须依赖 stream start 可见性，而不是只等 completion span：

bun scripts/cli.ts platform-infra observability search \
  --target <node> \
  --query '{ resource.service.name = "hwlab-cloud-web" && span.http.route = "/global/event" }' \
  --lookback-minutes 30 \
  --candidate-limit 300 \
  --limit 20

期望看到 opencode.proxy.stream.start，并检查 opencode.proxy.sse.directory_rewrite_enabled、opencode.proxy.sse.directory_rewrite_from、opencode.proxy.sse.directory_rewrite_to、opencode.proxy.ticket_accepted、span.http.route=/global/event 和 streaming 标记。from=/workspace、to=/ 代表 HWLAB iframe/public route 与 OpenCode server workspace route 已对齐；如果该 rewrite 不存在或 ticket 未接受，provider 正常也不能说明 UI 正常。

关闭 OpenCode UI 卡住类 issue 时，OTel 只能证明链路状态；最终还要用 web-probe DOM/事件证据确认浏览器看到 assistant 文本、没有残留 Thinking，并且 EventSource 收到 message.part.updated、step-finish、session.idle 等事件。若 --grep 没搜到新属性或 span 名，改用上述 TraceQL query，再只对小范围 trace 使用 trace --raw 做 bounded attr drill-down。

Code Agent / AgentRun 排障

追 Code Agent 代理暂时无法连接上游、provider-stream-disconnected、Workbench 加载/转圈、turn idle 报错、AgentRun command terminal 状态时：

优先用一条诊断命令汇总业务 trace、OTel trace、服务追穿、AgentRun 终态、HWLAB 读模型和 HTTP 403/401/5xx 根因：

bun scripts/cli.ts platform-infra observability diagnose-code-agent \
  --target D601 \
  --business-trace-id <trc_...>

默认输出必须保持有界低噪声，重点看：

mapping.businessTraceId / mapping.otelTraceId
servicePath 是否同时到达 hwlab-cloud-api、agentrun-manager、agentrun-runner
identity 里的 runId、commandId、sessionId、runnerJobId、runnerId、backendProfile、sourceCommit
agentrun.terminalStatus、terminalEventType、runnerProviderClassification
hwlabReadModel.sourceEventCount、requestedSinceSeq、turnStatusCounts
http.problemCounts 和 projectionLag.status
summary.rootCause 与按置信度排序的 rootCauseCandidates

只有需要展开 span 明细时使用 --full；只有排查 Tempo raw 响应或 CLI 解析器时使用 --raw。默认输出不得包含 Secret、Authorization header、DSN、可复制凭据或完整运行 transcript。

若没有业务 traceId 或诊断结果还需要 drill-down，再使用低层 trace/search：

先确认 OTel backend ready： bun scripts/cli.ts platform-infra observability status --target D601
查业务 trace 对应 OTel trace： bun scripts/cli.ts platform-infra observability trace --target D601 --trace-id <otelTraceId>
用错误关键词过滤： bun scripts/cli.ts platform-infra observability trace --target D601 --trace-id <otelTraceId> --grep <failureKind-or-message> --limit 20
对照 errorSpanCount、matchedSpanCount、terminalStatus、willRetry、runId、commandId 判断是 terminal failure、retryable transient 还是旧 trace 缺 instrumentation。

旧 trace 不会因为后续 instrumentation 修复自动回填。若旧 trace 查不到错误 span，但新的 canary/真实 trace 能查到同类 runner_error.* span，应把旧 trace 结论写成“当时未采集到该事件”，不要倒推出运行面没有发生过错误。

AgentRun codex-stdio 追穿检查

追 HWLAB Workbench turn idle、waitingFor=code-agent、工具调用后无 terminal、provider-stream-disconnected，或用户怀疑 AgentRun/codex-stdio 仍在运行但 OTel 没追到时，必须在同一 OTel trace 里同时看到 hwlab-cloud-api、agentrun-manager 和 agentrun-runner。只看到 manager dispatch 或 HWLAB business trace 不算追穿 runner。

优先用业务入口拿到 OTel trace id，再按 codex span 过滤：

bun scripts/cli.ts platform-infra observability trace \
  --target D601 \
  --trace-id <otelTraceId> \
  --grep codex_stdio \
  --limit 120 \
  --full

通过证据应能看到 codex_app_server.starting、codex_app_server.started、需要进程退出时的 codex_app_server.exit、codex_stdio.thread_start.* 或 thread_resume.*、codex_stdio.turn_start.*、codex_stdio.tool_call.started|completed|failed、codex_stdio.turn_completed，以及问题相关的 idle_warning、idle_timeout、provider_stream_disconnected 或 missing_terminal_after_tool。这些 span 应带 runId、commandId、runnerJobId、runnerId、sessionId、backendProfile、sourceCommit、traceId、otel.trace_id 和 valuesPrinted=false。若默认摘要或 grep 看不到 runnerJobId，先用当前 UniDesk CLI 执行 platform-infra observability trace --grep runnerJobId --full 复查摘要器输出，必要时用 --raw 只排查 Tempo/CLI 解析结构；只有 raw 或更新后的摘要仍缺 runnerJobId，或同一 trace 没有 agentrun-runner service，才回 AgentRun runner-side instrumentation 排查。

如果需要确认工具调用 started/completed 的归一化，canary 应要求一次只读 shell 工具调用。Codex notification 的 item/started + status=inProgress 应落为 codex_stdio.tool_call.started，不得和 completed 混在一起。长期规则见 docs/reference/agentrun.md#agentrun--hwlab-otel-追踪口径。

Trace 读取窗口与乱序调查

排查 HWLAB/Workbench trace 乱序、分页缺口、--after-seq/--tail 不生效、旧 trace 只返回局部事件或 read model 是否完整时，优先查 trace_events_read span：

bun scripts/cli.ts platform-infra observability trace \
  --target D601 \
  --trace-id <otelTraceId> \
  --grep trace_events_read \
  --limit 20 \
  --full

摘要里重点看 returnedEvents、sinceSeq、limit、fromSeq、toSeq、totalEvents、hasMore、fullTraceLoaded、rawEventCount、maxSeq、traceLastSeq、endSeq 和 commandFiltered。这些字段用于判断“查询窗口是否正确传到后端”“后端是否只返回了局部事件”“read model 是否已经加载完整 trace”。若 errorSpanCount=0 但用户可见 timeline 仍乱序，先把结论写成展示/投影/renderer 调查 issue，不要把它定性为后端错误 span。

旧业务 trace 在 runtime 重启或保留策略后可能只剩局部事件；OTel 只能证明读取窗口、span 和当时观测到的字段，不能自动恢复业务事件流。需要验证新 instrumentation 时，使用新 canary 或仍可完整读取的真实 trace。

何时先改进 OTel

遇到以下情况，先修 OTel CLI 或 instrumentation，再继续业务排障：

trace 命令只能返回 raw/tail，不能给出可读 span 摘要。
大 trace 输出淹没上下文，缺少 --grep、--limit、错误 span 汇总或下一步 drill-down。
关键 runner/backend/projection 事件只存在业务事件流，不进入 OTel。
error span 缺 failureKind、willRetry、terminalStatus、runId、commandId 等定位字段。
CLI 默认输出 Secret、Authorization header、DSN 或其他敏感值。

改进后必须用一条 canary 或真实 trace 证明新 span/摘要可查询，再继续定位原业务问题。

交付边界

OTel 平台配置真相是 config/platform-infra/observability.yaml。
OTel CLI 实现在 scripts/src/platform-infra-observability.ts，帮助入口由 scripts/src/platform-infra.ts 暴露。
修改 OTel CLI 属于 UniDesk 轻量 CLI 变更：默认只做语法检查、命令形态验证和真实 trace 查询，不新增合同测试。
修改 AgentRun/HWLAB instrumentation 属于对应仓库/运行面的代码变更，必须按目标 repo 的 source truth、PR/CD 和原入口验收规则执行。

12 KiB Raw Blame History Unescape Escape