From 79e419ce98460e8d58f478ff608c088a7a10b8fe Mon Sep 17 00:00:00 2001 From: Codex Date: Wed, 1 Jul 2026 10:13:15 +0000 Subject: [PATCH] docs: record sentinel findings closeout lessons --- .agents/skills/unidesk-cicd/references/full.md | 2 ++ .agents/skills/unidesk-monitor/SKILL.md | 2 ++ .agents/skills/unidesk-monitor/references/full.md | 11 +++++++++++ docs/reference/observability.md | 2 ++ 4 files changed, 17 insertions(+) diff --git a/.agents/skills/unidesk-cicd/references/full.md b/.agents/skills/unidesk-cicd/references/full.md index b778421a..693d5c28 100644 --- a/.agents/skills/unidesk-cicd/references/full.md +++ b/.agents/skills/unidesk-cicd/references/full.md @@ -109,6 +109,8 @@ bun scripts/cli.ts hwlab nodes control-plane sync --node D601 --lane v03 --confi `hwlab nodes control-plane trigger-current --node --lane --confirm --wait` 是 node/lane CI/CD 一键入口:按 YAML 解析 source head,执行 git-mirror pre-sync/pre-flush,刷新 control-plane,创建或复用 commit-pinned PipelineRun,等待 PipelineRun 终态,并在终态成功后执行 post-flush。默认输出必须是低噪声 CICD 表格摘要;完整 JSON 只能通过 `--full` 或 `--raw` 展开。120 秒是严重超时阈值:PipelineRun wait 或 `trigger-current` total elapsed 超过 120 秒时,即使最终 status=ok/completed,也必须输出并在 closeout 中记录 `node-runtime-trigger-over-120s` warning、total elapsed、pipeline wait、git mirror status,并从 env-reuse 和 git-mirror/control-plane path 着手排查;未到终态时 CLI 返回 `pending` warning,不继续长时间阻塞,也不把仍在运行误报为构建失败。小范围 PR 触发 120s 时必须看 `plan-artifacts` 的 `affectedServices/buildServices/reusedServices`:如果 source diff 很小却出现所有 envreuse 服务都在 `buildServices` 且 `reusedServices=[]`,优先怀疑 current GitOps artifact catalog 没有 hydrate 到 source plan 阶段,而不是继续盲目重跑 PipelineRun。 +Web sentinel `trigger-current --confirm --wait` can exhaust its wait budget while the Tekton publish continues in the background and the top-level summary still says `source-fetch`. Do not immediately rerun or patch the workload. First run `web-probe sentinel control-plane status --node --lane --sentinel --full`: if source, registry and GitOps have advanced to the expected source commit/digest but runtime still points at the previous digest, continue with the controlled `web-probe sentinel control-plane apply --confirm --wait` path and then recheck status. If status shows the expected source object or registry digest is still absent, inspect the reported PipelineRun logs/status drill-down and track it as a CI/CD visibility or publish defect. Closeout must record the source commit, registry digest, GitOps revision, Argo revision and runtime digest separately; a wait timeout alone is not proof that publish failed. + ### G14 v0.3 runtime base image ```bash diff --git a/.agents/skills/unidesk-monitor/SKILL.md b/.agents/skills/unidesk-monitor/SKILL.md index 9bc63bf5..4b729e8e 100644 --- a/.agents/skills/unidesk-monitor/SKILL.md +++ b/.agents/skills/unidesk-monitor/SKILL.md @@ -60,6 +60,8 @@ bun scripts/cli.ts web-probe observe analyze 8. Browser memory/responsiveness/CDP red findings may include `rootCauseSignals` such as session list reads, trace event reads, web-performance beacon failures, EventSource failures and requestfailed/http TopN. Use those fields as first-line root-cause evidence for refresh storms before manually grepping JSONL artifacts. 9. `web-probe sentinel report --raw` is the bounded issue-evidence JSON view. It should include run/report SHA, compact findings, artifact summary and `rootCauseSignalFindings` when available. Use `--full` only when the complete indexed service payload is explicitly needed. 10. Quick-verify classification is separate from CI/CD health: `/health` proves deployment readiness, while `quick-verify-no-business-turn` or red analyzer findings prove post-deploy target validation is blocked and should remain visible in the bounded report. +11. If a run appears to have only WBC-003, compare public `/api/report?view=findings&run=` with CLI `web-probe sentinel report --run --view findings --raw`. `artifactSummary.reason=analysis-report-json-missing-or-invalid` means the service index cannot read that old artifact, not that analyzer findings are absent; reindex/backfill the existing run instead of starting a new observe run. +12. Any new analyzer finding id emitted by quick verify must be registered in the selected check catalog before rollout. A missing catalog entry can make `/api/health` return 503 and leave the new runner pod unhealthy even when the image is otherwise correct. ## Architecture Preference diff --git a/.agents/skills/unidesk-monitor/references/full.md b/.agents/skills/unidesk-monitor/references/full.md index 008263f7..13cf591e 100644 --- a/.agents/skills/unidesk-monitor/references/full.md +++ b/.agents/skills/unidesk-monitor/references/full.md @@ -59,6 +59,15 @@ bun scripts/cli.ts web-probe sentinel report --node --lane --senti bun scripts/cli.ts web-probe sentinel report --node --lane --sentinel --run --view trace-frame ``` +When a findings view shows only `quick-verify-no-business-turn` / WBC-003, do not conclude that the target produced no analyzer findings until the existing artifact has been checked. Compare the public runner API with the CLI raw view for the same run: + +```bash +curl -fsS 'https://monitor.pikapython.com/api/report?view=findings&run=' +bun scripts/cli.ts web-probe sentinel report --node --lane --sentinel --run --view findings --raw +``` + +If the public API returns `findingCount=1` but the CLI raw view shows non-empty `artifactSummary.findings`, the service index is stale or cannot read the old `stateDir/analysis/report.json`. Reindex or backfill the existing run through the runner service's controlled record path, preserving existing report views; do not start a new observe run just to make the old finding list visible. If both views lack analyzer findings, then investigate the analyzer artifact and original observe run. If `artifactSummary.reason=analysis-report-json-missing-or-invalid`, treat it as an index/artifact visibility gap, not as proof that WBC-003 is the only finding. + Public dashboard paths: - `https://monitor.pikapython.com/` @@ -131,6 +140,8 @@ For a Web sentinel fix, closeout needs four independent evidence surfaces: Long `quick-verify` or CI/CD waits should be bounded by the YAML-declared budget and the operator's outer timeout. If a wait would exceed about two minutes during rollout, first inspect the visible stage and either optimize the slow path, defer the expensive quick verify to manual validation, or record it as a non-blocking timing warning; do not dead-wait without new evidence. +After adding a new quick-verify or analyzer finding id, run the sentinel plan before rollout and verify the selected check catalog contains that id. A missing catalog row is a runtime health defect: the service can start, report `config.ok=false`, return `/api/health` 503 and leave the new pod in CrashLoopBackOff while an older ready pod continues serving. Fix the catalog through YAML/source control, redeploy, and only then validate the report/dashboard path. + If `origin/master` advances while rolling out a sentinel, first classify the new commits. For unrelated parallel changes, finish the current bounded check, wait briefly for the branch head to stabilize, then perform one final rollout/status pass against the stable head. Do not loop forever chasing every concurrent merge, but do not call a rollout complete while source truth, internal mirror and runtime image point at different commits. Source mirror readiness must be proven by the internal mirror object/read probe for the expected commit. A GitHub/source head check alone is not sufficient evidence to skip source sync, because it does not prove the k3s publish job can fetch the object from the node-local mirror. diff --git a/docs/reference/observability.md b/docs/reference/observability.md index 954c118e..7b5427e5 100644 --- a/docs/reference/observability.md +++ b/docs/reference/observability.md @@ -25,6 +25,8 @@ Web/Workbench trace、Web 哨兵和 `web-probe observe` 的人工判定入口以 Web 哨兵 dashboard/API 展示问题的第一事实源是 sentinel runner 的 `/api/overview`、`/api/runs`、`/api/runs/{id}`、`/api/findings` 和 `web-probe sentinel dashboard verify|screenshot` 远程浏览器证据。OTel/Tempo 查询不到 `hwlab-web-probe-sentinel` service span 或具体 `sentinel-run-*` id 时,只能说明当前 instrumentation 或保留窗口没有覆盖这条 dashboard/API 路径;不得因此把 UI/API 口径问题判为已追穿,也不得阻塞已由 API/DOM 证据定位的修复。需要继续追 runner 内部链路时,应把缺少 Web 哨兵 span 作为 instrumentation 问题登记到对应治理 issue。 +Web 哨兵 findings 可见性要同时核对 runner API 和已有 observe artifact。若某个 run 的公开 `/api/report?view=findings&run=` 只显示 WBC-003,但 `web-probe sentinel report --run --view findings --raw` 能从 `analysis/report.json` 读出 red/amber analyzer findings,根因是索引或 artifact 可见性遮盖,不是业务没有产生 warning/error。此时应回填或重建这条既有 run 的 report index,并保留原有 report views;不要通过启动新的哨兵 run 来解释旧记录。 + ## Workbench Request Storm And Freeze Workbench 请求风暴和浏览器无响应的根因调查必须同时使用 OTel、web-probe artifact 和前端 runtime 诊断,不能只看 provider 是否成功或单个 REST route 是否返回 200。最小证据应包含同一用户动作的 `traceId/sessionId/turnId` 或脱敏 scoped key、request family 计数、SSE transport state、recovery action、refresh queue/single-flight 状态、browser memory/freeze sample、observer/run/report SHA,以及用户页面是否仍可操作。缺少其中某个观测面时,先补观测或记录 instrumentation gap,再给出根因结论。