From d8761ed26ff4b4e7e4fcde0ab797d21dd0f50c48 Mon Sep 17 00:00:00 2001 From: Codex Date: Sun, 28 Jun 2026 08:55:07 +0000 Subject: [PATCH] docs: capture sentinel dashboard scope rules --- .agents/skills/unidesk-monitor/SKILL.md | 2 ++ .agents/skills/unidesk-monitor/references/full.md | 12 ++++++++---- .agents/skills/unidesk-webdev/references/full.md | 2 +- docs/reference/observability.md | 2 ++ 4 files changed, 13 insertions(+), 5 deletions(-) diff --git a/.agents/skills/unidesk-monitor/SKILL.md b/.agents/skills/unidesk-monitor/SKILL.md index d03c9734..280f2bf4 100644 --- a/.agents/skills/unidesk-monitor/SKILL.md +++ b/.agents/skills/unidesk-monitor/SKILL.md @@ -18,6 +18,7 @@ description: UniDesk monitoring and Web sentinel operations. Use when working on - HWLAB Web 哨兵 cadence 调度必须落在目标 node/lane 的 k3s CronJob/GitOps 中;不要用本机或远端 systemd timer 承载周期巡检。systemd 只可用于明确标注的历史/非 k3s legacy 排查。 - 诊断可用 `curl` 或一次性 `web-probe script` 采证,但重复 dashboard 验证必须沉淀为受控 `web-probe sentinel dashboard verify|screenshot` 或等价入口。 - `web-probe sentinel dashboard screenshot` 必须作为远程浏览器截图入口使用,PNG 默认下载到调用者 `/tmp`;issue/PR 证据引用 `localPath`、`sha256`、HTTP status、DOM 摘要和 overflow 结果。 +- monitor-web 的“监测项”默认必须跟随选中 run;曲线点、运行详情和监测项摘要必须区分类型数与样本数,历史聚合只能作为明确标注的历史口径展示。 ## Quick Commands @@ -52,6 +53,7 @@ bun scripts/cli.ts web-probe observe analyze 3. Separate service rollout and target validation: Argo/runtime green only proves哨兵自身可用;HWLAB business recovery must come from observe/analyze report. 4. Separate single-sentinel and multi-sentinel: root registry shows all sentinels; each runner owns independent Pod/PVC/Service/report. A single monitor-web aggregation layer is a separate responsibility. 5. Separate timing alerts and blockers: YAML-configured elapsed/timeout warnings are non-blocking unless the turn fails to complete, breaks Code Agent multi-round continuity, loses samples, or makes auth/submit/report unavailable. +6. Separate check type counts and sample counts: `findingCount`/`findingTypeCount` is a type count, while `severityCounts` and finding `count` are sample counts. ## Architecture Preference diff --git a/.agents/skills/unidesk-monitor/references/full.md b/.agents/skills/unidesk-monitor/references/full.md index da8b00b3..44f84140 100644 --- a/.agents/skills/unidesk-monitor/references/full.md +++ b/.agents/skills/unidesk-monitor/references/full.md @@ -46,6 +46,8 @@ bun scripts/cli.ts web-probe sentinel dashboard screenshot --node D601 --lane v0 The screenshot command runs through the selected node/lane remote browser and downloads the PNG artifact to the caller's `/tmp` by default. Closeout evidence should cite `localPath`, `sha256`, page HTTP status, selected DOM summary fields and `layout.horizontalOverflow` / `overflowCount`; do not replace this with a local browser screenshot or ad-hoc `web-probe script` when the sentinel command can cover the page. +`dashboard verify` is the canonical monitor-web DOM contract. It should assert that trend latest-point counts match the latest `/api/runs?sort=updated` run, and that the monitor-web check panel is scoped to the selected run by default. For a selected run, check type count, alert type count, error samples, warning samples and alert samples must match `/api/runs/{id}.findings`; historical `/api/findings` aggregates may be available only behind an explicit time-window scope and must be labeled as historical. + Use the freshness-only `--dry-run` scheduler command when the question is only "how old is the latest run?". It reads cadence, latest age, due status and latest run id without starting a new browser observation. If an enabled sentinel is `due` or stale while other sentinels are fresh, treat it as a sentinel-specific cadence or runtime issue and record the sentinel id, cadence, latest age and run id before starting a repair loop. Report drill-down: @@ -100,9 +102,10 @@ Classify a monitor page issue in this order: 1. HTML shell loads: root status, asset links, `data-sentinel-id`, `data-base-path`, contract version. 2. API returns data: `/api/overview`, `/api/runs`, `/api/findings`, selected `/api/runs/{id}`. -3. Browser render executes: page errors, console errors, DOM rows, status summary, findings count. -4. Runtime service health: `/api/health`, `/metrics`, scheduler heartbeat, SQLite/PVC write probe. -5. Control-plane health: source, registry image, git-mirror, GitOps, Argo, Deployment/Service/PVC. +3. Count scopes are coherent: trend latest point is latest-run sample count; selected run detail exposes type count and sample count separately; monitor-web checks default to selected-run scope and only show historical aggregates behind an explicit history/time-window selector. +4. Browser render executes: page errors, console errors, DOM rows, status summary, check cards, scope labels and overflow state. +5. Runtime service health: `/api/health`, `/metrics`, scheduler heartbeat, SQLite/PVC write probe. +6. Control-plane health: source, registry image, git-mirror, GitOps, Argo, Deployment/Service/PVC. Do not treat public root/CSS/JS 200 as dashboard success. Browser console and DOM render evidence are required. @@ -114,6 +117,7 @@ For a Web sentinel fix, closeout needs four independent evidence surfaces: 2. `web-probe sentinel validate --node --lane --sentinel ` must pass `/api/health`, `/metrics`, indexed recent report, public exposure and public dashboard probes. 3. `web-probe sentinel dashboard screenshot --node --lane --sentinel ` must pass through the remote browser and save a verified PNG to the caller `/tmp`; record the local path, hash and layout overflow result. 4. `web-probe sentinel report --node --lane --sentinel --latest --view summary` must show the latest business run, report hash, samples and finding severity. Auth/login failures, submit failures, missing samples, absent recent reports and Code Agent multi-round continuity breaks are blockers. YAML elapsed-budget or total-turn timing alerts are non-blocking warnings unless they coincide with a failed turn, broken continuity, missing report or unavailable user path. +5. For monitor-web count fixes, `dashboard verify` must include the latest run id, trend sample counts, selected-run check counts and any targeted run filter evidence. A pass requires check samples to match the selected run detail, not just chart arithmetic. Long `quick-verify` or CI/CD waits should be bounded by the YAML-declared budget and the operator's outer timeout. If a wait would exceed about two minutes during rollout, first inspect the visible stage and either optimize the slow path, defer the expensive quick verify to manual validation, or record it as a non-blocking timing warning; do not dead-wait without new evidence. @@ -121,7 +125,7 @@ If `origin/master` advances while rolling out a sentinel, first classify the new Source mirror readiness must be proven by the internal mirror object/read probe for the expected commit. A GitHub/source head check alone is not sufficient evidence to skip source sync, because it does not prove the k3s publish job can fetch the object from the node-local mirror. -Dashboard aggregate counters may include historical runs. When they disagree with the latest selected run, closeout should name the latest `runId` and report hash as the acceptance source, and track the aggregate-labeling UI improvement separately instead of treating historical aggregate red counts as the latest run's blocker. +Dashboard aggregate counters may include historical runs only when the UI labels that scope explicitly. They must not sit beside a latest-run chart or selected-run check list without a scope label. If trend, run detail and check list disagree, first identify whether each number is a type count, sample count or historical aggregate before changing code. For Code Agent multi-round quick-verify, accept the latest run's `turn-summary` / `trace-frame` plus `blockingFindingCount=0` and `controlFindingCount=0`. Analyzer red findings about hydration, API-to-DOM lag or timing drift are investigation evidence unless they coincide with missing durable turns/final responses, failed submit/login/auth, broken continuity, absent report or unavailable user path. diff --git a/.agents/skills/unidesk-webdev/references/full.md b/.agents/skills/unidesk-webdev/references/full.md index 8548fbbd..bb605ad6 100644 --- a/.agents/skills/unidesk-webdev/references/full.md +++ b/.agents/skills/unidesk-webdev/references/full.md @@ -155,7 +155,7 @@ MDTODO 或 Project Management 的 Web 重写/布局 closeout 不能只引用组 - `observe start/status/command/collect/analyze` 默认输出包含 `Wrapper contract` 区块;该区块证明 Web 哨兵只能 wrap 现有 observe CLI verb、现有 runner/analyzer 和既有 artifact contract,不新增第二套 Playwright runner、analyzer、状态机或私有 web-probe API。 - `web-probe sentinel plan|status` 只读取 `observability.webProbe.sentinel.enabled/configRefs` 和 owning YAML,渲染 redacted 配置引用图、文件 hash、缺失字段和跨 ref 冲突;`web-probe sentinel image|control-plane` 继续从 owning YAML 渲染 image、GitOps、Argo 和 manifest 计划,并在远端 publish job 接通前拒绝报告部署 mutation。它不启动浏览器、不读取 Secret 值、不保存采样结果,也不是第二套 runner/analyzer。真正的采样和判定仍以 `observe start|command|collect|analyze` artifacts 为准。 - Web 哨兵 public dashboard/origin 必须以 issue/SPEC/YAML 既定计划为准;当前 P6 计划沿用 `monitor.pikapython.com`,不要未经明确变更改成 `hwlab-monitor.pikapython.com` 或其他新域名。验证 report 时记录 `publicOrigin`,但不要把域名硬编码到 runner/analyzer 逻辑里。 -- 验证 sentinel public dashboard **页面 DOM/截图**(区别于 CI/CD `sentinel status|plan|image|control-plane`)时,`web-probe script` 的默认 origin 是 lane 的 HWLAB Cloud Web(`gotoStable('/')` 进入 hwlab workbench),不是 sentinel dashboard origin;必须用显式 `page.goto('/')`(`publicBaseUrl` 来自 `config/hwlab-web-probe-sentinel/public-exposure.*.yaml#sentinel.publicExposure.publicBaseUrl`,如 `https://monitor.pikapython.com`)进入 dashboard 页面,再用 `safeEvaluate`/`screenshot` 采样,不能用 `gotoStable('/')` 期望落到 sentinel dashboard。这是 web-probe 缺少 sentinel dashboard DOM 验证受控入口的临时绕过,根因见 [pikasTech/unidesk#1030](https://github.com/pikasTech/unidesk/issues/1030);dashboard DOM 验证脚本不要复用 hwlab workbench 的 `waitWorkbenchReady`/session helper 作为通过条件。 +- 验证 sentinel public dashboard **页面 DOM/截图**(区别于 CI/CD `sentinel status|plan|image|control-plane`)时,优先使用 `web-probe sentinel dashboard verify|screenshot --node --lane --sentinel `;该受控入口会从 YAML public exposure 进入 dashboard 页面,使用目标 node/lane 远程浏览器,并输出 bounded DOM/截图证据。`web-probe script` 的默认 origin 仍是 lane 的 HWLAB Cloud Web,不是 sentinel dashboard origin;只有受控 dashboard 命令暂时覆盖不到的一次性探索才用显式 `page.goto('/')`,同类动作第二次出现时必须沉淀回 sentinel dashboard 命令。 - `scripts/web-probe-sentinel-service.ts` 是 Web 哨兵 Pod entrypoint;`--once` 只做 config/PVC/SQLite/scheduler/analyzer-command health 快照,`--scheduler-disabled` 仅用于本地服务健康冒烟,不能作为生产运行参数。HTTP 服务只提供 `/api/health`、`/api/status`、`/api/runs`、`/api/maintenance`、`/metrics` 和 redacted dashboard 外壳,底层采样仍只能经 observe CLI adapter。 - `trace-frame` 出现 `(无 trace rows;这是 blocker...)` 时,必须先看同一输出中的 `TRACE DIAGNOSTIC`:记录 pageRole/pageId、traceRows/turns/messages 数量、sampleTraceIds、尾部 traceRow/turn/message 归属。若目标 trace 的 turn/message/final 存在但 traceRows 全部属于旧 trace,应按 Workbench read model authority 分裂登记到架构/业务 issue(例:HWLAB #2124),不得把旧 traceRows 当作新 turn 通过证据,也不得让 analyzer 的聚合计数压过 CLI trace 视图。 - analyzer finding 不得压过 CLI `trace-frame` 人工视图。尤其 `trace-assistant-message-duplicates-final-response` 只有在 `trace-frame` 中同一 completed turn 可见多条相同 assistant final rows 时才按业务 bug 处理;如果 `trace-frame` 只有一条 assistant final row、后面固定 `Final Response` 区块正确且 API messages/turns 对齐,该 amber 归类为 analyzer 精度问题,应登记/修工具,不得阻止业务 closeout。 diff --git a/docs/reference/observability.md b/docs/reference/observability.md index cac335b3..e5dbeb60 100644 --- a/docs/reference/observability.md +++ b/docs/reference/observability.md @@ -23,6 +23,8 @@ UniDesk 的可观测性优先级高于静默成功。CLI、服务日志、Docker Web/Workbench trace、Web 哨兵和 `web-probe observe` 的人工判定入口以 `$unidesk-webdev` 为准:先用采样器保存的 artifact 渲染 `turn-summary` 和 `trace-frame` CLI 视图,再解释 analyzer finding。自动判别器、聚合计数或额外截图保存源不能压过同一采样帧的 CLI trace 视图;若二者冲突,应登记 analyzer/tooling 精度问题或上游投影问题,而不是用 fallback 视图修业务结论。 +Web 哨兵 dashboard/API 展示问题的第一事实源是 sentinel runner 的 `/api/overview`、`/api/runs`、`/api/runs/{id}`、`/api/findings` 和 `web-probe sentinel dashboard verify|screenshot` 远程浏览器证据。OTel/Tempo 查询不到 `hwlab-web-probe-sentinel` service span 或具体 `sentinel-run-*` id 时,只能说明当前 instrumentation 或保留窗口没有覆盖这条 dashboard/API 路径;不得因此把 UI/API 口径问题判为已追穿,也不得阻塞已由 API/DOM 证据定位的修复。需要继续追 runner 内部链路时,应把缺少 Web 哨兵 span 作为 instrumentation 问题登记到对应治理 issue。 + ## CLI Logs 异步 job 的 stdout 和 stderr 位于 `.state/jobs/`。`job|jobs list` 默认只返回最新 50 条摘要,并为已知异步工作流返回轻量 `progress.summary`;`job status ` 与兼容别名 `jobs get/read ` 会返回结构化 `progress` 与有限尾部,避免输出爆炸,同时保留完整日志文件路径便于继续排查。实现必须只读取日志尾部字节,不得先把完整 job 日志读入 CLI 内存;长时命令的阶段、关键对象名和下一步查询命令应优先沉淀到 `progress`,不能要求调用者先阅读完整日志才能知道是否卡在提交、构建、发布或观测阶段。