9.9 KiB
name, description
| name | description |
|---|---|
| unidesk-monitor | UniDesk monitoring and Web sentinel operations. Use when working on monitor.pikapython.com, HWLAB Web哨兵, web-probe sentinel status/report/dashboard, 定期巡检/周期巡检/哨兵巡检, 新建或调整 Web sentinel YAML, monitor/minitor requests, Prometheus/OTel monitoring, multi-sentinel runtime visibility, or monitoring-related issue triage and rollout evidence. |
UniDesk Monitor
本技能是 UniDesk 监控与 Web 哨兵操作面的入口。它不替代 $unidesk-webdev、$unidesk-cicd、$unidesk-ymalops、$unidesk-gh 或 $unidesk-otel;遇到对应工作时同时加载那些技能。
当用户提到 Web 哨兵、web-probe sentinel、monitor.pikapython.com、定期/周期巡检、新建巡检、巡检 dashboard/report/status,或误写为 minitor 时,必须加载本技能。
Boundaries
- Web 哨兵只 wrap 现有
web-probe observe start/status/command/collect/analyze,不得新增第二套 Playwright runner、采样器、报告器或 analyzer。 - 哨兵观察对象是 HWLAB Web 用户入口和业务 E2E 链路;不是让一个哨兵观察另一个哨兵。
- YAML 是 source of truth。node/lane、sentinel id、Deployment、Service、PVC、route prefix、cadence、Secret sourceRef、dashboard public URL 和 report views 都必须从 YAML/configRef 进入受控 CLI。
- 正式读写 GitHub issue/PR 走
$unidesk-gh;部署、Argo、git-mirror、PipelineRun、runtime 状态走$unidesk-cicd和受控 CLI;YAML 正规化走$unidesk-ymalops。 - HWLAB Web 哨兵 cadence 调度必须落在目标 node/lane 的 k3s CronJob/GitOps 中;不要用本机或远端 systemd timer 承载周期巡检。systemd 只可用于明确标注的历史/非 k3s legacy 排查。
- 诊断可用
curl或一次性web-probe script采证,但重复 dashboard 验证必须沉淀为受控web-probe sentinel dashboard verify|screenshot或等价入口。 web-probe sentinel dashboard screenshot必须作为远程浏览器截图入口使用,PNG 默认下载到调用者/tmp;issue/PR 证据引用localPath、sha256、HTTP status、DOM 摘要和 overflow 结果。VERIFIED=true只证明 PNG 回传和哈希校验通过,收口前仍必须打开截图或用 DOM 摘要确认不是 Chrome 网络错误页、登录页或空壳页。- monitor-web 的“监测项”默认必须跟随选中 run;曲线点、运行详情和监测项摘要必须区分类型数与样本数,历史聚合只能作为明确标注的历史口径展示。
- Web 哨兵 check code 必须语义单一且确定:一个 code/id 只能表达一种处置路径;如果同一 finding 可能表示多种根因或状态,必须拆成多个固定 code/id,而不是用动态标题或摘要在同一 code 下区分。
Quick Commands
bun scripts/cli.ts web-probe sentinel status --node <node> --lane <lane>
bun scripts/cli.ts web-probe sentinel status --node <node> --lane <lane> --sentinel <id>
bun scripts/cli.ts web-probe sentinel control-plane status --node <node> --lane <lane> --sentinel <id>
bun scripts/cli.ts web-probe sentinel validate --node <node> --lane <lane> --sentinel <id>
bun scripts/cli.ts web-probe sentinel dashboard verify --node <node> --lane <lane> --sentinel <id>
bun scripts/cli.ts web-probe sentinel dashboard screenshot --node <node> --lane <lane> --sentinel <id>
bun scripts/cli.ts web-probe sentinel report --node <node> --lane <lane> --sentinel <id> --latest --view summary
bun scripts/cli.ts web-probe sentinel report --node <node> --lane <lane> --sentinel <id> --latest --view summary --raw
bun scripts/cli.ts web-probe sentinel report --node <node> --lane <lane> --sentinel <id> --latest --view summary --full
bun scripts/cli.ts web-probe sentinel dashboard trigger --node <node> --lane <lane> --sentinel <id>
bun scripts/cli.ts web-probe sentinel control-plane trigger-current --node <node> --lane <lane> --sentinel <id> --confirm
trans <node>:k3s kubectl -n <namespace> get cronjob -l app.kubernetes.io/component=cadence-scheduler
trans <node>:k3s kubectl -n <namespace> create job --from=cronjob/<quick-verify-cronjob> <manual-job-name>
For WebUI manual validation, use web-probe sentinel dashboard trigger so the remote browser clicks the monitor-web button. Direct kubectl create job --from=cronjob/... is a last-resort diagnostic only, not acceptance evidence. For k3s cadence validation, first use the controlled control-plane status/trigger commands, then inspect the rendered CronJob in the target k3s namespace. Persistent cadence changes must be made through YAML/GitOps and redeployed.
For long Workbench/user-path evidence, use the normal Web probe surface:
bun scripts/cli.ts web-probe observe start --node D601 --lane v03 --target-path /workbench
bun scripts/cli.ts web-probe observe command <observerId> --type <command>
bun scripts/cli.ts web-probe observe collect <observerId> --view <view>
bun scripts/cli.ts web-probe observe analyze <observerId>
Triage Shape
- 用户要求“离线调查”时,只读已有 sentinel run/report、runner API/index、OTel 既有 trace、observe artifact 和源码;不得新开 web-probe、触发 dashboard/manual quick verify 或改运行面。若 report/analyze 无法给出有界 root-cause 证据,先改进 report/analyzer/CLI 可见性,再继续结论。
- Separate shell/API/render: check public HTML/CSS/JS,
/api/overview,/api/runs, then browser console/DOM render evidence. - Separate runner and web: runner Pod/PVC/API/report health is not the same as monitor-web rendering health.
- Separate service rollout and target validation: Argo/runtime green only proves哨兵自身可用;HWLAB business recovery must come from observe/analyze report.
- Separate single-sentinel and multi-sentinel: root registry shows all sentinels; each runner owns independent Pod/PVC/Service/report. A single monitor-web aggregation layer is a separate responsibility.
- Separate timing alerts and blockers: YAML-configured elapsed/timeout warnings are non-blocking unless the turn fails to complete, breaks Code Agent multi-round continuity, loses samples, or makes auth/submit/report unavailable.
- Separate check type counts and sample counts:
findingCount/findingTypeCountis a type count, whileseverityCountsand findingcountare sample counts. - Trace-frame reports should prefer latest terminal/completed samples. If a report shows an early running/non-terminal sample, check whether the frame reports a later terminal sample and rerun with that
--sample-seqbefore concluding the business turn is still running. - Browser memory/responsiveness/CDP red findings may include
rootCauseSignalssuch as session list reads, trace event reads, web-performance beacon failures, EventSource failures and requestfailed/http TopN. Use those fields as first-line root-cause evidence for refresh storms before manually grepping JSONL artifacts. web-probe sentinel report --rawis the bounded issue-evidence JSON view. It should include run/report SHA, compact findings, artifact summary androotCauseSignalFindingswhen available. Use--fullonly when the complete indexed service payload is explicitly needed.- Quick-verify classification is separate from CI/CD health:
/healthproves deployment readiness, whilequick-verify-no-business-turnor red analyzer findings prove post-deploy target validation is blocked and should remain visible in the bounded report. - If a run appears to have only WBC-003, compare public
/api/report?view=findings&run=<id>with CLIweb-probe sentinel report --run <id> --view findings --raw.artifactSummary.reason=analysis-report-json-missing-or-invalidmeans the service index cannot read that old artifact, not that analyzer findings are absent; reindex/backfill the existing run instead of starting a new observe run. - Any new analyzer finding id emitted by quick verify must be registered in the selected check catalog before rollout. A missing catalog entry can make
/api/healthreturn 503 and leave the new runner pod unhealthy even when the image is otherwise correct. - If a dashboard screenshot artifact is small or visually shows
ERR_NETWORK_CHANGED/browser error chrome while CLI status is otherwise pass, discard it as evidence and rerun after checking the public URL/API status. Treat this as a web-probe evidence-quality issue if repeated; do not close visibility issues from such a screenshot alone. - Request-rate curve acceptance uses
/api/runs/{id}.requestRateplus dashboard screenshot/DOM evidence that the request chart is above the memory chart with aligned time axis. Untildashboard verifyexposes request-rate-specific fields, do not treat legacyAPI_PAGES/API_SAMPLEScolumns as request curve counts; seedocs/reference/observability.md. - Check code 设计必须先拆语义再实现展示:例如“没有业务轮次”“目标轮次缺 traceId”“trace rows/projection 缺失”“Final Response 为空且仍在运行/取消”“Final Response 为空且已失败/终止”应是不同固定 code,而不是一个 WBC-003 下的动态解释。
Architecture Preference
Prefer Kubernetes-native discovery and isolation before inventing a custom control plane:
- Labels/selectors identify sentinel runners.
- ClusterIP Services expose runner APIs.
- EndpointSlice or Service list/watch lets a monitor-web discover runner endpoints.
- PVC per runner preserves local report index.
- Deployment per runner is appropriate for long-lived observe sessions; CronJob is appropriate only for short stateless periodic probes.
- ConfigMap/Secret carry non-secret config and sourceRef-derived runtime material; output remains redacted.
- Prometheus/ServiceMonitor may scrape
/metricswhen the namespace already has that stack, but report drill-down should stay on runner HTTP APIs or a declared shared store.
Read references/full.md for the current D601/v03 Web 哨兵 command matrix, dashboard triage checklist, and multi-sentinel target architecture.