Files
pikasTech-unidesk/.agents/skills/unidesk-monitor/SKILL.md
T
2026-07-02 07:33:19 +00:00

9.9 KiB
Raw Blame History

name, description
name description
unidesk-monitor UniDesk monitoring and Web sentinel operations. Use when working on monitor.pikapython.com, HWLAB Web哨兵, web-probe sentinel status/report/dashboard, 定期巡检/周期巡检/哨兵巡检, 新建或调整 Web sentinel YAML, monitor/minitor requests, Prometheus/OTel monitoring, multi-sentinel runtime visibility, or monitoring-related issue triage and rollout evidence.

UniDesk Monitor

本技能是 UniDesk 监控与 Web 哨兵操作面的入口。它不替代 $unidesk-webdev$unidesk-cicd$unidesk-ymalops$unidesk-gh$unidesk-otel;遇到对应工作时同时加载那些技能。

当用户提到 Web 哨兵、web-probe sentinelmonitor.pikapython.com、定期/周期巡检、新建巡检、巡检 dashboard/report/status,或误写为 minitor 时,必须加载本技能。

Boundaries

  • Web 哨兵只 wrap 现有 web-probe observe start/status/command/collect/analyze,不得新增第二套 Playwright runner、采样器、报告器或 analyzer。
  • 哨兵观察对象是 HWLAB Web 用户入口和业务 E2E 链路;不是让一个哨兵观察另一个哨兵。
  • YAML 是 source of truth。node/lane、sentinel id、Deployment、Service、PVC、route prefix、cadence、Secret sourceRef、dashboard public URL 和 report views 都必须从 YAML/configRef 进入受控 CLI。
  • 正式读写 GitHub issue/PR 走 $unidesk-gh;部署、Argo、git-mirror、PipelineRun、runtime 状态走 $unidesk-cicd 和受控 CLIYAML 正规化走 $unidesk-ymalops
  • HWLAB Web 哨兵 cadence 调度必须落在目标 node/lane 的 k3s CronJob/GitOps 中;不要用本机或远端 systemd timer 承载周期巡检。systemd 只可用于明确标注的历史/非 k3s legacy 排查。
  • 诊断可用 curl 或一次性 web-probe script 采证,但重复 dashboard 验证必须沉淀为受控 web-probe sentinel dashboard verify|screenshot 或等价入口。
  • web-probe sentinel dashboard screenshot 必须作为远程浏览器截图入口使用,PNG 默认下载到调用者 /tmpissue/PR 证据引用 localPathsha256、HTTP status、DOM 摘要和 overflow 结果。VERIFIED=true 只证明 PNG 回传和哈希校验通过,收口前仍必须打开截图或用 DOM 摘要确认不是 Chrome 网络错误页、登录页或空壳页。
  • monitor-web 的“监测项”默认必须跟随选中 run;曲线点、运行详情和监测项摘要必须区分类型数与样本数,历史聚合只能作为明确标注的历史口径展示。
  • Web 哨兵 check code 必须语义单一且确定:一个 code/id 只能表达一种处置路径;如果同一 finding 可能表示多种根因或状态,必须拆成多个固定 code/id,而不是用动态标题或摘要在同一 code 下区分。

Quick Commands

bun scripts/cli.ts web-probe sentinel status --node <node> --lane <lane>
bun scripts/cli.ts web-probe sentinel status --node <node> --lane <lane> --sentinel <id>
bun scripts/cli.ts web-probe sentinel control-plane status --node <node> --lane <lane> --sentinel <id>
bun scripts/cli.ts web-probe sentinel validate --node <node> --lane <lane> --sentinel <id>
bun scripts/cli.ts web-probe sentinel dashboard verify --node <node> --lane <lane> --sentinel <id>
bun scripts/cli.ts web-probe sentinel dashboard screenshot --node <node> --lane <lane> --sentinel <id>
bun scripts/cli.ts web-probe sentinel report --node <node> --lane <lane> --sentinel <id> --latest --view summary
bun scripts/cli.ts web-probe sentinel report --node <node> --lane <lane> --sentinel <id> --latest --view summary --raw
bun scripts/cli.ts web-probe sentinel report --node <node> --lane <lane> --sentinel <id> --latest --view summary --full
bun scripts/cli.ts web-probe sentinel dashboard trigger --node <node> --lane <lane> --sentinel <id>
bun scripts/cli.ts web-probe sentinel control-plane trigger-current --node <node> --lane <lane> --sentinel <id> --confirm
trans <node>:k3s kubectl -n <namespace> get cronjob -l app.kubernetes.io/component=cadence-scheduler
trans <node>:k3s kubectl -n <namespace> create job --from=cronjob/<quick-verify-cronjob> <manual-job-name>

For WebUI manual validation, use web-probe sentinel dashboard trigger so the remote browser clicks the monitor-web button. Direct kubectl create job --from=cronjob/... is a last-resort diagnostic only, not acceptance evidence. For k3s cadence validation, first use the controlled control-plane status/trigger commands, then inspect the rendered CronJob in the target k3s namespace. Persistent cadence changes must be made through YAML/GitOps and redeployed.

For long Workbench/user-path evidence, use the normal Web probe surface:

bun scripts/cli.ts web-probe observe start --node D601 --lane v03 --target-path /workbench
bun scripts/cli.ts web-probe observe command <observerId> --type <command>
bun scripts/cli.ts web-probe observe collect <observerId> --view <view>
bun scripts/cli.ts web-probe observe analyze <observerId>

Triage Shape

  1. 用户要求“离线调查”时,只读已有 sentinel run/report、runner API/index、OTel 既有 trace、observe artifact 和源码;不得新开 web-probe、触发 dashboard/manual quick verify 或改运行面。若 report/analyze 无法给出有界 root-cause 证据,先改进 report/analyzer/CLI 可见性,再继续结论。
  2. Separate shell/API/render: check public HTML/CSS/JS, /api/overview, /api/runs, then browser console/DOM render evidence.
  3. Separate runner and web: runner Pod/PVC/API/report health is not the same as monitor-web rendering health.
  4. Separate service rollout and target validation: Argo/runtime green only proves哨兵自身可用;HWLAB business recovery must come from observe/analyze report.
  5. Separate single-sentinel and multi-sentinel: root registry shows all sentinels; each runner owns independent Pod/PVC/Service/report. A single monitor-web aggregation layer is a separate responsibility.
  6. Separate timing alerts and blockers: YAML-configured elapsed/timeout warnings are non-blocking unless the turn fails to complete, breaks Code Agent multi-round continuity, loses samples, or makes auth/submit/report unavailable.
  7. Separate check type counts and sample counts: findingCount/findingTypeCount is a type count, while severityCounts and finding count are sample counts.
  8. Trace-frame reports should prefer latest terminal/completed samples. If a report shows an early running/non-terminal sample, check whether the frame reports a later terminal sample and rerun with that --sample-seq before concluding the business turn is still running.
  9. Browser memory/responsiveness/CDP red findings may include rootCauseSignals such as session list reads, trace event reads, web-performance beacon failures, EventSource failures and requestfailed/http TopN. Use those fields as first-line root-cause evidence for refresh storms before manually grepping JSONL artifacts.
  10. web-probe sentinel report --raw is the bounded issue-evidence JSON view. It should include run/report SHA, compact findings, artifact summary and rootCauseSignalFindings when available. Use --full only when the complete indexed service payload is explicitly needed.
  11. Quick-verify classification is separate from CI/CD health: /health proves deployment readiness, while quick-verify-no-business-turn or red analyzer findings prove post-deploy target validation is blocked and should remain visible in the bounded report.
  12. If a run appears to have only WBC-003, compare public /api/report?view=findings&run=<id> with CLI web-probe sentinel report --run <id> --view findings --raw. artifactSummary.reason=analysis-report-json-missing-or-invalid means the service index cannot read that old artifact, not that analyzer findings are absent; reindex/backfill the existing run instead of starting a new observe run.
  13. Any new analyzer finding id emitted by quick verify must be registered in the selected check catalog before rollout. A missing catalog entry can make /api/health return 503 and leave the new runner pod unhealthy even when the image is otherwise correct.
  14. If a dashboard screenshot artifact is small or visually shows ERR_NETWORK_CHANGED/browser error chrome while CLI status is otherwise pass, discard it as evidence and rerun after checking the public URL/API status. Treat this as a web-probe evidence-quality issue if repeated; do not close visibility issues from such a screenshot alone.
  15. Request-rate curve acceptance uses /api/runs/{id}.requestRate plus dashboard screenshot/DOM evidence that the request chart is above the memory chart with aligned time axis. Until dashboard verify exposes request-rate-specific fields, do not treat legacy API_PAGES / API_SAMPLES columns as request curve counts; see docs/reference/observability.md.
  16. Check code 设计必须先拆语义再实现展示:例如“没有业务轮次”“目标轮次缺 traceId”“trace rows/projection 缺失”“Final Response 为空且仍在运行/取消”“Final Response 为空且已失败/终止”应是不同固定 code,而不是一个 WBC-003 下的动态解释。

Architecture Preference

Prefer Kubernetes-native discovery and isolation before inventing a custom control plane:

  • Labels/selectors identify sentinel runners.
  • ClusterIP Services expose runner APIs.
  • EndpointSlice or Service list/watch lets a monitor-web discover runner endpoints.
  • PVC per runner preserves local report index.
  • Deployment per runner is appropriate for long-lived observe sessions; CronJob is appropriate only for short stateless periodic probes.
  • ConfigMap/Secret carry non-secret config and sourceRef-derived runtime material; output remains redacted.
  • Prometheus/ServiceMonitor may scrape /metrics when the namespace already has that stack, but report drill-down should stay on runner HTTP APIs or a declared shared store.

Read references/full.md for the current D601/v03 Web 哨兵 command matrix, dashboard triage checklist, and multi-sentinel target architecture.