fix: improve hwlab probe diagnostics

2026-06-20 03:59:48 +00:00
parent 7696fddccf
commit 6a88f38ca9
6 changed files with 317 additions and 5 deletions
@@ -24,12 +24,16 @@ G14/D601 v03 的 bootstrap admin password 是 HWLAB runtime Secret 生命周期

 `hwlab nodes web-probe run|script --node <node> --lane <lane>` 是 HWLAB Cloud Web 线上 DOM/Playwright 验收的受控入口；CLI 负责从 YAML 解析 workspace、public URL 和 bootstrap admin sourceRef，并只输出 redacted 凭据状态、artifact path/hash、readiness、`probe.summary` 和失败分类。`run` 使用 repo-owned 标准 DOM probe；`script` 不运行默认探针，必须通过 stdin heredoc 或 `--script-file <path>` 提供调用者脚本。`run --message ...` 未显式设置 trace 参数时会做轻量 trace 采样，`script` helper 可用 `recordStep` / `safeFetchJson` / `fetchApiMatrix` 保留失败前的结构化 partial evidence，完整 redacted 报告通过 `reportPath`/`reportSha256` 展开。具体 Web 开发、fake-server Playwright、fixture 脱敏、`web-probe script` helper、截图和 Workbench/Performance 判定口径统一见 `$unidesk-webdev`，本 CLI 参考不再维护第二套操作面。

+`web-probe script` 的 issue closeout 优先引用顶层 `issueEvidence` 或 `summary.issueEvidence`，其中包含 redacted `result`、最近 steps、lastStep、script/report SHA 和截图摘要；完整调查才展开 `probe.script.result`、`probe.steps` 或 `reportPath`，避免把同一证据在 stdout、summary 和 report 多层重复粘贴。
+
 `hwlab nodes control-plane infra plan|status|apply --node D601 --lane v03` 是 D601 HWLAB v03 节点本地 k3s、CI/CD 与 git-mirror 前置控制面的 YAML 驱动入口，配置真相源是 `config/hwlab-node-control-plane.yaml`。`plan` 只读展示 YAML target、host k3s node config 摘要和将渲染的 control-plane 对象；`status` 只读观察 k3s systemd drop-in 与 node `capacity/allocatable.pods`、D601 Tekton、CI namespace、git-mirror、Argo、node-local registry 和 tools image readiness；`apply --dry-run` 只输出 manifest 与 host config 摘要；`apply --confirm` 按 YAML 收敛 D601 host k3s drop-in 和 control-plane bootstrap 对象，只有 host k3s 配置或 live pod capacity 未收敛时才重启 k3s，不触发 HWLAB runtime rollout，不创建 PK01 DB，也不修改 Caddy/FRP。D601 host 侧 k3s pre-start 修正也必须写成 YAML `execStartPre` argv，不做手工 systemd 热改；当 kube API 已不可用时，`apply` 可用同一 YAML 渲染出的 host 脚本经 node-local tools image/Docker fallback 恢复 systemd drop-in，输出仍只给对象名、SHA、exit code 和摘要。k3s pod capacity 等可调数值只以 YAML 为准，长期参考不复制具体数值；tools image 的 node-local registry 地址只能作为输出 artifact，输入 base image 必须由 YAML 声明为公开 registry 来源，缺少 output image 时应在 `status.next.blockers` 中体现，而不是把现有 node-local image 当成输入基础镜像。

 `hwlab nodes git-mirror status|sync|flush --node <node> --lane <lane>` 是 node-scoped runtime lane 的 Git mirror 维护入口。`status` 的 `githubSource` / `githubGitops` 来自本地 mirror cache 的 `refs/mirror-stage/...`，不是实时 GitHub API；输出中的 `refSources.githubFieldsAreMirrorStageCache=true` 和 `refSources.cacheRefresh` 给出这一来源和刷新命令。`sync --confirm --wait` 的 k3s Job 遇到 GitHub SSH transient 时，应通过目标 workspace fallback 拉取 GitHub source/gitops 并写回 node-local mirror，输出只披露 commit、mirror write URL 和 fallback 状态。`flush --confirm --wait` 如果已经把 GitOps ref push 到 GitHub，但 post-push fetch/recheck 因 transient SSH 失败而无法刷新 mirror-stage，会标记 `partialSuccess=push-succeeded-fetch-failed`；CLI 应自动执行一次受控 sync 刷新 mirror-stage，若恢复后 `pendingFlush=false` 且 `githubInSync=true`，结果应为 `ok=true` 并输出 `partialSuccessRecovered` / `postPushRecovery`，否则才保留 `degradedReason=node-runtime-git-mirror-flush-post-push-fetch-failed` 和下一步 `sync --confirm --wait`。不要把这种 partial success 解读为需要连续盲目 flush。`hwlab nodes control-plane trigger-current --node <node> --lane <lane> --confirm --wait` 会在 source sync 后自动执行必要的 pre-flush，在 PipelineRun terminal 后自动执行必要的 post-flush；progress 事件必须显式输出 `git-mirror-pre-flush` / `git-mirror-post-flush` 的 executed/skipped、jobName、local/github source、local/github GitOps、`pendingFlush` 和 `githubInSync`，且已恢复的 partial success 不能让顶层 trigger-current false-fail。`control-plane status` 仍是只读入口，只暴露 compact `gitMirror` 摘要和下一步 flush 命令，不隐式执行写操作。

 PR 合并后触发 node-scoped runtime lane 时，`control-plane status --pipeline-run <name>` 是某次 PipelineRun 的定点观察入口，但同一输出中的 `sourceHead` / `summary.sourceCommit` 仍可能反映当前分支最新 head；如果触发后又有后续 PR 合并，当前 head 可能已经不是该 PipelineRun 名称中的短 SHA。closeout 证据必须同时写明：PR merge commit、定点 PipelineRun 名称和状态、最终 runtime/GitOps revision、当前 branch tip，以及当前 branch tip 是否包含本次 PR merge commit。不要只凭 `summary.sourceCommit` 反推某个旧 PipelineRun 的源码身份。

+`hwlab nodes control-plane status` 的 `publicProbe.ready` 表示控制面从公网用户入口访问 YAML 声明 public Web/API 成功；`publicProbe.targetHost` 只表示目标节点 host 自己访问同一公网 URL 的诊断结果。若 `publicProbe.ready=true` 且 `publicProbe.diagnostic.kind=target-host-public-egress-mismatch`，closeout 仍以 `publicProbe` 和 `web-probe` 用户入口证据为准，host 侧 `hwlab-cli` 访问公网失败应单独按目标 host egress/hairpin 问题跟踪。
+
 PipelineRun 失败或长时间未完成时，先按定点 `control-plane status --pipeline-run <name>` 和 bounded 只读 k3s 诊断定位失败 TaskRun/Pod/container。env-reuse service build 常见失败点是 `build-<service>` 的 `step-publish` 日志，apt、npm、Go module 等外部依赖下载可能通过 lane YAML 注入的 egress proxy 出现瞬时 502、reset 或超时；先用 `platform-infra sub2api status|validate` 区分共享 proxy 整体故障和单个上游 transient。proxy 健康但单个依赖下载 transient 时，可以受控 `trigger-current --rerun`；重复失败应把对应 `artifact-publish`/envRecipe 下载步骤补成有限重试后重新合并发布。不要用原生 `kubectl delete/patch`、pod 内热补或盲目全量重跑替代持久化 recipe 修复。

 `hwlab nodes control-plane infra tools-image status|build|logs --node D601 --lane v03` 是 D601 tools image 的受控入口。Dockerfile 必须由 `config/hwlab-node-control-plane.yaml` 的 `tekton.toolsImage.dockerfileInline` 声明，输入镜像必须列在 `publicBaseImages`，构建参数和网络模式也来自 YAML；confirmed build 只在 D601 后台异步构建并推送到 node-local registry，返回 status/logs 轮询命令。`hwlab nodes control-plane infra argo status|apply|logs --node D601 --lane v03` 是 D601 Argo CD 的声明式安装入口。Argo 版本、官方 manifest URL、镜像 rewrite/preload、field manager、imagePullPolicy、CRD 列表、期望 Deployment/StatefulSet 以及生成的 AppProject/Application 都必须来自同一个 YAML；`argo apply --confirm` 只执行可重复 server-side apply 和后台轮询，不把原生 `kubectl apply`、手工 Argo CLI 或临时 manifest 作为正式安装路径。