fix: manage D601 k3s pod capacity via YAML

2026-06-19 14:52:36 +00:00
parent bc95f373e0
commit b77b01ed72
4 changed files with 478 additions and 16 deletions
@@ -24,7 +24,7 @@ G14/D601 v03 的 bootstrap admin password 是 HWLAB runtime Secret 生命周期

 `hwlab nodes web-probe run|script --node <node> --lane <lane>` 是 HWLAB Cloud Web 线上 DOM/Playwright 验收的受控入口；CLI 负责从 YAML 解析 workspace、public URL 和 bootstrap admin sourceRef，并只输出 redacted 凭据状态、artifact path/hash、readiness、`probe.summary` 和失败分类。`run` 使用 repo-owned 标准 DOM probe；`script` 不运行默认探针，必须通过 stdin heredoc 或 `--script-file <path>` 提供调用者脚本。`run --message ...` 未显式设置 trace 参数时会做轻量 trace 采样，`script` helper 可用 `recordStep` / `safeFetchJson` / `fetchApiMatrix` 保留失败前的结构化 partial evidence，完整 redacted 报告通过 `reportPath`/`reportSha256` 展开。具体 Web 开发、fake-server Playwright、fixture 脱敏、`web-probe script` helper、截图和 Workbench/Performance 判定口径统一见 `$unidesk-webdev`，本 CLI 参考不再维护第二套操作面。

-`hwlab nodes control-plane infra plan|status|apply --node D601 --lane v03` 是 D601 HWLAB v03 节点本地 CI/CD 与 git-mirror 前置控制面的 YAML 驱动入口，配置真相源是 `config/hwlab-node-control-plane.yaml`。`plan` 只读展示 YAML target 和将渲染的 control-plane 对象；`status` 只读观察 D601 Tekton、CI namespace、git-mirror、Argo、node-local registry 和 tools image readiness；`apply --dry-run` 只输出 manifest 摘要；`apply --confirm` 只收敛 D601 control-plane bootstrap 对象，不触发 HWLAB runtime rollout，不创建 PK01 DB，也不修改 Caddy/FRP。tools image 的 node-local registry 地址只能作为输出 artifact；输入 base image 必须由 YAML 声明为公开 registry 来源，缺少 output image 时应在 `status.next.blockers` 中体现，而不是把现有 node-local image 当成输入基础镜像。
+`hwlab nodes control-plane infra plan|status|apply --node D601 --lane v03` 是 D601 HWLAB v03 节点本地 k3s、CI/CD 与 git-mirror 前置控制面的 YAML 驱动入口，配置真相源是 `config/hwlab-node-control-plane.yaml`。`plan` 只读展示 YAML target、host k3s node config 摘要和将渲染的 control-plane 对象；`status` 只读观察 k3s systemd drop-in 与 node `capacity/allocatable.pods`、D601 Tekton、CI namespace、git-mirror、Argo、node-local registry 和 tools image readiness；`apply --dry-run` 只输出 manifest 与 host config 摘要；`apply --confirm` 按 YAML 收敛 D601 host k3s drop-in 和 control-plane bootstrap 对象，只有 host k3s 配置或 live pod capacity 未收敛时才重启 k3s，不触发 HWLAB runtime rollout，不创建 PK01 DB，也不修改 Caddy/FRP。D601 host 侧 k3s pre-start 修正也必须写成 YAML `execStartPre` argv，不做手工 systemd 热改；当 kube API 已不可用时，`apply` 可用同一 YAML 渲染出的 host 脚本经 node-local tools image/Docker fallback 恢复 systemd drop-in，输出仍只给对象名、SHA、exit code 和摘要。k3s pod capacity 等可调数值只以 YAML 为准，长期参考不复制具体数值；tools image 的 node-local registry 地址只能作为输出 artifact，输入 base image 必须由 YAML 声明为公开 registry 来源，缺少 output image 时应在 `status.next.blockers` 中体现，而不是把现有 node-local image 当成输入基础镜像。

 `hwlab nodes git-mirror status|sync|flush --node <node> --lane <lane>` 是 node-scoped runtime lane 的 Git mirror 维护入口。`status` 的 `githubSource` / `githubGitops` 来自本地 mirror cache 的 `refs/mirror-stage/...`，不是实时 GitHub API；输出中的 `refSources.githubFieldsAreMirrorStageCache=true` 和 `refSources.cacheRefresh` 给出这一来源和刷新命令。`sync --confirm --wait` 的 k3s Job 遇到 GitHub SSH transient 时，应通过目标 workspace fallback 拉取 GitHub source/gitops 并写回 node-local mirror，输出只披露 commit、mirror write URL 和 fallback 状态。`flush --confirm --wait` 如果已经把 GitOps ref push 到 GitHub，但 post-push fetch/recheck 因 transient SSH 失败而无法刷新 mirror-stage，会标记 `partialSuccess=push-succeeded-fetch-failed`；CLI 应自动执行一次受控 sync 刷新 mirror-stage，若恢复后 `pendingFlush=false` 且 `githubInSync=true`，结果应为 `ok=true` 并输出 `partialSuccessRecovered` / `postPushRecovery`，否则才保留 `degradedReason=node-runtime-git-mirror-flush-post-push-fetch-failed` 和下一步 `sync --confirm --wait`。不要把这种 partial success 解读为需要连续盲目 flush。`hwlab nodes control-plane trigger-current --node <node> --lane <lane> --confirm --wait` 会在 source sync 后自动执行必要的 pre-flush，在 PipelineRun terminal 后自动执行必要的 post-flush；progress 事件必须显式输出 `git-mirror-pre-flush` / `git-mirror-post-flush` 的 executed/skipped、jobName、local/github source、local/github GitOps、`pendingFlush` 和 `githubInSync`，且已恢复的 partial success 不能让顶层 trigger-current false-fail。`control-plane status` 仍是只读入口，只暴露 compact `gitMirror` 摘要和下一步 flush 命令，不隐式执行写操作。