Files

T

Codex 0920d39c7c feat: retire legacy HWLAB G14 runtime

2026-06-08 17:57:41 +00:00

45 KiB

Raw Blame History

G14 Provider Node

G14 is the current HWLAB runtime-lane node and a UniDesk provider node for staging other infrastructure workloads. Legacy HWLAB G14 DEV/PROD (G14/G14-gitops, hwlab-dev/hwlab-prod, ports 17666/17667 and 18666/18667, Argo Applications hwlab-g14-dev/hwlab-g14-prod) is retired and must not be used as current source, release, validation, rollback or support truth. G14's UniDesk provider id is G14; the local UniDesk worktree is /root/unidesk, and the native k3s kubeconfig is /etc/rancher/k3s/k3s.yaml.

G14's long-lived k3s control bridge is k3sctl-adapter-g14, a UniDesk direct service outside the k3s fault domain. It listens on the G14 host loopback port 127.0.0.1:4266 and is registered separately from the D601 k3sctl-adapter, so G14 infrastructure services can be built and tested without taking over user services that still run on D601.

For Code Queue and non-HWLAB CI/CD migration preparation, G14 uses native k3s labels unidesk.ai/node-id=G14 and unidesk.ai/provider-id=G14. The G14 Code Queue manifests src/components/microservices/k3sctl-adapter/k3s/code-queue.g14.k8s.yaml and src/components/microservices/k3sctl-adapter/k3s/code-queue.g14.k3s.json are candidate staging artifacts only until an explicit production cutover is approved. Non-HWLAB production Code Queue, CI/CD and user-service execution must remain on D601 while D601 is carrying those services.

Legacy DEV/PROD Retirement

Legacy G14 DEV/PROD is a retired runtime, not a stable baseline. The controlled UniDesk entry is:

bun scripts/cli.ts hwlab g14 retirement status
bun scripts/cli.ts hwlab g14 retirement plan
bun scripts/cli.ts hwlab g14 retirement execute --confirm

status reports the legacy Argo Applications, legacy namespaces, bounded resource previews, protected v0.2/v0.3 Applications and namespaces, and the local marker at .state/hwlab-g14/legacy-g14-retirement.json. plan is dry-run only and lists the exact destructive targets. execute --confirm deletes only argocd/hwlab-g14-dev, argocd/hwlab-g14-prod, namespaces hwlab-dev, hwlab-prod, and active local hwlab_g14_pr_monitor jobs; it must not touch hwlab-g14-v02, hwlab-node-v03, hwlab-v02, or hwlab-v03. The legacy base=G14 PR monitor is blocked by this retirement contract even in a fresh checkout; the marker is execution evidence, not the source of the policy.

The old /root/hwlab workspace on branch G14 is no longer a default source truth. Use it only for explicit legacy archaeology or a user-authorized retirement follow-up, after fast-forwarding and reading the HWLAB repo rules. Current source, render, CI/CD and validation work must use the target runtime lane workspace such as G14:/root/hwlab-v02 for v0.2 or the node-scoped lane configuration for v0.3+.

The standard entry forms are:

trans G14:/root/hwlab script -- 'git fetch origin G14 && git pull --ff-only origin G14 && git status --short --branch && git remote -v'
trans G14:/root/hwlab apply-patch < patch.diff
trans G14:k3s kubectl get pods -n hwlab-v02

G14:k3s is the only supported k3s route form. Do not use ssh G14 k3s ...; the first token must locate the distributed target, and the following tokens must be the operation.

If /root/hwlab has unrelated local changes when this sync starts, first determine whether they can be quickly merged with the latest origin/G14. Merge them immediately when they are mergeable; do not default to stash, discard or a behind worktree. Only when the changes cannot be automatically merged should they be isolated and the operation stopped for human decision. A behind fixed workspace is not a valid basis for precheck, new worktree creation, render, polling, deployment or runtime validation.

The retired G14 DEV/PROD boundary is:

Legacy public endpoints http://74.48.78.17:17666/, http://74.48.78.17:17667/health/live, http://74.48.78.17:18666/, and http://74.48.78.17:18667/health/live are not current validation targets.
Keep HWLAB Services as ClusterIP unless a repo-owned G14 GitOps rule explicitly exposes them. Public exposure should stay in the approved G14 edge/proxy path, not ad hoc NodePort or local port-forward.
Use runtime-lane local PostgreSQL and runtime-lane Secrets for cloud-api durable runtime tests. Do not copy D601 database credentials.
Use only G14-local Codex auth material and k8s Secrets authorized for HWLAB on G14; do not copy D601 or production auth material by hand.
Set HWLAB_CLOUD_API_PORT=6667 explicitly in the G14 cloud-api Deployment. Kubernetes otherwise injects a HWLAB_CLOUD_API_PORT=tcp://... Service environment variable that breaks the Node port parser.
HWLAB_PUBLIC_ENDPOINT and health/live evidence must describe the active runtime lane endpoint, not retired G14 DEV/PROD or old D601 production endpoints.
Do not run HWLAB repository check, Playwright/browser smoke, image builds or other heavy validation on the master server. Run those through the target G14 runtime lane workspace, G14 k3s/Tekton, or another explicitly approved external execution plane.
Manual device-agent experiments for real hardware must be standalone resources in the active runtime lane namespace and must not patch existing HWLAB Deployments, Services, ArgoCD Applications, FRP, CD desired-state or public frontend routing unless a separate HWLAB change authorizes it.
A D601 Windows hwlab-gateway may connect outbound to the active G14 runtime lane cloud-api as an external host bridge for Keil/serial/workspace access. That bridge does not make D601 the HWLAB runtime truth; it is only a hardware access provider behind the G14 device-agent/cloud-api path.

Healthy G14 HWLAB runtime means the active runtime lane's main Deployments and StatefulSets are Ready, cloud-api and edge-proxy return /health/live with status=ok, durable runtime checks pass, and the public lane endpoints report the expected revision. For a device-agent smoke, health also requires the standalone device-agent Service to answer in-cluster and the D601 Windows gateway session/resource/capability to be visible through the active G14 cloud-api.

HWLAB v0.2 Expansion Line

HWLAB v0.2 is the current supported G14 runtime lane for the v0.2 branch. It must not recreate, depend on, or roll back to the retired legacy G14 DEV/PROD runtime. Legacy G14/G14-gitops, hwlab-dev/hwlab-prod, DEV public ports 17666/17667, and PROD public ports 18666/18667 are not the stability baseline.

The fixed v0.2 source branch is v0.2, forked from the current G14 branch after the G14 long-term reference docs record this decision. The fixed G14 development workspace for that branch is:

trans G14:/root/hwlab-v02 script -- 'git status --short --branch && git remote -v'

/root/hwlab-v02 is the long-lived v0.2 development workspace, not a scratch clone or CI/CD source selector. It must track origin/v0.2 with origin git@github.com:pikasTech/HWLAB.git; local dirty state, stale HEAD, and untracked .worktree/ only affect human development. Do not reuse retired /root/hwlab or /root/hwlab/.worktree/* as the v0.2 fixed workspace.

v0.2 CI/CD source selection is isolated in the dedicated bare repo G14:/root/hwlab-v02-cicd.git. UniDesk control-plane commands must fetch origin/v0.2 into that repo and render from a commit-pinned detached worktree; they must not read the source commit from /root/hwlab-v02 checkout state.

The fixed v0.2 runtime namespace is hwlab-v02. The intended public FRP allocation is:

Cloud Web browser entry: http://74.48.78.17:19666/.
API/edge entry and live health: http://74.48.78.17:19667/health/live.

Master-side FRP server maintenance for HWLAB public ports is documented in docs/reference/hwlab.md#hwlab-frp-维护; keep the detailed allowlist, restart boundary and verification sequence there instead of duplicating another runbook in this G14 node reference.

The v0.2 CI/CD integration owns a dedicated CI/CD source repo, devops-infra git mirror/relay, GitOps desired-state lane, Argo CD Application, namespace resources, artifact catalog, and a deploy/deploy.yaml lane config only when they target v0.2/hwlab-v02 explicitly. Do not add or revive a legacy G14 branch poller, DEV/PROD Argo Applications, DEV/PROD runtime paths, or existing namespace resources to bootstrap v0.2.

For current G14/v0.2, deploy/deploy.yaml is the single human-authored deploy/runtime config source. deploy/deploy.json is not a v0.2 compatibility source and must not be recreated for this lane. YAML and JSON parsing are centralized in the HWLAB repo's format-agnostic config layer: scripts/src/structured-config.mjs handles file format parsing/writing, while scripts/src/deploy-config.mjs owns deploy-config defaults and shape. Renderers, planners, smoke scripts and CLIs should consume that layer through readStructuredFile / writeStructuredFile / readDeployConfig or equivalent helpers; do not scatter direct YAML parser imports or ad hoc readFile + YAML.parse calls. Legacy D601 HWLAB CD still has its own deploy/deploy.json desired-state documented in docs/reference/hwlab.md; that legacy path is not the G14/v0.2 authority.

On the UniDesk control-plane side, HWLAB G14 runtime lane expansion is sourced from /root/unidesk/config/hwlab-node-lanes.yaml. That YAML owns nodes, lanes, networkProfiles and downloadProfiles for the UniDesk trigger/status/apply layer: lanes.v03.node points at nodes.G14, while proxy URLs, NO_PROXY, git/npm/pip/docker/curl retry/download defaults and Docker build proxy settings live under profile objects. G14 must remain a node config value, not a hardcoded code path such as g14Proxy or g14GitOps; future v0.4+ lanes should be added by adding YAML entries and then consuming the generated spec. hyueapi.com and .hyueapi.com are required NO_PROXY entries and must remain present in effective runtime and Docker build proxy environments.

The devops-infra git mirror/relay remains manual and CLI-controlled, not CronJob-driven. The standard v0.2 delivery trigger is bun scripts/cli.ts hwlab g14 monitor-prs --lane v02: it watches base=v0.2 PRs, waits for GitHub preflight/CI readiness, auto-merges only ready and non-conflicting PRs, then drives the same controlled CD path and comments pending/blocked/succeeded/failed/timeout state back to the PR. The lower-level bun scripts/cli.ts hwlab g14 control-plane trigger-current --lane v02 --confirm remains the manual recovery or diagnosis entry; it must fetch /root/hwlab-v02-cicd.git, resolve the current origin/v0.2 source commit, check the mirror's localV02 ref before creating the PipelineRun, run one bounded manual git-mirror sync Job when the mirror is stale, and only continue after the mirror ref matches the current source commit. Use hwlab g14 git-mirror sync --confirm directly only for explicit mirror maintenance or diagnosis.

After a v0.2 PipelineRun completes, treat runtime rollout and remote GitOps persistence as two separate checks. hwlab g14 control-plane status --lane v02 is the runtime check: it must show the expected source commit, PipelineRun completed, Argo Synced/Healthy, public 19666/19667 probes passing, and Cloud Web asset probes such as /app.js readable. hwlab g14 git-mirror status is the persistence check: cache.summary.pendingFlush must be false and cache.summary.githubInSync true before declaring GitOps fully flushed back to GitHub. The PR monitor performs this flush automatically for its own merged PRs and records the result in the PR comment. Manual operators should run bun scripts/cli.ts hwlab g14 git-mirror flush --confirm and poll the returned job with bun scripts/cli.ts job status <jobId> --tail-bytes 12000 only when they used lower-level manual trigger/status paths or when the monitor reports a flush failure; do not replace this with raw kubectl, native git push, or a long SSH wait.

When closing an issue against a specific completed v0.2 PipelineRun, use targeted status instead of the latest-head status if origin/v0.2 has already advanced through a parallel task:

bun scripts/cli.ts hwlab g14 control-plane status --lane v02 --pipeline-run hwlab-v02-ci-poll-<short-sha>
bun scripts/cli.ts hwlab g14 control-plane status --lane v02 --source-commit <full-sha>

Targeted status must expose statusTarget.mode and targetValidation. targetValidation.state=passed means the requested PipelineRun/source commit reached a succeeded PipelineRun, Argo Synced/Healthy, public web/API probes, flushed Git mirror, and matching runtime source commits for the services listed in that run's planArtifacts.rolloutServices; services listed in planArtifacts.reusedServices remain visible as runtime/provenance evidence but must not be forced to the target source commit. targetValidation.state=superseded means the requested PipelineRun succeeded but no longer owns runtime: either it was replaced by a newer succeeded v0.2 PipelineRun, or latest-only promotion observed that origin/v0.2 had advanced before GitOps/runtime writeback and closed the historical run as no-op. This is valid closure evidence for the requested run when the newer commit is on the same branch lineage. In both states, commitAlignment.staleReasons may still mention later origin/v0.2 or CI/CD source head movement; that is parallel-head context, not a failure of the requested run. falseGreenGuard is a current-runtime guard and should report not-applicable/superseded for such historical targets instead of turning later runtime movement into a false failure. Default status without a target remains strict for the latest source head.

For HWLAB user-feedback, CLI, Cloud Web, AgentRun, device-pod, public API, or runtime workflow issues, source-level validation is not enough to close the issue. Unit tests, contract tests, git diff --check, targeted build checks, PR merge metadata, and source commit rollout evidence are supporting evidence only. The issue may be closed only after the affected user entry or original entry has been exercised against the target runtime. For CLI issues, that means running the relevant hwlab-cli or UniDesk-controlled CLI command from the G14 v0.2 workspace or approved execution plane against the intended lane/URL/namespace and proving the observed behavior, not just proving the helper code compiles. For Cloud Web or public API issues, use the public endpoint or a bounded API/asset smoke that reaches the deployed runtime. For AgentRun or device-pod issues, capture the trace/session/thread/run/job/device evidence that proves the specific continuation or hardware workflow reached the live backend.

For Cloud Web Workbench and Code Agent issues, the closeout validation must use the same dispatch entry as the browser flow, or a CLI command that calls that same Cloud Web/Cloud API dispatcher path. A hand-written dispatchHwlabAgentRun() canary, direct AgentRun manager command, or runner job created outside the Web dispatcher is only infrastructure evidence; it cannot prove that the browser path requested the current AgentRun runtime assembly, tool credentials, transient env, conversation/session/thread binding, or runtime lane. The current HWLAB v0.2 resource assembly contract is ResourceBundleRef.kind="gitbundle" with bundles[] materializing repo tools/ and skills/; the authority for those fields is HWLAB's docs/reference/agentrun-code-agent-dispatch.md and AgentRun's docs/reference/spec-v01-runtime-assembly.md. If no CLI can exercise the Web-equivalent path, improve the CLI first and keep the issue open until the Web-equivalent CLI or browser trace proves the deployed behavior.

Provider profile configuration and credential write rules for Code Agent are owned by docs/reference/hwlab.md#code-agent-provider-profile-配置与验收. This G14 reference only defines runtime lane and closeout evidence; do not duplicate profile Secret, config.toml, auth.json or CLI credential semantics here.

For Cloud Web Workbench Code Agent response or trace-rendering bugs, the minimum Web-equivalent CLI proof is a fresh hwlab-cli client agent send --wait against the deployed public Web origin, followed by hwlab-cli client agent trace <traceId> --render web against the same origin. The submit proof must show the browser dispatcher family, normally POST /v1/agent/chat, result polling through /v1/agent/chat/result/<traceId>, continuation.webEquivalent=true, shortConnection=true, and explicit sessionId / conversationId / threadId binding when those values affect the bug. The result proof must show the final assistant text from assistantText or reply.content; placeholder status text, result summaries, terminal status messages, and AgentRun completion boilerplate are not acceptable substitutes for the assistant final response.

For persisted final-response display regressions, a fresh turn alone is not enough when the user report identifies an existing conversation, session, or trace. Re-read the original record on the deployed v0.2 runtime with locked lane env and the correct projectId; the default session list project may differ from the affected Workbench project. The minimum proof is client session list --project-id <projectId> --limit <N> --full, client session inspect <conversationId> --full, and client agent result <traceId> --full. Passing evidence must show that list and inspect surface the same latest agent traceId as lastTraceId, the latest agent text matches the terminal result reply.content or equivalent final assistant text, and known fallback text such as Code Agent 仍在处理，可以继续 steer 或等待 trace 完成。 is absent from list, inspect, and result output. When the repair is lazy-on-read, run the read path again or capture the exposed repair source/updated marker so the evidence proves persisted conversation state was repaired, not merely synthesized for one response. client agent trace <traceId> --render web remains required for trace-rendering bugs; for persisted conversation-display bugs it is supporting evidence unless it returns rendered assistant rows from the same original trace.

The --render web proof must inspect the rendered body, not only the raw event count. Passing evidence should include body.render=web, the shared renderer identity when exposed, status=completed, rendered/returned row counts, noise/omitted counts when available, at least one rendered assistant row containing the final assistant text, and an explicit absence check for known non-user boilerplate such as AgentRun terminal status completed, AgentRun result is ready, and Code Agent 仍在处理. If the trace API returns status=missing, sourceEventCount=0, or no rows for a historical issue trace, treat that trace as expired or unavailable; do not use it as closure evidence. Generate a fresh equivalent turn on the current v0.2 runtime and validate that trace instead.

CLI/Web-equivalent trace evidence does not replace browser UI evidence for visual, layout, copy-to-clipboard, collapsed-panel or removed-control bugs. Those require a bounded browser or DOM smoke against http://74.48.78.17:19666/ after rollout, with assertions on the deployed page text, DOM state, or control behavior that the user reported. A local bundle smoke can support regression coverage, but the closeout still needs the deployed public endpoint unless the browser entry is unavailable and the issue comment records the blocker. Missing Playwright browser binaries or declared test dependencies are not a valid skip; install the repository-declared runner/browser or use an approved system browser executable and record that choice in the validation evidence.

The closing comment for these issues must be semantic natural language before it lists evidence: state what the user-visible problem was, what changed, where it rolled out, and what original entry was rechecked. It must include the actual command or entry path, target lane or endpoint, relevant trace/session/thread/PipelineRun/run/device ids, and the pass/fail result. If the original entry cannot be verified because rollout has not happened, credentials are unavailable, the target runtime is down, or the required CLI capability is missing, keep the issue open and record the blocker. Do not close the issue on the strength of PR merge, targeted tests, or "will be verified after rollout" wording. If an issue was closed before this real CLI/user-entry validation, reopen it and add a correction comment before continuing.

For HWLAB v0.2 Code Agent context-loss or multi-turn continuity issues, the minimum closeout is a real hwlab-cli client agent two-turn E2E from G14:/root/hwlab-v02 or another approved G14 execution plane with locked runtime namespace/lane env. Submit the first turn, poll its result to completed, submit the second turn with the same explicit conversationId/sessionId/threadId, then capture trace/inspect evidence. Passing evidence must show the second turn used prior-turn context, and should include context attachment or run reuse labels such as conversation-context:attached, agentrun:run:reused, agentrun:runner-job:reused, plus the relevant run/command ids. Long verification evidence belongs in a separate gh issue comment create --body-file comment; lifecycle close comments stay short, as defined in docs/reference/cli.md.

/health/live revision is owned by hwlab-cloud-api; it can legitimately differ from the source commit for a Cloud Web-only change. Do not call that difference a failed Cloud Web rollout when webAssets.checks.htmlOk, webAssets.checks.appJsOk, CSS probes, Argo health, and hwlab-cloud-web Deployment readiness have passed. For Cloud Web behavior changes, the public JS asset probe or a bounded browser/DOM check is stronger evidence than cloud-api apiRevision.

Do not turn v0.2 expansion governance into a stack of broad compatibility gates. The stable control points are branch, dedicated CI/CD source repo, git mirror/relay refs, GitOps branch, namespace, runtime path, Argo Application, FRP ports and generated-output ownership. Legacy DEV/D601/main preflights that block the v0.2 lane should be removed from that lane, not patched with fallback or legacy modes. Naming, RBAC scope, cleanup policy, resource quota and rollback order are design decisions or runbook entries unless they protect a concrete high-value risk that cannot be enforced by the fixed boundaries above.

v0.2 Source Workflow

The generic P2/P3/P4 flow is owned by $dad-dev; this section fixes the G14/v0.2 source route, branch and lane. v0.2 now has two source workflows:

direct-lightweight: CaseRun, case registry aggregation, trace rendering, short-connection CLI/helper, docs/reference, schema-free config reader/writer, and deploy-config cleanup that does not change cloud-api, web, gateway, GitOps, k3s runtime or other long-running services. Use the fixed workspace directly, run the relevant CLI/render/test validation on G14, commit to v0.2, and push origin v0.2. Do not open a PR, trigger CI/CD, rollout, or reintroduce legacy gates for this class.
pr-rollout / service workflow: cloud-api, Cloud Web UI, gateway, AgentRun dispatch integration, device-pod runtime, GitOps/Tekton/Argo, k3s manifests, Secrets/RBAC, public endpoint behavior, or any change that must reach live runtime. Use a task-scoped worktree on a feature branch, merge through a PR, then use the controlled v0.2 CD/status path.

Direct-lightweight precheck:

trans G14:/root/hwlab-v02 script -- 'git fetch origin v0.2 && git pull --ff-only origin v0.2 && git status --short --branch'

Service workflow setup:

trans G14:/root/hwlab-v02 script -- 'git worktree add .worktree/<task> -b fix/issue<N>-<short-name> origin/v0.2'

The fixed repo at /root/hwlab-v02 is not a scratch area for service/runtime work, but it is the direct-lightweight source workspace. When a direct-lightweight task sees parallel dirty state in the fixed repo, inspect and include or separate it according to the current user instruction and project Git rules; never discard it silently. Worktree branches for service workflow should follow the fix/issue<N>-<short-name> naming so PR titles and merge commits stay scannable. GitHub PR writes, merge, rollout trigger and final original-entry validation follow $dad-dev plus the UniDesk CLI control rules in AGENTS.md.

Recovery From an Unapproved Direct Commit To v0.2

Direct-lightweight commits are allowed and do not need recovery. A direct commit on v0.2 only needs recovery when it changed service/runtime/GitOps/CI/CD/public behavior that should have used the PR/rollout workflow. The recovery is bounded and audit-friendly, but it is also a git push --force-with-lease against the protected branch, so it is only acceptable when the unapproved direct commit is the only new content on v0.2 since the last merged PR:

Confirm no parallel worktree was in flight and the commit is the only delta. trans G14:/root/hwlab-v02 script -- 'git log origin/v0.2..HEAD' and git log HEAD..origin/v0.2 must show the direct commit as a single fast-forward candidate.

Capture the commit identity and patch for the recovery record:

trans G14:/root/hwlab-v02 script -- 'git show <direct-commit-sha> > /tmp/v0.2-recovery.patch'

Roll the fixed repo back to the previous merged PR head. Use git reset --hard <previous-pr-sha>; this preserves any autostash (e.g. from a parallel git checkout snapshot in another worktree) on the stash list and does not touch the other worktree's working tree.
In the pre-existing worktree (e.g. .worktree/<task> on fix/issue<N>-<short-name>) bring the branch up to the previous PR head with trans G14:/root/hwlab-v02/.worktree/<task> script -- 'git reset --hard <previous-pr-sha>', then git cherry-pick <direct-commit-sha> to replay the direct commit on the feature branch. If the worktree branch was already a clean clone of origin/v0.2 at the previous PR head, the reset is a no-op.

Push the feature branch and force-push v0.2 back to the rolled-back head with --force-with-lease (refuses to clobber a concurrent push):

trans G14:/root/hwlab-v02/.worktree/<task> script -- 'git push -u origin fix/issue<N>-<short-name>'
trans G14:/root/hwlab-v02 script -- 'git push --force-with-lease origin v0.2'

Open the PR through UniDesk CLI, squash-merge, then git pull --ff-only origin v0.2 to bring the fixed repo back in sync. The previous PR's merge commit will not be in the new PR's history; the new PR's diff equals the original direct commit's diff, so the PR trail still contains the exact same bytes.
bun scripts/cli.ts hwlab g14 control-plane status --lane v02 will read the new merge commit; the previously-staged PipelineRun for the direct commit was created on the v0.2 head and trigger-current will delete + recreate it for the post-merge head, so no manual PipelineRun cleanup is required.

The recovery is auditable: the original git show patch and the cherry-pick SHA both land in the PR diff, so the issue/PR trail still contains the exact same bytes that were first committed directly. Recurring unapproved service/runtime direct commits on v0.2 are a workflow regression and must be called out in the relevant issue or PR; direct-lightweight commits are not a regression.

v0.2 Cloud Web Runtime Layout Validation

Cloud Web layout, status-panel, collapsed-control, and modal issues on v0.2 need deployed browser evidence. Source checks and control-plane rollout are supporting evidence; they do not prove that the public 19666 page renders the fixed DOM.

Use these surfaces together:

trans G14:/root/hwlab-v02/.worktree/<task>/web/hwlab-cloud-web script -- 'bun run check' for static unit/contract/layout checks and dist freshness.
bun scripts/cli.ts hwlab g14 control-plane status --lane v02 for runtime, Argo, public endpoint, and GitOps alignment. If origin/v0.2 moved through a parallel PR, use --pipeline-run or --source-commit and treat same-branch supersession as context rather than failure.
Public API probes for both /health/live and /v1/live-builds. /health/live proves live service health/revision, but Cloud Web build time, image tag/digest, source metadata, and actual runtime commit/revision should be read from /v1/live-builds.
A bounded browser/DOM probe against http://74.48.78.17:19666/ that asserts the deployed page state relevant to the issue.

Cloud Web frontend regressions still use the two-layer validation rule. Deterministic client behavior, such as scroll-follow state machines, Markdown/HTML escaping, shared renderer output, persisted view mapping and DOM class/attribute decisions, should be reproduced first in source-level unit or contract tests; those tests may mock DOM nodes, API responses or renderer input because they are the fast regression guard. The deployed browser or Web-equivalent CLI layer must not mock the user entry, and should prove only the live integration that unit tests cannot prove: the public bundle is deployed, the real page dispatch path creates the expected DOM state, and the user-visible control behaves on the target lane. Do not move every frontend bug into CLI/browser smoke just because it is user-facing.

Cloud Web message Markdown must go through a single shared React renderer component. Do not maintain a hand-written Markdown parser or a dangerouslySetInnerHTML message path for normal chat/workbench messages. The shared renderer's fast tests should cover at least GFM table rendering, inline/fenced code, emphasis/strong text and raw HTML escaping. Browser closeout should assert rendered DOM shape, such as table/code/strong counts and absence of injected script nodes or executed script flags, instead of comparing the full rendered HTML string.

For Workbench status/build panels, the minimum DOM proof should check the topbar chip, absence of full status cards in the right sidebar, hidden collapsed lists actually absent from layout, bounded scroll ownership on the right content area, and a details dialog that contains environment image metadata, actual live commit/revision, and source/build-time fields when available.

/v1/live-builds.latest is global across services and can legitimately point at hwlab-cloud-api when API rolled after Web. Inspect the hwlab-cloud-web service row before deciding whether a Web build field is missing or stale.

For #workspace or other scroll-owner fixes, closeout evidence should include numeric scroll metrics before and after the interaction: scrollHeight, clientHeight, scrollTop, distanceFromBottom, computed overflowY, and the page's follow/detached state attribute when one exists. Passing evidence for follow-tail behavior must show that new content keeps the view at bottom while already following, manual upward scroll detaches, and scrolling back to the bottom re-attaches. If the issue is specifically about final assistant response persistence or trace rendering, the browser/CLI proof must wait for the final agent/trace result as described above. If the issue is a frontend-only renderer or scroll-container regression and the same component/path renders user and agent messages, a real #command-input submission that creates a long user message is sufficient to exercise the deployed renderer/scroll path; do not block closure on an unrelated slow external model turn.

Generic layout smoke can be used only when it is bounded in the current transport. A Playwright smoke that runs through trans with no output for the SSH idle timeout, leaves preview/browser processes behind, or never writes an exit/report file is not closure evidence. Run it as an async remote job with explicit report and cleanup, or use a smaller issue-specific DOM probe that emits one JSON result and exits. The stable remote-probe shape is: create a fresh Workbench session through the UI when prior session state may be failed, start the browser script as a target-side job, write a PID/log/result JSON/screenshot on G14, poll those files with short trans queries, and cancel any running live turn through the UI before exit when the probe submitted a real prompt. Missing Playwright-managed browser binaries are not a skip; use an approved system browser executable on G14 or install the declared browser dependency, and record the choice. When staging a Node probe outside the repo workspace, make package resolution explicit by running from the workspace or importing packages through the workspace's node_modules; do not treat MODULE_NOT_FOUND from a /tmp script as an application failure.

v0.2 Cloud Web Button/JS Sync Rule (HWLAB #748)

When a v0.2 Cloud Web fix removes a button from index.html or a field from the el literal in web/hwlab-cloud-web/app.ts, every el.<removed-field>.addEventListener(...) (or .requestSubmit() / .showModal() / etc.) binding must be removed from the matching init* function in the same commit. The static web:check does not catch this orphan listener class because the TypeScript build is Bun.build transpile-only (no tsc --noEmit), and the runtime crash only surfaces as Cannot read properties of undefined (reading 'addEventListener') on first init. The minimal closeout checks for the v0.2 lane are:

# 1. Web assets rebuild and the orphan is gone from the dist
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'cd web/hwlab-cloud-web && bun run build'
trans G14:/root/hwlab-v02/.worktree/<task> script -- "grep -c '<removed-field>' web/hwlab-cloud-web/dist/app.js"   # must be 0
trans G14:/root/hwlab-v02/.worktree/<task> script -- "grep -c 'id=\"<removed-id>\"' web/hwlab-cloud-web/index.html" # must be 0

# 2. Live 19666/19667 confirms the deployed bundle is the new build
curl -fsS http://74.48.78.17:19666/ | grep -c '<removed-id>'                                          # must be 0
curl -fsS http://74.48.78.17:19666/app.js | grep -c '<removed-field>'                                 # must be 0
bun scripts/cli.ts hwlab g14 control-plane status --lane v02                                          # webAssets.checks.appJsOk = true, sourceCommit = merge commit

While the PR is open, the author can also run a one-liner to surface any orphan el.<field>.addEventListener whose field is not declared in the el literal of app.ts:

trans G14:/root/hwlab-v02/.worktree/<task> script -- 'awk "/^const el = /,/^};/" web/hwlab-cloud-web/app.ts | tr -d "," | awk "{print \$1}" | grep -E "^[a-zA-Z]" | sort -u > /tmp/el-fields.txt; grep -nEo "el\\.([A-Za-z_$][A-Za-z0-9_$]*)\\.addEventListener" web/hwlab-cloud-web/*.ts | while read m; do field=$(echo "$m" | sed -E "s/.*el\\.([A-Za-z_$][A-Za-z0-9_$]*)\\.addEventListener.*/\\1/"); if ! grep -q "^$field$" /tmp/el-fields.txt; then echo "ORPHAN: el.$field.addEventListener"; fi; done'

Document the explicit grep / curl evidence in the issue closeout comment. Tightening the el literal with proper TypeScript types is tracked separately and must not be done as part of a runtime fix PR.

Node-Local VPN Proxy

G14 has a node-local VPN/proxy stack for infrastructure bootstrap and recovery downloads:

Primary mixed HTTP/SOCKS proxy: 127.0.0.1:10808.
Backup Hysteria2 HTTP proxy: 127.0.0.1:11809.
Backup Hysteria2 SOCKS5 proxy: 127.0.0.1:11808.
Operator-only local details remain on G14 under /root/docs/vpn-proxy-ops.md; subscription URLs, node credentials and GUI database contents must not be copied into the UniDesk repository.

The G14 host persists this proxy configuration in these local files:

/etc/profile.d/unidesk-g14-proxy.sh exports HTTP_PROXY, HTTPS_PROXY, ALL_PROXY, lowercase aliases and NO_PROXY for new login shells. Set UNIDESK_G14_DISABLE_PROXY=1 before shell startup to opt out.
/root/.npmrc pins npm proxy, https-proxy, noproxy and retry settings for root-side bootstrap commands.
/root/.gitconfig pins root Git HTTP/HTTPS proxy settings.
/root/.docker/config.json pins Docker client proxy settings for commands and build contexts that honor Docker client proxy configuration.
/etc/systemd/system/docker.service.d/proxy.conf pins Docker daemon pull proxy settings. Updating this drop-in requires systemctl daemon-reload and a Docker restart before the active daemon sees the new NO_PROXY; do not restart Docker while G14 provider-gateway, k3s bootstrap or image builds are in flight unless that interruption is intentional.

The NO_PROXY list must include localhost, the main server, private LAN ranges, k3s pod/service CIDRs, Kubernetes service domains and the loopback registry so that k3s, 127.0.0.1:5000, Kubernetes API access and UniDesk control paths do not route through the VPN proxy.

The primary proxy can be used for G14 target-side image bootstrap when Docker Hub, npm, GitHub or Playwright downloads are unreliable through direct network or provider-gateway WS egress. For Docker build steps that use 127.0.0.1, build with host networking so the build container reaches the host proxy:

docker build --network host \
  --build-arg HTTP_PROXY=http://127.0.0.1:10808 \
  --build-arg HTTPS_PROXY=http://127.0.0.1:10808 \
  --build-arg ALL_PROXY=socks5h://127.0.0.1:10808 \
  --build-arg http_proxy=http://127.0.0.1:10808 \
  --build-arg https_proxy=http://127.0.0.1:10808 \
  --build-arg all_proxy=socks5h://127.0.0.1:10808 \
  ...

127.0.0.1:10808 is a G14 host loopback endpoint. Inside an ordinary k3s Pod, 127.0.0.1 is the Pod network namespace, not the node proxy. Do not set long-lived workload proxy env to http://127.0.0.1:10808 unless that workload is intentionally hostNetwork and the port conflict/DEV-PROD blast radius has been reviewed. Temporary hostNetwork debug Pods may use the node-local proxy only for bounded bootstrap proof or cache prewarm; they must not become GitOps desired state just to make external downloads work.

The backup proxy uses HTTP_PROXY=http://127.0.0.1:11809, HTTPS_PROXY=http://127.0.0.1:11809 and ALL_PROXY=socks5h://127.0.0.1:11808.

This proxy is not a replacement for UniDesk runtime egress. k3s workloads such as Code Queue must still use the cataloged g14-provider-egress-proxy Kubernetes Service and g14-tcp-egress-gateway for normal runtime access to PostgreSQL, OA Event Flow and external APIs. The node-local VPN proxy is allowed only for G14 host-side bootstrap, image build, cache prewarm or recovery steps, and those steps should record the proxy choice in issue or deployment evidence.

v0.2 device-pod cloud-api architecture

v0.2 device-pod integration is the cloud-api → executor → D601 Windows device-host-cli.mjs chain under internal/cloud/access-control.ts, cmd/hwlab-device-pod/main.ts and the host-side F:\Work\ConStart\tools\device-host-cli.mjs. PR #765 (selector cheat sheet + fail-fast) and PR #778 (output.text propagation + evidence selector + read-only sub-action --reason exemption) are the two anchor PRs; PR #779 tracks the still-open host-side ops work. Earlier work used raw MUTATING_INTENTS.has(intent) && !reason and a single-pass textOr(output.text, …) extractor; both are obsolete and must not be re-introduced.

Intent / sub-action / reason matrix

DEVICE_JOB_INTENTS (cloud-api) enumerates the full supported surface; MUTATING_INTENTS is the strict subset whose default sub-action is mutating. Only workspace.build and debug.download carry a structured sub-action (start / status / output / wait / cancel / evidence) and are listed in DEVICE_JOB_ACTIONABLE_INTENTS; for those two, _deviceJobRequiresReason(intent, args, reason) returns false when reason is provided OR when args.action is in DEVICE_JOB_READ_ONLY_SUB_ACTIONS. Any other mutating intent (workspace.apply-patch, workspace.put, debug.reset, io.uart.write, io.uart.jsonrpc, io.uart.read-after-launch-flash, etc.) still always requires a non-empty reason. Adding a new actionable mutating intent requires extending both MUTATING_INTENTS and DEVICE_JOB_ACTIONABLE_INTENTS together; adding a new read-only sub-action requires only the DEVICE_JOB_READ_ONLY_SUB_ACTIONS set.

The evidence sub-action on workspace.evidence / debug.evidence is a first-class intent, not a workspace.build sub-action. Code Agent sees <pod>:workspace:/ build evidence [jobId] and <pod>:debug-probe download evidence [jobId]; cloud-api maps to a new device-pod executor job, the executor maps to deviceHostArgs = ["workspace", "evidence", kind, ...], and the host-side device-host-cli.mjs dispatches via if (command === "evidence") at the top level of main() (not nested under if (command === "build")). workspace.evidence kind=build → keil-build job; debug.evidence kind=download → keil-download job; the kind sub-arg must be build / download and the optional jobId selects a specific past job.

Output text propagation chain

body.output.text flows through three layers in order; each layer tries more fields and only falls back when earlier sources are empty:

host device-host-cli.mjs returns a JSON envelope that already contains stdout / stderr / summary / logTail / buildSummary for build/download ops; workspace.ls / workspace.cat / workspace.rg are inline and include a JSON body.
executor cmd/hwlab-device-pod/main.ts gatewayDispatchText(result, dispatch) walks result.stdout → result.stderr → dispatch.stdout → dispatch.stderr → result.evidence.{text,logTail,summary} → dispatch.message (only when dispatchStatus === "completed") → dispatch.summary → result.summary → dispatch.buildSummary → result.text → JSON.stringify(result). The executor stores this as job.output and exposes it via boundedOutput() which clips at DEVICE_JOB_OUTPUT_MAX_BYTES (12000) and drops executor / nested output when truncated.
cloud-api executorOutputPayload(body, httpStatus) wraps what the executor sent and exposes body.text / body.output / body.bytes / body.truncation to the /v1/device-pods/{id}/jobs/{jobId}/output endpoint. text is firstString(body?.text, output.text, nestedOutput.text, output.summary, nestedOutput.summary, evidence.text, evidence.logTail, evidence.summary); the matched key is recorded by caller convention. executor payload stays on the response so callers can read dispatch.exitCode / dispatch.message / dispatch.stdout even when text is empty.

The evidence.* and *summary lookups exist so a dispatcher that already includes host logTail / buildSummary becomes visible without a separate bootsharp re-run on the Code Agent side. The summary lookups also keep error messages (dispatch.message) in the response even when dispatchStatus is not completed; this is the reason body.error.message always has something to show for failed host dispatches.

Cloud-api vs host-side boundary

/root/hwlab-v02/skills/device-pod-cli/assets/device-host-cli.mjs is the v0.2-shipped copy of the host-side CLI. The actual hardware host runs a separate F:\Work\ConStart\tools\device-host-cli.mjs that is not a deployment of the v0.2 repo; it is a D601 ops-side copy that must be synced manually when the v0.2 repo changes host-side behavior. The two-step contract is:

v0.2 cloud-api / executor changes are valid once PipelineRun Succeeded + git mirror flush complete; runtime revision is commit.id from /health/live and source commit can be forced to match via the next trigger-current.
v0.2 host-side device-host-cli.mjs changes are NOT visible until someone replaces F:\Work\ConStart\tools\device-host-cli.mjs on the D601 Windows host; cloud-api body.text will faithfully surface the "unsupported command" JSON error from the stale host binary, which proves the cloud-api propagation chain works but the host side is stale.

A live workspace.evidence / debug.evidence / download evidence selector that returns the host logTail end-to-end therefore requires both (a) the v0.2 PR merged and rolled, and (b) the D601 host binary replaced; missing either half is a known gap tracked in #779.

v0.2 device-pod closeout checks

Device-pod fixes still follow $dad-dev and the service/runtime side of the ## v0.2 Source Workflow route above. The device-pod-specific closeout is the three-layer runtime matrix below; keep these checks because they prove the cloud-api -> executor -> D601 host chain, while generic PR/CI/CD and worktree mechanics stay in $dad-dev.

trans G14:/root/hwlab-v02/.worktree/<task> script -- 'cd tools && bun test device-pod-cli.test.ts'
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'cd cmd/hwlab-device-pod && bun test main.test.ts'
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'cd internal/cloud && bun test access-control.test.ts'
trans G14:/root/hwlab-v02/.worktree/<task> script -- 'node --check skills/device-pod-cli/assets/device-host-cli.mjs'

Treat access-control.test.ts workbench failures as pre-existing on the v0.2 base unless the new test list explicitly covers them. After PR merge and trigger-current --lane v02 --confirm, the live http://74.48.78.17:19667/ CLI 验收 must hit all three layers:

body.output.text non-empty for at least one happy-path intent (workspace.ls / workspace.cat are the cheapest ones to verify propagation without needing a real D601 build).
workspace.evidence kind=build / kind=download accepted by cloud-api, dispatched to executor, executor blocker === null and job.reason === "".
<mutating intent> action=status accepted without --reason while the same intent with action=start is still rejected with device_job_reason_required.

There is no separate device-pod doc; this section is the single authoritative reference for the architecture, and the AGENTS.md index points to it.

45 KiB Raw Blame History