fix: restore web-probe severe timeout threshold

Also records instruction hygiene, YAML-first config split guidance, and Sub2API D601 recovery notes from the recovered worktree state.
2026-06-26 09:34:04 +00:00
parent ab71273867
commit 2a8f279575
11 changed files with 120 additions and 11 deletions
@@ -12,6 +12,7 @@ GitHub issue/PR 正式读写必须走 `bun scripts/cli.ts gh ...` 或 `trans gh:
 - Issue/PR 正文、评论、关闭/重开、PR 描述和 merge closeout 默认中文。
 - 新 issue 正文必须包含 `目标合并分支: <repo branch/lane>`；不需要合并时写 `目标合并分支: 不适用`。
 - 大计划、后续阶段和独立改进方向创建新 issue；已有 issue 评论只写短进展、证据、阻塞和链接。
+- 规划型、多阶段、架构/API/平台运维类 issue 第一阶段必须 `P0 SPEC 先行`；细则见 [references/full.md](references/full.md) 的 `多阶段 Issue 与 SPEC-First`。
 - 多行正文使用 quoted heredoc：`--body-stdin <<'EOF'`；不要把长 Markdown 塞进 shell 参数。
 - PR merge 只走 guarded `gh pr merge`；`gh pr create` 的 Next 默认是 `--merge --delete-branch`，只有确认 ancestry 可丢弃时才显式 `--squash`。

@@ -42,6 +42,13 @@ CLI 自检使用 `bun scripts/cli.ts check --syntax-only`、针对被改模块
 - 如果调查中发现了独立改进方向，应先用 `gh issue create --body-stdin` 创建新 issue，标题和正文写清目标合并分支/lane、背景、计划、验收标准；然后在原 issue 评论中用 1-3 句说明已拆出，并链接到新 issue。
 - 只有用户明确要求把计划写回当前 issue 正文，或当前 issue 本身就是唯一的专题计划 issue，才允许更新当前 issue 正文；即便如此，评论仍保持短小，不复制整篇计划。

+## 多阶段 Issue 与 SPEC-First
+
+- 形成多阶段实施、跨模块架构、新能力、长期 API/数据模型、平台运维能力或用户可见工作流的规划型 issue 时，第一阶段必须是 `P0 SPEC 先行`，并按 `$unidesk-oa` 的 SPEC 管理模式处理。
+- `P0 SPEC 先行` 必须在 issue 正文列出 SPEC 编号、SPEC 文档路径、上级规格、关联规格、实现引用版本、目标架构图/数据流图/关键时序图完成项，以及源码文件头部 `SPEC: <编号> <短名> <实现引用版本>` 标注规则。
+- issue 正文只能承载执行计划、阶段状态和证据索引，不能替代 `project-management/PJ2026-01/specs/` 中的长期 SPEC 正文。若稳定需求、数据流、接口或验收口径变化，先更新 SPEC，再更新 issue 阶段计划。
+- P0 未完成前，不得把代码实现、部署、CI/CD、测试补充或验收收口列为已可执行阶段；这些只能作为后续 P1+ 阶段。
+
 ---

 ## 认证探测
@@ -24,10 +24,11 @@ bun scripts/cli.ts platform-infra sub2api apply --target D601 --dry-run
 - Secret 只输出对象名、key 名、presence、fingerprint 或 redacted prefix；禁止打印完整 token/key。
 - D601 是默认 active target；D518/G14 等 target 以 YAML 和 issue 明确目标为准。
 - Codex pool、统一 API key、master `~/.codex` 配置、FRP/Caddy 暴露、账号增删都必须走本技能的受控 CLI。
+- D601 public 502 或 `api.pikapython.com` 异常先区分 edge/app endpoint，并用 `status`、`validate`、`apply --confirm`、`codex-pool validate` 做分层恢复；完整步骤见 [references/full.md](references/full.md) 的排障段。

 ## 何时读取 reference

 - 添加/删除上游、受保护账号代理、分组绑定：读 [references/full.md](references/full.md) 的账号管理段。
 - 部署/状态/镜像升级/FRP 暴露：读部署、镜像、FRP 段。
- master Codex 消费端、`/v1/models`、Codex pool 验收：读 Codex Pool 和验收口径段。
+- master Codex消费端、`/v1/models`、Codex pool 验收：读 Codex Pool 和验收口径段。
 - 排障或禁止事项不确定时，读排障和禁止事项段。
@@ -246,6 +246,8 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm

 ## 排障

+- `api.pikapython.com` 或 D601 public exposure 返回 502 时，先判定是 edge 还是 app endpoint：跑 `sub2api status --target D601` 和 `sub2api validate --target D601`。若 `sub2api`、`sub2api-frpc`、`sub2api-redis` 或 `sub2api-egress-proxy` 出现 `0/1`，或 validate 显示 `no endpoints available for service "sub2api"` / app Pod 已终止，先用 `bun scripts/cli.ts platform-infra sub2api apply --target D601 --confirm` 重新收敛 YAML 资源，按返回的 `job status` 轮询，再跑 `status`、`validate` 和 `codex-pool validate --target D601`。不要先改账号池、哨兵状态、Secret 或 Caddy。
+- D601 快速恢复完成后，用分层证据 closeout：`https://api.pikapython.com/health` 应返回 200；`codex-pool validate --target D601` 应证明内部 `GET /v1/models` 和最小 `POST /v1/responses` smoke 成功；若需要证明公网 OpenAI-compatible API，用 `trans D601:k3s sh` 从 `platform-infra/sub2api-codex-pool-api-key.API_KEY` 只读到临时 shell 变量后请求 public `/v1/models` 和最小 `/v1/responses` marker，只输出 HTTP status、模型数量和 marker，不打印 key。不要为了公网验证运行 `configure-local --confirm`，它会重写本机 `~/.codex`；本机默认 `auth.json` key 返回 401 只能说明本机配置和公网统一 key 不一致，不能当作服务不可用证据。
 - Codex pool 哨兵、账号冻结/恢复、marker-only 判断或 probe 周期看不清：第一步跑 `bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-report`。这个报表是主观察面；只有报表缺字段或需要底层证据时，才继续看 `--raw`、CronJob log、state ConfigMap 或 Sub2API 管理 UI。若看到“临时不可调度状态”且包含规则序号/匹配关键词，检查 Sub2API `account_temp_unschedulable` 日志和账号 `temp_unschedulable_*` 字段；sentinel 只解释 `schedulable=false` 的 active quarantine，不解释这类内置临时冷却。
 - 只加强监控、不让哨兵自动冻结账号时，把 YAML `sentinel.actions.enabled=false` 后 `codex-pool sync --confirm`。此时 marker probe 和 gateway failure monitor 仍记录 `would-freeze` / observe-only 证据，但不会通过 Sub2API admin 写 `schedulable=false`；`/responses/compact` 的 `codex.remote_compact.failed` 和 compact 上游 5xx failover 只作为 `gateway-compact-*` 观察事件记录，不作为哨兵自动切换触发器。
 - 单个 request id 报 502/503/中断/没有自动切号：第一步跑 `bun scripts/cli.ts platform-infra sub2api codex-pool trace --request-id <requestId>`。先看 `outcome`、`reason`、`FAILOVER`、`SELECT-FAILED`、`ACCOUNT SIGNALS` 和 `WINDOW STATS`；只有 trace 报表缺字段或需要审计原始日志时，才加 `--show-lines` 或 `--raw`。若 `reason=failover-attempted-no-candidate`，说明切号动作已发生，但 scheduler 在排除失败账号后没有可用候选；继续用 `sentinel-report` 和 `validate --full` 区分 sentinel quarantine、request-path temp-unschedulable、账号 status 或容量耗尽。
@@ -1,6 +1,6 @@
 ---
 name: unidesk-ymalops
-description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/YAML-first 正规化、YAML ops、运维配置职责、platform-infra 配置重构、Secret sourceRef、publicExposure、target/lane/node、ops helper 抽取、删除 hardcoded defaults/特例，或历史收敛 issue pikasTech/unidesk#390/#398/#401 时使用。
+description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/YAML-first 正规化、YAML ops、运维配置职责拆分、configRef/path 引用、platform-infra 配置重构、Secret sourceRef、publicExposure、target/lane/node、ops helper 抽取、删除 hardcoded defaults/特例，或历史收敛 issue pikasTech/unidesk#390/#398/#401 时使用。
 ---

 # UniDesk YAML Ops
@@ -26,6 +26,9 @@ description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/
 - 源码、配置、部署类正规化默认在独立 `.worktree/<task>` 中做；轻量 skill/docs/reference 收敛可按项目规则直接在主 worktree 做。
 - YAML 是 source of truth。不得新增隐藏代码默认值、schema 数值硬限制、合同测试或测试硬编码策略。
 - 代码校验只保证字段能被正确读取和渲染：类型、必填、枚举键名、引用存在性。版本号、namespace、endpoint、容量、冷却时间、回退窗口等数值以 YAML 为准。
+- 避免“超级配置”。当一个能力同时涉及 target/lane、runtime、scenario、prompt、report、publicExposure、Secret、CI/CD 等不同职责时，按职责拆分到 owning YAML；root YAML 只保存归属和 `configRefs`/path 引用，不承载全部细节。
+- 跨 YAML 引用应使用稳定的 `path/to/file.yaml#object.path` 或当前 domain parser 明确支持的等价语法。parser 只解析引用、校验存在性/类型/形状和冲突，不生成隐藏默认值，也不把合并后的大对象写成新的 source of truth。
+- CLI `plan/status` 应输出 redacted 配置引用图：每个 ref 的文件、path、presence、摘要 hash、缺失字段和下一步 drill-down 命令。不要默认 dump 展开后的完整 YAML 或 Secret。
 - Secret 只能通过 YAML 的 `sourceRef`/`targetKey` 声明和受控 CLI 下发；禁止从运行面 Secret、pod env、日志或数据库状态反推、解码、回填本地凭据。
 - 受控 CLI 输出只能披露对象名、key 名、sourceRef、targetKey、缺失项、fingerprint、字节数和执行摘要；不得打印 base64 payload、解码值、完整 DSN、API key 或可复制凭据。
 - 不做新的全局大 orchestrator。优先保留 domain CLI，把公共能力抽到 ops helper，domain CLI 只表达领域动作。
@@ -36,13 +39,14 @@ description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/

 1. 盘点目标面：列出涉及的 YAML、CLI 入口、helper、Secret 绑定、运行面对象和现有 hardcode。
 2. 确认归属：每个事实必须有唯一 owning YAML；代码、运行面 Secret、pod env 和 docs 不能反向成为配置真相。
-3. 归一 YAML envelope：自有运维配置优先补齐 `version`、`kind`、`metadata`、`defaults`、`targets/services/lane` 等必要结构，但不要为外部工具格式强行套壳。
-4. 搬迁数值：把 namespace、serviceName、secretName、endpoint、image/tag、node/lane、probe、NO_PROXY、容量、回退窗口等可调项从代码迁到 YAML。
-5. 精简 parser：parser 只做结构和类型校验，不藏业务策略，不提供长期默认值。缺项应让 CLI 报出 YAML 路径和字段名。
-6. 抽公共 ops primitives：在增加新 service 分支前，优先复用或扩展公共 helper。
-7. 保持 domain CLI 薄：`platform-infra`、`server`、`gc`、`agentrun`、`hwlab` 等入口只组合 YAML、helper 和执行动作，不复制底层 Kubernetes/FRP/Caddy/Secret 逻辑。
-8. 验证原入口：CLI/config 改动默认只跑语法、help/命令形态、plan/dry-run 或对应 sync/validate；涉及真实运行面的收口要跑原 CLI 入口，不新增合同测试。
-9. 有限收口：当 issue 已经冻结阶段，完成当前阶段后只更新父 issue 的进展和下一固定阶段；固定阶段全部完成后关闭总 issue，不把候选扫描结果转成新的 Round。
+3. 拆分配置职责：把不同生命周期或不同 owner 的事实拆到各自 owning YAML，用 root `configRefs`/path 引用串联；只有同 owner、同生命周期、同命令模型的字段才放在同一个对象中。
+4. 归一 YAML envelope：自有运维配置优先补齐 `version`、`kind`、`metadata`、`defaults`、`targets/services/lane` 等必要结构，但不要为外部工具格式强行套壳。
+5. 搬迁数值：把 namespace、serviceName、secretName、endpoint、image/tag、node/lane、probe、NO_PROXY、容量、回退窗口等可调项从代码迁到 YAML。
+6. 精简 parser：parser 只做结构和类型校验，不藏业务策略，不提供长期默认值。缺项应让 CLI 报出 YAML 路径和字段名；重复声明同一事实或引用冲突时应失败并指出冲突路径。
+7. 抽公共 ops primitives：在增加新 service 分支前，优先复用或扩展公共 helper。
+8. 保持 domain CLI 薄：`platform-infra`、`server`、`gc`、`agentrun`、`hwlab` 等入口只组合 YAML、helper 和执行动作，不复制底层 Kubernetes/FRP/Caddy/Secret 逻辑。
+9. 验证原入口：CLI/config 改动默认只跑语法、help/命令形态、plan/dry-run 或对应 sync/validate；涉及真实运行面的收口要跑原 CLI 入口，不新增合同测试。
+10. 有限收口：当 issue 已经冻结阶段，完成当前阶段后只更新父 issue 的进展和下一固定阶段；固定阶段全部完成后关闭总 issue，不把候选扫描结果转成新的 Round。

 ## Common Refactor Targets

@@ -185,6 +185,7 @@ lanes:
            longLivedStreamOpenSlowMs: 10000
            visibleLoadingSlowMs: 10000
            turnTimingSampleSlackSeconds: 3
+            turnElapsedSevereTimeoutSeconds: 120
            uncommandedStateChangeCommandWindowMs: 10000
            scrollJumpCommandWindowMs: 8000
            scrollJumpFromY: 250
@@ -0,0 +1,83 @@
+# Agent Instruction Hygiene
+
+This document is the long-term reference for keeping always-loaded agent instruction files small, navigable and stable. It applies to local and remote `AGENTS.md`, `CLAUDE.md`-style aliases and any repo-level instruction file that is automatically injected into an agent context.
+
+## Size Budget
+
+`AGENTS.md` is an index, not a knowledge base. The hard size budget for any local or remote `AGENTS.md` is 10 KiB, measured as bytes with `wc -c AGENTS.md`.
+
+When an `AGENTS.md` is already over 10 KiB, do not append more detailed rules to it. Split first, then add only a one-line index entry back to `AGENTS.md`.
+
+When editing an `AGENTS.md` would push it over 10 KiB, route the new content to a skill or a `docs/reference/*.md` document and keep `AGENTS.md` as a short pointer.
+
+If loading or printing `AGENTS.md` triggers CLI output dump or context blow-up, treat that as a visibility bug and an instruction-hygiene bug. The fix is to split the file, not to increase output limits or ask agents to read around the dump.
+
+## What Belongs In AGENTS.md
+
+Keep only always-needed routing information in `AGENTS.md`:
+
+- Project identity and source-of-truth boundaries.
+- P0 one-line rules that prevent immediate damage.
+- Links to the authoritative long-term reference document for each domain.
+- Skill names that must be loaded for common workflows.
+- Short warnings about secrets, destructive commands, target workspaces and build bans.
+
+Do not put long examples, command transcripts, JSON output, issue timelines, architecture essays, provider-specific debugging logs or one-off incident analysis in `AGENTS.md`.
+
+## Where Overflow Content Goes
+
+Use this routing order when splitting content out of `AGENTS.md`:
+
+- Reusable workflow behavior belongs in a skill `SKILL.md`, for example `$dad-dev`, `$unidesk-cicd`, `$unidesk-gh`, `$unidesk-trans`, `$unidesk-otel`, `$unidesk-webdev` or `$unidesk-ymalops`.
+- Stable project constraints, workspace rules, architecture boundaries and validation criteria belong in `docs/reference/*.md`.
+- CLI shape, output style, route syntax and operator ergonomics belong in `docs/reference/cli.md` unless a narrower reference already owns them.
+- Deployment hygiene, fixed repo boundaries and source-of-truth rules belong in `docs/reference/devops-hygiene.md`.
+- Node/lane-specific HWLAB rules belong in `docs/reference/hwlab.md` and the target repo's own reference docs.
+- AgentRun source-truth and deployment-lane rules belong in `docs/reference/agentrun.md`.
+- Platform-infra and YAML-first operations belong in `docs/reference/platform-infra.md` and `docs/reference/yaml-first-ops.md`.
+- Process notes, temporary findings and dated investigation logs belong in GitHub issues, PR comments or process notes; they must be distilled before entering long-term reference.
+
+If a rule is both reusable across projects and specific to UniDesk's current directories or services, put the reusable workflow in the skill and put UniDesk-specific paths, lane names and validation boundaries in `docs/reference/*.md`, then cross-reference both.
+
+## Split Procedure
+
+When an agent sees a local or remote `AGENTS.md` over 10 KiB:
+
+1. Identify the detailed section being changed or expanded.
+2. Move the detailed content to the owning skill or `docs/reference/*.md` document.
+3. Replace the original section with one concise bullet and a link to the authoritative location.
+4. Preserve P0 damage-prevention warnings in `AGENTS.md`, but compress them to one-line routing rules.
+5. Do not create a single giant overflow archive as the normal solution. A temporary migration note is acceptable only if it immediately points to the domain documents that must absorb it.
+6. Do not add tests, guards or preflight checks just to enforce the size budget unless the user explicitly asks. The default control is documentation hygiene plus concise review.
+
+For large legacy files, split incrementally by domain. Each new edit should leave the touched domain smaller and better referenced than before.
+
+## Cross-Reference Requirements
+
+Every `AGENTS.md` index entry that points out of the file must name the authoritative target. Prefer direct paths such as `docs/reference/hwlab.md` or skill names such as `$unidesk-cicd`.
+
+Avoid duplicated full rules between `AGENTS.md`, skills and long-term reference docs. `AGENTS.md` may summarize; the reference owns the detail. If two references conflict, update the narrower domain reference and keep only one authoritative version.
+
+## Secrets And Output Hygiene
+
+Instruction files must not contain secrets, full API keys, full DSNs, base64 payloads, bearer tokens, SSH private keys or copy-pastable credentials.
+
+Do not paste large CLI output, OTel trace dumps, JSON arrays or browser transcripts into `AGENTS.md`. If a large output demonstrates a durable rule, summarize the rule and link to the issue or reference that owns the conclusion.
+
+## Current UniDesk Routing Map
+
+The current top-level routing map is:
+
+- CLI behavior and output: `docs/reference/cli.md`.
+- YAML-first configuration: `docs/reference/yaml-first-ops.md` and `$unidesk-ymalops`.
+- Platform infrastructure: `docs/reference/platform-infra.md` and `$unidesk-sub2api` when Sub2API is involved.
+- Distributed field repair: `$dad-dev` plus `docs/reference/devops-hygiene.md`.
+- CI/CD and rollout: `$unidesk-cicd` plus `docs/reference/cli.md`.
+- GitHub issue and PR writes: `$unidesk-gh`.
+- Trans/remote patch transport: `$unidesk-trans` plus `docs/reference/cli.md`.
+- Web UI, Workbench and web-probe: `$unidesk-webdev`.
+- OpenTelemetry and Tempo: `$unidesk-otel` plus `docs/reference/observability.md`.
+- HWLAB node/lane operation: `docs/reference/hwlab.md`.
+- AgentRun: `docs/reference/agentrun.md`.
+- Master/D601 development environment: `docs/reference/dev-environment.md`.
+- Secretary work: `docs/reference/secretary-reference.md`.
@@ -33,7 +33,7 @@

 Workbench唯一投影负责把 AgentRun 执行事实收敛成 HWLAB 自有的 durable Workbench facts，使 Web、CLI、REST、SSE、fake-server 和浏览器回归都消费同一份可恢复、可分页、可诊断的用户态会话事实。

-本专项的目标状态是：AgentRun run、command、event 和 result 只作为执行事实输入；HWLAB 只有 `WorkbenchProjectionWriter`、`WorkbenchProjectionFinalizer` 写入 Workbench facts；cloud-api boot/background scheduler 是 startup/resume 的 authority，负责扫描 durable open checkpoint 和 running/projecting turn 并恢复追平；所有 `GET /v1/workbench/*` 和兼容读路径都只通过 `WorkbenchReadModel` 读取，不在读取时调用 AgentRun、Code Agent manager、trace polling、result polling 或 workspace repair 推进事实。Workbench 必须满足 0repair：页面、GET、SSE、fake-server 和 web-probe 都不得通过 reload、切换 session、`sessionRepair`、`realignFreshSession`、localStorage truth 或 read-through repair 把已经分裂的 route/session/message/trace 状态补成看起来正确。
+本专项的目标状态是：AgentRun run、command、event 和 result 只作为执行事实输入；HWLAB 只有 `WorkbenchProjectionWriter`、`WorkbenchProjectionFinalizer` 写入 Workbench facts；cloud-api boot/background scheduler 是 startup/resume 的 authority，负责扫描 durable open checkpoint 和 running/projecting turn 并恢复追平；投影推进只能由上游 source event/result、projection writer/finalizer、background scheduler/reconciler 或显式受控 checkpoint replay/reprojection 触发。所有 `GET /v1/workbench/*` 和兼容读路径都只通过 `WorkbenchReadModel` 读取，不在读取时调用 AgentRun、Code Agent manager、trace polling、result polling、`hydrateRealtimeGap` 或 workspace repair 推进事实。Workbench 必须满足 0repair：页面、GET、SSE、fake-server 和 web-probe 都不得通过 reload、切换 session、`sessionRepair`、`realignFreshSession`、localStorage truth、SSE gap repair、visibility gap refresh 或 read-through repair 把已经分裂的 route/session/message/trace 状态补成看起来正确。

 Workbench aggregate event stream 是上述唯一投影的提交脊柱。所有 admission、AgentRun event/result、cancel、replay/reprojection 和 diagnostic transition 先归一化为带 `eventSeq`、`aggregateId`、`aggregateSeq`、`turnId`、`traceId`、`sourceRunId`、`sourceCommandId` 和来源幂等键的 append-only 事件，再由同一 projector 写入 message、part、turn、trace、checkpoint、outbox 和 read model。Web、SSE、CLI、fake-server 和 probe 只能消费该 event stream 的投影结果或 cursor replay，不能分别从 trace tail、result envelope、message cache、session list 和浏览器本地状态重新排序。

@@ -83,6 +83,7 @@ D601 v0.3 可以在 `hwlab-v03` namespace 内为 `hwlab-workbench-runtime` 使
 | Workbench aggregate event stream | Workbench 投影自己的 append-only 事件流，承载 admission、source event/result、terminal、diagnostic、replay/reprojection 和 cancel transition；它是 trace/timeline 顺序、SSE replay cursor 和 projector revision 的唯一提交序。 |
 | message/part authority | Workbench timeline 的消息和片段权威。每个 user/assistant/tool/diagnostic/final response 片段都有稳定 `messageId`、`partId`、`order`、`status` 和 sealed 状态；trace-level 文本、DOM 行、result envelope 或 analyzer 不能替代它选择最终内容。 |
 | WorkbenchReadModel | 唯一读取 Workbench facts 并组装 session rail、session detail、message page、turn snapshot、trace event page 和 projection diagnostics 的读模型。 |
+| 上游投影推进 | Workbench facts 只能由 projection writer/finalizer、background scheduler/reconciler、AgentRun source event/result outbox 或显式受控 checkpoint replay/reprojection 推进；REST GET、Web 页面、SSE consumer、SSE open/error/visibility handler、web-probe、fake-server 和 CLI renderer 只能观察或展示投影与诊断，不能以 gap hydration、trace/result polling、reload 或 read-through sync 触发写侧投影。 |
 | checkpoint replay/reprojection | 受控管理入口按已持久化 sourceRun/sourceCommand/checkpoint 重放投影逻辑，用于恢复投影 lag 或阻塞；它只能调用同一 finalizer/writer，不由 GET、Web 页面、SSE 订阅或测试 helper 触发，也不得改变 active session 或 route。 |
 | projection commit | writer/finalizer 对一组 message、part、turn、trace、session summary 和 checkpoint 的一次幂等持久化提交；terminal commit 必须保持用户可见事实一致。 |
 | terminal commit | 标记同一 turn 结束的 projection commit，必须原子更新 assistant final text、message/part status、turn terminal、trace terminal event、session running=false、summary 和 SSE cursor。 |
@@ -200,6 +201,8 @@ flowchart LR

 目标架构要求 route/auth、adapter、projection writer/finalizer、facts store、read model、SSE publisher 和 compat wrapper 分工清晰。任何 route、GET handler、trace polling、result polling、workspace snapshot 或 front-end reducer 都不能绕过 writer/finalizer 直接改变 Workbench facts。

+目标架构还要求投影推进只发生在上游写侧。SSE publisher/handler 只发布和 replay durable outbox commit；Web/CLI/fake-server/SSE consumer 只消费 REST/SSE projection。SSE open/error、visibility change、route hydrate、Trace detail hydration、web-probe 观察、GET refresh 和 observer reload 不能触发 `hydrateRealtimeGap`、read-through sync、result sync 或 trace polling 来推动 projection；缺口只能表现为 projection diagnostic/blocker，并由 scheduler/reconciler/finalizer 或 source event/outbox 追平。
+
 目标架构还要求彻底禁止读侧推理。`turn.status`、`message.status`、`session.running`、`trace terminal`、`finalResponse` 和 `projectionStatus` 必须是 projection writer/finalizer 已经写入 durable facts 的字段；read model、REST route、SSE consumer、compat wrapper、Web reducer、CLI renderer、fake-server 和测试只能读取和重放这些字段。AgentRun facts、trace events、message parts、result envelope、session summary、list row 和 workspace snapshot 只能作为 writer/finalizer 输入或诊断字段，不得在读取链路中通过优先级、fallback、最后事件、空文本、超时或 UI heuristic 生成生命周期事实。

 ### 5.3 目标数据流图
@@ -276,7 +279,7 @@ sequenceDiagram
  Store-->>Read: caught-up facts or explicit lag/blocker
 ```

-重启恢复要求 finalizer 不依赖进程内 90s 轮询作为唯一推进机制。进程内任务丢失、cloud-api 重启或慢任务超过短轮询预算后，boot/background scheduler 必须能从 durable checkpoint 和 running/projecting turn 找回需要追平的 sourceRun/sourceCommand，并以同一 writer 逻辑提交或记录 blocker。GET、SSE、Web 页面、fake-server 和 probe 只能观察恢复状态，不能触发恢复。
+重启恢复要求 finalizer 不依赖进程内 90s 轮询作为唯一推进机制。进程内任务丢失、cloud-api 重启或慢任务超过短轮询预算后，boot/background scheduler 必须能从 durable checkpoint 和 running/projecting turn 找回需要追平的 sourceRun/sourceCommand，并以同一 writer 逻辑提交或记录 blocker。GET、SSE、Web 页面、visibility handler、fake-server、web-probe 和 CLI renderer 只能观察恢复状态，不能触发恢复、gap hydration 或 read-through sync。

 ### 5.6 durable Workbench facts 对象模型

@@ -155,6 +155,7 @@ export interface HwlabRuntimeWebProbeAlertThresholdsSpec {
  readonly longLivedStreamOpenSlowMs: number;
  readonly visibleLoadingSlowMs: number;
  readonly turnTimingSampleSlackSeconds: number;
+  readonly turnElapsedSevereTimeoutSeconds: number;
  readonly uncommandedStateChangeCommandWindowMs: number;
  readonly scrollJumpCommandWindowMs: number;
  readonly scrollJumpFromY: number;
@@ -839,6 +840,7 @@ function webProbeAlertThresholdsConfig(value: unknown, path: string): HwlabRunti
    longLivedStreamOpenSlowMs: positiveNumberField(raw, "longLivedStreamOpenSlowMs", path),
    visibleLoadingSlowMs: positiveNumberField(raw, "visibleLoadingSlowMs", path),
    turnTimingSampleSlackSeconds: positiveNumberField(raw, "turnTimingSampleSlackSeconds", path),
+    turnElapsedSevereTimeoutSeconds: positiveNumberField(raw, "turnElapsedSevereTimeoutSeconds", path),
    uncommandedStateChangeCommandWindowMs: positiveNumberField(raw, "uncommandedStateChangeCommandWindowMs", path),
    scrollJumpCommandWindowMs: positiveNumberField(raw, "scrollJumpCommandWindowMs", path),
    scrollJumpFromY: positiveNumberField(raw, "scrollJumpFromY", path),
@@ -533,6 +533,7 @@ function parseAlertThresholds(value) {
    longLivedStreamOpenSlowMs: requiredPositiveThreshold(raw, "longLivedStreamOpenSlowMs"),
    visibleLoadingSlowMs: requiredPositiveThreshold(raw, "visibleLoadingSlowMs"),
    turnTimingSampleSlackSeconds: requiredPositiveThreshold(raw, "turnTimingSampleSlackSeconds"),
+    turnElapsedSevereTimeoutSeconds: requiredPositiveThreshold(raw, "turnElapsedSevereTimeoutSeconds"),
    uncommandedStateChangeCommandWindowMs: requiredPositiveThreshold(raw, "uncommandedStateChangeCommandWindowMs"),
    scrollJumpCommandWindowMs: requiredPositiveThreshold(raw, "scrollJumpCommandWindowMs"),
    scrollJumpFromY: requiredPositiveThreshold(raw, "scrollJumpFromY"),
@@ -1383,6 +1384,9 @@ function buildFindings(samples, control, network, errors, sampleMetrics, promptN
      ? sampleMetrics.turnTimingNonMonotonic.filter((item) => item.metric === "recentUpdateSeconds" && item.anomaly === "jump")
      : [];
  if (recentUpdateSawtoothJumps.length > 0) findings.push({ id: "turn-timing-recent-update-sawtooth-jump", severity: "amber", summary: "最近更新 value jumped faster than sample interval; expected sawtooth increase-or-reset", count: recentUpdateSawtoothJumps.length, samples: recentUpdateSawtoothJumps.slice(0, 20) });
+  const severeTimeoutRounds = Array.isArray(sampleMetrics?.rounds) ? sampleMetrics.rounds.filter((item) => Number(item.maxTotalElapsedSeconds) > alertThresholds.turnElapsedSevereTimeoutSeconds) : [];
+  const severeTimeoutSamples = Array.isArray(sampleMetrics?.timeline) ? sampleMetrics.timeline.filter((item) => Number(item.totalElapsedSeconds) > alertThresholds.turnElapsedSevereTimeoutSeconds) : [];
+  if (severeTimeoutRounds.length > 0 || severeTimeoutSamples.length > 0) findings.push({ id: "turn-elapsed-severe-timeout", severity: "red", summary: "turn total elapsed exceeded YAML-configured severe timeout; investigate Workbench/AgentRun progress instead of treating the turn as healthy", thresholdSeconds: alertThresholds.turnElapsedSevereTimeoutSeconds, count: Math.max(severeTimeoutRounds.length, severeTimeoutSamples.length), rounds: severeTimeoutRounds.slice(0, 20), samples: severeTimeoutSamples.slice(0, 20) });
  const loadingSummary = sampleMetrics?.loading?.summary || {};
  const visibleLoadingSlowSeconds = alertThresholds.visibleLoadingSlowMs / 1000;
  if (Number(loadingSummary.longestContinuousSeconds ?? 0) > visibleLoadingSlowSeconds) findings.push({ id: "page-loading-visible-over-budget", severity: "red", summary: "visible 加载中 stayed on screen longer than configured YAML budget; fix real loading latency instead of revealing incomplete content early", count: loadingSummary.overBudgetSegmentCount ?? loadingSummary.overFiveSecondSegmentCount ?? 1, longestContinuousSeconds: loadingSummary.longestContinuousSeconds, budgetSeconds: visibleLoadingSlowSeconds, segments: sampleMetrics.loading.segments.slice(0, 20), owners: sampleMetrics.loading.owners.slice(0, 20) });
@@ -2921,6 +2921,7 @@ function parseAlertThresholds(value) {
    longLivedStreamOpenSlowMs: requiredPositiveThreshold(raw, "longLivedStreamOpenSlowMs"),
    visibleLoadingSlowMs: requiredPositiveThreshold(raw, "visibleLoadingSlowMs"),
    turnTimingSampleSlackSeconds: requiredPositiveThreshold(raw, "turnTimingSampleSlackSeconds"),
+    turnElapsedSevereTimeoutSeconds: requiredPositiveThreshold(raw, "turnElapsedSevereTimeoutSeconds"),
    uncommandedStateChangeCommandWindowMs: requiredPositiveThreshold(raw, "uncommandedStateChangeCommandWindowMs"),
    scrollJumpCommandWindowMs: requiredPositiveThreshold(raw, "scrollJumpCommandWindowMs"),
    scrollJumpFromY: requiredPositiveThreshold(raw, "scrollJumpFromY"),