fix: restore web-probe severe timeout threshold

Also records instruction hygiene, YAML-first config split guidance, and Sub2API D601 recovery notes from the recovered worktree state.
This commit is contained in:
Codex
2026-06-26 09:34:04 +00:00
parent ab71273867
commit 2a8f279575
11 changed files with 120 additions and 11 deletions
+1
View File
@@ -12,6 +12,7 @@ GitHub issue/PR 正式读写必须走 `bun scripts/cli.ts gh ...` 或 `trans gh:
- Issue/PR 正文、评论、关闭/重开、PR 描述和 merge closeout 默认中文。
- 新 issue 正文必须包含 `目标合并分支: <repo branch/lane>`;不需要合并时写 `目标合并分支: 不适用`
- 大计划、后续阶段和独立改进方向创建新 issue;已有 issue 评论只写短进展、证据、阻塞和链接。
- 规划型、多阶段、架构/API/平台运维类 issue 第一阶段必须 `P0 SPEC 先行`;细则见 [references/full.md](references/full.md) 的 `多阶段 Issue 与 SPEC-First`
- 多行正文使用 quoted heredoc`--body-stdin <<'EOF'`;不要把长 Markdown 塞进 shell 参数。
- PR merge 只走 guarded `gh pr merge``gh pr create` 的 Next 默认是 `--merge --delete-branch`,只有确认 ancestry 可丢弃时才显式 `--squash`
@@ -42,6 +42,13 @@ CLI 自检使用 `bun scripts/cli.ts check --syntax-only`、针对被改模块
- 如果调查中发现了独立改进方向,应先用 `gh issue create --body-stdin` 创建新 issue,标题和正文写清目标合并分支/lane、背景、计划、验收标准;然后在原 issue 评论中用 1-3 句说明已拆出,并链接到新 issue。
- 只有用户明确要求把计划写回当前 issue 正文,或当前 issue 本身就是唯一的专题计划 issue,才允许更新当前 issue 正文;即便如此,评论仍保持短小,不复制整篇计划。
## 多阶段 Issue 与 SPEC-First
- 形成多阶段实施、跨模块架构、新能力、长期 API/数据模型、平台运维能力或用户可见工作流的规划型 issue 时,第一阶段必须是 `P0 SPEC 先行`,并按 `$unidesk-oa` 的 SPEC 管理模式处理。
- `P0 SPEC 先行` 必须在 issue 正文列出 SPEC 编号、SPEC 文档路径、上级规格、关联规格、实现引用版本、目标架构图/数据流图/关键时序图完成项,以及源码文件头部 `SPEC: <编号> <短名> <实现引用版本>` 标注规则。
- issue 正文只能承载执行计划、阶段状态和证据索引,不能替代 `project-management/PJ2026-01/specs/` 中的长期 SPEC 正文。若稳定需求、数据流、接口或验收口径变化,先更新 SPEC,再更新 issue 阶段计划。
- P0 未完成前,不得把代码实现、部署、CI/CD、测试补充或验收收口列为已可执行阶段;这些只能作为后续 P1+ 阶段。
---
## 认证探测
+2 -1
View File
@@ -24,10 +24,11 @@ bun scripts/cli.ts platform-infra sub2api apply --target D601 --dry-run
- Secret 只输出对象名、key 名、presence、fingerprint 或 redacted prefix;禁止打印完整 token/key。
- D601 是默认 active targetD518/G14 等 target 以 YAML 和 issue 明确目标为准。
- Codex pool、统一 API key、master `~/.codex` 配置、FRP/Caddy 暴露、账号增删都必须走本技能的受控 CLI。
- D601 public 502 或 `api.pikapython.com` 异常先区分 edge/app endpoint,并用 `status``validate``apply --confirm``codex-pool validate` 做分层恢复;完整步骤见 [references/full.md](references/full.md) 的排障段。
## 何时读取 reference
- 添加/删除上游、受保护账号代理、分组绑定:读 [references/full.md](references/full.md) 的账号管理段。
- 部署/状态/镜像升级/FRP 暴露:读部署、镜像、FRP 段。
- master Codex 消费端、`/v1/models`、Codex pool 验收:读 Codex Pool 和验收口径段。
- master Codex消费端、`/v1/models`、Codex pool 验收:读 Codex Pool 和验收口径段。
- 排障或禁止事项不确定时,读排障和禁止事项段。
@@ -246,6 +246,8 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm
## 排障
- `api.pikapython.com` 或 D601 public exposure 返回 502 时,先判定是 edge 还是 app endpoint:跑 `sub2api status --target D601``sub2api validate --target D601`。若 `sub2api``sub2api-frpc``sub2api-redis``sub2api-egress-proxy` 出现 `0/1`,或 validate 显示 `no endpoints available for service "sub2api"` / app Pod 已终止,先用 `bun scripts/cli.ts platform-infra sub2api apply --target D601 --confirm` 重新收敛 YAML 资源,按返回的 `job status` 轮询,再跑 `status``validate``codex-pool validate --target D601`。不要先改账号池、哨兵状态、Secret 或 Caddy。
- D601 快速恢复完成后,用分层证据 closeout:`https://api.pikapython.com/health` 应返回 200`codex-pool validate --target D601` 应证明内部 `GET /v1/models` 和最小 `POST /v1/responses` smoke 成功;若需要证明公网 OpenAI-compatible API,用 `trans D601:k3s sh``platform-infra/sub2api-codex-pool-api-key.API_KEY` 只读到临时 shell 变量后请求 public `/v1/models` 和最小 `/v1/responses` marker,只输出 HTTP status、模型数量和 marker,不打印 key。不要为了公网验证运行 `configure-local --confirm`,它会重写本机 `~/.codex`;本机默认 `auth.json` key 返回 401 只能说明本机配置和公网统一 key 不一致,不能当作服务不可用证据。
- Codex pool 哨兵、账号冻结/恢复、marker-only 判断或 probe 周期看不清:第一步跑 `bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-report`。这个报表是主观察面;只有报表缺字段或需要底层证据时,才继续看 `--raw`、CronJob log、state ConfigMap 或 Sub2API 管理 UI。若看到“临时不可调度状态”且包含规则序号/匹配关键词,检查 Sub2API `account_temp_unschedulable` 日志和账号 `temp_unschedulable_*` 字段;sentinel 只解释 `schedulable=false` 的 active quarantine,不解释这类内置临时冷却。
- 只加强监控、不让哨兵自动冻结账号时,把 YAML `sentinel.actions.enabled=false``codex-pool sync --confirm`。此时 marker probe 和 gateway failure monitor 仍记录 `would-freeze` / observe-only 证据,但不会通过 Sub2API admin 写 `schedulable=false``/responses/compact``codex.remote_compact.failed` 和 compact 上游 5xx failover 只作为 `gateway-compact-*` 观察事件记录,不作为哨兵自动切换触发器。
- 单个 request id 报 502/503/中断/没有自动切号:第一步跑 `bun scripts/cli.ts platform-infra sub2api codex-pool trace --request-id <requestId>`。先看 `outcome``reason``FAILOVER``SELECT-FAILED``ACCOUNT SIGNALS``WINDOW STATS`;只有 trace 报表缺字段或需要审计原始日志时,才加 `--show-lines``--raw`。若 `reason=failover-attempted-no-candidate`,说明切号动作已发生,但 scheduler 在排除失败账号后没有可用候选;继续用 `sentinel-report``validate --full` 区分 sentinel quarantine、request-path temp-unschedulable、账号 status 或容量耗尽。
+12 -8
View File
@@ -1,6 +1,6 @@
---
name: unidesk-ymalops
description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/YAML-first 正规化、YAML ops、运维配置职责、platform-infra 配置重构、Secret sourceRef、publicExposure、target/lane/node、ops helper 抽取、删除 hardcoded defaults/特例,或历史收敛 issue pikasTech/unidesk#390/#398/#401 时使用。
description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/YAML-first 正规化、YAML ops、运维配置职责拆分、configRef/path 引用、platform-infra 配置重构、Secret sourceRef、publicExposure、target/lane/node、ops helper 抽取、删除 hardcoded defaults/特例,或历史收敛 issue pikasTech/unidesk#390/#398/#401 时使用。
---
# UniDesk YAML Ops
@@ -26,6 +26,9 @@ description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/
- 源码、配置、部署类正规化默认在独立 `.worktree/<task>` 中做;轻量 skill/docs/reference 收敛可按项目规则直接在主 worktree 做。
- YAML 是 source of truth。不得新增隐藏代码默认值、schema 数值硬限制、合同测试或测试硬编码策略。
- 代码校验只保证字段能被正确读取和渲染:类型、必填、枚举键名、引用存在性。版本号、namespace、endpoint、容量、冷却时间、回退窗口等数值以 YAML 为准。
- 避免“超级配置”。当一个能力同时涉及 target/lane、runtime、scenario、prompt、report、publicExposure、Secret、CI/CD 等不同职责时,按职责拆分到 owning YAMLroot YAML 只保存归属和 `configRefs`/path 引用,不承载全部细节。
- 跨 YAML 引用应使用稳定的 `path/to/file.yaml#object.path` 或当前 domain parser 明确支持的等价语法。parser 只解析引用、校验存在性/类型/形状和冲突,不生成隐藏默认值,也不把合并后的大对象写成新的 source of truth。
- CLI `plan/status` 应输出 redacted 配置引用图:每个 ref 的文件、path、presence、摘要 hash、缺失字段和下一步 drill-down 命令。不要默认 dump 展开后的完整 YAML 或 Secret。
- Secret 只能通过 YAML 的 `sourceRef`/`targetKey` 声明和受控 CLI 下发;禁止从运行面 Secret、pod env、日志或数据库状态反推、解码、回填本地凭据。
- 受控 CLI 输出只能披露对象名、key 名、sourceRef、targetKey、缺失项、fingerprint、字节数和执行摘要;不得打印 base64 payload、解码值、完整 DSN、API key 或可复制凭据。
- 不做新的全局大 orchestrator。优先保留 domain CLI,把公共能力抽到 ops helperdomain CLI 只表达领域动作。
@@ -36,13 +39,14 @@ description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/
1. 盘点目标面:列出涉及的 YAML、CLI 入口、helper、Secret 绑定、运行面对象和现有 hardcode。
2. 确认归属:每个事实必须有唯一 owning YAML;代码、运行面 Secret、pod env 和 docs 不能反向成为配置真相。
3. 归一 YAML envelope:自有运维配置优先补齐 `version``kind``metadata``defaults``targets/services/lane` 等必要结构,但不要为外部工具格式强行套壳
4. 搬迁数值:把 namespace、serviceName、secretName、endpoint、image/tag、node/lane、probe、NO_PROXY、容量、回退窗口等可调项从代码迁到 YAML
5. 精简 parser:parser 只做结构和类型校验,不藏业务策略,不提供长期默认值。缺项应让 CLI 报出 YAML 路径和字段名
6. 抽公共 ops primitives:在增加新 service 分支前,优先复用或扩展公共 helper
7. 保持 domain CLI 薄:`platform-infra``server``gc``agentrun``hwlab` 等入口只组合 YAML、helper 和执行动作,不复制底层 Kubernetes/FRP/Caddy/Secret 逻辑
8. 验证原入口:CLI/config 改动默认只跑语法、help/命令形态、plan/dry-run 或对应 sync/validate;涉及真实运行面的收口要跑原 CLI 入口,不新增合同测试
9. 有限收口:当 issue 已经冻结阶段,完成当前阶段后只更新父 issue 的进展和下一固定阶段;固定阶段全部完成后关闭总 issue,不把候选扫描结果转成新的 Round
3. 拆分配置职责:把不同生命周期或不同 owner 的事实拆到各自 owning YAML,用 root `configRefs`/path 引用串联;只有同 owner、同生命周期、同命令模型的字段才放在同一个对象中
4. 归一 YAML envelope:自有运维配置优先补齐 `version``kind``metadata``defaults``targets/services/lane` 等必要结构,但不要为外部工具格式强行套壳
5. 搬迁数值:把 namespace、serviceName、secretName、endpoint、image/tag、node/lane、probe、NO_PROXY、容量、回退窗口等可调项从代码迁到 YAML
6. 精简 parser:parser 只做结构和类型校验,不藏业务策略,不提供长期默认值。缺项应让 CLI 报出 YAML 路径和字段名;重复声明同一事实或引用冲突时应失败并指出冲突路径
7. 抽公共 ops primitives:在增加新 service 分支前,优先复用或扩展公共 helper
8. 保持 domain CLI 薄:`platform-infra``server``gc``agentrun``hwlab` 等入口只组合 YAML、helper 和执行动作,不复制底层 Kubernetes/FRP/Caddy/Secret 逻辑
9. 验证原入口:CLI/config 改动默认只跑语法、help/命令形态、plan/dry-run 或对应 sync/validate;涉及真实运行面的收口要跑原 CLI 入口,不新增合同测试
10. 有限收口:当 issue 已经冻结阶段,完成当前阶段后只更新父 issue 的进展和下一固定阶段;固定阶段全部完成后关闭总 issue,不把候选扫描结果转成新的 Round。
## Common Refactor Targets
+1
View File
@@ -185,6 +185,7 @@ lanes:
longLivedStreamOpenSlowMs: 10000
visibleLoadingSlowMs: 10000
turnTimingSampleSlackSeconds: 3
turnElapsedSevereTimeoutSeconds: 120
uncommandedStateChangeCommandWindowMs: 10000
scrollJumpCommandWindowMs: 8000
scrollJumpFromY: 250
@@ -0,0 +1,83 @@
# Agent Instruction Hygiene
This document is the long-term reference for keeping always-loaded agent instruction files small, navigable and stable. It applies to local and remote `AGENTS.md`, `CLAUDE.md`-style aliases and any repo-level instruction file that is automatically injected into an agent context.
## Size Budget
`AGENTS.md` is an index, not a knowledge base. The hard size budget for any local or remote `AGENTS.md` is 10 KiB, measured as bytes with `wc -c AGENTS.md`.
When an `AGENTS.md` is already over 10 KiB, do not append more detailed rules to it. Split first, then add only a one-line index entry back to `AGENTS.md`.
When editing an `AGENTS.md` would push it over 10 KiB, route the new content to a skill or a `docs/reference/*.md` document and keep `AGENTS.md` as a short pointer.
If loading or printing `AGENTS.md` triggers CLI output dump or context blow-up, treat that as a visibility bug and an instruction-hygiene bug. The fix is to split the file, not to increase output limits or ask agents to read around the dump.
## What Belongs In AGENTS.md
Keep only always-needed routing information in `AGENTS.md`:
- Project identity and source-of-truth boundaries.
- P0 one-line rules that prevent immediate damage.
- Links to the authoritative long-term reference document for each domain.
- Skill names that must be loaded for common workflows.
- Short warnings about secrets, destructive commands, target workspaces and build bans.
Do not put long examples, command transcripts, JSON output, issue timelines, architecture essays, provider-specific debugging logs or one-off incident analysis in `AGENTS.md`.
## Where Overflow Content Goes
Use this routing order when splitting content out of `AGENTS.md`:
- Reusable workflow behavior belongs in a skill `SKILL.md`, for example `$dad-dev`, `$unidesk-cicd`, `$unidesk-gh`, `$unidesk-trans`, `$unidesk-otel`, `$unidesk-webdev` or `$unidesk-ymalops`.
- Stable project constraints, workspace rules, architecture boundaries and validation criteria belong in `docs/reference/*.md`.
- CLI shape, output style, route syntax and operator ergonomics belong in `docs/reference/cli.md` unless a narrower reference already owns them.
- Deployment hygiene, fixed repo boundaries and source-of-truth rules belong in `docs/reference/devops-hygiene.md`.
- Node/lane-specific HWLAB rules belong in `docs/reference/hwlab.md` and the target repo's own reference docs.
- AgentRun source-truth and deployment-lane rules belong in `docs/reference/agentrun.md`.
- Platform-infra and YAML-first operations belong in `docs/reference/platform-infra.md` and `docs/reference/yaml-first-ops.md`.
- Process notes, temporary findings and dated investigation logs belong in GitHub issues, PR comments or process notes; they must be distilled before entering long-term reference.
If a rule is both reusable across projects and specific to UniDesk's current directories or services, put the reusable workflow in the skill and put UniDesk-specific paths, lane names and validation boundaries in `docs/reference/*.md`, then cross-reference both.
## Split Procedure
When an agent sees a local or remote `AGENTS.md` over 10 KiB:
1. Identify the detailed section being changed or expanded.
2. Move the detailed content to the owning skill or `docs/reference/*.md` document.
3. Replace the original section with one concise bullet and a link to the authoritative location.
4. Preserve P0 damage-prevention warnings in `AGENTS.md`, but compress them to one-line routing rules.
5. Do not create a single giant overflow archive as the normal solution. A temporary migration note is acceptable only if it immediately points to the domain documents that must absorb it.
6. Do not add tests, guards or preflight checks just to enforce the size budget unless the user explicitly asks. The default control is documentation hygiene plus concise review.
For large legacy files, split incrementally by domain. Each new edit should leave the touched domain smaller and better referenced than before.
## Cross-Reference Requirements
Every `AGENTS.md` index entry that points out of the file must name the authoritative target. Prefer direct paths such as `docs/reference/hwlab.md` or skill names such as `$unidesk-cicd`.
Avoid duplicated full rules between `AGENTS.md`, skills and long-term reference docs. `AGENTS.md` may summarize; the reference owns the detail. If two references conflict, update the narrower domain reference and keep only one authoritative version.
## Secrets And Output Hygiene
Instruction files must not contain secrets, full API keys, full DSNs, base64 payloads, bearer tokens, SSH private keys or copy-pastable credentials.
Do not paste large CLI output, OTel trace dumps, JSON arrays or browser transcripts into `AGENTS.md`. If a large output demonstrates a durable rule, summarize the rule and link to the issue or reference that owns the conclusion.
## Current UniDesk Routing Map
The current top-level routing map is:
- CLI behavior and output: `docs/reference/cli.md`.
- YAML-first configuration: `docs/reference/yaml-first-ops.md` and `$unidesk-ymalops`.
- Platform infrastructure: `docs/reference/platform-infra.md` and `$unidesk-sub2api` when Sub2API is involved.
- Distributed field repair: `$dad-dev` plus `docs/reference/devops-hygiene.md`.
- CI/CD and rollout: `$unidesk-cicd` plus `docs/reference/cli.md`.
- GitHub issue and PR writes: `$unidesk-gh`.
- Trans/remote patch transport: `$unidesk-trans` plus `docs/reference/cli.md`.
- Web UI, Workbench and web-probe: `$unidesk-webdev`.
- OpenTelemetry and Tempo: `$unidesk-otel` plus `docs/reference/observability.md`.
- HWLAB node/lane operation: `docs/reference/hwlab.md`.
- AgentRun: `docs/reference/agentrun.md`.
- Master/D601 development environment: `docs/reference/dev-environment.md`.
- Secretary work: `docs/reference/secretary-reference.md`.
@@ -33,7 +33,7 @@
Workbench唯一投影负责把 AgentRun 执行事实收敛成 HWLAB 自有的 durable Workbench facts,使 Web、CLI、REST、SSE、fake-server 和浏览器回归都消费同一份可恢复、可分页、可诊断的用户态会话事实。
本专项的目标状态是:AgentRun run、command、event 和 result 只作为执行事实输入;HWLAB 只有 `WorkbenchProjectionWriter``WorkbenchProjectionFinalizer` 写入 Workbench factscloud-api boot/background scheduler 是 startup/resume 的 authority,负责扫描 durable open checkpoint 和 running/projecting turn 并恢复追平;所有 `GET /v1/workbench/*` 和兼容读路径都只通过 `WorkbenchReadModel` 读取,不在读取时调用 AgentRun、Code Agent manager、trace polling、result polling 或 workspace repair 推进事实。Workbench 必须满足 0repair:页面、GET、SSE、fake-server 和 web-probe 都不得通过 reload、切换 session、`sessionRepair``realignFreshSession`、localStorage truth 或 read-through repair 把已经分裂的 route/session/message/trace 状态补成看起来正确。
本专项的目标状态是:AgentRun run、command、event 和 result 只作为执行事实输入;HWLAB 只有 `WorkbenchProjectionWriter``WorkbenchProjectionFinalizer` 写入 Workbench factscloud-api boot/background scheduler 是 startup/resume 的 authority,负责扫描 durable open checkpoint 和 running/projecting turn 并恢复追平;投影推进只能由上游 source event/result、projection writer/finalizer、background scheduler/reconciler 或显式受控 checkpoint replay/reprojection 触发。所有 `GET /v1/workbench/*` 和兼容读路径都只通过 `WorkbenchReadModel` 读取,不在读取时调用 AgentRun、Code Agent manager、trace polling、result polling`hydrateRealtimeGap` 或 workspace repair 推进事实。Workbench 必须满足 0repair:页面、GET、SSE、fake-server 和 web-probe 都不得通过 reload、切换 session、`sessionRepair``realignFreshSession`、localStorage truth、SSE gap repair、visibility gap refresh 或 read-through repair 把已经分裂的 route/session/message/trace 状态补成看起来正确。
Workbench aggregate event stream 是上述唯一投影的提交脊柱。所有 admission、AgentRun event/result、cancel、replay/reprojection 和 diagnostic transition 先归一化为带 `eventSeq``aggregateId``aggregateSeq``turnId``traceId``sourceRunId``sourceCommandId` 和来源幂等键的 append-only 事件,再由同一 projector 写入 message、part、turn、trace、checkpoint、outbox 和 read model。Web、SSE、CLI、fake-server 和 probe 只能消费该 event stream 的投影结果或 cursor replay,不能分别从 trace tail、result envelope、message cache、session list 和浏览器本地状态重新排序。
@@ -83,6 +83,7 @@ D601 v0.3 可以在 `hwlab-v03` namespace 内为 `hwlab-workbench-runtime` 使
| Workbench aggregate event stream | Workbench 投影自己的 append-only 事件流,承载 admission、source event/result、terminal、diagnostic、replay/reprojection 和 cancel transition;它是 trace/timeline 顺序、SSE replay cursor 和 projector revision 的唯一提交序。 |
| message/part authority | Workbench timeline 的消息和片段权威。每个 user/assistant/tool/diagnostic/final response 片段都有稳定 `messageId``partId``order``status` 和 sealed 状态;trace-level 文本、DOM 行、result envelope 或 analyzer 不能替代它选择最终内容。 |
| WorkbenchReadModel | 唯一读取 Workbench facts 并组装 session rail、session detail、message page、turn snapshot、trace event page 和 projection diagnostics 的读模型。 |
| 上游投影推进 | Workbench facts 只能由 projection writer/finalizer、background scheduler/reconciler、AgentRun source event/result outbox 或显式受控 checkpoint replay/reprojection 推进;REST GET、Web 页面、SSE consumer、SSE open/error/visibility handler、web-probe、fake-server 和 CLI renderer 只能观察或展示投影与诊断,不能以 gap hydration、trace/result polling、reload 或 read-through sync 触发写侧投影。 |
| checkpoint replay/reprojection | 受控管理入口按已持久化 sourceRun/sourceCommand/checkpoint 重放投影逻辑,用于恢复投影 lag 或阻塞;它只能调用同一 finalizer/writer,不由 GET、Web 页面、SSE 订阅或测试 helper 触发,也不得改变 active session 或 route。 |
| projection commit | writer/finalizer 对一组 message、part、turn、trace、session summary 和 checkpoint 的一次幂等持久化提交;terminal commit 必须保持用户可见事实一致。 |
| terminal commit | 标记同一 turn 结束的 projection commit,必须原子更新 assistant final text、message/part status、turn terminal、trace terminal event、session running=false、summary 和 SSE cursor。 |
@@ -200,6 +201,8 @@ flowchart LR
目标架构要求 route/auth、adapter、projection writer/finalizer、facts store、read model、SSE publisher 和 compat wrapper 分工清晰。任何 route、GET handler、trace polling、result polling、workspace snapshot 或 front-end reducer 都不能绕过 writer/finalizer 直接改变 Workbench facts。
目标架构还要求投影推进只发生在上游写侧。SSE publisher/handler 只发布和 replay durable outbox commitWeb/CLI/fake-server/SSE consumer 只消费 REST/SSE projection。SSE open/error、visibility change、route hydrate、Trace detail hydration、web-probe 观察、GET refresh 和 observer reload 不能触发 `hydrateRealtimeGap`、read-through sync、result sync 或 trace polling 来推动 projection;缺口只能表现为 projection diagnostic/blocker,并由 scheduler/reconciler/finalizer 或 source event/outbox 追平。
目标架构还要求彻底禁止读侧推理。`turn.status``message.status``session.running``trace terminal``finalResponse``projectionStatus` 必须是 projection writer/finalizer 已经写入 durable facts 的字段;read model、REST route、SSE consumer、compat wrapper、Web reducer、CLI renderer、fake-server 和测试只能读取和重放这些字段。AgentRun facts、trace events、message parts、result envelope、session summary、list row 和 workspace snapshot 只能作为 writer/finalizer 输入或诊断字段,不得在读取链路中通过优先级、fallback、最后事件、空文本、超时或 UI heuristic 生成生命周期事实。
### 5.3 目标数据流图
@@ -276,7 +279,7 @@ sequenceDiagram
Store-->>Read: caught-up facts or explicit lag/blocker
```
重启恢复要求 finalizer 不依赖进程内 90s 轮询作为唯一推进机制。进程内任务丢失、cloud-api 重启或慢任务超过短轮询预算后,boot/background scheduler 必须能从 durable checkpoint 和 running/projecting turn 找回需要追平的 sourceRun/sourceCommand,并以同一 writer 逻辑提交或记录 blocker。GET、SSE、Web 页面、fake-serverprobe 只能观察恢复状态,不能触发恢复
重启恢复要求 finalizer 不依赖进程内 90s 轮询作为唯一推进机制。进程内任务丢失、cloud-api 重启或慢任务超过短轮询预算后,boot/background scheduler 必须能从 durable checkpoint 和 running/projecting turn 找回需要追平的 sourceRun/sourceCommand,并以同一 writer 逻辑提交或记录 blocker。GET、SSE、Web 页面、visibility handler、fake-server、web-probe 和 CLI renderer 只能观察恢复状态,不能触发恢复、gap hydration 或 read-through sync
### 5.6 durable Workbench facts 对象模型
+2
View File
@@ -155,6 +155,7 @@ export interface HwlabRuntimeWebProbeAlertThresholdsSpec {
readonly longLivedStreamOpenSlowMs: number;
readonly visibleLoadingSlowMs: number;
readonly turnTimingSampleSlackSeconds: number;
readonly turnElapsedSevereTimeoutSeconds: number;
readonly uncommandedStateChangeCommandWindowMs: number;
readonly scrollJumpCommandWindowMs: number;
readonly scrollJumpFromY: number;
@@ -839,6 +840,7 @@ function webProbeAlertThresholdsConfig(value: unknown, path: string): HwlabRunti
longLivedStreamOpenSlowMs: positiveNumberField(raw, "longLivedStreamOpenSlowMs", path),
visibleLoadingSlowMs: positiveNumberField(raw, "visibleLoadingSlowMs", path),
turnTimingSampleSlackSeconds: positiveNumberField(raw, "turnTimingSampleSlackSeconds", path),
turnElapsedSevereTimeoutSeconds: positiveNumberField(raw, "turnElapsedSevereTimeoutSeconds", path),
uncommandedStateChangeCommandWindowMs: positiveNumberField(raw, "uncommandedStateChangeCommandWindowMs", path),
scrollJumpCommandWindowMs: positiveNumberField(raw, "scrollJumpCommandWindowMs", path),
scrollJumpFromY: positiveNumberField(raw, "scrollJumpFromY", path),
@@ -533,6 +533,7 @@ function parseAlertThresholds(value) {
longLivedStreamOpenSlowMs: requiredPositiveThreshold(raw, "longLivedStreamOpenSlowMs"),
visibleLoadingSlowMs: requiredPositiveThreshold(raw, "visibleLoadingSlowMs"),
turnTimingSampleSlackSeconds: requiredPositiveThreshold(raw, "turnTimingSampleSlackSeconds"),
turnElapsedSevereTimeoutSeconds: requiredPositiveThreshold(raw, "turnElapsedSevereTimeoutSeconds"),
uncommandedStateChangeCommandWindowMs: requiredPositiveThreshold(raw, "uncommandedStateChangeCommandWindowMs"),
scrollJumpCommandWindowMs: requiredPositiveThreshold(raw, "scrollJumpCommandWindowMs"),
scrollJumpFromY: requiredPositiveThreshold(raw, "scrollJumpFromY"),
@@ -1383,6 +1384,9 @@ function buildFindings(samples, control, network, errors, sampleMetrics, promptN
? sampleMetrics.turnTimingNonMonotonic.filter((item) => item.metric === "recentUpdateSeconds" && item.anomaly === "jump")
: [];
if (recentUpdateSawtoothJumps.length > 0) findings.push({ id: "turn-timing-recent-update-sawtooth-jump", severity: "amber", summary: "最近更新 value jumped faster than sample interval; expected sawtooth increase-or-reset", count: recentUpdateSawtoothJumps.length, samples: recentUpdateSawtoothJumps.slice(0, 20) });
const severeTimeoutRounds = Array.isArray(sampleMetrics?.rounds) ? sampleMetrics.rounds.filter((item) => Number(item.maxTotalElapsedSeconds) > alertThresholds.turnElapsedSevereTimeoutSeconds) : [];
const severeTimeoutSamples = Array.isArray(sampleMetrics?.timeline) ? sampleMetrics.timeline.filter((item) => Number(item.totalElapsedSeconds) > alertThresholds.turnElapsedSevereTimeoutSeconds) : [];
if (severeTimeoutRounds.length > 0 || severeTimeoutSamples.length > 0) findings.push({ id: "turn-elapsed-severe-timeout", severity: "red", summary: "turn total elapsed exceeded YAML-configured severe timeout; investigate Workbench/AgentRun progress instead of treating the turn as healthy", thresholdSeconds: alertThresholds.turnElapsedSevereTimeoutSeconds, count: Math.max(severeTimeoutRounds.length, severeTimeoutSamples.length), rounds: severeTimeoutRounds.slice(0, 20), samples: severeTimeoutSamples.slice(0, 20) });
const loadingSummary = sampleMetrics?.loading?.summary || {};
const visibleLoadingSlowSeconds = alertThresholds.visibleLoadingSlowMs / 1000;
if (Number(loadingSummary.longestContinuousSeconds ?? 0) > visibleLoadingSlowSeconds) findings.push({ id: "page-loading-visible-over-budget", severity: "red", summary: "visible 加载中 stayed on screen longer than configured YAML budget; fix real loading latency instead of revealing incomplete content early", count: loadingSummary.overBudgetSegmentCount ?? loadingSummary.overFiveSecondSegmentCount ?? 1, longestContinuousSeconds: loadingSummary.longestContinuousSeconds, budgetSeconds: visibleLoadingSlowSeconds, segments: sampleMetrics.loading.segments.slice(0, 20), owners: sampleMetrics.loading.owners.slice(0, 20) });
@@ -2921,6 +2921,7 @@ function parseAlertThresholds(value) {
longLivedStreamOpenSlowMs: requiredPositiveThreshold(raw, "longLivedStreamOpenSlowMs"),
visibleLoadingSlowMs: requiredPositiveThreshold(raw, "visibleLoadingSlowMs"),
turnTimingSampleSlackSeconds: requiredPositiveThreshold(raw, "turnTimingSampleSlackSeconds"),
turnElapsedSevereTimeoutSeconds: requiredPositiveThreshold(raw, "turnElapsedSevereTimeoutSeconds"),
uncommandedStateChangeCommandWindowMs: requiredPositiveThreshold(raw, "uncommandedStateChangeCommandWindowMs"),
scrollJumpCommandWindowMs: requiredPositiveThreshold(raw, "scrollJumpCommandWindowMs"),
scrollJumpFromY: requiredPositiveThreshold(raw, "scrollJumpFromY"),