diff --git a/.agents/skills/unidesk-gh/SKILL.md b/.agents/skills/unidesk-gh/SKILL.md index 4245af8f..615e6b15 100644 --- a/.agents/skills/unidesk-gh/SKILL.md +++ b/.agents/skills/unidesk-gh/SKILL.md @@ -12,6 +12,7 @@ GitHub issue/PR 正式读写必须走 `bun scripts/cli.ts gh ...` 或 `trans gh: - Issue/PR 正文、评论、关闭/重开、PR 描述和 merge closeout 默认中文。 - 新 issue 正文必须包含 `目标合并分支: `;不需要合并时写 `目标合并分支: 不适用`。 - 大计划、后续阶段和独立改进方向创建新 issue;已有 issue 评论只写短进展、证据、阻塞和链接。 +- 规划型、多阶段、架构/API/平台运维类 issue 第一阶段必须 `P0 SPEC 先行`;细则见 [references/full.md](references/full.md) 的 `多阶段 Issue 与 SPEC-First`。 - 多行正文使用 quoted heredoc:`--body-stdin <<'EOF'`;不要把长 Markdown 塞进 shell 参数。 - PR merge 只走 guarded `gh pr merge`;`gh pr create` 的 Next 默认是 `--merge --delete-branch`,只有确认 ancestry 可丢弃时才显式 `--squash`。 diff --git a/.agents/skills/unidesk-gh/references/full.md b/.agents/skills/unidesk-gh/references/full.md index a4b459ab..4c9bd924 100644 --- a/.agents/skills/unidesk-gh/references/full.md +++ b/.agents/skills/unidesk-gh/references/full.md @@ -42,6 +42,13 @@ CLI 自检使用 `bun scripts/cli.ts check --syntax-only`、针对被改模块 - 如果调查中发现了独立改进方向,应先用 `gh issue create --body-stdin` 创建新 issue,标题和正文写清目标合并分支/lane、背景、计划、验收标准;然后在原 issue 评论中用 1-3 句说明已拆出,并链接到新 issue。 - 只有用户明确要求把计划写回当前 issue 正文,或当前 issue 本身就是唯一的专题计划 issue,才允许更新当前 issue 正文;即便如此,评论仍保持短小,不复制整篇计划。 +## 多阶段 Issue 与 SPEC-First + +- 形成多阶段实施、跨模块架构、新能力、长期 API/数据模型、平台运维能力或用户可见工作流的规划型 issue 时,第一阶段必须是 `P0 SPEC 先行`,并按 `$unidesk-oa` 的 SPEC 管理模式处理。 +- `P0 SPEC 先行` 必须在 issue 正文列出 SPEC 编号、SPEC 文档路径、上级规格、关联规格、实现引用版本、目标架构图/数据流图/关键时序图完成项,以及源码文件头部 `SPEC: <编号> <短名> <实现引用版本>` 标注规则。 +- issue 正文只能承载执行计划、阶段状态和证据索引,不能替代 `project-management/PJ2026-01/specs/` 中的长期 SPEC 正文。若稳定需求、数据流、接口或验收口径变化,先更新 SPEC,再更新 issue 阶段计划。 +- P0 未完成前,不得把代码实现、部署、CI/CD、测试补充或验收收口列为已可执行阶段;这些只能作为后续 P1+ 阶段。 + --- ## 认证探测 diff --git a/.agents/skills/unidesk-sub2api/SKILL.md b/.agents/skills/unidesk-sub2api/SKILL.md index 7e7a0730..edf5700e 100644 --- a/.agents/skills/unidesk-sub2api/SKILL.md +++ b/.agents/skills/unidesk-sub2api/SKILL.md @@ -24,10 +24,11 @@ bun scripts/cli.ts platform-infra sub2api apply --target D601 --dry-run - Secret 只输出对象名、key 名、presence、fingerprint 或 redacted prefix;禁止打印完整 token/key。 - D601 是默认 active target;D518/G14 等 target 以 YAML 和 issue 明确目标为准。 - Codex pool、统一 API key、master `~/.codex` 配置、FRP/Caddy 暴露、账号增删都必须走本技能的受控 CLI。 +- D601 public 502 或 `api.pikapython.com` 异常先区分 edge/app endpoint,并用 `status`、`validate`、`apply --confirm`、`codex-pool validate` 做分层恢复;完整步骤见 [references/full.md](references/full.md) 的排障段。 ## 何时读取 reference - 添加/删除上游、受保护账号代理、分组绑定:读 [references/full.md](references/full.md) 的账号管理段。 - 部署/状态/镜像升级/FRP 暴露:读部署、镜像、FRP 段。 -- master Codex 消费端、`/v1/models`、Codex pool 验收:读 Codex Pool 和验收口径段。 +- master Codex消费端、`/v1/models`、Codex pool 验收:读 Codex Pool 和验收口径段。 - 排障或禁止事项不确定时,读排障和禁止事项段。 diff --git a/.agents/skills/unidesk-sub2api/references/full.md b/.agents/skills/unidesk-sub2api/references/full.md index fd827d3a..476dda92 100644 --- a/.agents/skills/unidesk-sub2api/references/full.md +++ b/.agents/skills/unidesk-sub2api/references/full.md @@ -246,6 +246,8 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm ## 排障 +- `api.pikapython.com` 或 D601 public exposure 返回 502 时,先判定是 edge 还是 app endpoint:跑 `sub2api status --target D601` 和 `sub2api validate --target D601`。若 `sub2api`、`sub2api-frpc`、`sub2api-redis` 或 `sub2api-egress-proxy` 出现 `0/1`,或 validate 显示 `no endpoints available for service "sub2api"` / app Pod 已终止,先用 `bun scripts/cli.ts platform-infra sub2api apply --target D601 --confirm` 重新收敛 YAML 资源,按返回的 `job status` 轮询,再跑 `status`、`validate` 和 `codex-pool validate --target D601`。不要先改账号池、哨兵状态、Secret 或 Caddy。 +- D601 快速恢复完成后,用分层证据 closeout:`https://api.pikapython.com/health` 应返回 200;`codex-pool validate --target D601` 应证明内部 `GET /v1/models` 和最小 `POST /v1/responses` smoke 成功;若需要证明公网 OpenAI-compatible API,用 `trans D601:k3s sh` 从 `platform-infra/sub2api-codex-pool-api-key.API_KEY` 只读到临时 shell 变量后请求 public `/v1/models` 和最小 `/v1/responses` marker,只输出 HTTP status、模型数量和 marker,不打印 key。不要为了公网验证运行 `configure-local --confirm`,它会重写本机 `~/.codex`;本机默认 `auth.json` key 返回 401 只能说明本机配置和公网统一 key 不一致,不能当作服务不可用证据。 - Codex pool 哨兵、账号冻结/恢复、marker-only 判断或 probe 周期看不清:第一步跑 `bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-report`。这个报表是主观察面;只有报表缺字段或需要底层证据时,才继续看 `--raw`、CronJob log、state ConfigMap 或 Sub2API 管理 UI。若看到“临时不可调度状态”且包含规则序号/匹配关键词,检查 Sub2API `account_temp_unschedulable` 日志和账号 `temp_unschedulable_*` 字段;sentinel 只解释 `schedulable=false` 的 active quarantine,不解释这类内置临时冷却。 - 只加强监控、不让哨兵自动冻结账号时,把 YAML `sentinel.actions.enabled=false` 后 `codex-pool sync --confirm`。此时 marker probe 和 gateway failure monitor 仍记录 `would-freeze` / observe-only 证据,但不会通过 Sub2API admin 写 `schedulable=false`;`/responses/compact` 的 `codex.remote_compact.failed` 和 compact 上游 5xx failover 只作为 `gateway-compact-*` 观察事件记录,不作为哨兵自动切换触发器。 - 单个 request id 报 502/503/中断/没有自动切号:第一步跑 `bun scripts/cli.ts platform-infra sub2api codex-pool trace --request-id `。先看 `outcome`、`reason`、`FAILOVER`、`SELECT-FAILED`、`ACCOUNT SIGNALS` 和 `WINDOW STATS`;只有 trace 报表缺字段或需要审计原始日志时,才加 `--show-lines` 或 `--raw`。若 `reason=failover-attempted-no-candidate`,说明切号动作已发生,但 scheduler 在排除失败账号后没有可用候选;继续用 `sentinel-report` 和 `validate --full` 区分 sentinel quarantine、request-path temp-unschedulable、账号 status 或容量耗尽。 diff --git a/.agents/skills/unidesk-ymalops/SKILL.md b/.agents/skills/unidesk-ymalops/SKILL.md index 96d79ae4..02dc5bb7 100644 --- a/.agents/skills/unidesk-ymalops/SKILL.md +++ b/.agents/skills/unidesk-ymalops/SKILL.md @@ -1,6 +1,6 @@ --- name: unidesk-ymalops -description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/YAML-first 正规化、YAML ops、运维配置职责、platform-infra 配置重构、Secret sourceRef、publicExposure、target/lane/node、ops helper 抽取、删除 hardcoded defaults/特例,或历史收敛 issue pikasTech/unidesk#390/#398/#401 时使用。 +description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/YAML-first 正规化、YAML ops、运维配置职责拆分、configRef/path 引用、platform-infra 配置重构、Secret sourceRef、publicExposure、target/lane/node、ops helper 抽取、删除 hardcoded defaults/特例,或历史收敛 issue pikasTech/unidesk#390/#398/#401 时使用。 --- # UniDesk YAML Ops @@ -26,6 +26,9 @@ description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/ - 源码、配置、部署类正规化默认在独立 `.worktree/` 中做;轻量 skill/docs/reference 收敛可按项目规则直接在主 worktree 做。 - YAML 是 source of truth。不得新增隐藏代码默认值、schema 数值硬限制、合同测试或测试硬编码策略。 - 代码校验只保证字段能被正确读取和渲染:类型、必填、枚举键名、引用存在性。版本号、namespace、endpoint、容量、冷却时间、回退窗口等数值以 YAML 为准。 +- 避免“超级配置”。当一个能力同时涉及 target/lane、runtime、scenario、prompt、report、publicExposure、Secret、CI/CD 等不同职责时,按职责拆分到 owning YAML;root YAML 只保存归属和 `configRefs`/path 引用,不承载全部细节。 +- 跨 YAML 引用应使用稳定的 `path/to/file.yaml#object.path` 或当前 domain parser 明确支持的等价语法。parser 只解析引用、校验存在性/类型/形状和冲突,不生成隐藏默认值,也不把合并后的大对象写成新的 source of truth。 +- CLI `plan/status` 应输出 redacted 配置引用图:每个 ref 的文件、path、presence、摘要 hash、缺失字段和下一步 drill-down 命令。不要默认 dump 展开后的完整 YAML 或 Secret。 - Secret 只能通过 YAML 的 `sourceRef`/`targetKey` 声明和受控 CLI 下发;禁止从运行面 Secret、pod env、日志或数据库状态反推、解码、回填本地凭据。 - 受控 CLI 输出只能披露对象名、key 名、sourceRef、targetKey、缺失项、fingerprint、字节数和执行摘要;不得打印 base64 payload、解码值、完整 DSN、API key 或可复制凭据。 - 不做新的全局大 orchestrator。优先保留 domain CLI,把公共能力抽到 ops helper,domain CLI 只表达领域动作。 @@ -36,13 +39,14 @@ description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/ 1. 盘点目标面:列出涉及的 YAML、CLI 入口、helper、Secret 绑定、运行面对象和现有 hardcode。 2. 确认归属:每个事实必须有唯一 owning YAML;代码、运行面 Secret、pod env 和 docs 不能反向成为配置真相。 -3. 归一 YAML envelope:自有运维配置优先补齐 `version`、`kind`、`metadata`、`defaults`、`targets/services/lane` 等必要结构,但不要为外部工具格式强行套壳。 -4. 搬迁数值:把 namespace、serviceName、secretName、endpoint、image/tag、node/lane、probe、NO_PROXY、容量、回退窗口等可调项从代码迁到 YAML。 -5. 精简 parser:parser 只做结构和类型校验,不藏业务策略,不提供长期默认值。缺项应让 CLI 报出 YAML 路径和字段名。 -6. 抽公共 ops primitives:在增加新 service 分支前,优先复用或扩展公共 helper。 -7. 保持 domain CLI 薄:`platform-infra`、`server`、`gc`、`agentrun`、`hwlab` 等入口只组合 YAML、helper 和执行动作,不复制底层 Kubernetes/FRP/Caddy/Secret 逻辑。 -8. 验证原入口:CLI/config 改动默认只跑语法、help/命令形态、plan/dry-run 或对应 sync/validate;涉及真实运行面的收口要跑原 CLI 入口,不新增合同测试。 -9. 有限收口:当 issue 已经冻结阶段,完成当前阶段后只更新父 issue 的进展和下一固定阶段;固定阶段全部完成后关闭总 issue,不把候选扫描结果转成新的 Round。 +3. 拆分配置职责:把不同生命周期或不同 owner 的事实拆到各自 owning YAML,用 root `configRefs`/path 引用串联;只有同 owner、同生命周期、同命令模型的字段才放在同一个对象中。 +4. 归一 YAML envelope:自有运维配置优先补齐 `version`、`kind`、`metadata`、`defaults`、`targets/services/lane` 等必要结构,但不要为外部工具格式强行套壳。 +5. 搬迁数值:把 namespace、serviceName、secretName、endpoint、image/tag、node/lane、probe、NO_PROXY、容量、回退窗口等可调项从代码迁到 YAML。 +6. 精简 parser:parser 只做结构和类型校验,不藏业务策略,不提供长期默认值。缺项应让 CLI 报出 YAML 路径和字段名;重复声明同一事实或引用冲突时应失败并指出冲突路径。 +7. 抽公共 ops primitives:在增加新 service 分支前,优先复用或扩展公共 helper。 +8. 保持 domain CLI 薄:`platform-infra`、`server`、`gc`、`agentrun`、`hwlab` 等入口只组合 YAML、helper 和执行动作,不复制底层 Kubernetes/FRP/Caddy/Secret 逻辑。 +9. 验证原入口:CLI/config 改动默认只跑语法、help/命令形态、plan/dry-run 或对应 sync/validate;涉及真实运行面的收口要跑原 CLI 入口,不新增合同测试。 +10. 有限收口:当 issue 已经冻结阶段,完成当前阶段后只更新父 issue 的进展和下一固定阶段;固定阶段全部完成后关闭总 issue,不把候选扫描结果转成新的 Round。 ## Common Refactor Targets diff --git a/config/hwlab-node-lanes.yaml b/config/hwlab-node-lanes.yaml index 9874e6de..3c711b7f 100644 --- a/config/hwlab-node-lanes.yaml +++ b/config/hwlab-node-lanes.yaml @@ -185,6 +185,7 @@ lanes: longLivedStreamOpenSlowMs: 10000 visibleLoadingSlowMs: 10000 turnTimingSampleSlackSeconds: 3 + turnElapsedSevereTimeoutSeconds: 120 uncommandedStateChangeCommandWindowMs: 10000 scrollJumpCommandWindowMs: 8000 scrollJumpFromY: 250 diff --git a/docs/reference/agent-instruction-hygiene.md b/docs/reference/agent-instruction-hygiene.md new file mode 100644 index 00000000..17f6fae8 --- /dev/null +++ b/docs/reference/agent-instruction-hygiene.md @@ -0,0 +1,83 @@ +# Agent Instruction Hygiene + +This document is the long-term reference for keeping always-loaded agent instruction files small, navigable and stable. It applies to local and remote `AGENTS.md`, `CLAUDE.md`-style aliases and any repo-level instruction file that is automatically injected into an agent context. + +## Size Budget + +`AGENTS.md` is an index, not a knowledge base. The hard size budget for any local or remote `AGENTS.md` is 10 KiB, measured as bytes with `wc -c AGENTS.md`. + +When an `AGENTS.md` is already over 10 KiB, do not append more detailed rules to it. Split first, then add only a one-line index entry back to `AGENTS.md`. + +When editing an `AGENTS.md` would push it over 10 KiB, route the new content to a skill or a `docs/reference/*.md` document and keep `AGENTS.md` as a short pointer. + +If loading or printing `AGENTS.md` triggers CLI output dump or context blow-up, treat that as a visibility bug and an instruction-hygiene bug. The fix is to split the file, not to increase output limits or ask agents to read around the dump. + +## What Belongs In AGENTS.md + +Keep only always-needed routing information in `AGENTS.md`: + +- Project identity and source-of-truth boundaries. +- P0 one-line rules that prevent immediate damage. +- Links to the authoritative long-term reference document for each domain. +- Skill names that must be loaded for common workflows. +- Short warnings about secrets, destructive commands, target workspaces and build bans. + +Do not put long examples, command transcripts, JSON output, issue timelines, architecture essays, provider-specific debugging logs or one-off incident analysis in `AGENTS.md`. + +## Where Overflow Content Goes + +Use this routing order when splitting content out of `AGENTS.md`: + +- Reusable workflow behavior belongs in a skill `SKILL.md`, for example `$dad-dev`, `$unidesk-cicd`, `$unidesk-gh`, `$unidesk-trans`, `$unidesk-otel`, `$unidesk-webdev` or `$unidesk-ymalops`. +- Stable project constraints, workspace rules, architecture boundaries and validation criteria belong in `docs/reference/*.md`. +- CLI shape, output style, route syntax and operator ergonomics belong in `docs/reference/cli.md` unless a narrower reference already owns them. +- Deployment hygiene, fixed repo boundaries and source-of-truth rules belong in `docs/reference/devops-hygiene.md`. +- Node/lane-specific HWLAB rules belong in `docs/reference/hwlab.md` and the target repo's own reference docs. +- AgentRun source-truth and deployment-lane rules belong in `docs/reference/agentrun.md`. +- Platform-infra and YAML-first operations belong in `docs/reference/platform-infra.md` and `docs/reference/yaml-first-ops.md`. +- Process notes, temporary findings and dated investigation logs belong in GitHub issues, PR comments or process notes; they must be distilled before entering long-term reference. + +If a rule is both reusable across projects and specific to UniDesk's current directories or services, put the reusable workflow in the skill and put UniDesk-specific paths, lane names and validation boundaries in `docs/reference/*.md`, then cross-reference both. + +## Split Procedure + +When an agent sees a local or remote `AGENTS.md` over 10 KiB: + +1. Identify the detailed section being changed or expanded. +2. Move the detailed content to the owning skill or `docs/reference/*.md` document. +3. Replace the original section with one concise bullet and a link to the authoritative location. +4. Preserve P0 damage-prevention warnings in `AGENTS.md`, but compress them to one-line routing rules. +5. Do not create a single giant overflow archive as the normal solution. A temporary migration note is acceptable only if it immediately points to the domain documents that must absorb it. +6. Do not add tests, guards or preflight checks just to enforce the size budget unless the user explicitly asks. The default control is documentation hygiene plus concise review. + +For large legacy files, split incrementally by domain. Each new edit should leave the touched domain smaller and better referenced than before. + +## Cross-Reference Requirements + +Every `AGENTS.md` index entry that points out of the file must name the authoritative target. Prefer direct paths such as `docs/reference/hwlab.md` or skill names such as `$unidesk-cicd`. + +Avoid duplicated full rules between `AGENTS.md`, skills and long-term reference docs. `AGENTS.md` may summarize; the reference owns the detail. If two references conflict, update the narrower domain reference and keep only one authoritative version. + +## Secrets And Output Hygiene + +Instruction files must not contain secrets, full API keys, full DSNs, base64 payloads, bearer tokens, SSH private keys or copy-pastable credentials. + +Do not paste large CLI output, OTel trace dumps, JSON arrays or browser transcripts into `AGENTS.md`. If a large output demonstrates a durable rule, summarize the rule and link to the issue or reference that owns the conclusion. + +## Current UniDesk Routing Map + +The current top-level routing map is: + +- CLI behavior and output: `docs/reference/cli.md`. +- YAML-first configuration: `docs/reference/yaml-first-ops.md` and `$unidesk-ymalops`. +- Platform infrastructure: `docs/reference/platform-infra.md` and `$unidesk-sub2api` when Sub2API is involved. +- Distributed field repair: `$dad-dev` plus `docs/reference/devops-hygiene.md`. +- CI/CD and rollout: `$unidesk-cicd` plus `docs/reference/cli.md`. +- GitHub issue and PR writes: `$unidesk-gh`. +- Trans/remote patch transport: `$unidesk-trans` plus `docs/reference/cli.md`. +- Web UI, Workbench and web-probe: `$unidesk-webdev`. +- OpenTelemetry and Tempo: `$unidesk-otel` plus `docs/reference/observability.md`. +- HWLAB node/lane operation: `docs/reference/hwlab.md`. +- AgentRun: `docs/reference/agentrun.md`. +- Master/D601 development environment: `docs/reference/dev-environment.md`. +- Secretary work: `docs/reference/secretary-reference.md`. diff --git a/project-management/PJ2026-01/specs/PJ2026-0104010803-workbench-unique-projection.md b/project-management/PJ2026-01/specs/PJ2026-0104010803-workbench-unique-projection.md index 4162f5c2..f897bfe1 100644 --- a/project-management/PJ2026-01/specs/PJ2026-0104010803-workbench-unique-projection.md +++ b/project-management/PJ2026-01/specs/PJ2026-0104010803-workbench-unique-projection.md @@ -33,7 +33,7 @@ Workbench唯一投影负责把 AgentRun 执行事实收敛成 HWLAB 自有的 durable Workbench facts,使 Web、CLI、REST、SSE、fake-server 和浏览器回归都消费同一份可恢复、可分页、可诊断的用户态会话事实。 -本专项的目标状态是:AgentRun run、command、event 和 result 只作为执行事实输入;HWLAB 只有 `WorkbenchProjectionWriter`、`WorkbenchProjectionFinalizer` 写入 Workbench facts;cloud-api boot/background scheduler 是 startup/resume 的 authority,负责扫描 durable open checkpoint 和 running/projecting turn 并恢复追平;所有 `GET /v1/workbench/*` 和兼容读路径都只通过 `WorkbenchReadModel` 读取,不在读取时调用 AgentRun、Code Agent manager、trace polling、result polling 或 workspace repair 推进事实。Workbench 必须满足 0repair:页面、GET、SSE、fake-server 和 web-probe 都不得通过 reload、切换 session、`sessionRepair`、`realignFreshSession`、localStorage truth 或 read-through repair 把已经分裂的 route/session/message/trace 状态补成看起来正确。 +本专项的目标状态是:AgentRun run、command、event 和 result 只作为执行事实输入;HWLAB 只有 `WorkbenchProjectionWriter`、`WorkbenchProjectionFinalizer` 写入 Workbench facts;cloud-api boot/background scheduler 是 startup/resume 的 authority,负责扫描 durable open checkpoint 和 running/projecting turn 并恢复追平;投影推进只能由上游 source event/result、projection writer/finalizer、background scheduler/reconciler 或显式受控 checkpoint replay/reprojection 触发。所有 `GET /v1/workbench/*` 和兼容读路径都只通过 `WorkbenchReadModel` 读取,不在读取时调用 AgentRun、Code Agent manager、trace polling、result polling、`hydrateRealtimeGap` 或 workspace repair 推进事实。Workbench 必须满足 0repair:页面、GET、SSE、fake-server 和 web-probe 都不得通过 reload、切换 session、`sessionRepair`、`realignFreshSession`、localStorage truth、SSE gap repair、visibility gap refresh 或 read-through repair 把已经分裂的 route/session/message/trace 状态补成看起来正确。 Workbench aggregate event stream 是上述唯一投影的提交脊柱。所有 admission、AgentRun event/result、cancel、replay/reprojection 和 diagnostic transition 先归一化为带 `eventSeq`、`aggregateId`、`aggregateSeq`、`turnId`、`traceId`、`sourceRunId`、`sourceCommandId` 和来源幂等键的 append-only 事件,再由同一 projector 写入 message、part、turn、trace、checkpoint、outbox 和 read model。Web、SSE、CLI、fake-server 和 probe 只能消费该 event stream 的投影结果或 cursor replay,不能分别从 trace tail、result envelope、message cache、session list 和浏览器本地状态重新排序。 @@ -83,6 +83,7 @@ D601 v0.3 可以在 `hwlab-v03` namespace 内为 `hwlab-workbench-runtime` 使 | Workbench aggregate event stream | Workbench 投影自己的 append-only 事件流,承载 admission、source event/result、terminal、diagnostic、replay/reprojection 和 cancel transition;它是 trace/timeline 顺序、SSE replay cursor 和 projector revision 的唯一提交序。 | | message/part authority | Workbench timeline 的消息和片段权威。每个 user/assistant/tool/diagnostic/final response 片段都有稳定 `messageId`、`partId`、`order`、`status` 和 sealed 状态;trace-level 文本、DOM 行、result envelope 或 analyzer 不能替代它选择最终内容。 | | WorkbenchReadModel | 唯一读取 Workbench facts 并组装 session rail、session detail、message page、turn snapshot、trace event page 和 projection diagnostics 的读模型。 | +| 上游投影推进 | Workbench facts 只能由 projection writer/finalizer、background scheduler/reconciler、AgentRun source event/result outbox 或显式受控 checkpoint replay/reprojection 推进;REST GET、Web 页面、SSE consumer、SSE open/error/visibility handler、web-probe、fake-server 和 CLI renderer 只能观察或展示投影与诊断,不能以 gap hydration、trace/result polling、reload 或 read-through sync 触发写侧投影。 | | checkpoint replay/reprojection | 受控管理入口按已持久化 sourceRun/sourceCommand/checkpoint 重放投影逻辑,用于恢复投影 lag 或阻塞;它只能调用同一 finalizer/writer,不由 GET、Web 页面、SSE 订阅或测试 helper 触发,也不得改变 active session 或 route。 | | projection commit | writer/finalizer 对一组 message、part、turn、trace、session summary 和 checkpoint 的一次幂等持久化提交;terminal commit 必须保持用户可见事实一致。 | | terminal commit | 标记同一 turn 结束的 projection commit,必须原子更新 assistant final text、message/part status、turn terminal、trace terminal event、session running=false、summary 和 SSE cursor。 | @@ -200,6 +201,8 @@ flowchart LR 目标架构要求 route/auth、adapter、projection writer/finalizer、facts store、read model、SSE publisher 和 compat wrapper 分工清晰。任何 route、GET handler、trace polling、result polling、workspace snapshot 或 front-end reducer 都不能绕过 writer/finalizer 直接改变 Workbench facts。 +目标架构还要求投影推进只发生在上游写侧。SSE publisher/handler 只发布和 replay durable outbox commit;Web/CLI/fake-server/SSE consumer 只消费 REST/SSE projection。SSE open/error、visibility change、route hydrate、Trace detail hydration、web-probe 观察、GET refresh 和 observer reload 不能触发 `hydrateRealtimeGap`、read-through sync、result sync 或 trace polling 来推动 projection;缺口只能表现为 projection diagnostic/blocker,并由 scheduler/reconciler/finalizer 或 source event/outbox 追平。 + 目标架构还要求彻底禁止读侧推理。`turn.status`、`message.status`、`session.running`、`trace terminal`、`finalResponse` 和 `projectionStatus` 必须是 projection writer/finalizer 已经写入 durable facts 的字段;read model、REST route、SSE consumer、compat wrapper、Web reducer、CLI renderer、fake-server 和测试只能读取和重放这些字段。AgentRun facts、trace events、message parts、result envelope、session summary、list row 和 workspace snapshot 只能作为 writer/finalizer 输入或诊断字段,不得在读取链路中通过优先级、fallback、最后事件、空文本、超时或 UI heuristic 生成生命周期事实。 ### 5.3 目标数据流图 @@ -276,7 +279,7 @@ sequenceDiagram Store-->>Read: caught-up facts or explicit lag/blocker ``` -重启恢复要求 finalizer 不依赖进程内 90s 轮询作为唯一推进机制。进程内任务丢失、cloud-api 重启或慢任务超过短轮询预算后,boot/background scheduler 必须能从 durable checkpoint 和 running/projecting turn 找回需要追平的 sourceRun/sourceCommand,并以同一 writer 逻辑提交或记录 blocker。GET、SSE、Web 页面、fake-server 和 probe 只能观察恢复状态,不能触发恢复。 +重启恢复要求 finalizer 不依赖进程内 90s 轮询作为唯一推进机制。进程内任务丢失、cloud-api 重启或慢任务超过短轮询预算后,boot/background scheduler 必须能从 durable checkpoint 和 running/projecting turn 找回需要追平的 sourceRun/sourceCommand,并以同一 writer 逻辑提交或记录 blocker。GET、SSE、Web 页面、visibility handler、fake-server、web-probe 和 CLI renderer 只能观察恢复状态,不能触发恢复、gap hydration 或 read-through sync。 ### 5.6 durable Workbench facts 对象模型 diff --git a/scripts/src/hwlab-node-lanes.ts b/scripts/src/hwlab-node-lanes.ts index 342ab246..9d19017f 100644 --- a/scripts/src/hwlab-node-lanes.ts +++ b/scripts/src/hwlab-node-lanes.ts @@ -155,6 +155,7 @@ export interface HwlabRuntimeWebProbeAlertThresholdsSpec { readonly longLivedStreamOpenSlowMs: number; readonly visibleLoadingSlowMs: number; readonly turnTimingSampleSlackSeconds: number; + readonly turnElapsedSevereTimeoutSeconds: number; readonly uncommandedStateChangeCommandWindowMs: number; readonly scrollJumpCommandWindowMs: number; readonly scrollJumpFromY: number; @@ -839,6 +840,7 @@ function webProbeAlertThresholdsConfig(value: unknown, path: string): HwlabRunti longLivedStreamOpenSlowMs: positiveNumberField(raw, "longLivedStreamOpenSlowMs", path), visibleLoadingSlowMs: positiveNumberField(raw, "visibleLoadingSlowMs", path), turnTimingSampleSlackSeconds: positiveNumberField(raw, "turnTimingSampleSlackSeconds", path), + turnElapsedSevereTimeoutSeconds: positiveNumberField(raw, "turnElapsedSevereTimeoutSeconds", path), uncommandedStateChangeCommandWindowMs: positiveNumberField(raw, "uncommandedStateChangeCommandWindowMs", path), scrollJumpCommandWindowMs: positiveNumberField(raw, "scrollJumpCommandWindowMs", path), scrollJumpFromY: positiveNumberField(raw, "scrollJumpFromY", path), diff --git a/scripts/src/hwlab-node-web-observe-analyzer-source.ts b/scripts/src/hwlab-node-web-observe-analyzer-source.ts index 04d5f9e6..81dc6b11 100644 --- a/scripts/src/hwlab-node-web-observe-analyzer-source.ts +++ b/scripts/src/hwlab-node-web-observe-analyzer-source.ts @@ -533,6 +533,7 @@ function parseAlertThresholds(value) { longLivedStreamOpenSlowMs: requiredPositiveThreshold(raw, "longLivedStreamOpenSlowMs"), visibleLoadingSlowMs: requiredPositiveThreshold(raw, "visibleLoadingSlowMs"), turnTimingSampleSlackSeconds: requiredPositiveThreshold(raw, "turnTimingSampleSlackSeconds"), + turnElapsedSevereTimeoutSeconds: requiredPositiveThreshold(raw, "turnElapsedSevereTimeoutSeconds"), uncommandedStateChangeCommandWindowMs: requiredPositiveThreshold(raw, "uncommandedStateChangeCommandWindowMs"), scrollJumpCommandWindowMs: requiredPositiveThreshold(raw, "scrollJumpCommandWindowMs"), scrollJumpFromY: requiredPositiveThreshold(raw, "scrollJumpFromY"), @@ -1383,6 +1384,9 @@ function buildFindings(samples, control, network, errors, sampleMetrics, promptN ? sampleMetrics.turnTimingNonMonotonic.filter((item) => item.metric === "recentUpdateSeconds" && item.anomaly === "jump") : []; if (recentUpdateSawtoothJumps.length > 0) findings.push({ id: "turn-timing-recent-update-sawtooth-jump", severity: "amber", summary: "最近更新 value jumped faster than sample interval; expected sawtooth increase-or-reset", count: recentUpdateSawtoothJumps.length, samples: recentUpdateSawtoothJumps.slice(0, 20) }); + const severeTimeoutRounds = Array.isArray(sampleMetrics?.rounds) ? sampleMetrics.rounds.filter((item) => Number(item.maxTotalElapsedSeconds) > alertThresholds.turnElapsedSevereTimeoutSeconds) : []; + const severeTimeoutSamples = Array.isArray(sampleMetrics?.timeline) ? sampleMetrics.timeline.filter((item) => Number(item.totalElapsedSeconds) > alertThresholds.turnElapsedSevereTimeoutSeconds) : []; + if (severeTimeoutRounds.length > 0 || severeTimeoutSamples.length > 0) findings.push({ id: "turn-elapsed-severe-timeout", severity: "red", summary: "turn total elapsed exceeded YAML-configured severe timeout; investigate Workbench/AgentRun progress instead of treating the turn as healthy", thresholdSeconds: alertThresholds.turnElapsedSevereTimeoutSeconds, count: Math.max(severeTimeoutRounds.length, severeTimeoutSamples.length), rounds: severeTimeoutRounds.slice(0, 20), samples: severeTimeoutSamples.slice(0, 20) }); const loadingSummary = sampleMetrics?.loading?.summary || {}; const visibleLoadingSlowSeconds = alertThresholds.visibleLoadingSlowMs / 1000; if (Number(loadingSummary.longestContinuousSeconds ?? 0) > visibleLoadingSlowSeconds) findings.push({ id: "page-loading-visible-over-budget", severity: "red", summary: "visible 加载中 stayed on screen longer than configured YAML budget; fix real loading latency instead of revealing incomplete content early", count: loadingSummary.overBudgetSegmentCount ?? loadingSummary.overFiveSecondSegmentCount ?? 1, longestContinuousSeconds: loadingSummary.longestContinuousSeconds, budgetSeconds: visibleLoadingSlowSeconds, segments: sampleMetrics.loading.segments.slice(0, 20), owners: sampleMetrics.loading.owners.slice(0, 20) }); diff --git a/scripts/src/hwlab-node-web-observe-runner-source.ts b/scripts/src/hwlab-node-web-observe-runner-source.ts index e930f862..089cfedc 100644 --- a/scripts/src/hwlab-node-web-observe-runner-source.ts +++ b/scripts/src/hwlab-node-web-observe-runner-source.ts @@ -2921,6 +2921,7 @@ function parseAlertThresholds(value) { longLivedStreamOpenSlowMs: requiredPositiveThreshold(raw, "longLivedStreamOpenSlowMs"), visibleLoadingSlowMs: requiredPositiveThreshold(raw, "visibleLoadingSlowMs"), turnTimingSampleSlackSeconds: requiredPositiveThreshold(raw, "turnTimingSampleSlackSeconds"), + turnElapsedSevereTimeoutSeconds: requiredPositiveThreshold(raw, "turnElapsedSevereTimeoutSeconds"), uncommandedStateChangeCommandWindowMs: requiredPositiveThreshold(raw, "uncommandedStateChangeCommandWindowMs"), scrollJumpCommandWindowMs: requiredPositiveThreshold(raw, "scrollJumpCommandWindowMs"), scrollJumpFromY: requiredPositiveThreshold(raw, "scrollJumpFromY"),