fix: restore web-probe severe timeout threshold
Also records instruction hygiene, YAML-first config split guidance, and Sub2API D601 recovery notes from the recovered worktree state.
This commit is contained in:
@@ -12,6 +12,7 @@ GitHub issue/PR 正式读写必须走 `bun scripts/cli.ts gh ...` 或 `trans gh:
|
||||
- Issue/PR 正文、评论、关闭/重开、PR 描述和 merge closeout 默认中文。
|
||||
- 新 issue 正文必须包含 `目标合并分支: <repo branch/lane>`;不需要合并时写 `目标合并分支: 不适用`。
|
||||
- 大计划、后续阶段和独立改进方向创建新 issue;已有 issue 评论只写短进展、证据、阻塞和链接。
|
||||
- 规划型、多阶段、架构/API/平台运维类 issue 第一阶段必须 `P0 SPEC 先行`;细则见 [references/full.md](references/full.md) 的 `多阶段 Issue 与 SPEC-First`。
|
||||
- 多行正文使用 quoted heredoc:`--body-stdin <<'EOF'`;不要把长 Markdown 塞进 shell 参数。
|
||||
- PR merge 只走 guarded `gh pr merge`;`gh pr create` 的 Next 默认是 `--merge --delete-branch`,只有确认 ancestry 可丢弃时才显式 `--squash`。
|
||||
|
||||
|
||||
@@ -42,6 +42,13 @@ CLI 自检使用 `bun scripts/cli.ts check --syntax-only`、针对被改模块
|
||||
- 如果调查中发现了独立改进方向,应先用 `gh issue create --body-stdin` 创建新 issue,标题和正文写清目标合并分支/lane、背景、计划、验收标准;然后在原 issue 评论中用 1-3 句说明已拆出,并链接到新 issue。
|
||||
- 只有用户明确要求把计划写回当前 issue 正文,或当前 issue 本身就是唯一的专题计划 issue,才允许更新当前 issue 正文;即便如此,评论仍保持短小,不复制整篇计划。
|
||||
|
||||
## 多阶段 Issue 与 SPEC-First
|
||||
|
||||
- 形成多阶段实施、跨模块架构、新能力、长期 API/数据模型、平台运维能力或用户可见工作流的规划型 issue 时,第一阶段必须是 `P0 SPEC 先行`,并按 `$unidesk-oa` 的 SPEC 管理模式处理。
|
||||
- `P0 SPEC 先行` 必须在 issue 正文列出 SPEC 编号、SPEC 文档路径、上级规格、关联规格、实现引用版本、目标架构图/数据流图/关键时序图完成项,以及源码文件头部 `SPEC: <编号> <短名> <实现引用版本>` 标注规则。
|
||||
- issue 正文只能承载执行计划、阶段状态和证据索引,不能替代 `project-management/PJ2026-01/specs/` 中的长期 SPEC 正文。若稳定需求、数据流、接口或验收口径变化,先更新 SPEC,再更新 issue 阶段计划。
|
||||
- P0 未完成前,不得把代码实现、部署、CI/CD、测试补充或验收收口列为已可执行阶段;这些只能作为后续 P1+ 阶段。
|
||||
|
||||
---
|
||||
|
||||
## 认证探测
|
||||
|
||||
@@ -24,6 +24,7 @@ bun scripts/cli.ts platform-infra sub2api apply --target D601 --dry-run
|
||||
- Secret 只输出对象名、key 名、presence、fingerprint 或 redacted prefix;禁止打印完整 token/key。
|
||||
- D601 是默认 active target;D518/G14 等 target 以 YAML 和 issue 明确目标为准。
|
||||
- Codex pool、统一 API key、master `~/.codex` 配置、FRP/Caddy 暴露、账号增删都必须走本技能的受控 CLI。
|
||||
- D601 public 502 或 `api.pikapython.com` 异常先区分 edge/app endpoint,并用 `status`、`validate`、`apply --confirm`、`codex-pool validate` 做分层恢复;完整步骤见 [references/full.md](references/full.md) 的排障段。
|
||||
|
||||
## 何时读取 reference
|
||||
|
||||
|
||||
@@ -246,6 +246,8 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm
|
||||
|
||||
## 排障
|
||||
|
||||
- `api.pikapython.com` 或 D601 public exposure 返回 502 时,先判定是 edge 还是 app endpoint:跑 `sub2api status --target D601` 和 `sub2api validate --target D601`。若 `sub2api`、`sub2api-frpc`、`sub2api-redis` 或 `sub2api-egress-proxy` 出现 `0/1`,或 validate 显示 `no endpoints available for service "sub2api"` / app Pod 已终止,先用 `bun scripts/cli.ts platform-infra sub2api apply --target D601 --confirm` 重新收敛 YAML 资源,按返回的 `job status` 轮询,再跑 `status`、`validate` 和 `codex-pool validate --target D601`。不要先改账号池、哨兵状态、Secret 或 Caddy。
|
||||
- D601 快速恢复完成后,用分层证据 closeout:`https://api.pikapython.com/health` 应返回 200;`codex-pool validate --target D601` 应证明内部 `GET /v1/models` 和最小 `POST /v1/responses` smoke 成功;若需要证明公网 OpenAI-compatible API,用 `trans D601:k3s sh` 从 `platform-infra/sub2api-codex-pool-api-key.API_KEY` 只读到临时 shell 变量后请求 public `/v1/models` 和最小 `/v1/responses` marker,只输出 HTTP status、模型数量和 marker,不打印 key。不要为了公网验证运行 `configure-local --confirm`,它会重写本机 `~/.codex`;本机默认 `auth.json` key 返回 401 只能说明本机配置和公网统一 key 不一致,不能当作服务不可用证据。
|
||||
- Codex pool 哨兵、账号冻结/恢复、marker-only 判断或 probe 周期看不清:第一步跑 `bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-report`。这个报表是主观察面;只有报表缺字段或需要底层证据时,才继续看 `--raw`、CronJob log、state ConfigMap 或 Sub2API 管理 UI。若看到“临时不可调度状态”且包含规则序号/匹配关键词,检查 Sub2API `account_temp_unschedulable` 日志和账号 `temp_unschedulable_*` 字段;sentinel 只解释 `schedulable=false` 的 active quarantine,不解释这类内置临时冷却。
|
||||
- 只加强监控、不让哨兵自动冻结账号时,把 YAML `sentinel.actions.enabled=false` 后 `codex-pool sync --confirm`。此时 marker probe 和 gateway failure monitor 仍记录 `would-freeze` / observe-only 证据,但不会通过 Sub2API admin 写 `schedulable=false`;`/responses/compact` 的 `codex.remote_compact.failed` 和 compact 上游 5xx failover 只作为 `gateway-compact-*` 观察事件记录,不作为哨兵自动切换触发器。
|
||||
- 单个 request id 报 502/503/中断/没有自动切号:第一步跑 `bun scripts/cli.ts platform-infra sub2api codex-pool trace --request-id <requestId>`。先看 `outcome`、`reason`、`FAILOVER`、`SELECT-FAILED`、`ACCOUNT SIGNALS` 和 `WINDOW STATS`;只有 trace 报表缺字段或需要审计原始日志时,才加 `--show-lines` 或 `--raw`。若 `reason=failover-attempted-no-candidate`,说明切号动作已发生,但 scheduler 在排除失败账号后没有可用候选;继续用 `sentinel-report` 和 `validate --full` 区分 sentinel quarantine、request-path temp-unschedulable、账号 status 或容量耗尽。
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
---
|
||||
name: unidesk-ymalops
|
||||
description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/YAML-first 正规化、YAML ops、运维配置职责、platform-infra 配置重构、Secret sourceRef、publicExposure、target/lane/node、ops helper 抽取、删除 hardcoded defaults/特例,或历史收敛 issue pikasTech/unidesk#390/#398/#401 时使用。
|
||||
description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/YAML-first 正规化、YAML ops、运维配置职责拆分、configRef/path 引用、platform-infra 配置重构、Secret sourceRef、publicExposure、target/lane/node、ops helper 抽取、删除 hardcoded defaults/特例,或历史收敛 issue pikasTech/unidesk#390/#398/#401 时使用。
|
||||
---
|
||||
|
||||
# UniDesk YAML Ops
|
||||
@@ -26,6 +26,9 @@ description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/
|
||||
- 源码、配置、部署类正规化默认在独立 `.worktree/<task>` 中做;轻量 skill/docs/reference 收敛可按项目规则直接在主 worktree 做。
|
||||
- YAML 是 source of truth。不得新增隐藏代码默认值、schema 数值硬限制、合同测试或测试硬编码策略。
|
||||
- 代码校验只保证字段能被正确读取和渲染:类型、必填、枚举键名、引用存在性。版本号、namespace、endpoint、容量、冷却时间、回退窗口等数值以 YAML 为准。
|
||||
- 避免“超级配置”。当一个能力同时涉及 target/lane、runtime、scenario、prompt、report、publicExposure、Secret、CI/CD 等不同职责时,按职责拆分到 owning YAML;root YAML 只保存归属和 `configRefs`/path 引用,不承载全部细节。
|
||||
- 跨 YAML 引用应使用稳定的 `path/to/file.yaml#object.path` 或当前 domain parser 明确支持的等价语法。parser 只解析引用、校验存在性/类型/形状和冲突,不生成隐藏默认值,也不把合并后的大对象写成新的 source of truth。
|
||||
- CLI `plan/status` 应输出 redacted 配置引用图:每个 ref 的文件、path、presence、摘要 hash、缺失字段和下一步 drill-down 命令。不要默认 dump 展开后的完整 YAML 或 Secret。
|
||||
- Secret 只能通过 YAML 的 `sourceRef`/`targetKey` 声明和受控 CLI 下发;禁止从运行面 Secret、pod env、日志或数据库状态反推、解码、回填本地凭据。
|
||||
- 受控 CLI 输出只能披露对象名、key 名、sourceRef、targetKey、缺失项、fingerprint、字节数和执行摘要;不得打印 base64 payload、解码值、完整 DSN、API key 或可复制凭据。
|
||||
- 不做新的全局大 orchestrator。优先保留 domain CLI,把公共能力抽到 ops helper,domain CLI 只表达领域动作。
|
||||
@@ -36,13 +39,14 @@ description: UniDesk YAML-first 运维正规化技能。用户提到 ymal-first/
|
||||
|
||||
1. 盘点目标面:列出涉及的 YAML、CLI 入口、helper、Secret 绑定、运行面对象和现有 hardcode。
|
||||
2. 确认归属:每个事实必须有唯一 owning YAML;代码、运行面 Secret、pod env 和 docs 不能反向成为配置真相。
|
||||
3. 归一 YAML envelope:自有运维配置优先补齐 `version`、`kind`、`metadata`、`defaults`、`targets/services/lane` 等必要结构,但不要为外部工具格式强行套壳。
|
||||
4. 搬迁数值:把 namespace、serviceName、secretName、endpoint、image/tag、node/lane、probe、NO_PROXY、容量、回退窗口等可调项从代码迁到 YAML。
|
||||
5. 精简 parser:parser 只做结构和类型校验,不藏业务策略,不提供长期默认值。缺项应让 CLI 报出 YAML 路径和字段名。
|
||||
6. 抽公共 ops primitives:在增加新 service 分支前,优先复用或扩展公共 helper。
|
||||
7. 保持 domain CLI 薄:`platform-infra`、`server`、`gc`、`agentrun`、`hwlab` 等入口只组合 YAML、helper 和执行动作,不复制底层 Kubernetes/FRP/Caddy/Secret 逻辑。
|
||||
8. 验证原入口:CLI/config 改动默认只跑语法、help/命令形态、plan/dry-run 或对应 sync/validate;涉及真实运行面的收口要跑原 CLI 入口,不新增合同测试。
|
||||
9. 有限收口:当 issue 已经冻结阶段,完成当前阶段后只更新父 issue 的进展和下一固定阶段;固定阶段全部完成后关闭总 issue,不把候选扫描结果转成新的 Round。
|
||||
3. 拆分配置职责:把不同生命周期或不同 owner 的事实拆到各自 owning YAML,用 root `configRefs`/path 引用串联;只有同 owner、同生命周期、同命令模型的字段才放在同一个对象中。
|
||||
4. 归一 YAML envelope:自有运维配置优先补齐 `version`、`kind`、`metadata`、`defaults`、`targets/services/lane` 等必要结构,但不要为外部工具格式强行套壳。
|
||||
5. 搬迁数值:把 namespace、serviceName、secretName、endpoint、image/tag、node/lane、probe、NO_PROXY、容量、回退窗口等可调项从代码迁到 YAML。
|
||||
6. 精简 parser:parser 只做结构和类型校验,不藏业务策略,不提供长期默认值。缺项应让 CLI 报出 YAML 路径和字段名;重复声明同一事实或引用冲突时应失败并指出冲突路径。
|
||||
7. 抽公共 ops primitives:在增加新 service 分支前,优先复用或扩展公共 helper。
|
||||
8. 保持 domain CLI 薄:`platform-infra`、`server`、`gc`、`agentrun`、`hwlab` 等入口只组合 YAML、helper 和执行动作,不复制底层 Kubernetes/FRP/Caddy/Secret 逻辑。
|
||||
9. 验证原入口:CLI/config 改动默认只跑语法、help/命令形态、plan/dry-run 或对应 sync/validate;涉及真实运行面的收口要跑原 CLI 入口,不新增合同测试。
|
||||
10. 有限收口:当 issue 已经冻结阶段,完成当前阶段后只更新父 issue 的进展和下一固定阶段;固定阶段全部完成后关闭总 issue,不把候选扫描结果转成新的 Round。
|
||||
|
||||
## Common Refactor Targets
|
||||
|
||||
|
||||
@@ -185,6 +185,7 @@ lanes:
|
||||
longLivedStreamOpenSlowMs: 10000
|
||||
visibleLoadingSlowMs: 10000
|
||||
turnTimingSampleSlackSeconds: 3
|
||||
turnElapsedSevereTimeoutSeconds: 120
|
||||
uncommandedStateChangeCommandWindowMs: 10000
|
||||
scrollJumpCommandWindowMs: 8000
|
||||
scrollJumpFromY: 250
|
||||
|
||||
@@ -0,0 +1,83 @@
|
||||
# Agent Instruction Hygiene
|
||||
|
||||
This document is the long-term reference for keeping always-loaded agent instruction files small, navigable and stable. It applies to local and remote `AGENTS.md`, `CLAUDE.md`-style aliases and any repo-level instruction file that is automatically injected into an agent context.
|
||||
|
||||
## Size Budget
|
||||
|
||||
`AGENTS.md` is an index, not a knowledge base. The hard size budget for any local or remote `AGENTS.md` is 10 KiB, measured as bytes with `wc -c AGENTS.md`.
|
||||
|
||||
When an `AGENTS.md` is already over 10 KiB, do not append more detailed rules to it. Split first, then add only a one-line index entry back to `AGENTS.md`.
|
||||
|
||||
When editing an `AGENTS.md` would push it over 10 KiB, route the new content to a skill or a `docs/reference/*.md` document and keep `AGENTS.md` as a short pointer.
|
||||
|
||||
If loading or printing `AGENTS.md` triggers CLI output dump or context blow-up, treat that as a visibility bug and an instruction-hygiene bug. The fix is to split the file, not to increase output limits or ask agents to read around the dump.
|
||||
|
||||
## What Belongs In AGENTS.md
|
||||
|
||||
Keep only always-needed routing information in `AGENTS.md`:
|
||||
|
||||
- Project identity and source-of-truth boundaries.
|
||||
- P0 one-line rules that prevent immediate damage.
|
||||
- Links to the authoritative long-term reference document for each domain.
|
||||
- Skill names that must be loaded for common workflows.
|
||||
- Short warnings about secrets, destructive commands, target workspaces and build bans.
|
||||
|
||||
Do not put long examples, command transcripts, JSON output, issue timelines, architecture essays, provider-specific debugging logs or one-off incident analysis in `AGENTS.md`.
|
||||
|
||||
## Where Overflow Content Goes
|
||||
|
||||
Use this routing order when splitting content out of `AGENTS.md`:
|
||||
|
||||
- Reusable workflow behavior belongs in a skill `SKILL.md`, for example `$dad-dev`, `$unidesk-cicd`, `$unidesk-gh`, `$unidesk-trans`, `$unidesk-otel`, `$unidesk-webdev` or `$unidesk-ymalops`.
|
||||
- Stable project constraints, workspace rules, architecture boundaries and validation criteria belong in `docs/reference/*.md`.
|
||||
- CLI shape, output style, route syntax and operator ergonomics belong in `docs/reference/cli.md` unless a narrower reference already owns them.
|
||||
- Deployment hygiene, fixed repo boundaries and source-of-truth rules belong in `docs/reference/devops-hygiene.md`.
|
||||
- Node/lane-specific HWLAB rules belong in `docs/reference/hwlab.md` and the target repo's own reference docs.
|
||||
- AgentRun source-truth and deployment-lane rules belong in `docs/reference/agentrun.md`.
|
||||
- Platform-infra and YAML-first operations belong in `docs/reference/platform-infra.md` and `docs/reference/yaml-first-ops.md`.
|
||||
- Process notes, temporary findings and dated investigation logs belong in GitHub issues, PR comments or process notes; they must be distilled before entering long-term reference.
|
||||
|
||||
If a rule is both reusable across projects and specific to UniDesk's current directories or services, put the reusable workflow in the skill and put UniDesk-specific paths, lane names and validation boundaries in `docs/reference/*.md`, then cross-reference both.
|
||||
|
||||
## Split Procedure
|
||||
|
||||
When an agent sees a local or remote `AGENTS.md` over 10 KiB:
|
||||
|
||||
1. Identify the detailed section being changed or expanded.
|
||||
2. Move the detailed content to the owning skill or `docs/reference/*.md` document.
|
||||
3. Replace the original section with one concise bullet and a link to the authoritative location.
|
||||
4. Preserve P0 damage-prevention warnings in `AGENTS.md`, but compress them to one-line routing rules.
|
||||
5. Do not create a single giant overflow archive as the normal solution. A temporary migration note is acceptable only if it immediately points to the domain documents that must absorb it.
|
||||
6. Do not add tests, guards or preflight checks just to enforce the size budget unless the user explicitly asks. The default control is documentation hygiene plus concise review.
|
||||
|
||||
For large legacy files, split incrementally by domain. Each new edit should leave the touched domain smaller and better referenced than before.
|
||||
|
||||
## Cross-Reference Requirements
|
||||
|
||||
Every `AGENTS.md` index entry that points out of the file must name the authoritative target. Prefer direct paths such as `docs/reference/hwlab.md` or skill names such as `$unidesk-cicd`.
|
||||
|
||||
Avoid duplicated full rules between `AGENTS.md`, skills and long-term reference docs. `AGENTS.md` may summarize; the reference owns the detail. If two references conflict, update the narrower domain reference and keep only one authoritative version.
|
||||
|
||||
## Secrets And Output Hygiene
|
||||
|
||||
Instruction files must not contain secrets, full API keys, full DSNs, base64 payloads, bearer tokens, SSH private keys or copy-pastable credentials.
|
||||
|
||||
Do not paste large CLI output, OTel trace dumps, JSON arrays or browser transcripts into `AGENTS.md`. If a large output demonstrates a durable rule, summarize the rule and link to the issue or reference that owns the conclusion.
|
||||
|
||||
## Current UniDesk Routing Map
|
||||
|
||||
The current top-level routing map is:
|
||||
|
||||
- CLI behavior and output: `docs/reference/cli.md`.
|
||||
- YAML-first configuration: `docs/reference/yaml-first-ops.md` and `$unidesk-ymalops`.
|
||||
- Platform infrastructure: `docs/reference/platform-infra.md` and `$unidesk-sub2api` when Sub2API is involved.
|
||||
- Distributed field repair: `$dad-dev` plus `docs/reference/devops-hygiene.md`.
|
||||
- CI/CD and rollout: `$unidesk-cicd` plus `docs/reference/cli.md`.
|
||||
- GitHub issue and PR writes: `$unidesk-gh`.
|
||||
- Trans/remote patch transport: `$unidesk-trans` plus `docs/reference/cli.md`.
|
||||
- Web UI, Workbench and web-probe: `$unidesk-webdev`.
|
||||
- OpenTelemetry and Tempo: `$unidesk-otel` plus `docs/reference/observability.md`.
|
||||
- HWLAB node/lane operation: `docs/reference/hwlab.md`.
|
||||
- AgentRun: `docs/reference/agentrun.md`.
|
||||
- Master/D601 development environment: `docs/reference/dev-environment.md`.
|
||||
- Secretary work: `docs/reference/secretary-reference.md`.
|
||||
@@ -33,7 +33,7 @@
|
||||
|
||||
Workbench唯一投影负责把 AgentRun 执行事实收敛成 HWLAB 自有的 durable Workbench facts,使 Web、CLI、REST、SSE、fake-server 和浏览器回归都消费同一份可恢复、可分页、可诊断的用户态会话事实。
|
||||
|
||||
本专项的目标状态是:AgentRun run、command、event 和 result 只作为执行事实输入;HWLAB 只有 `WorkbenchProjectionWriter`、`WorkbenchProjectionFinalizer` 写入 Workbench facts;cloud-api boot/background scheduler 是 startup/resume 的 authority,负责扫描 durable open checkpoint 和 running/projecting turn 并恢复追平;所有 `GET /v1/workbench/*` 和兼容读路径都只通过 `WorkbenchReadModel` 读取,不在读取时调用 AgentRun、Code Agent manager、trace polling、result polling 或 workspace repair 推进事实。Workbench 必须满足 0repair:页面、GET、SSE、fake-server 和 web-probe 都不得通过 reload、切换 session、`sessionRepair`、`realignFreshSession`、localStorage truth 或 read-through repair 把已经分裂的 route/session/message/trace 状态补成看起来正确。
|
||||
本专项的目标状态是:AgentRun run、command、event 和 result 只作为执行事实输入;HWLAB 只有 `WorkbenchProjectionWriter`、`WorkbenchProjectionFinalizer` 写入 Workbench facts;cloud-api boot/background scheduler 是 startup/resume 的 authority,负责扫描 durable open checkpoint 和 running/projecting turn 并恢复追平;投影推进只能由上游 source event/result、projection writer/finalizer、background scheduler/reconciler 或显式受控 checkpoint replay/reprojection 触发。所有 `GET /v1/workbench/*` 和兼容读路径都只通过 `WorkbenchReadModel` 读取,不在读取时调用 AgentRun、Code Agent manager、trace polling、result polling、`hydrateRealtimeGap` 或 workspace repair 推进事实。Workbench 必须满足 0repair:页面、GET、SSE、fake-server 和 web-probe 都不得通过 reload、切换 session、`sessionRepair`、`realignFreshSession`、localStorage truth、SSE gap repair、visibility gap refresh 或 read-through repair 把已经分裂的 route/session/message/trace 状态补成看起来正确。
|
||||
|
||||
Workbench aggregate event stream 是上述唯一投影的提交脊柱。所有 admission、AgentRun event/result、cancel、replay/reprojection 和 diagnostic transition 先归一化为带 `eventSeq`、`aggregateId`、`aggregateSeq`、`turnId`、`traceId`、`sourceRunId`、`sourceCommandId` 和来源幂等键的 append-only 事件,再由同一 projector 写入 message、part、turn、trace、checkpoint、outbox 和 read model。Web、SSE、CLI、fake-server 和 probe 只能消费该 event stream 的投影结果或 cursor replay,不能分别从 trace tail、result envelope、message cache、session list 和浏览器本地状态重新排序。
|
||||
|
||||
@@ -83,6 +83,7 @@ D601 v0.3 可以在 `hwlab-v03` namespace 内为 `hwlab-workbench-runtime` 使
|
||||
| Workbench aggregate event stream | Workbench 投影自己的 append-only 事件流,承载 admission、source event/result、terminal、diagnostic、replay/reprojection 和 cancel transition;它是 trace/timeline 顺序、SSE replay cursor 和 projector revision 的唯一提交序。 |
|
||||
| message/part authority | Workbench timeline 的消息和片段权威。每个 user/assistant/tool/diagnostic/final response 片段都有稳定 `messageId`、`partId`、`order`、`status` 和 sealed 状态;trace-level 文本、DOM 行、result envelope 或 analyzer 不能替代它选择最终内容。 |
|
||||
| WorkbenchReadModel | 唯一读取 Workbench facts 并组装 session rail、session detail、message page、turn snapshot、trace event page 和 projection diagnostics 的读模型。 |
|
||||
| 上游投影推进 | Workbench facts 只能由 projection writer/finalizer、background scheduler/reconciler、AgentRun source event/result outbox 或显式受控 checkpoint replay/reprojection 推进;REST GET、Web 页面、SSE consumer、SSE open/error/visibility handler、web-probe、fake-server 和 CLI renderer 只能观察或展示投影与诊断,不能以 gap hydration、trace/result polling、reload 或 read-through sync 触发写侧投影。 |
|
||||
| checkpoint replay/reprojection | 受控管理入口按已持久化 sourceRun/sourceCommand/checkpoint 重放投影逻辑,用于恢复投影 lag 或阻塞;它只能调用同一 finalizer/writer,不由 GET、Web 页面、SSE 订阅或测试 helper 触发,也不得改变 active session 或 route。 |
|
||||
| projection commit | writer/finalizer 对一组 message、part、turn、trace、session summary 和 checkpoint 的一次幂等持久化提交;terminal commit 必须保持用户可见事实一致。 |
|
||||
| terminal commit | 标记同一 turn 结束的 projection commit,必须原子更新 assistant final text、message/part status、turn terminal、trace terminal event、session running=false、summary 和 SSE cursor。 |
|
||||
@@ -200,6 +201,8 @@ flowchart LR
|
||||
|
||||
目标架构要求 route/auth、adapter、projection writer/finalizer、facts store、read model、SSE publisher 和 compat wrapper 分工清晰。任何 route、GET handler、trace polling、result polling、workspace snapshot 或 front-end reducer 都不能绕过 writer/finalizer 直接改变 Workbench facts。
|
||||
|
||||
目标架构还要求投影推进只发生在上游写侧。SSE publisher/handler 只发布和 replay durable outbox commit;Web/CLI/fake-server/SSE consumer 只消费 REST/SSE projection。SSE open/error、visibility change、route hydrate、Trace detail hydration、web-probe 观察、GET refresh 和 observer reload 不能触发 `hydrateRealtimeGap`、read-through sync、result sync 或 trace polling 来推动 projection;缺口只能表现为 projection diagnostic/blocker,并由 scheduler/reconciler/finalizer 或 source event/outbox 追平。
|
||||
|
||||
目标架构还要求彻底禁止读侧推理。`turn.status`、`message.status`、`session.running`、`trace terminal`、`finalResponse` 和 `projectionStatus` 必须是 projection writer/finalizer 已经写入 durable facts 的字段;read model、REST route、SSE consumer、compat wrapper、Web reducer、CLI renderer、fake-server 和测试只能读取和重放这些字段。AgentRun facts、trace events、message parts、result envelope、session summary、list row 和 workspace snapshot 只能作为 writer/finalizer 输入或诊断字段,不得在读取链路中通过优先级、fallback、最后事件、空文本、超时或 UI heuristic 生成生命周期事实。
|
||||
|
||||
### 5.3 目标数据流图
|
||||
@@ -276,7 +279,7 @@ sequenceDiagram
|
||||
Store-->>Read: caught-up facts or explicit lag/blocker
|
||||
```
|
||||
|
||||
重启恢复要求 finalizer 不依赖进程内 90s 轮询作为唯一推进机制。进程内任务丢失、cloud-api 重启或慢任务超过短轮询预算后,boot/background scheduler 必须能从 durable checkpoint 和 running/projecting turn 找回需要追平的 sourceRun/sourceCommand,并以同一 writer 逻辑提交或记录 blocker。GET、SSE、Web 页面、fake-server 和 probe 只能观察恢复状态,不能触发恢复。
|
||||
重启恢复要求 finalizer 不依赖进程内 90s 轮询作为唯一推进机制。进程内任务丢失、cloud-api 重启或慢任务超过短轮询预算后,boot/background scheduler 必须能从 durable checkpoint 和 running/projecting turn 找回需要追平的 sourceRun/sourceCommand,并以同一 writer 逻辑提交或记录 blocker。GET、SSE、Web 页面、visibility handler、fake-server、web-probe 和 CLI renderer 只能观察恢复状态,不能触发恢复、gap hydration 或 read-through sync。
|
||||
|
||||
### 5.6 durable Workbench facts 对象模型
|
||||
|
||||
|
||||
@@ -155,6 +155,7 @@ export interface HwlabRuntimeWebProbeAlertThresholdsSpec {
|
||||
readonly longLivedStreamOpenSlowMs: number;
|
||||
readonly visibleLoadingSlowMs: number;
|
||||
readonly turnTimingSampleSlackSeconds: number;
|
||||
readonly turnElapsedSevereTimeoutSeconds: number;
|
||||
readonly uncommandedStateChangeCommandWindowMs: number;
|
||||
readonly scrollJumpCommandWindowMs: number;
|
||||
readonly scrollJumpFromY: number;
|
||||
@@ -839,6 +840,7 @@ function webProbeAlertThresholdsConfig(value: unknown, path: string): HwlabRunti
|
||||
longLivedStreamOpenSlowMs: positiveNumberField(raw, "longLivedStreamOpenSlowMs", path),
|
||||
visibleLoadingSlowMs: positiveNumberField(raw, "visibleLoadingSlowMs", path),
|
||||
turnTimingSampleSlackSeconds: positiveNumberField(raw, "turnTimingSampleSlackSeconds", path),
|
||||
turnElapsedSevereTimeoutSeconds: positiveNumberField(raw, "turnElapsedSevereTimeoutSeconds", path),
|
||||
uncommandedStateChangeCommandWindowMs: positiveNumberField(raw, "uncommandedStateChangeCommandWindowMs", path),
|
||||
scrollJumpCommandWindowMs: positiveNumberField(raw, "scrollJumpCommandWindowMs", path),
|
||||
scrollJumpFromY: positiveNumberField(raw, "scrollJumpFromY", path),
|
||||
|
||||
@@ -533,6 +533,7 @@ function parseAlertThresholds(value) {
|
||||
longLivedStreamOpenSlowMs: requiredPositiveThreshold(raw, "longLivedStreamOpenSlowMs"),
|
||||
visibleLoadingSlowMs: requiredPositiveThreshold(raw, "visibleLoadingSlowMs"),
|
||||
turnTimingSampleSlackSeconds: requiredPositiveThreshold(raw, "turnTimingSampleSlackSeconds"),
|
||||
turnElapsedSevereTimeoutSeconds: requiredPositiveThreshold(raw, "turnElapsedSevereTimeoutSeconds"),
|
||||
uncommandedStateChangeCommandWindowMs: requiredPositiveThreshold(raw, "uncommandedStateChangeCommandWindowMs"),
|
||||
scrollJumpCommandWindowMs: requiredPositiveThreshold(raw, "scrollJumpCommandWindowMs"),
|
||||
scrollJumpFromY: requiredPositiveThreshold(raw, "scrollJumpFromY"),
|
||||
@@ -1383,6 +1384,9 @@ function buildFindings(samples, control, network, errors, sampleMetrics, promptN
|
||||
? sampleMetrics.turnTimingNonMonotonic.filter((item) => item.metric === "recentUpdateSeconds" && item.anomaly === "jump")
|
||||
: [];
|
||||
if (recentUpdateSawtoothJumps.length > 0) findings.push({ id: "turn-timing-recent-update-sawtooth-jump", severity: "amber", summary: "最近更新 value jumped faster than sample interval; expected sawtooth increase-or-reset", count: recentUpdateSawtoothJumps.length, samples: recentUpdateSawtoothJumps.slice(0, 20) });
|
||||
const severeTimeoutRounds = Array.isArray(sampleMetrics?.rounds) ? sampleMetrics.rounds.filter((item) => Number(item.maxTotalElapsedSeconds) > alertThresholds.turnElapsedSevereTimeoutSeconds) : [];
|
||||
const severeTimeoutSamples = Array.isArray(sampleMetrics?.timeline) ? sampleMetrics.timeline.filter((item) => Number(item.totalElapsedSeconds) > alertThresholds.turnElapsedSevereTimeoutSeconds) : [];
|
||||
if (severeTimeoutRounds.length > 0 || severeTimeoutSamples.length > 0) findings.push({ id: "turn-elapsed-severe-timeout", severity: "red", summary: "turn total elapsed exceeded YAML-configured severe timeout; investigate Workbench/AgentRun progress instead of treating the turn as healthy", thresholdSeconds: alertThresholds.turnElapsedSevereTimeoutSeconds, count: Math.max(severeTimeoutRounds.length, severeTimeoutSamples.length), rounds: severeTimeoutRounds.slice(0, 20), samples: severeTimeoutSamples.slice(0, 20) });
|
||||
const loadingSummary = sampleMetrics?.loading?.summary || {};
|
||||
const visibleLoadingSlowSeconds = alertThresholds.visibleLoadingSlowMs / 1000;
|
||||
if (Number(loadingSummary.longestContinuousSeconds ?? 0) > visibleLoadingSlowSeconds) findings.push({ id: "page-loading-visible-over-budget", severity: "red", summary: "visible 加载中 stayed on screen longer than configured YAML budget; fix real loading latency instead of revealing incomplete content early", count: loadingSummary.overBudgetSegmentCount ?? loadingSummary.overFiveSecondSegmentCount ?? 1, longestContinuousSeconds: loadingSummary.longestContinuousSeconds, budgetSeconds: visibleLoadingSlowSeconds, segments: sampleMetrics.loading.segments.slice(0, 20), owners: sampleMetrics.loading.owners.slice(0, 20) });
|
||||
|
||||
@@ -2921,6 +2921,7 @@ function parseAlertThresholds(value) {
|
||||
longLivedStreamOpenSlowMs: requiredPositiveThreshold(raw, "longLivedStreamOpenSlowMs"),
|
||||
visibleLoadingSlowMs: requiredPositiveThreshold(raw, "visibleLoadingSlowMs"),
|
||||
turnTimingSampleSlackSeconds: requiredPositiveThreshold(raw, "turnTimingSampleSlackSeconds"),
|
||||
turnElapsedSevereTimeoutSeconds: requiredPositiveThreshold(raw, "turnElapsedSevereTimeoutSeconds"),
|
||||
uncommandedStateChangeCommandWindowMs: requiredPositiveThreshold(raw, "uncommandedStateChangeCommandWindowMs"),
|
||||
scrollJumpCommandWindowMs: requiredPositiveThreshold(raw, "scrollJumpCommandWindowMs"),
|
||||
scrollJumpFromY: requiredPositiveThreshold(raw, "scrollJumpFromY"),
|
||||
|
||||
Reference in New Issue
Block a user