Files

T

Codex 43c47d3fa9 fix: 增强长 turn liveness 可见性

2026-06-10 12:05:21 +08:00

24 KiB

Raw Blame History

v0.1 agentrun-mgr 服务规格

agentrun-mgr 是 AgentRun v0.1 的长驻管理服务。它是公共 RESTful API、durable facts、tenant policy boundary、runner claim、event append 和 terminal status 的 authority；业务客户端、CLI 和 runner 都不能绕过它直接写 Postgres。

在系统中的职责划分

提供 Manager 公共 API：创建和查询 run、提交 command、分页读取 events、查询 backend capability。
提供手动调度 API：为已创建的 run/command 显式创建 Kubernetes runner Job，并快速返回 job identity、attempt 和轮询入口。
提供 Runner 私有 API：runner register、claim run、lease heartbeat、poll commands、append events、ack command、上报 status。
提供 Provider Profile 管理 API：服务端查询 profile 状态、写入 API Key、更新 Secret/config 和触发 canary；具体合同见 spec-v01-provider-profile-management.md。
校验并持久化 tenantId、projectId、workspaceRef、providerId、backendProfile、executionPolicy 和 traceSink。
执行最小 tenant policy boundary：只做 schema、allowlist、idempotency、secret scope 和 executionPolicy 范围检查；不内建 UniDesk/HWLAB 的业务授权。
使用 Postgres 保存 runs、commands、events、runners、backends、leases 和 migration ledger。
输出结构化 health/readiness、failureKind、redacted SecretRef 和 trace correlation。
可观测性只能定位和验证状态，不能替代缺失能力实现；如果 HWLAB canary 需要的 final reply、command result、runner 多 turn、SessionRef 或 cancel 能力缺失，manager 必须补 durable API/状态机，而不是只补 trace 文案。

内部架构

agentrun-mgr 运行在 agentrun-v01 namespace，长驻 Deployment/Service 名称使用 agentrun-mgr。v0.1 自研服务实现优先使用 Bun + TypeScript；具体 HTTP 框架不是规格边界，但源码必须保持可被 Bun 运行、测试和打包。服务必须保持以下边界：

HTTP JSON API 是稳定跨服务边界；不使用 SSE、WebSocket、long-polling 或长同步 turn 请求作为 v0.1 必备能力。
Postgres adapter 是唯一 durable store adapter；file、sqlite、JSONL 或内存状态只能用于自测试。
Migration 必须在 readiness 前完成或显式 fail fast，不能以空 schema 静默启动。
Provider credential、Codex auth/config、Postgres DSN 明文不进数据库、event、trace、日志或 CLI 输出。
Manager 可以保存 SecretRef 和 credential source reference，但不得读取 provider Secret 值后存库。

API 接口说明

公共 API 的 v0.1 范围：

GET  /health
GET  /health/live
GET  /health/readiness
POST /api/v1/runs
GET  /api/v1/runs/:runId
GET  /api/v1/runs/:runId/events?afterSeq=0&limit=100
GET  /api/v1/runs/:runId/result?commandId=<commandId>
POST /api/v1/runs/:runId/cancel
POST /api/v1/runs/:runId/commands
GET  /api/v1/runs/:runId/commands/:commandId
GET  /api/v1/runs/:runId/commands/:commandId/result
POST /api/v1/runs/:runId/runner-jobs
GET  /api/v1/runs/:runId/runner-jobs?commandId=<commandId>
GET  /api/v1/runs/:runId/runner-jobs/:runnerJobId
POST /api/v1/commands/:commandId/cancel
GET  /api/v1/sessions?state=default&readerId=cli&backendProfile=<profile>&cursor=<cursor>&limit=50
GET  /api/v1/sessions/:sessionId?readerId=cli
GET  /api/v1/sessions/:sessionId/trace?afterSeq=0&limit=100&runId=<runId>
GET  /api/v1/sessions/:sessionId/output?afterSeq=0&limit=100&runId=<runId>
POST /api/v1/sessions/:sessionId/read
POST /api/v1/sessions/:sessionId/control
GET  /api/v1/backends
GET  /api/v1/provider-profiles
GET  /api/v1/provider-profiles/:profile
DELETE /api/v1/provider-profiles/:profile
PUT  /api/v1/provider-profiles/:profile/credential
POST /api/v1/provider-profiles/:profile/validate
GET  /api/v1/provider-profiles/:profile/validations/:validationId

Session API 是异步 subagent 的轻量控制面。state=default 必须只返回 running 和 unread session；state=all 才返回历史 read session。command/run 进入 terminal 后，所属 session 的 projection 必须进入 executionState=terminal 并 bump version，使未读 reader 在 ps/default 中看到它；POST /read 写入 reader cursor 后，该 session 不再出现在该 reader 的默认列表中。trace/output 只分页读取所属 run 的 events，不代理 Queue summary；control action=cancel 取消 active command 或 active run。

面向 HWLAB v0.2 canary 的手动调度 API 目标见 spec-v01-hwlab-manual-dispatch.md。runner-jobs 只显式启动当前 run/command 的 runner Job，不扫描 pending queue，不等待完整模型 turn；自动 scheduler 仍是 deferred 能力。后续 durable cancel API 必须与同一 run/command 状态机衔接，不能让 HWLAB 直接删除 Kubernetes Job 作为正式取消语义。

Provider Profile 管理 API 是服务端委托面，不是浏览器用户 API。HWLAB hwlab-cloud-api 完成自身鉴权和授权后调用这些接口；AgentRun 只校验调用来源、profile allowlist、SecretRef scope、payload schema 和 redaction，不读取或判断 HWLAB Web session、用户角色或 OpenFGA relation。

Runner 私有 API 的 v0.1 范围：

POST  /api/v1/runners/register
POST  /api/v1/runs/:runId/claim
PATCH /api/v1/runs/:runId/lease
GET   /api/v1/runs/:runId/commands?afterSeq=0&limit=20
POST  /api/v1/runs/:runId/events
PATCH /api/v1/runs/:runId/status
POST  /api/v1/commands/:commandId/ack
PATCH /api/v1/commands/:commandId/status

所有 API 成功和失败响应都必须是 JSON。失败响应至少包含 failureKind、message 和 trace correlation；不得出现空 stdout/空 response 被误判为成功的情况。

v0.1.1 Session state 存储 API

在 P1 SessionRef 持久化 升级到「per-session RWO PVC 直接挂载」后，manager 必须提供下列受控 API 来管理 session 的 PVC 生命周期：

POST   /api/v1/sessions                                 # 创建 session + 同步创建 PVC
GET    /api/v1/sessions/:id/storage                     # 查询 PVC 摘要（不返回内容）
DELETE /api/v1/sessions/:id/storage                     # 删 PVC + 标记 storage_kind=evicted
POST   /api/v1/sessions/:id/storage/refresh             # runner 上报 PVC 摘要

边界：

POST /api/v1/sessions 同步创建 agentrun-v01-session-<sessionId> PVC（1Gi RWO，StorageClass 走 AGENTRUN_SESSION_STORAGE_CLASS），再返回 session record。
GET /api/v1/sessions/:id/storage 只返回 pvcName / pvcPhase / storage_size_bytes / storage_files_count / storage_sha256 / storage_updated_at 摘要。
DELETE /api/v1/sessions/:id/storage 删 PVC 并回写 storage_kind='evicted' 与 storage_evicted_at。
POST /api/v1/sessions/:id/storage/refresh 写回 runner 报告的 PVC 摘要，写到 agentrun_sessions 表的 storage_* 列。
GC 循环（默认 5min，可调 AGENTRUN_SESSION_GC_INTERVAL_MS）扫到 expires_at 过 + 无 active run 的 session 删 PVC。
run 渲染前 get pvc：不存在且 storage_kind='pvc' 时短路返回 session-store-evicted，不创建 runner Job。
mgr SA 必须有 persistentvolumeclaims: [create, get, list, watch, delete] 权限（RBAC 由 deploy 模板提供）。
failureKind 矩阵新增 session-store-evicted，仅当 AGENTRUN_SESSION_PVC_NAME 已设 + codex 报 no rollout found for thread id 时使用；其他 thread/resume 失败按 T7 走 thread-resume-failed。

HWLAB v0.2 基线承接

Manager 只承接 HWLAB v0.2 Code Agent 的通用执行事实，不承接 HWLAB 的用户鉴权、HWPOD 授权或 Workbench schema。HWLAB 侧能力吸收总表见 spec-v01-hwlab-manual-dispatch.md。本服务需要把以下能力固化为 AgentRun 自身合同：

HWLAB 原有能力	Manager 承接合同	不归 Manager 管的内容
`/v1/agent/chat` 短连接提交后后台运行	`POST /api/v1/runs`、`POST /api/v1/runs/:runId/commands` 和 `POST /api/v1/runs/:runId/runner-jobs` 均短返回 JSON，持久化 run/command/job identity	HWLAB HTTP route、浏览器同源代理、用户登录态
result 轮询判断终态	`GET /api/v1/runs/:runId/result` 和 command result 必须聚合 terminal status、reply、failureKind、blocker、event cursor 和 attempt	HWLAB result schema 和用户可见文案
trace 轮询/回放	`events` 是 append-only durable facts，单 run 内 `seq` 单调，支持 `afterSeq/limit` 分页	Workbench trace UI、HWLAB event 展示分组
取消当前 turn	run/command cancel 幂等；pending cancel 阻止新 runner job，running cancel 通过 runner/backend interrupt 收敛，terminal 后返回当前终态	HWLAB cancel 按 owner/admin 鉴权
provider/backend 失败分类	manager 统一保存并返回 failureKind；缺少 Secret、provider 鉴权失败、provider 不可用、backend 失败、infra 失败和 cancelled 必须可区分	HWLAB 业务 blocker 分类或 device 授权失败
runner/job 定位证据	runner job 创建响应和后续查询至少能回答 `attemptId`、`jobName`、namespace、runnerId、pod/log identity 和当前 terminal 摘要	直接暴露 Kubernetes 控制权给业务客户端

Run 与 Command 合同

POST /api/v1/runs 必须持久化以下字段：

字段	v0.1 规则
`tenantId`	必填，只做 allowlist/schema 校验，默认候选为 `unidesk`、`hwlab`。
`projectId`	必填，例如 `pikasTech/unidesk`、`pikasTech/HWLAB`。
`workspaceRef`	必填，描述 source/worktree/workspace，不由 runner 猜测。
`providerId`	必填，例如 `G14`、`D601`；只表示目标 provider，不直接授予业务权限。
`backendProfile`	必填，必须是小写 slug；`codex`、`deepseek`、`minimax-m3`、`dsflash-go` 是内建 profile，动态 slug 通过 `agentrun-v01-provider-<profile>` SecretRef 生效。当前这些 profile 共享 Codex stdio backend kind。
`executionPolicy`	必填或由 manager 显式补齐默认值，至少包含 sandbox、approval、timeout、network 和 secretScope。
`traceSink`	字段必须存在；可以为 `null` 或显式 sink。

POST /api/v1/runs/:runId/commands 必须支持 idempotency key。相同 key 且 payload hash 相同应返回既有 command；相同 key 但 payload hash 不同必须结构化失败。type=turn 是普通对话 command；type=steer 是面向同 run active turn 的运行中引导 command，payload 必须包含非空 prompt、message 或 text，普通 runner poll 不得把它当作新 turn 执行；type=interrupt 只保留 durable command 语义，业务 cancel 仍以 run/command cancel API 为权威。

Tenant Policy Boundary

v0.1 不实现独立 policy engine。Manager 只做基础边界收敛：

校验 tenant/project/provider/backendProfile 是否符合 v0.1 允许范围；当前 backendProfile 允许内建 profile 和符合小写 slug 规则的动态 profile，动态 profile 必须有匹配 provider credential SecretRef，不能 fallback 到其他 profile。
校验 workspaceRef 形态存在且与 tenant 请求一致；不替 tenant 判断某个 repo 操作是否业务授权。
校验 executionPolicy 不扩大 sandbox、network、approval、timeout 和 secretScope。
校验 secretScope 只引用 spec-v01-secret-distribution.md 中允许的 SecretRef，且存在与 backendProfile 同名的 provider credential；manager 只校验引用形态，不读取 Secret 值。
对 HWLAB live device mutation、UniDesk production deploy、GitHub issue/PR 写入等业务授权，Manager 只记录字段和审计事件，不把业务规则硬编码成通用门禁。

最小 Observability 合同

events append-only，单 run 内 seq 单调递增。
command terminal 与 run terminal 必须分离。普通 turn completed 只终结对应 command，run 可以保持 claimed/running 以继续接收后续 command；每个 command result 必须能从 command record 与 command-scoped events 得到 authoritative terminal。run 级 terminal_status 只用于 run cancel、runner 级不可恢复失败或明确 run terminal；assistant partial、stdout、transport close 或 idle timeout 不能替代 terminal completed。
failureKind 至少能区分 schema-invalid、tenant-policy-denied、secret-unavailable、runner-lease-conflict、backend-failed、provider-auth-failed、provider-unavailable、infra-failed、cancelled。runner-lease-conflict 对普通并发 runner 是拒绝原因；对 replacement runner 是可恢复的 transient 状态，manager 响应必须包含当前 owner、lease expiry 或等价等待依据，便于 runner 等待 stale lease 并重试。
health/readiness 必须返回 Postgres reachable、schema migration ready、SecretRef redacted 状态和 build/source metadata。
日志、event、trace、health 和 diagnostics 不得输出 provider credential、Codex auth/config 内容、DSN password、token 或 URL credential。

Result envelope

面向 HWLAB v0.2 原有 Code Agent 的承接基线，manager 的 run/command result envelope 至少包含：

字段	规则
`status`	run/command 当前聚合状态，只能由 command state 和 terminal_status 推导。
`terminalStatus`	`completed`、`failed`、`blocked` 或 `cancelled`；没有 terminal event 时为 `null` 或 equivalent running 状态。
`completed` / `terminalSource`	`completed=true` 只能来自 terminal completed；`terminalSource` 标明来自 `terminal_status` event、run record 或暂无 terminal。
`reply` / `finalResponse`	从 `assistant_message` 聚合的最终用户可见文本；若存在 `replyAuthority=true` 或 `final=true` 的 `assistant_message`，必须取最后一条作为 authoritative reply。没有 authoritative final 时，result 可以 fallback 到 terminal 前最后一条非空 assistant 文本，但必须在 `finalResponse` 暴露 `seq`、`source`、`replyAuthority`、`final`、`textTruncated` 和 `outputTruncated`，让消费侧知道它是可见性 fallback，不是 backend final authority。没有 terminal completed 时不得伪造 completed reply。
`finalAssistantSeq` / `finalAssistantSource`	必须指向 result 本次选中的 assistant event；长 trace、steer 或 progress snapshot 场景不能让早期 assistant row 继续冒充最终摘要。
`finalAssistantTextTruncated` / `finalAssistantOutputTruncated`	必须原样暴露被选中 assistant event 的截断标记；被选中的最终摘要截断时，消费侧应继续读 events 或 trace，而不是把截断隐藏成完整 final。
`failureKind` / `blocker`	结构化失败分类和摘要；必须 redacted。
`lastSeq` / `eventCount` / `eventsCapped` / `nextAfterSeq`	支持调用方增量轮询和 result/trace reconciliation。result 聚合必须分页读取 events，不能只读取第一页后静默返回过期 `reply` 或 `lastSeq`；如果达到服务端聚合上限，必须 `eventsCapped=true` 并给出 continuation cursor。
`scopedLastSeq` / `scopedEventCount`	指定 `commandId` 时暴露 command-scoped 聚合范围，方便上层区分 run 全局 cursor 和当前 command 事件范围。
`runId` / `commandId` / `attemptId`	支持调用方持久关联和问题定位。
`artifactSummary`	第一阶段只放有界摘要、字节数、截断标记和必要引用；不内嵌大 stdout/stderr。
`toolCallSummary`	输出有界、脱敏的 tool call 状态摘要，至少包含 `count`、`statusCounts`、`exitCodeCounts` 和最近若干条 `items` 的 `method/toolName/type/status/exitCode/command`。消费侧必须用它区分 AgentRun command terminal、agent 内部工具执行和后置诊断，不得用单一 `hwpodExitCode` 覆盖 AgentRun 成功终态。
`liveness`	查询时派生的 supervisor 活性快照，不写入 durable event。必须暴露 `phase`、`active`、`lastSeq`、`lastEventAgeMs`、`lastActivity`/`lastCommandActivity`、`timeoutBudget`、lease/heartbeat 摘要和可执行恢复动作。`lastActivity` 必须包含 `sourceSeq`、`eventId`、`activityKind` 和 `ageMs`，用于按 id/seq drill-down；默认只给有界摘要，不展开 stdout、runnerTrace、完整 tool command 或 raw event。`timeoutBudget` 必须基于 `executionPolicy.timeoutMs` 暴露 `elapsedMs`、`remainingMs` 和 `state`（如 `within-budget`、`approaching-hard-timeout`、`timed-out`）。`phase` 至少区分 `waiting-runner`、`waiting-model`、`waiting-model-output`、`waiting-tool`、`waiting-tool-output`、`idle-after-tool`、`runner-stdio-inactive`、`transport-disconnected`、`runner-heartbeat-stale` 和 `terminal`，避免调用方只能用外层超时猜测 backend 状态。终态失败/阻塞时仍必须保留恢复动作，例如 inspect result、resume session、split task，而不是返回空数组。
`steerDelivery`	仅在查询 `type=steer` command result 时出现。必须说明 steer 是否已被 runner ack、是否已转发并被 backend `turn/steer` RPC 接受、目标 `targetCommandId`、是否观察到 target command 后续事件，以及“steer command completed 不等于 target turn 已产生后续 assistant/tool 输出”的语义。

assistant_message partial、command_output 存在、stdout 非空、backend transport close 或 idle timeout 都不能单独让 result 进入 completed。

GET /api/v1/sessions/:sessionId 作为 session status 入口，必须在存在 active/last run 时透出同一套 liveness 和 supervisor 摘要；该摘要是观测辅助，不能替代 command terminal、run terminal 或 raw events 的事实来源。

当 commandId 已指定，result envelope 必须只聚合该 command 的 assistant/output/error/terminal 事件；同一 run 的其他 command reply 不能串入当前 command result。未指定 commandId 时可默认选择最新 command。

长 trace / steer 场景的验收标准是：raw events 已有 terminal seq 时，commands result 的 lastSeq 与 eventCount 必须覆盖同一终态事件范围，finalAssistantSeq 必须指向 terminal 前最后一条可用 assistant 或 authoritative final，且 silent first-page truncation 一律视为 result 合同失败。

测试规格

T1 Manager health/readiness

阅读 AGENTS.md、本文和 spec-v01-postgres.md，然后用 RESTful API 手动测试 agentrun-mgr 的 /health/live 和 /health/readiness。确认响应为 JSON，包含 serviceId、Postgres readiness、migration 状态、source commit、SecretRef redacted 状态；不得输出任何 Secret value。

T2 Run schema 与 tenant boundary

阅读本文和 spec-v01-services.md，然后调用 POST /api/v1/runs 创建包含 tenant/project/workspace/provider/backend/execution/trace 字段的 run。确认缺失字段、非法 tenant、非法 backend、backendProfile=deepseek 或 backendProfile=minimax-m3 但缺少 matching provider credential、或扩大 secretScope 都返回结构化 failureKind，合法请求持久化后可用 GET /api/v1/runs/:runId 查询。

T3 Command idempotency

阅读本文，然后对同一个 run 使用相同 idempotency key 提交相同 command 两次，再提交 payload hash 不同的第三次。确认前两次返回同一个 command，第三次结构化失败，且所有响应为 JSON。

T4 Runner claim 与 event pagination

阅读本文和 spec-v01-agentrun-runner.md，然后让两个真实 runner 尝试 claim 同一个 run。确认只有一个 owner 成功，另一个返回 runner-lease-conflict 或等价 failureKind，并包含足够判断 owner/lease expiry 的结构化字段；随后分页读取 events，确认 seq 单调、不重复、不丢失。再模拟旧 runner pod 删除后的 replacement runner，确认旧 lease 过期后 replacement runner 可以接管，且 claim waiting/recovered 事件可见。

T5 手动 runner Job 调度 API

阅读本文和 spec-v01-hwlab-manual-dispatch.md，然后用 RESTful API 创建 tenantId=hwlab 的 run、提交 command、调用 POST /api/v1/runs/:runId/runner-jobs。确认响应短返回 JSON，包含 runId、commandId、attemptId、jobName、namespace、log/pod identity 和后续 poll 入口；重复 idempotency key 不创建重复 job。

T6 command/run terminal 分离

阅读本文和 spec-v01-agentrun-runner.md，然后在同一 run 内让两个 command 依次 completed。确认第一条 command completed 不会把 run 标为 terminal，GET /commands/:id/result 只返回对应 command 的 reply/terminal，run cancel 才会把 run 和未完成 command 一起收敛到 cancelled。

规格的实现情况

规格项	状态	说明
`agentrun-mgr` 服务规格	已定义	本文为 v0.1 manager 权威。
Manager REST API	已实现/已通过主闭环	已有 run、command、event、backends、runner register、claim、lease heartbeat、poll、ack、status、runner Job 创建和 health/readiness 的 HTTP JSON API；真实 runtime 已通过 RESTful API 主闭环。
手动 runner Job API	已实现	`POST /api/v1/runs/:runId/runner-jobs` 已可创建 Kubernetes runner Job，并固化 idempotency、持久 runner job record、响应 schema 和 cancel 前置检查。
runner Job 状态查询	已实现	`GET /api/v1/runs/:runId/runner-jobs` 和 `GET /api/v1/runs/:runId/runner-jobs/:runnerJobId` 返回 attempt/job/log/phase/terminal 摘要，业务客户端无需直连 Kubernetes 做最小定位。
Session 控制面 API	已实现/Q3	已提供 `list/show/trace/output/read/control(cancel)`；session projection 保存 running/terminal、active run/command、last event seq 和 read cursor，用于 CLI `ps/unread`。
command/run terminal 分离	已实现最小闭环	`PATCH /api/v1/commands/:commandId/status` 终结 command 并更新 SessionRef；普通 turn completed 不终结 run，run status 仅由 run cancel 或 runner 级不可恢复失败终结。
Tenant policy boundary	已实现最小边界	v0.1 已做 schema、tenant/backend allowlist、executionPolicy 和 secretScope 结构校验；业务授权仍由 UniDesk/HWLAB 自己判定。
`deepseek` backendProfile allowlist	已实现/已通过主闭环	Manager validation、backend capability 和 matching SecretRef 校验已支持 `deepseek`；真实 runtime 已经通过 CI/CD 发布并确认 Postgres migration `002_v01_backend_profiles` 应用。
`minimax-m3` backendProfile allowlist	已实现/已通过 HWLAB v0.2 原入口复测	Manager validation、backend capability 和 matching SecretRef 校验已支持 `minimax-m3`；真实 runtime 已通过 HWLAB 显式 session CLI 原入口复测。
Postgres durable adapter	已实现/已通过主闭环	live runtime 通过 `DATABASE_URL` 使用 Postgres durable store；memory store 仅用于显式 self-test/dev。见 spec-v01-postgres.md。
Observability 最小合同	已实现主路径	events append-only、command-scoped terminal status、failureKind、health/readiness store 状态、runner claim/lease/backend events 和 Secret/DSN redaction 已进入 manager；集中 trace 和部署级观测仍属后续工作。
durable cancel API	已实现最小闭环	已提供 run/command cancel API；pending command cancel 阻止新 runner Job，running runner 轮询 cancel 并中止 Codex stdio backend，终态使用 `cancelled`。
stale lease recovery	已实现/已通过 HWLAB v0.2 原入口复测	replacement runner 遇到旧 lease 时等待 stale lease 并重试 claim，成功接管后继续同一 SessionRef/PVC/thread；正常并发 runner 仍返回 `runner-lease-conflict`。

24 KiB Raw Blame History Unescape Escape