pikasTech-agentrun/docs/reference/spec-v01-agentrun-mgr.md

# v0.1 agentrun-mgr 服务规格

`agentrun-mgr` 是 AgentRun `v0.1` 的长驻管理服务。它是公共 RESTful API、durable facts、tenant policy boundary、runner claim、event append 和 terminal status 的 authority；业务客户端、CLI 和 runner 都不能绕过它直接写 Postgres。

## 在系统中的职责划分

- 提供 Manager 公共 API：创建和查询 run、提交 command、分页读取 events、查询 backend capability。
- 提供手动调度 API：为已创建的 run/command 显式创建 Kubernetes runner Job，并快速返回 job identity、attempt 和轮询入口。
- 提供 Runner 私有 API：runner register、claim run、lease heartbeat、poll commands、append events、ack command、上报 status。
- 提供 Provider Profile 管理 API：服务端查询 profile 状态、写入 API Key、更新 Secret/config 和触发 canary；具体合同见 [spec-v01-provider-profile-management.md](spec-v01-provider-profile-management.md)。
- 校验并持久化 `tenantId`、`projectId`、`workspaceRef`、`providerId`、`backendProfile`、`executionPolicy` 和 `traceSink`。
- 执行最小 tenant policy boundary：只做 schema、allowlist、idempotency、secret scope 和 executionPolicy 范围检查；不内建 UniDesk/HWLAB 的业务授权。
- 使用 Postgres 保存 runs、commands、events、runners、backends、leases 和 migration ledger。
- 输出结构化 health/readiness、failureKind、redacted SecretRef 和 trace correlation。
- 可观测性只能定位和验证状态，不能替代缺失能力实现；如果 HWLAB canary 需要的 final reply、command result、runner 多 turn、SessionRef 或 cancel 能力缺失，manager 必须补 durable API/状态机，而不是只补 trace 文案。

## 内部架构

`agentrun-mgr` 运行在 `agentrun-v01` namespace，长驻 Deployment/Service 名称使用 `agentrun-mgr`。`v0.1` 自研服务实现优先使用 Bun + TypeScript；具体 HTTP 框架不是规格边界，但源码必须保持可被 Bun 运行、测试和打包。服务必须保持以下边界：

- HTTP JSON API 是稳定跨服务边界；不使用 SSE、WebSocket、long-polling 或长同步 `turn` 请求作为 `v0.1` 必备能力。
- Postgres adapter 是唯一 durable store adapter；file、sqlite、JSONL 或内存状态只能用于自测试。
- Migration 必须在 readiness 前完成或显式 fail fast，不能以空 schema 静默启动。
- Provider credential、Codex auth/config、Postgres DSN 明文不进数据库、event、trace、日志或 CLI 输出。
- Manager 可以保存 SecretRef 和 credential source reference，但不得读取 provider Secret 值后存库。

## API 接口说明

公共 API 的 `v0.1` 范围：

```http
GET  /health
GET  /health/live
GET  /health/readiness
POST /api/v1/runs
GET  /api/v1/runs/:runId
GET  /api/v1/runs/:runId/events?afterSeq=0&limit=100
GET  /api/v1/runs/:runId/result?commandId=<commandId>
POST /api/v1/runs/:runId/cancel
POST /api/v1/runs/:runId/commands
GET  /api/v1/runs/:runId/commands/:commandId
GET  /api/v1/runs/:runId/commands/:commandId/result
POST /api/v1/runs/:runId/runner-jobs
GET  /api/v1/runs/:runId/runner-jobs?commandId=<commandId>
GET  /api/v1/runs/:runId/runner-jobs/:runnerJobId
POST /api/v1/commands/:commandId/cancel
GET  /api/v1/sessions?state=default&readerId=cli&backendProfile=<profile>&cursor=<cursor>&limit=50
GET  /api/v1/sessions/:sessionId?readerId=cli
GET  /api/v1/sessions/:sessionId/trace?afterSeq=0&limit=100&runId=<runId>
GET  /api/v1/sessions/:sessionId/output?afterSeq=0&limit=100&runId=<runId>
POST /api/v1/sessions/:sessionId/read
POST /api/v1/sessions/:sessionId/control
GET  /api/v1/backends
GET  /api/v1/provider-profiles
GET  /api/v1/provider-profiles/:profile
DELETE /api/v1/provider-profiles/:profile
PUT  /api/v1/provider-profiles/:profile/credential
POST /api/v1/provider-profiles/:profile/validate
GET  /api/v1/provider-profiles/:profile/validations/:validationId
```

Session API 是异步 subagent 的轻量控制面。`state=default` 必须只返回 running 和 unread session；`state=all` 才返回历史 read session。command/run 进入 terminal 后，所属 session 的 projection 必须进入 `executionState=terminal` 并 bump version，使未读 reader 在 `ps/default` 中看到它；`POST /read` 写入 reader cursor 后，该 session 不再出现在该 reader 的默认列表中。`trace/output` 只分页读取所属 run 的 events，不代理 Queue summary；`control action=cancel` 取消 active command 或 active run。

面向 HWLAB v0.2 canary 的手动调度 API 目标见 [spec-v01-hwlab-manual-dispatch.md](spec-v01-hwlab-manual-dispatch.md)。`runner-jobs` 只显式启动当前 run/command 的 runner Job，不扫描 pending queue，不等待完整模型 turn；自动 scheduler 仍是 deferred 能力。后续 durable cancel API 必须与同一 run/command 状态机衔接，不能让 HWLAB 直接删除 Kubernetes Job 作为正式取消语义。

Provider Profile 管理 API 是服务端委托面，不是浏览器用户 API。HWLAB `hwlab-cloud-api` 完成自身鉴权和授权后调用这些接口；AgentRun 只校验调用来源、profile allowlist、SecretRef scope、payload schema 和 redaction，不读取或判断 HWLAB Web session、用户角色或 OpenFGA relation。

Runner 私有 API 的 `v0.1` 范围：

```http
POST  /api/v1/runners/register
POST  /api/v1/runs/:runId/claim
PATCH /api/v1/runs/:runId/lease
GET   /api/v1/runs/:runId/commands?afterSeq=0&limit=20
POST  /api/v1/runs/:runId/events
PATCH /api/v1/runs/:runId/status
POST  /api/v1/commands/:commandId/ack
PATCH /api/v1/commands/:commandId/status
```

所有 API 成功和失败响应都必须是 JSON。失败响应至少包含 `failureKind`、`message` 和 trace correlation；不得出现空 stdout/空 response 被误判为成功的情况。

### API 鉴权边界

`/health`、`/health/live` 和 `/health/readiness` 是公开健康探针，不要求鉴权。`/api/v1/**` 在 runtime 中必须要求 `Authorization: Bearer <token>`，server 侧只从 `AGENTRUN_API_KEY` 或 `AGENTRUN_API_KEY_FILE` 读取期望 token；缺少 server token 时启动为本地/自测宽松模式，但 runtime Deployment 必须通过 `managerApiKeyEnv` 注入 `AGENTRUN_API_KEY`。鉴权失败返回 JSON：缺 server token 且 runtime 要求鉴权时为 `failureKind=auth-missing` / HTTP 503，客户端未带或带错 token 时为 `failureKind=auth-failed` / HTTP 401。

UniDesk 或其他客户端可以参考 HWLAB 的 key 发现风格，把本机 `HWLAB_API_KEY` 映射成 AgentRun REST bearer token，但这只是客户端凭据来源约定，不代表 AgentRun 依赖 HWLAB runtime、HWLAB backend-core、HWLAB frontend 代理或 HWLAB 用户会话。AgentRun manager 只校验 bearer token 是否等于自身 `AGENTRUN_API_KEY`，不读取 HWLAB 的鉴权状态。

### v0.1.1 Session state 存储 API

在 `P1 SessionRef 持久化` 升级到「per-session RWO PVC 直接挂载」后，manager 必须提供下列受控 API 来管理 session 的 PVC 生命周期：

```http
POST   /api/v1/sessions                                 # 创建 session + 同步创建 PVC
GET    /api/v1/sessions/:id/storage                     # 查询 PVC 摘要（不返回内容）
DELETE /api/v1/sessions/:id/storage                     # 删 PVC + 标记 storage_kind=evicted
POST   /api/v1/sessions/:id/storage/refresh             # runner 上报 PVC 摘要
```

边界：

- `POST /api/v1/sessions` 同步创建 `agentrun-v01-session-<sessionId>` PVC（1Gi RWO，StorageClass 走 `AGENTRUN_SESSION_STORAGE_CLASS`），再返回 session record。
- `GET /api/v1/sessions/:id/storage` 只返回 `pvcName` / `pvcPhase` / `storage_size_bytes` / `storage_files_count` / `storage_sha256` / `storage_updated_at` 摘要。
- `DELETE /api/v1/sessions/:id/storage` 删 PVC 并回写 `storage_kind='evicted'` 与 `storage_evicted_at`。
- `POST /api/v1/sessions/:id/storage/refresh` 写回 runner 报告的 PVC 摘要，写到 `agentrun_sessions` 表的 `storage_*` 列。
- GC 循环（默认 5min，可调 `AGENTRUN_SESSION_GC_INTERVAL_MS`）扫到 `expires_at` 过 + 无 active run 的 session 删 PVC。
- run 渲染前 `get pvc`：不存在且 `storage_kind='pvc'` 时短路返回 `session-store-evicted`，不创建 runner Job。
- mgr SA 必须有 `persistentvolumeclaims: [create, get, list, watch, delete]` 权限（RBAC 由 deploy 模板提供）。
- failureKind 矩阵新增 `session-store-evicted`，仅当 `AGENTRUN_SESSION_PVC_NAME` 已设 + codex 报 `no rollout found for thread id` 时使用；其他 `thread/resume` 失败按 T7 走 `thread-resume-failed`。

## HWLAB v0.2 基线承接

Manager 只承接 HWLAB v0.2 Code Agent 的通用执行事实，不承接 HWLAB 的用户鉴权、HWPOD 授权或 Workbench schema。HWLAB 侧能力吸收总表见 [spec-v01-hwlab-manual-dispatch.md](spec-v01-hwlab-manual-dispatch.md)。本服务需要把以下能力固化为 AgentRun 自身合同：

| HWLAB 原有能力 | Manager 承接合同 | 不归 Manager 管的内容 |
| --- | --- | --- |
| `/v1/agent/chat` 短连接提交后后台运行 | `POST /api/v1/runs`、`POST /api/v1/runs/:runId/commands` 和 `POST /api/v1/runs/:runId/runner-jobs` 均短返回 JSON，持久化 run/command/job identity | HWLAB HTTP route、浏览器同源代理、用户登录态 |
| result 轮询判断终态 | `GET /api/v1/runs/:runId/result` 和 command result 必须聚合 terminal status、reply、failureKind、blocker、event cursor 和 attempt | HWLAB result schema 和用户可见文案 |
| trace 轮询/回放 | `events` 是 append-only durable facts，单 run 内 `seq` 单调，支持 `afterSeq/limit` 分页 | Workbench trace UI、HWLAB event 展示分组 |
| 取消当前 turn | run/command cancel 幂等；pending cancel 阻止新 runner job，running cancel 通过 runner/backend interrupt 收敛，terminal 后返回当前终态 | HWLAB cancel 按 owner/admin 鉴权 |
| provider/backend 失败分类 | manager 统一保存并返回 failureKind；缺少 Secret、provider 鉴权失败、provider 不可用、backend 失败、infra 失败和 cancelled 必须可区分 | HWLAB 业务 blocker 分类或 device 授权失败 |
| runner/job 定位证据 | runner job 创建响应和后续查询至少能回答 `attemptId`、`jobName`、namespace、runnerId、pod/log identity 和当前 terminal 摘要 | 直接暴露 Kubernetes 控制权给业务客户端 |

## Run 与 Command 合同

`POST /api/v1/runs` 必须持久化以下字段：

| 字段 | v0.1 规则 |
| --- | --- |
| `tenantId` | 必填，只做 allowlist/schema 校验，默认候选为 `unidesk`、`hwlab`。 |
| `projectId` | 必填，例如 `pikasTech/unidesk`、`pikasTech/HWLAB`。 |
| `workspaceRef` | 必填，描述 source/worktree/workspace，不由 runner 猜测。 |
| `providerId` | 必填，例如 `G14`、`D601`；只表示目标 provider，不直接授予业务权限。 |
| `backendProfile` | 必填，必须是小写 slug；`codex`、`deepseek`、`minimax-m3`、`dsflash-go` 是内建 profile，动态 slug 通过 `agentrun-v01-provider-<profile>` SecretRef 生效。当前这些 profile 共享 Codex stdio backend kind。 |
| `executionPolicy` | 必填或由 manager 显式补齐默认值，至少包含 sandbox、approval、timeout、network 和 secretScope。 |
| `traceSink` | 字段必须存在；可以为 `null` 或显式 sink。 |

`POST /api/v1/sessions/:sessionId/send` 是用户级 Session 续跑的唯一 REST 入口。客户端只提交 prompt/payload、可选 run base 和 runner job override；manager 必须读取 durable session/run/command 状态后自动决定内部行为：只有 active `turn` command 已被 runner ack、run 处于 claimed/running 且 lease 未过期，才创建 `type=steer` command；pending/waiting-runner、stale lease、terminal 或无 active command 都必须创建新 run、`type=turn` command，并按请求创建 runner job。响应必须暴露 `decision`、`internalCommandType`、run/command/runnerJob 摘要、activeBefore 和 `valuesPrinted=false`。带 `dryRun=true` 时只返回 non-mutating plan，不得创建 session、PVC、run、command 或 runner job。

`POST /api/v1/runs/:runId/commands` 必须支持 idempotency key。相同 key 且 payload hash 相同应返回既有 command；相同 key 但 payload hash 不同必须结构化失败。`type=turn` 是普通对话 command；`type=steer` 是面向同 run active turn 的运行中引导 command，payload 必须包含非空 `prompt`、`message` 或 `text`，普通 runner poll 不得把它当作新 turn 执行；`type=interrupt` 只保留 durable command 语义，业务 cancel 仍以 run/command cancel API 为权威。`turn` / `steer` 是 manager 内部 command type 和低层诊断资源，不是用户级 CLI 分叉。

## Tenant Policy Boundary

`v0.1` 不实现独立 policy engine。Manager 只做基础边界收敛：

- 校验 tenant/project/provider/backendProfile 是否符合 `v0.1` 允许范围；当前 backendProfile 允许内建 profile 和符合小写 slug 规则的动态 profile，动态 profile 必须有匹配 provider credential SecretRef，不能 fallback 到其他 profile。
- 校验 workspaceRef 形态存在且与 tenant 请求一致；不替 tenant 判断某个 repo 操作是否业务授权。
- 校验 executionPolicy 不扩大 sandbox、network、approval、timeout 和 secretScope。
- 校验 secretScope 只引用 [spec-v01-secret-distribution.md](spec-v01-secret-distribution.md) 中允许的 SecretRef，且存在与 `backendProfile` 同名的 provider credential；manager 只校验引用形态，不读取 Secret 值。
- 对 HWLAB live device mutation、UniDesk production deploy、GitHub issue/PR 写入等业务授权，Manager 只记录字段和审计事件，不把业务规则硬编码成通用门禁。

## 最小 Observability 合同

- events append-only，单 run 内 `seq` 单调递增。
- command terminal 与 run terminal 必须分离。普通 turn completed 只终结对应 command，run 可以保持 `claimed/running` 以继续接收后续 command；每个 command result 必须能从 command record 与 command-scoped events 得到 authoritative terminal。run 级 `terminal_status` 只用于 run cancel、runner 级不可恢复失败或明确 run terminal；assistant partial、stdout、transport close 或 idle timeout 不能替代 terminal completed。
- failureKind 至少能区分 `schema-invalid`、`tenant-policy-denied`、`secret-unavailable`、`runner-lease-conflict`、`backend-failed`、`provider-auth-failed`、`provider-unavailable`、`infra-failed`、`cancelled`。`runner-lease-conflict` 对普通并发 runner 是拒绝原因；对 replacement runner 是可恢复的 transient 状态，manager 响应必须包含当前 owner、lease expiry 或等价等待依据，便于 runner 等待 stale lease 并重试。
- health/readiness 必须返回 Postgres reachable、schema migration ready、SecretRef redacted 状态和 build/source metadata。
- 日志、event、trace、health 和 diagnostics 不得输出 provider credential、Codex auth/config 内容、DSN password、token 或 URL credential。

### Result envelope

面向 HWLAB v0.2 原有 Code Agent 的承接基线，manager 的 run/command result envelope 至少包含：

| 字段 | 规则 |
| --- | --- |
| `status` | run/command 当前聚合状态，只能由 command state 和 terminal_status 推导。 |
| `terminalStatus` | `completed`、`failed`、`blocked` 或 `cancelled`；没有 terminal event 时为 `null` 或 equivalent running 状态。 |
| `completed` / `terminalSource` | `completed=true` 只能来自 terminal completed；`terminalSource` 标明来自 `terminal_status` event、run record 或暂无 terminal。 |
| `reply` / `finalResponse` | 从 `assistant_message` 聚合的最终用户可见文本；若存在 `replyAuthority=true` 或 `final=true` 的 `assistant_message`，必须取最后一条作为 authoritative reply。没有 authoritative final 时，result 可以 fallback 到 terminal 前最后一条非空 assistant 文本，但必须在 `finalResponse` 暴露 `seq`、`source`、`replyAuthority`、`final`、`textTruncated` 和 `outputTruncated`，让消费侧知道它是可见性 fallback，不是 backend final authority。没有 terminal completed 时不得伪造 completed reply。 |
| `finalResponseAuthority` / `finalResponseFallback` / `needsContinuation` / `completionEvidence` | 必须在 result 顶层暴露最终回复权威性。`finalResponseAuthority` 只能是 `authoritative`、`fallback` 或 `missing`；terminal completed 但没有 authoritative final 时，`needsContinuation=true`，`completionEvidence` 必须说明原因并给出同 session 的 `sessions send` 恢复入口。 |
| `finalAssistantSeq` / `finalAssistantSource` | 必须指向 result 本次选中的 assistant event；长 trace、steer 或 progress snapshot 场景不能让早期 assistant row 继续冒充最终摘要。 |
| `finalAssistantTextTruncated` / `finalAssistantOutputTruncated` | 必须原样暴露被选中 assistant event 的截断标记；被选中的最终摘要截断时，消费侧应继续读 events 或 trace，而不是把截断隐藏成完整 final。 |
| `failureKind` / `blocker` | 结构化失败分类和摘要；必须 redacted。 |
| `lastSeq` / `eventCount` / `eventsCapped` / `nextAfterSeq` | 支持调用方增量轮询和 result/trace reconciliation。result 聚合必须分页读取 events，不能只读取第一页后静默返回过期 `reply` 或 `lastSeq`；如果达到服务端聚合上限，必须 `eventsCapped=true` 并给出 continuation cursor。 |
| `scopedLastSeq` / `scopedEventCount` | 指定 `commandId` 时暴露 command-scoped 聚合范围，方便上层区分 run 全局 cursor 和当前 command 事件范围。 |
| `runId` / `commandId` / `attemptId` | 支持调用方持久关联和问题定位。 |
| `artifactSummary` | 第一阶段只放有界摘要、字节数、截断标记和必要引用；不内嵌大 stdout/stderr。 |
| `toolCallSummary` | 输出有界、脱敏的 tool call 状态摘要，至少包含 `count`、`statusCounts`、`exitCodeCounts` 和最近若干条 `items` 的 `method/toolName/type/status/exitCode/command`。消费侧必须用它区分 AgentRun command terminal、agent 内部工具执行和后置诊断，不得用单一 `hwpodExitCode` 覆盖 AgentRun 成功终态。 |
| `liveness` | 查询时派生的 supervisor 活性快照，不写入 durable event。必须暴露 `phase`、`active`、`lastSeq`、`lastEventAgeMs`、`lastActivity`/`lastCommandActivity`、`timeoutBudget`、lease/heartbeat 摘要和可执行恢复动作。`lastActivity` 必须包含 `sourceSeq`、`eventId`、`activityKind`、`observedAt` 和 `ageMs`，用于按 id/seq drill-down；默认只给有界摘要，不展开 stdout、runnerTrace、完整 tool command 或 raw event。`timeoutBudget` 必须按无响应空闲时间计算，`executionPolicy.timeoutMs` 是 idle budget，不是 turn wall-clock hard timeout；只要 backend notification、assistant/tool/event、command output 或等价 activity 持续刷新，就必须重置 idle 起点并继续等待。该对象必须暴露 `timeoutKind="idle"`、`hardTimeout=false`、`idleStartedAt`、`idleElapsedMs`、`lastActivityAt`、`lastActivitySeq`、`elapsedMs`、`remainingMs` 和 `state`（如 `within-budget`、`approaching-idle-timeout`、`idle-timed-out`、`terminal`）。`phase` 至少区分 `waiting-runner`、`waiting-model`、`waiting-model-output`、`waiting-tool`、`waiting-tool-output`、`idle-after-tool`、`runner-stdio-inactive`、`transport-disconnected`、`runner-heartbeat-stale` 和 `terminal`，避免调用方只能用外层超时猜测 backend 状态。终态失败/阻塞时仍必须保留恢复动作，例如 inspect result、read events/trace、continue same session、split task，而不是返回空数组。 |
| `steerDelivery` | 仅在查询 `type=steer` command result 时出现。必须说明 steer 是否已被 runner ack、是否已转发并被 backend `turn/steer` RPC 接受、目标 `targetCommandId`、是否观察到 target command 后续事件，以及“steer command completed 不等于 target turn 已产生后续 assistant/tool 输出”的语义。 |

`assistant_message` partial、`command_output` 存在、stdout 非空、backend transport close 或 idle timeout 都不能单独让 result 进入 `completed`。

`GET /api/v1/sessions/:sessionId` 作为 session status 入口，必须在存在 active/last run 时透出同一套 `liveness` 和 `supervisor` 摘要；该摘要是观测辅助，不能替代 command terminal、run terminal 或 raw events 的事实来源。

当 command 因 idle timeout、provider stream disconnect、runner stdio inactive、completed-without-authoritative-final 或其他非业务终态失败时，manager 的恢复建议必须面向指挥官而不是要求 worker 自行读 trace。指挥官应先读取 `result`、`events` 或 `sessions/:id/trace` 确认最后有效 activity、已完成修改和卡点；若 run/task 有可继续的 `sessionRef`，后续 prompt 必须用同一个 AgentRun session 通过 `sessions send <sessionId>` 续跑，并在 prompt 中写入管理者从 trace 得出的下一步。只有旧任务没有 `sessionRef`、session 已 evicted、或同 session 已证明不可恢复时，才创建带管理者摘要的新任务。

当 `commandId` 已指定，result envelope 必须只聚合该 command 的 assistant/output/error/terminal 事件；同一 run 的其他 command reply 不能串入当前 command result。未指定 `commandId` 时可默认选择最新 command。

长 trace / steer 场景的验收标准是：raw events 已有 terminal seq 时，`commands result` 的 `lastSeq` 与 `eventCount` 必须覆盖同一终态事件范围，`finalAssistantSeq` 必须指向 terminal 前最后一条可用 assistant 或 authoritative final，且 silent first-page truncation 一律视为 result 合同失败。

## 测试规格

### T1 Manager health/readiness

阅读 `AGENTS.md`、本文和 [spec-v01-postgres.md](spec-v01-postgres.md)，然后用 RESTful API 手动测试 `agentrun-mgr` 的 `/health/live` 和 `/health/readiness`。确认响应为 JSON，包含 serviceId、Postgres readiness、migration 状态、source commit、SecretRef redacted 状态；不得输出任何 Secret value。

### T2 Run schema 与 tenant boundary

阅读本文和 [spec-v01-services.md](spec-v01-services.md)，然后调用 `POST /api/v1/runs` 创建包含 tenant/project/workspace/provider/backend/execution/trace 字段的 run。确认缺失字段、非法 tenant、非法 backend、`backendProfile=deepseek` 或 `backendProfile=minimax-m3` 但缺少 matching provider credential、或扩大 secretScope 都返回结构化 failureKind，合法请求持久化后可用 `GET /api/v1/runs/:runId` 查询。

### T3 Command idempotency

阅读本文，然后对同一个 run 使用相同 idempotency key 提交相同 command 两次，再提交 payload hash 不同的第三次。确认前两次返回同一个 command，第三次结构化失败，且所有响应为 JSON。

### T4 Runner claim 与 event pagination

阅读本文和 [spec-v01-agentrun-runner.md](spec-v01-agentrun-runner.md)，然后让两个真实 runner 尝试 claim 同一个 run。确认只有一个 owner 成功，另一个返回 `runner-lease-conflict` 或等价 failureKind，并包含足够判断 owner/lease expiry 的结构化字段；随后分页读取 events，确认 `seq` 单调、不重复、不丢失。再模拟旧 runner pod 删除后的 replacement runner，确认旧 lease 过期后 replacement runner 可以接管，且 claim waiting/recovered 事件可见。

### T5 手动 runner Job 调度 API

阅读本文和 [spec-v01-hwlab-manual-dispatch.md](spec-v01-hwlab-manual-dispatch.md)，然后用 RESTful API 创建 `tenantId=hwlab` 的 run、提交 command、调用 `POST /api/v1/runs/:runId/runner-jobs`。确认响应短返回 JSON，包含 `runId`、`commandId`、`attemptId`、`jobName`、namespace、log/pod identity 和后续 poll 入口；重复 idempotency key 不创建重复 job。

### T6 command/run terminal 分离

阅读本文和 [spec-v01-agentrun-runner.md](spec-v01-agentrun-runner.md)，然后在同一 run 内让两个 command 依次 completed。确认第一条 command completed 不会把 run 标为 terminal，`GET /commands/:id/result` 只返回对应 command 的 reply/terminal，run cancel 才会把 run 和未完成 command 一起收敛到 cancelled。

## 规格的实现情况

| 规格项 | 状态 | 说明 |
| --- | --- | --- |
| `agentrun-mgr` 服务规格 | 已定义 | 本文为 v0.1 manager 权威。 |
| Manager REST API | 已实现/已通过主闭环 | 已有 run、command、event、backends、runner register、claim、lease heartbeat、poll、ack、status、runner Job 创建和 health/readiness 的 HTTP JSON API；真实 runtime 已通过 RESTful API 主闭环。 |
| 手动 runner Job API | 已实现 | `POST /api/v1/runs/:runId/runner-jobs` 已可创建 Kubernetes runner Job，并固化 idempotency、持久 runner job record、响应 schema 和 cancel 前置检查。 |
| runner Job 状态查询 | 已实现 | `GET /api/v1/runs/:runId/runner-jobs` 和 `GET /api/v1/runs/:runId/runner-jobs/:runnerJobId` 返回 attempt/job/log/phase/terminal 摘要，业务客户端无需直连 Kubernetes 做最小定位。 |
| Session 控制面 API | 已实现/Q3 | 已提供 `list/show/trace/output/read/control(cancel)`；session projection 保存 running/terminal、active run/command、last event seq 和 read cursor，用于 CLI `ps/unread`。 |
| command/run terminal 分离 | 已实现最小闭环 | `PATCH /api/v1/commands/:commandId/status` 终结 command 并更新 SessionRef；普通 turn completed 不终结 run，run status 仅由 run cancel 或 runner 级不可恢复失败终结。 |
| Tenant policy boundary | 已实现最小边界 | v0.1 已做 schema、tenant/backend allowlist、executionPolicy 和 secretScope 结构校验；业务授权仍由 UniDesk/HWLAB 自己判定。 |
| `deepseek` backendProfile allowlist | 已实现/已通过主闭环 | Manager validation、backend capability 和 matching SecretRef 校验已支持 `deepseek`；真实 runtime 已经通过 CI/CD 发布并确认 Postgres migration `002_v01_backend_profiles` 应用。 |
| `minimax-m3` backendProfile allowlist | 已实现/已通过 HWLAB v0.2 原入口复测 | Manager validation、backend capability 和 matching SecretRef 校验已支持 `minimax-m3`；真实 runtime 已通过 HWLAB 显式 session CLI 原入口复测。 |
| Postgres durable adapter | 已实现/已通过主闭环 | live runtime 通过 `DATABASE_URL` 使用 Postgres durable store；memory store 仅用于显式 self-test/dev。见 [spec-v01-postgres.md](spec-v01-postgres.md)。 |
| Observability 最小合同 | 已实现主路径 | events append-only、command-scoped terminal status、failureKind、health/readiness store 状态、runner claim/lease/backend events 和 Secret/DSN redaction 已进入 manager；集中 trace 和部署级观测仍属后续工作。 |
| durable cancel API | 已实现最小闭环 | 已提供 run/command cancel API；pending command cancel 阻止新 runner Job，running runner 轮询 cancel 并中止 Codex stdio backend，终态使用 `cancelled`。 |
| stale lease recovery | 已实现/已通过 HWLAB v0.2 原入口复测 | replacement runner 遇到旧 lease 时等待 stale lease 并重试 claim，成功接管后继续同一 SessionRef/PVC/thread；正常并发 runner 仍返回 `runner-lease-conflict`。 |