diff --git a/.agents/skills/unidesk-sub2api/SKILL.md b/.agents/skills/unidesk-sub2api/SKILL.md
index 5d684004..ada9b662 100644
--- a/.agents/skills/unidesk-sub2api/SKILL.md
+++ b/.agents/skills/unidesk-sub2api/SKILL.md
@@ -16,7 +16,7 @@ bun scripts/cli.ts platform-infra sub2api validate --target PK01
 bun scripts/cli.ts platform-infra sub2api plan --target PK01
 ```
 
-先看报表和状态，再做计划或变更。部署、PK01 host-Docker、D601/D518 k3s target、egress proxy、镜像升级、Codex pool、账号代理、FRP/Caddy 暴露、master Codex 消费端配置和排障细节见 [references/full.md](references/full.md)。
+先看报表和状态，再做计划或变更。详细规则按职责拆在 `references/` 下；不要新增 `full.md`、`all.md`、`guide.md` 这类变相超级文件。
 
 ## 边界
 
@@ -24,11 +24,17 @@ bun scripts/cli.ts platform-infra sub2api plan --target PK01
 - Secret 只输出对象名、key 名、presence、fingerprint 或 redacted prefix；禁止打印完整 token/key。
 - 默认 active target 以 YAML `defaults.targetId` 和 target role 为准；当前 `api.pikapython.com` 对应 PK01 host-Docker target。
 - Codex pool、统一 API key、master `~/.codex` 配置、FRP/Caddy 暴露、账号增删都必须走本技能的受控 CLI。
-- `api.pikapython.com` 异常先按 YAML target 区分 PK01 local edge/app、k3s FRP target 和账号池调度；用 `status`、`validate`、受控 apply/sync 以及最小 `/v1/responses` smoke 做分层恢复。完整步骤见 [references/full.md](references/full.md) 的排障段。
+- `api.pikapython.com` 异常先按 YAML target 区分 PK01 local edge/app、k3s FRP target 和账号池调度；用 `status`、`validate`、受控 apply/sync 以及最小 `/v1/responses` smoke 做分层恢复。完整步骤见 [references/troubleshooting.md](references/troubleshooting.md) 和 [references/public-exposure.md](references/public-exposure.md)。
 
 ## 何时读取 reference
 
-- 添加/删除上游、受保护账号代理、分组绑定：读 [references/full.md](references/full.md) 的账号管理段。
-- 部署/状态/镜像升级/FRP 暴露：读部署、镜像、FRP 段。
-- master Codex消费端、`/v1/models`、Codex pool 验收：读 Codex Pool 和验收口径段。
-- 排障或禁止事项不确定时，读排障和禁止事项段。
+- 部署、状态、target 边界、PK01 host-Docker、k3s target、egress proxy、镜像升级：读 [references/operations.md](references/operations.md)。
+- Codex pool、统一 key、trace、account temp-unschedulable、`codex-pool sync|validate`：读 [references/codex-pool.md](references/codex-pool.md)。
+- Sentinel、marker-only 判定、账号冻结/恢复、`sentinel-report|sentinel-probe|sentinel-image`：读 [references/sentinel.md](references/sentinel.md)。
+- 受保护手动账号代理、分组绑定、WebUI account test：读 [references/manual-accounts.md](references/manual-accounts.md)。
+- 添加或删除上游 profile/account：读 [references/upstreams.md](references/upstreams.md)。
+- FRP/Caddy、PK01 shared Caddy managed block、public URL 暴露：读 [references/public-exposure.md](references/public-exposure.md)。
+- master `~/.codex` 统一消费端配置：读 [references/local-codex-consumer.md](references/local-codex-consumer.md)。
+- closeout 验收和最小 smoke：读 [references/validation.md](references/validation.md)。
+- 排障总入口：先读 [references/troubleshooting.md](references/troubleshooting.md)，再按失败层读取运行面、账号池或公网暴露专题。
+- 禁止事项和越界判断：读 [references/guardrails.md](references/guardrails.md)。
diff --git a/.agents/skills/unidesk-sub2api/references/codex-pool.md b/.agents/skills/unidesk-sub2api/references/codex-pool.md
new file mode 100644
index 00000000..2a91dec9
--- /dev/null
+++ b/.agents/skills/unidesk-sub2api/references/codex-pool.md
@@ -0,0 +1,51 @@
+## Codex Pool
+
+当前 codex-pool sync/validate/report/trace 适配器主要覆盖 k3s target。若 YAML 默认 target 是 PK01 host-Docker，不要直接把无 `--target` 的 codex-pool 命令当成验收入口；先使用 `sub2api status --target PK01`、`sub2api validate --target PK01` 和最小 public `/v1/responses` smoke。host-Docker codex-pool adapter 补齐前，k3s 账号池操作必须显式选择 k3s target：
+
+```bash
+bun scripts/cli.ts platform-infra sub2api codex-pool plan --target D601
+bun scripts/cli.ts platform-infra sub2api codex-pool sync --target D601 --confirm
+bun scripts/cli.ts platform-infra sub2api codex-pool validate --target D601
+bun scripts/cli.ts platform-infra sub2api codex-pool trace --request-id <requestId>
+bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-image status --target D601
+bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-image build --target D601 --confirm
+bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-probe --target D601 --account unidesk-codex-hy --confirm
+bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-report --target D601
+bun scripts/cli.ts platform-infra sub2api codex-pool cleanup-probes --target D601 --confirm
+```
+
+`config/platform-infra/sub2api-codex-pool.yaml` 控制：
+
+- `pool.groupName`: Sub2API group 名称。
+- `pool.apiKeySecretName` / `pool.apiKeySecretKey`: 统一消费 API key 的 k3s Secret 位置，默认 `platform-infra/sub2api-codex-pool-api-key.API_KEY`。
+- `pool.minOwnerBalanceUsd`: pool key owner 最低余额，sync/validate 会补齐。
+- `pool.minOwnerConcurrency`: 可选统一消费 API key owner 最低并发；省略时 CLI 自动使用所有已解析账号 capacity 的总和，sync/validate 会补齐。显式 YAML 值只作为 override，仍必须不小于账号 capacity 总和；未显式写 `profiles.entries[].capacity` 的账号会使用 `pool.defaultAccountCapacity` 参与求和，不要用提高某个 provider capacity 来掩盖用户并发层 WS 1013。
+- `pool.defaultTempUnschedulable`: Sub2API 内置请求路径临时不可调度开关和 YAML 规则列表。当前要求是按 YAML 开启通用规则；sync 把 `temp_unschedulable_enabled` / `temp_unschedulable_rules` 渲染到 managed accounts，让匹配的 400/5xx/超时/模型路由/加密内容错误短暂冷却当前账号并触发同组 failover。
+- `pool.defaultTempUnschedulable` 与外部 `sentinel.*` 分开配置、互不驱动。内置规则负责 near-real-time request-path cooling/failover；哨兵负责 marker health、账号级隔离/恢复和 probe 退避。
+- 外部 sentinel 的写入面只允许通过 Sub2API admin `schedulable` 接口冻结/恢复账号；不能写入、清理或间接清理 `temp_unschedulable_until` / `temp_unschedulable_reason`、rate-limit、overload、model-rate-limit 等 Sub2API 请求路径 runtime 状态，也不能调用 `recover-state` 作为恢复动作。看到 UI 里的“触发时间/解除时间/规则序号/匹配关键词”临时不可调度状态时，默认先归因到 Sub2API 内置 request-path temp-unschedulable，而不是 sentinel。
+- YAML 只选择和配置 Codex 上游，不声明 `schedulable` 长期字段；`codex-pool sync --confirm` 不负责把既有账号恢复为 `schedulable=true`。既有账号的 `schedulable=false` 必须由哨兵先同步 Sub2API runtime 状态，再在 marker probe 命中后恢复。
+- `profiles.entries`: 从 master `~/.codex/` 选择上游 profile 并映射到 Sub2API account。
+- `profiles.entries[].capacity`: 可选 per-account concurrency override；不写则使用 `pool.defaultAccountCapacity`。具体数值只以 `config/platform-infra/sub2api-codex-pool.yaml` 为准，skill 和长期参考只描述规则，不重复写当前值。
+- `profiles.entries[].loadFactor`: 可选 per-account Sub2API `load_factor` override；不写则使用 `pool.defaultAccountLoadFactor`。具体数值只以 `config/platform-infra/sub2api-codex-pool.yaml` 为准，修改后必须 `codex-pool sync --confirm` 和 `codex-pool validate`。
+- `profiles.entries[].trustUpstream`: 可选账号级哨兵信任标记；默认 `false`。可信账号使用 `sentinel.cadence.trustedSuccessMaxIntervalMinutes` 作为连续成功后的最大探测退避，不可信账号使用 `sentinel.cadence.untrustedSuccessMaxIntervalMinutes`。它只影响哨兵探测频率和状态可见性，不改变 Sub2API account priority/capacity/loadFactor。
+- `pool.defaultSentinelProtect`: 账号级哨兵保护默认策略；是否启用、连续确认次数、初始延迟、最大延迟和退避倍率都只以 YAML 为准。marker probe 或 gateway failure 触发冻结前会先按该策略做连续 marker 确认，只有全部失败才进入冻结状态机。
+- `profiles.entries[].sentinelProtect`: 可选账号级哨兵保护覆盖；只用于明确偏离 pool 默认策略。它只影响哨兵冻结判定和 `sentinel-report` 可见性，不改变 Sub2API account priority/capacity/loadFactor。
+- 除非用户明确要求修改配置，不要仅凭推断改账号 membership、priority、capacity、loadFactor、WebSocket mode 或其他调度策略；先保留 YAML，完成 provenance/runtime evidence 溯源，并把结论写回相关 issue 或 runbook 后再提出变更。
+- Sub2API 是 UniDesk 可读源码和可观测运行面的受控组件；排查 Sub2API 调度、failover、错误传播、临时不可调度或 account selection 时，默认先读当前 Sub2API 源码实现，再用真实 request id、Sub2API 日志和原入口流量验证。不要用 mock upstream、临时 probe account 或测试桩作为默认结论来源；这类探针最多是显式 debug 辅助，不能替代源码链路和真实运行证据。
+- `profiles.entries[].tempUnschedulable`: 可选 per-account Sub2API 内置临时不可调度覆盖；只用于明确偏离 pool 默认规则，不用它给某个账号特殊优先级或临时绕过通用 failover。
+- `profiles.entries[].openaiResponsesWebSocketsV2Mode`: 需要 Responses WebSocket v2 的上游才设置，值为 `off`、`ctx_pool` 或 `passthrough`。
+- `profiles.entries[].upstreamUserAgent`: 少数要求 Codex CLI User-Agent 的上游才设置，不能含换行。
+- `manualAccounts.protected`: 已在 Sub2API 手动创建/维护、且必须排除在 UniDesk-managed Codex pool credentials 和 sentinel 控制之外的账号。默认不得改 credentials/status/schedulable/priority/capacity/loadFactor；只有显式声明 `proxyBinding` 时，`sync --confirm` 才允许把该账号的 `proxy_id` 对齐到 YAML 目标的 egress proxy；只有显式声明 `groupBinding.source: pool-group` 时，才允许把该账号加入统一消费 API key 使用的 pool group。
+- Sentinel 配置、marker-only 判定、镜像、report/probe 和远端 job/poll 边界见 [sentinel.md](sentinel.md)。
+
+对已支持的 k3s target，`sync --confirm` 会登录 Sub2API admin、创建/更新 group、创建/更新 YAML 中的 `unidesk-codex-*` accounts、创建/复用统一 API key Secret，并部署/更新哨兵资源；它不把既有 managed account 直接恢复为 `schedulable=true`。恢复只由哨兵在读取 Sub2API runtime `schedulable=false` 后触发 recovery probe，并在 marker 命中时执行。`sync` 默认不删除 YAML 中缺席的 managed account。只有明确退役上游时才使用 `sync --confirm --prune-removed` 删除缺席且 `extra.unidesk_managed=true` 的 `unidesk-codex-*` account。对 `manualAccounts.protected`，`sync` 只执行 YAML 显式允许的窄同步；当前允许项是从目标 `egressProxy` 创建/更新 Sub2API internal proxy 记录并绑定 `proxy_id`，以及把受保护手动账号加入当前 `pool.groupName`。它仍不接管该账号凭据、status、schedulable、priority/capacity/loadFactor 或哨兵状态。PK01 host-Docker target 在 codex-pool adapter 补齐前不具备这条完整 sync 路径。
+
+`trace --request-id <requestId>` 是只读 request 追溯报表，不触发 probe、不修改账号。默认输出请求开始/最终状态、failover、`account_select_failed`、窗口内 `account_temp_unschedulable`、admin schedulable 写入计数和当前账号快照；`reason=failover-attempted-no-candidate` 表示 Sub2API 已进入自动切号，但排除当前失败账号后没有可用候选。需要机器处理时使用 `--raw`，需要原始匹配行时加 `--show-lines`。
+
+对已支持的 k3s target，`sync --confirm`、`validate` 和 sentinel 操作可能超过单次 SSH/runtime 短连接窗口；远端 job/poll 边界见 [sentinel.md](sentinel.md)。
+
+不要给 UniDesk-managed Codex accounts 开 Sub2API `pool_mode`。UniDesk 期望的 failover 是把失败账号临时标记为 unschedulable，让同组其他账号接手；`pool_mode` 会重试同一个 account path。
+
+WebSocket v2 是账号能力集合，不是调度 pin。`openaiResponsesWebSocketsV2Mode` 只声明该账号可承担 Codex Responses WSv2 链路；只有 `localCodex.supportsWebSockets=true` / `localCodex.responsesWebSocketsV2=true` 时，`codex-pool validate` 才必须看到至少一个 `webSocketsV2.schedulableEnabled` 账号。真实可用性仍以 direct Codex WSv2 probe、Sub2API 日志和原入口 Codex smoke 为准。
+
+Codex 启动时反复出现 WebSocket reconnect、HTTPS fallback、`websocket closed by server before response.completed`，或 Sub2API 日志出现 `openai.websocket_proxy_failed` / `openai.websocket_account_select_failed` / 上游 WS handshake 4xx/5xx 时，先按运行证据定位具体 account 和 transport。若账号的 WSv2 握手失败，优先只在 YAML 中把该账号的 `openaiResponsesWebSocketsV2Mode` 收敛为 `off`；若没有任何 direct Codex WSv2 probe 通过，则同时把 `localCodex.supportsWebSockets` 与 `localCodex.responsesWebSocketsV2` 收敛为 `false`，再 `codex-pool sync --confirm`。不要顺手改 membership、priority、capacity、Secret 或代码 fallback。
diff --git a/.agents/skills/unidesk-sub2api/references/full.md b/.agents/skills/unidesk-sub2api/references/full.md
deleted file mode 100644
index 6ce696a0..00000000
--- a/.agents/skills/unidesk-sub2api/references/full.md
+++ /dev/null
@@ -1,296 +0,0 @@
----
-name: unidesk-sub2api
-description: UniDesk Sub2API 平台运维技能。用户提到 Sub2API、sub2api、platform-infra sub2api、Codex pool、统一 API key、Sub2API FRP 暴露、Sub2API 管理 UI、配置 master ~/.codex 走 Sub2API、添加/删除 Codex 上游账号、校验 Sub2API /v1/models 时使用。
----
-
-# UniDesk Sub2API
-
-UniDesk 通过 `platform-infra sub2api` 运维 YAML 选中的 Sub2API target。当前 active target 以 `config/platform-infra/sub2api.yaml` 为准；PK01 可作为 host-Docker active target 并通过 PK01 Caddy 本地反代提供 `api.pikapython.com`，D518/D601 等 k3s target 仍可按 YAML 声明为 external-active 或 retired，G14 由同一 YAML/CLI 控制为 standby predeploy。日常操作统一使用 UniDesk CLI，不直接写 Kubernetes 资源或手工调用 Sub2API 管理 API。
-
-**固定入口**: `cd /root/unidesk && bun scripts/cli.ts platform-infra sub2api ...`
-
-## 先看报表
-
-查 Codex pool 哨兵状态、账号冻结/恢复、marker 命中、下一次 probe、最近 CronJob run、token/cost 账本时，优先使用这个低噪声报表入口，不要先翻 ConfigMap、CronJob 日志或 Sub2API UI：
-
-```bash
-bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-report
-```
-
-需要机器处理或完整字段时再加 `--raw`；需要更多最近运行记录时加 `--events N`。
-
-追溯某个 Codex/Sub2API request id 的中断、上游账号、切号、临时不可调度、账号选择失败和同窗口账号池信号时，优先使用低噪声 trace 报表，不要先手写 `kubectl logs | grep`：
-
-```bash
-bun scripts/cli.ts platform-infra sub2api codex-pool trace --request-id <requestId>
-```
-
-默认输出类似 k8s/ps 的短表；机器处理用 `--raw` 读取 `.data.trace.*`；需要审计原始匹配日志时加 `--show-lines`；需要扩大搜索范围时使用 `--since 24h --tail 50000`。该命令只读：读取 Sub2API 日志、账号快照和 admin API 元数据，不改 `schedulable`、不清 runtime backoff、不中断请求。
-
-## 先读边界
-
-- 仓库长期开发边界见 `docs/reference/platform-infra.md`，本 skill 承担日常操作手册。
-- 配置真相是 YAML：`config/platform-infra/sub2api.yaml` 和 `config/platform-infra/sub2api-codex-pool.yaml`。
-- 业务策略和具体数值只以 YAML 为准。已有字段的数值调整只改 YAML 并跑 `plan` / `sync --confirm` / `validate`；不要自动补代码硬编码、schema 硬范围、合同测试、单元测试或长期参考文档。配置校验只校验格式、类型、必填和可渲染性，不判断数值策略是否“合理”。
-- 本 skill 目录下若存在 `agents/*.yaml`，只作为 skill/agent 展示与调用元数据，不是 Sub2API 或 Codex pool 运行配置；不要在 skill 目录维护第二份账号、capacity、priority、endpoint 或 Secret 配置。
-- Runtime target 由 `config/platform-infra/sub2api.yaml` 声明；默认 target 来自 YAML `defaults.targetId`，当前 `api.pikapython.com` 使用 PK01 host-Docker target。`D518:k3s`、`D601:k3s` 这类 k3s target 必须通过显式 `--target` 选择并按 YAML role 判定 active/retired/standby，`G14:k3s` 是 standby target。master server 只是控制端和消费者，不部署 Sub2API/PostgreSQL/Redis。
-- Standby target 不部署本地 PostgreSQL，不运行 sentinel、FRP 管理入口或 HTTPS egress proxy；只能预部署 namespace、NetworkPolicy、Service，以及 replicas=0 的 Sub2API/Redis Deployment。Redis 激活后也只允许 ephemeral cache。External-active target 仍不部署本地 PostgreSQL，必须直连 YAML 声明的外置 DB，使用本地 ephemeral Redis，并且只有在 YAML 启用时才运行 frpc、egress proxy 和目标级 sentinel；多个 external-active target 可以并存，不得把一个 target 的 public exposure 当作另一个 target 的替代或回退。
-- Secret、`~/.codex/config.toml*`、`~/.codex/auth.json*` 是运行时输入或本地状态，不提交。
-- 默认 `~/.codex/config.toml` 和 `~/.codex/auth.json` 只作为统一 Sub2API consumer 使用；`config.toml` 必须指向 YAML-selected active target 的 consumer URL，`auth.json` 必须使用统一 pool API key。新增上游账号不得覆盖这两个默认文件，只能新增 `config.toml.<profile>` / `auth.json.<profile>` 并在 YAML 里声明。
-- 输出只能包含 Secret 路径、key 名、presence、fingerprint 和 `valuesPrinted=false`；禁止打印完整 API key、admin password、JWT secret、TOTP key、base64 payload 或可复制的 preview。
-
-## 部署与状态
-
-```bash
-bun scripts/cli.ts platform-infra sub2api plan
-bun scripts/cli.ts platform-infra sub2api plan --target G14
-bun scripts/cli.ts platform-infra sub2api apply --dry-run
-bun scripts/cli.ts platform-infra sub2api apply --target G14 --dry-run
-bun scripts/cli.ts platform-infra sub2api apply --confirm
-bun scripts/cli.ts platform-infra sub2api apply --target G14 --confirm
-bun scripts/cli.ts platform-infra sub2api status
-bun scripts/cli.ts platform-infra sub2api status --target G14
-bun scripts/cli.ts platform-infra sub2api validate
-bun scripts/cli.ts platform-infra sub2api validate --target G14
-```
-
-- `plan` 读取 `config/platform-infra/sub2api.yaml`，渲染 `src/components/platform-infra/sub2api/sub2api.k8s.yaml`，检查 no Ingress/NodePort/LoadBalancer/hostPort/hostNetwork/resource limits，并要求 `NetworkPolicy/allow-all` 随 manifest 受控创建。
-- `apply --confirm` 默认创建异步 job；按返回的 `job status` 命令轮询，再跑 `status` 和 `validate`。
-- `status --full|--raw` 只在需要展开远端 stdout/stderr 或原始 JSON 时使用。
-- `validate` 是按需验收，不是连续可用性探针。对 standby target，`validate --target <id>` 验证预部署形态，不要求外置 DB 当前可连接；对 external-active target，必须验证外置 DB、ephemeral Redis、Sub2API service、YAML egress proxy 和目标级 public exposure。
-
-## PK01 host-Docker target
-
-PK01 host-Docker target 由 `config/platform-infra/sub2api.yaml` 的 target `runtimeMode: host-docker` 控制。`api.pikapython.com` 的当前路径是 `client -> PK01 Caddy -> 127.0.0.1:<YAML local upstream port> -> PK01 host-Docker Sub2API`，不是 D601 FRP 路径。优先用以下受控入口分层判断：
-
-```bash
-bun scripts/cli.ts platform-infra sub2api status --target PK01
-bun scripts/cli.ts platform-infra sub2api validate --target PK01
-```
-
-PK01 没有 k3s control plane。当前 `codex-pool sync`、`codex-pool validate`、`sentinel-report` 和 `trace` 的部分实现仍依赖 k8s/kubectl 远端脚本；在 PK01 host-Docker target 上看到 `kubectl` 缺失时，应归类为 CLI host-Docker adapter 缺口，不要误判为 Sub2API app、Caddy、上游或账号池故障。正式修复应补 host-Docker 版 codex-pool sync/validate/report/trace；临时排障只能做只读 admin API、DB join 表和最小公网 `/v1/responses` smoke，并且不得打印 admin password、API key 或账号凭据。
-
-## D601 Egress Proxy
-
-D601 的目标级 `egressProxy` 完全由 `config/platform-infra/sub2api.yaml` 控制。当前成熟形态是 master Docker `shadowsocks-rust` 作为加密出站源，D601 k3s 内 `sing-box` 暴露 HTTP/mixed ClusterIP proxy 给 Sub2API 和按 YAML 启用的 sentinel 使用。不要把 endpoint、端口、密码、健康探针或镜像 tag 写进 skill；只以 YAML 和 `config/platform-infra/sub2api-master-egress-proxy.compose.yaml` 为准。
-
-master 侧 proxy 由 UniDesk checkout 内的 compose 文件管理：
-
-```bash
-docker compose -f config/platform-infra/sub2api-master-egress-proxy.compose.yaml up -d --force-recreate
-bun scripts/cli.ts platform-infra sub2api apply --target D601 --confirm
-bun scripts/cli.ts platform-infra sub2api validate --target D601
-bun scripts/cli.ts platform-infra sub2api codex-pool sync --target D601 --confirm
-bun scripts/cli.ts platform-infra sub2api codex-pool validate --target D601
-```
-
-proxy secret/config 文件只允许放在受控 Secret/state 路径，输出只能披露路径、presence、fingerprint 或摘要，不能打印密码、完整订阅或生成配置。若 D601 到上游的 TLS/SNI 路径被 reset，不要用临时 JS 或简陋 HTTP CONNECT proxy 作为最终方案；通过 YAML/compose 更换或修复成熟加密 proxy source，再跑上面的 apply/validate/sync/validate 闭环。
-
-## 镜像升级
-
-1. 修改 `config/platform-infra/sub2api.yaml` 的 `image.repository`、`image.tag` 或 `pullPolicy`。
-2. 执行 `sub2api plan`，确认策略检查通过。
-3. 执行 `sub2api apply --confirm`，轮询 job 完成。
-4. 执行 `sub2api status`，确认运行镜像等于 YAML 声明。
-5. 执行 `sub2api validate` 或 `codex-pool validate` 做入口验收。
-
-不要把镜像版本写进脚本常量、JSON 或 manifest 模板。
-
-## Codex Pool
-
-当前 codex-pool sync/validate/report/trace 适配器主要覆盖 k3s target。若 YAML 默认 target 是 PK01 host-Docker，不要直接把无 `--target` 的 codex-pool 命令当成验收入口；先使用 `sub2api status --target PK01`、`sub2api validate --target PK01` 和最小 public `/v1/responses` smoke。host-Docker codex-pool adapter 补齐前，k3s 账号池操作必须显式选择 k3s target：
-
-```bash
-bun scripts/cli.ts platform-infra sub2api codex-pool plan --target D601
-bun scripts/cli.ts platform-infra sub2api codex-pool sync --target D601 --confirm
-bun scripts/cli.ts platform-infra sub2api codex-pool validate --target D601
-bun scripts/cli.ts platform-infra sub2api codex-pool trace --request-id <requestId>
-bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-image status --target D601
-bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-image build --target D601 --confirm
-bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-probe --target D601 --account unidesk-codex-hy --confirm
-bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-report --target D601
-bun scripts/cli.ts platform-infra sub2api codex-pool cleanup-probes --target D601 --confirm
-```
-
-`config/platform-infra/sub2api-codex-pool.yaml` 控制：
-
-- `pool.groupName`: Sub2API group 名称。
-- `pool.apiKeySecretName` / `pool.apiKeySecretKey`: 统一消费 API key 的 k3s Secret 位置，默认 `platform-infra/sub2api-codex-pool-api-key.API_KEY`。
-- `pool.minOwnerBalanceUsd`: pool key owner 最低余额，sync/validate 会补齐。
-- `pool.minOwnerConcurrency`: 可选统一消费 API key owner 最低并发；省略时 CLI 自动使用所有已解析账号 capacity 的总和，sync/validate 会补齐。显式 YAML 值只作为 override，仍必须不小于账号 capacity 总和；未显式写 `profiles.entries[].capacity` 的账号会使用 `pool.defaultAccountCapacity` 参与求和，不要用提高某个 provider capacity 来掩盖用户并发层 WS 1013。
-- `pool.defaultTempUnschedulable`: Sub2API 内置请求路径临时不可调度开关和 YAML 规则列表。当前要求是按 YAML 开启通用规则；sync 把 `temp_unschedulable_enabled` / `temp_unschedulable_rules` 渲染到 managed accounts，让匹配的 400/5xx/超时/模型路由/加密内容错误短暂冷却当前账号并触发同组 failover。
-- `pool.defaultTempUnschedulable` 与外部 `sentinel.*` 分开配置、互不驱动。内置规则负责 near-real-time request-path cooling/failover；哨兵负责 marker health、账号级隔离/恢复和 probe 退避。
-- 外部 sentinel 的写入面只允许通过 Sub2API admin `schedulable` 接口冻结/恢复账号；不能写入、清理或间接清理 `temp_unschedulable_until` / `temp_unschedulable_reason`、rate-limit、overload、model-rate-limit 等 Sub2API 请求路径 runtime 状态，也不能调用 `recover-state` 作为恢复动作。看到 UI 里的“触发时间/解除时间/规则序号/匹配关键词”临时不可调度状态时，默认先归因到 Sub2API 内置 request-path temp-unschedulable，而不是 sentinel。
-- YAML 只选择和配置 Codex 上游，不声明 `schedulable` 长期字段；`codex-pool sync --confirm` 不负责把既有账号恢复为 `schedulable=true`。既有账号的 `schedulable=false` 必须由哨兵先同步 Sub2API runtime 状态，再在 marker probe 命中后恢复。
-- `profiles.entries`: 从 master `~/.codex/` 选择上游 profile 并映射到 Sub2API account。
-- `profiles.entries[].capacity`: 可选 per-account concurrency override；不写则使用 `pool.defaultAccountCapacity`。具体数值只以 `config/platform-infra/sub2api-codex-pool.yaml` 为准，skill 和长期参考只描述规则，不重复写当前值。
-- `profiles.entries[].loadFactor`: 可选 per-account Sub2API `load_factor` override；不写则使用 `pool.defaultAccountLoadFactor`。具体数值只以 `config/platform-infra/sub2api-codex-pool.yaml` 为准，修改后必须 `codex-pool sync --confirm` 和 `codex-pool validate`。
-- `profiles.entries[].trustUpstream`: 可选账号级哨兵信任标记；默认 `false`。可信账号使用 `sentinel.cadence.trustedSuccessMaxIntervalMinutes` 作为连续成功后的最大探测退避，不可信账号使用 `sentinel.cadence.untrustedSuccessMaxIntervalMinutes`。它只影响哨兵探测频率和状态可见性，不改变 Sub2API account priority/capacity/loadFactor。
-- `pool.defaultSentinelProtect`: 账号级哨兵保护默认策略；是否启用、连续确认次数、初始延迟、最大延迟和退避倍率都只以 YAML 为准。marker probe 或 gateway failure 触发冻结前会先按该策略做连续 marker 确认，只有全部失败才进入冻结状态机。
-- `profiles.entries[].sentinelProtect`: 可选账号级哨兵保护覆盖；只用于明确偏离 pool 默认策略。它只影响哨兵冻结判定和 `sentinel-report` 可见性，不改变 Sub2API account priority/capacity/loadFactor。
-- 除非用户明确要求修改配置，不要仅凭推断改账号 membership、priority、capacity、loadFactor、WebSocket mode 或其他调度策略；先保留 YAML，完成 provenance/runtime evidence 溯源，并把结论写回相关 issue 或 runbook 后再提出变更。
-- Sub2API 是 UniDesk 可读源码和可观测运行面的受控组件；排查 Sub2API 调度、failover、错误传播、临时不可调度或 account selection 时，默认先读当前 Sub2API 源码实现，再用真实 request id、Sub2API 日志和原入口流量验证。不要用 mock upstream、临时 probe account 或测试桩作为默认结论来源；这类探针最多是显式 debug 辅助，不能替代源码链路和真实运行证据。
-- `profiles.entries[].tempUnschedulable`: 可选 per-account Sub2API 内置临时不可调度覆盖；只用于明确偏离 pool 默认规则，不用它给某个账号特殊优先级或临时绕过通用 failover。
-- `profiles.entries[].openaiResponsesWebSocketsV2Mode`: 需要 Responses WebSocket v2 的上游才设置，值为 `off`、`ctx_pool` 或 `passthrough`。
-- `profiles.entries[].upstreamUserAgent`: 少数要求 Codex CLI User-Agent 的上游才设置，不能含换行。
-- `manualAccounts.protected`: 已在 Sub2API 手动创建/维护、且必须排除在 UniDesk-managed Codex pool credentials 和 sentinel 控制之外的账号。默认不得改 credentials/status/schedulable/priority/capacity/loadFactor；只有显式声明 `proxyBinding` 时，`sync --confirm` 才允许把该账号的 `proxy_id` 对齐到 YAML 目标的 egress proxy；只有显式声明 `groupBinding.source: pool-group` 时，才允许把该账号加入统一消费 API key 使用的 pool group。
-- `sentinel.monitor.enabled`: 账号级 marker 哨兵监控开关；开启后 `codex-pool sync --confirm` 会在 `platform-infra` 创建/更新 k8s CronJob、ConfigMap、Secret、ServiceAccount、Role 和 RoleBinding。CronJob 直打 YAML-managed 上游账号的 OpenAI Responses `gpt-5.5`，用确定 marker 作为唯一健康标准，并在独立 state ConfigMap 中记录 token/cost 账本。
-- `sentinel.actions.enabled`: 账号级哨兵冻结/恢复动作开关；当前 marker-only guard 要求开启。动作关闭时只记录 `would-freeze`，不会调用 Sub2API admin API 改 `schedulable`。动作开启后，只要不满足 marker match，不论是 HTTP 200 私货、4xx/5xx、非 JSON、连接错误还是空输出，都进入同一个冻结/恢复状态机。
-- `sentinel.sdk.openaiPythonVersion`: 哨兵容器使用的 OpenAI Python SDK 固定版本；模型请求必须通过标准 SDK `responses.create`，不要手工拼 `/v1/responses` 请求体或手写响应解析。后续升级 SDK 只改 YAML 并 `sync --confirm`。
-- `sentinel.probe.maxOutputTokens`: 哨兵本地流式 delta 收集上限，必须保持小值；它不作为上游 `max_output_tokens` 字段发送，以保持与 Sub2API WebUI 默认账号连接测试的 Responses SSE 请求形态一致。哨兵不限制并发和每轮账号数，所有到期账号会在同一轮并发探测。
-- `sentinel.probe.userAgent`: 哨兵 direct upstream probe 的默认 User-Agent，通过 OpenAI SDK `extra_headers` 传递；默认贴近 Sub2API `net/http` 账号连接测试形态，个别账号仍可用 `profiles.entries[].upstreamUserAgent` 覆盖。
-- `sentinel.cadence`: 成功信任指数退避配置。当前口径是从 1 分钟开始，连续成功后按账号 `trustUpstream` 选择可信/不可信最大退避；任意非 marker match 清零成功信任并进入冻结退避。可信/不可信最大退避数值只写 YAML。
-- `sentinel.freeze`: 失败冻结 TTL 指数退避配置。当前口径是初始 1 分钟，失败后 `1m -> 2m -> 4m -> 8m -> 10m`，最大 10 分钟；失败 probe 基本不消耗有效输出 token，因此冻结窗口保持短周期。冻结到期后只做恢复 probe，通过才自动恢复，不能仅靠 TTL 到期解封。
-- `sentinel.pricing`: 直打上游时哨兵自己的 token/cost 估算价格。因为 direct upstream probe 不经过 Sub2API 普通用量账本，哨兵必须自己记录全局与 per-account token/cost；这些账本只用于观察，不作为跳过探测的预算门禁。
-
-对已支持的 k3s target，`sync --confirm` 会登录 Sub2API admin、创建/更新 group、创建/更新 YAML 中的 `unidesk-codex-*` accounts、创建/复用统一 API key Secret，并部署/更新哨兵资源；它不把既有 managed account 直接恢复为 `schedulable=true`。恢复只由哨兵在读取 Sub2API runtime `schedulable=false` 后触发 recovery probe，并在 marker 命中时执行。`sync` 默认不删除 YAML 中缺席的 managed account。只有明确退役上游时才使用 `sync --confirm --prune-removed` 删除缺席且 `extra.unidesk_managed=true` 的 `unidesk-codex-*` account。对 `manualAccounts.protected`，`sync` 只执行 YAML 显式允许的窄同步；当前允许项是从目标 `egressProxy` 创建/更新 Sub2API internal proxy 记录并绑定 `proxy_id`，以及把受保护手动账号加入当前 `pool.groupName`。它仍不接管该账号凭据、status、schedulable、priority/capacity/loadFactor 或哨兵状态。PK01 host-Docker target 在 codex-pool adapter 补齐前不具备这条完整 sync 路径。
-
-`sentinel-image status|build` 管理哨兵 Python 运行环境镜像。镜像由 YAML 的 `sentinel.image` 基础镜像和 `sentinel.sdk.openaiPythonVersion` 派生，发布到目标 runtime 的本地 registry；`build --confirm` 会先检查 registry tag，存在则快速复用，不存在才在目标 host 构建并 push。CronJob 启动时只校验 SDK 版本，不在运行时 `pip install`。目标是否启用哨兵以 `config/platform-infra/sub2api.yaml` 的 `sentinel.enabledOnTargets` 为准；未启用的 target 在 `sync`/`validate` 中应显示 `skipped-target-disabled`，不得要求镜像构建、CronJob、Secret 或 state ConfigMap 存在。
-
-`sync --confirm` 同时会按 YAML 渲染账号级哨兵资源，并在 monitor 开启时先确保可复用哨兵镜像存在。当前目标是 `sentinel.monitor.enabled=true` + `sentinel.actions.enabled=true` 的 marker-only 自动冻结/恢复；不要手工 patch CronJob、Secret 或 Sub2API account。若 YAML 新增账号或修改 profile/base URL/API key fingerprint/upstream User-Agent/Responses WebSocket mode，sync 会从变更前 runtime state 写入 pending probe 记录并立即安排 sentinel probe，但不会把既有账号直接恢复为可调度；只有 sentinel 读取到 Sub2API runtime `schedulable=false` 后执行 recovery probe，且 marker 命中，才恢复 `schedulable=true`。sentinel 冻结/恢复只改 `schedulable=false|true`，不得顺手调用 Sub2API `recover-state` 清除请求路径临时不可调度或其他 runtime backoff。无关账号的既有成功/失败退避不能被重置。若 YAML 下调失败冻结最大窗口，sync 会把仍 active 的旧冻结状态迁移到当前最大窗口内并立即安排 recovery probe，但不会直接解冻。若怀疑某个账号被误判，先用 `codex-pool sentinel-probe --account <accountName> --confirm` 立即触发该账号测量；该命令从现有 CronJob 模板派生一次性 Job，复用同一份 Secret、ConfigMap、OpenAI SDK probe、token/cost 账本和冻结/恢复状态机。
-
-`trace --request-id <requestId>` 是只读 request 追溯报表，不触发 probe、不修改账号。默认输出请求开始/最终状态、failover、`account_select_failed`、窗口内 `account_temp_unschedulable`、admin schedulable 写入计数和当前账号快照；`reason=failover-attempted-no-candidate` 表示 Sub2API 已进入自动切号，但排除当前失败账号后没有可用候选。需要机器处理时使用 `--raw`，需要原始匹配行时加 `--show-lines`。
-
-`sentinel-report` 是只读低噪声报表，不触发 probe、不修改账号。默认输出类似 `ps` 的文本表，展示每个账号的探测次数、Sub2API runtime `schedulable`、最近 marker/HTTP/动作、冻结 TTL、成功退避、下一次 probe 和最近 run 事件；`SCH` 展示 Sub2API runtime schedulable，`PROT` 展示账号级保护阈值，`P_FAIL` 展示最近一次保护确认中的失败次数/阈值；需要机器处理时使用 `sentinel-report --raw`。
-
-对已支持的 k3s target，`sync --confirm` 和 `validate` 可能超过单次 SSH/runtime 短连接窗口。必须继续使用 `bun scripts/cli.ts platform-infra sub2api codex-pool ... --target <k3s-target>`，由 CLI 在目标远端提交作业并短轮询状态；不要改用裸 `trans <target>:k3s sh` 等一个长连接等待完整结果。若看到 `UNIDESK_SSH_RUNTIME_TIMEOUT`，先按 `docs/reference/platform-infra.md` 的规则处理为控制面可见性问题，修 CLI/job/poll 或重跑受控命令，不要手工 patch Sub2API credentials 或源码。
-
-不要给 UniDesk-managed Codex accounts 开 Sub2API `pool_mode`。UniDesk 期望的 failover 是把失败账号临时标记为 unschedulable，让同组其他账号接手；`pool_mode` 会重试同一个 account path。
-
-WebSocket v2 是账号能力集合，不是调度 pin。`openaiResponsesWebSocketsV2Mode` 只声明该账号可承担 Codex Responses WSv2 链路；只有 `localCodex.supportsWebSockets=true` / `localCodex.responsesWebSocketsV2=true` 时，`codex-pool validate` 才必须看到至少一个 `webSocketsV2.schedulableEnabled` 账号。真实可用性仍以 direct Codex WSv2 probe、Sub2API 日志和原入口 Codex smoke 为准。
-
-Codex 启动时反复出现 WebSocket reconnect、HTTPS fallback、`websocket closed by server before response.completed`，或 Sub2API 日志出现 `openai.websocket_proxy_failed` / `openai.websocket_account_select_failed` / 上游 WS handshake 4xx/5xx 时，先按运行证据定位具体 account 和 transport。若账号的 WSv2 握手失败，优先只在 YAML 中把该账号的 `openaiResponsesWebSocketsV2Mode` 收敛为 `off`；若没有任何 direct Codex WSv2 probe 通过，则同时把 `localCodex.supportsWebSockets` 与 `localCodex.responsesWebSocketsV2` 收敛为 `false`，再 `codex-pool sync --confirm`。不要顺手改 membership、priority、capacity、Secret 或代码 fallback。
-
-## 受保护手动账号代理与分组绑定
-
-Sub2API 管理 UI 的账号连接测试使用账号级 `ProxyID` / proxy URL 配置上游 HTTP transport；账号没有绑定 proxy 时会直接出站，即使 Sub2API Pod 已经有 `HTTP_PROXY` / `HTTPS_PROXY` 环境变量。看到 WebUI 账号测试连 `chatgpt.com` 超时、但 Pod 内显式走目标 proxy 可通时，先检查该账号是否属于 `manualAccounts.protected` 并声明了 `proxyBinding`。如果同一账号用 `gpt-5.2-pro` 返回 ChatGPT OAuth 不支持 Codex 的模型能力错误，但默认/受支持模型能完成 `hi` 或 `/v1/responses` smoke，这不是代理失败；按模型映射/账号能力另行处理。
-
-如果 WebUI 账号连接测试显示 `proxyconnect tcp: dial tcp 127.0.0.1:<port>: connect: connection refused`，先确认该 proxy URL 是账号级 loopback 配置：在 k3s target 内，`127.0.0.1` 是 Sub2API Pod 自己，不是节点或 PK01。不要先改账号凭据、PK01 Caddy、`api.pikapython.com` 或统一 key；应在目标 `config/platform-infra/sub2api.yaml` 声明 `targets[].accountLocalProxy`，由 `platform-infra sub2api apply --target <id>` 渲染同 Pod sidecar 和 Secret，再用 `validate --target <id>` 的 `accountLocalProxy` 探针验证 `http://127.0.0.1:<port>`。输出仍只允许披露 sourceRef、fingerprint、secretName 和 proxyUrl，不打印 proxy 密码或生成配置。
-
-WebUI 账号连接测试也不经过统一消费 API key 的 pool group 选择器；账号测试正常不代表 PC Codex 客户端能选中该账号。看到 WebUI 账号测试正常、但 `/responses` 或 `/v1/responses` 以 `account-select-failed` / `no available accounts` 返回 503 时，先检查该手动账号是否声明了 `groupBinding.source: pool-group`，并确认 Sub2API `account_groups` join 里存在该账号与当前统一 API key `group_id` 的绑定。对已支持的 k3s target，通过 `sync --confirm` 加入当前 `pool.groupName`；对 PK01 host-Docker target，在 host-Docker codex-pool sync adapter 补齐前，只能用最小 admin API 写入 `group_ids` 做运行面恢复，且必须只输出 account id、group id、presence/fingerprint 和 smoke 状态，不打印密钥。
-
-受保护手动账号仍由人工在 Sub2API UI 维护 credentials/status 等字段；UniDesk 只允许通过 YAML 做代理和分组窄绑定：
-
-```bash
-bun scripts/cli.ts platform-infra sub2api codex-pool plan --target D601
-bun scripts/cli.ts platform-infra sub2api codex-pool sync --target D601 --confirm
-bun scripts/cli.ts platform-infra sub2api codex-pool validate --target D601
-```
-
-`sync` 输出应显示 `manualAccounts.ok=true`、`proxySync.ok=true`、`groupSync.ok=true`，且该账号的 proxy/group `bindingAligned=true`。`sentinel-probe --account <manual-account> --confirm` 对受保护手动账号必须继续拒绝，通常返回 `account-protected-manual`；不要为了测试而把该账号移入 `profiles.entries` 或取消保护。需要证明 WebUI 同款账号测试恢复时，用 Sub2API admin account test 原入口测最小 `hi` 和默认/受支持模型，并只记录 account id、proxy id、event types、HTTP status 和短 output preview，不记录 OAuth token 或 Secret 明文。若指定模型返回 “model is not supported when using Codex with a ChatGPT account” 一类能力错误，先归因到模型能力/映射，而不是 proxy。
-
-## 添加上游
-
-1. 在 master `~/.codex/` 准备带后缀的上游 profile 文件，例如 `config.toml.<profile>` 和 `auth.json.<profile>`；禁止覆盖默认 `config.toml` / `auth.json`。
-2. 在 `config/platform-infra/sub2api-codex-pool.yaml` 添加 `profiles.entries` 项，指定 `profile`、`accountName`、`configFile`、`authFile`。
-3. 如需要，给该项加 `priority`、`capacity`、`loadFactor`、`trustUpstream`、`sentinelProtect`、`openaiResponsesWebSocketsV2Mode` 或 `upstreamUserAgent`；capacity/loadFactor/信任退避/保护阈值的具体数值只写在 YAML。只有显式恢复 Sub2API 内置临时不可调度时才添加 per-account `tempUnschedulable`。
-4. 如果新增账号会提高声明 capacity 总和，默认让省略的 `pool.minOwnerConcurrency` 继续按 capacity 总和自动解析；只有 YAML 已经显式写了该 override 时，才同步提高到不低于总 capacity，或删除 override 回到自动解析。
-5. 跑 `codex-pool plan`，确认 profile 可读、`base_url` 和 API key 来源有效，且 stdout 未泄露完整 key。
-6. 跑 `codex-pool sync --confirm`。
-7. 跑 `codex-pool validate`。
-
-普通新增上游是 YAML 操作，不走 CI/CD，不改代码。只有需要渲染或校验上游 Sub2API 已经存在的可复用能力时才修改 `scripts/src/platform-infra-sub2api-codex.ts`；Sub2API 本身不支持的能力不在 UniDesk 侧魔改实现。
-
-## 删除上游
-
-删除上游只用于明确退役、凭据所有权变更或用户明确要求移除 provider；不能作为上游 5xx、compact 失败、限流、模型路由失败或哨兵隔离/恢复问题的处理手段。
-
-1. 从 `config/platform-infra/sub2api-codex-pool.yaml` 删除对应 `profiles.entries` 项。
-2. 跑 `codex-pool plan` 检查 desired 列表。
-3. 跑 `codex-pool sync --confirm --prune-removed`。
-4. 确认输出 `accounts.pruned` 只包含期望删除项。
-5. 跑 `codex-pool validate`。
-
-CLI 默认保留缺席账号，避免把可用性问题误处理成删除；只有显式 `--prune-removed` 才会 prune `name` 以 `unidesk-codex-` 开头且 `extra.unidesk_managed=true` 的缺席账号。
-
-## FRP 暴露
-
-```bash
-bun scripts/cli.ts platform-infra sub2api codex-pool expose
-bun scripts/cli.ts platform-infra sub2api codex-pool expose --confirm
-```
-
-- 由 YAML `publicExposure` 控制。Codex pool 默认公共端是 target `publicBaseUrl`；host-Docker target 可以使用 `mode: pk01-local` 直接由 PK01 Caddy 反代本机 loopback app，k3s external-active target 可以使用 FRP remotePort。不要把某个 target 的 exposure mode 推断成其它 target 的默认。
-- `expose --confirm` 只为 YAML 指定的 `remotePort` 补 master `frps` allow port，并在 G14 创建/更新 `sub2api-frpc`。
-- master Caddy site 也由 `publicExposure.masterCaddy` 渲染；`responseHeaderTimeoutSeconds` 必须足够覆盖 Codex `/responses/compact` 长请求，避免 Caddy 先返回 504 而 Sub2API 后台实际稍后成功。具体数值只改 `config/platform-infra/sub2api-codex-pool.yaml`，修改后跑 `codex-pool expose --confirm`，再核对 Caddyfile 中渲染出的 `response_header_timeout`。
-- master Caddy 的短窗口边缘重试由 `publicExposure.masterCaddy.edgeRetry` 渲染；用于吸收 FRP remotePort 短暂关闭、`connect: connection refused`、EOF 或 connection reset 这类请求尚未稳定到达 Sub2API 的 502。具体 retry 时长、间隔和 `retryMatch` 范围只写 YAML，修改后跑 `codex-pool expose --confirm`，再核对 Caddyfile 中渲染出的 `lb_try_duration`、`lb_try_interval` 和 `lb_retry_match`。不要手工 patch `/etc/caddy/Caddyfile`。
-- PK01 `/etc/caddy/Caddyfile` 是 Sub2API、LangBot、n8n、HWLAB 等多 YAML 来源共享的 edge artifact。Sub2API apply/expose 只能更新自己的 managed block 并保留其他 blocks；同一 Sub2API 服务暴露多个 target 时，D601 保留既有 `# BEGIN unidesk managed sub2api`，非默认 target 必须使用 target-scoped owner（例如 `sub2api-d518`），避免 `api.pikapython.com` 和 `api2.pikapython.com` 互相覆盖。若 apply 输出显示 managed block 数异常，先停止 closeout，检查 PK01 Caddy 合并与 validation 结果，不要手工整文件覆盖。
-- 非幂等 POST 的 round-trip retry 必须收窄到 YAML `retryMatch` 声明的安全路径；普通 `/responses` 上游账号错误仍归 Sub2API failover / temp-unschedulable / sentinel 处理，不用 Caddy 重放整段推理请求来掩盖账号池问题。
-- 同一个公开入口同时暴露 OpenAI-compatible API 和 Sub2API 管理 UI `/login`。FRP target 使用同一个 FRP TCP 入口；PK01 local target 使用 PK01 Caddy 到本机 app 的 managed block。不要另开第二个管理端口，除非 YAML 明确声明新的暴露决策。
-- k3s target 的 Sub2API Kubernetes Service 继续保持 ClusterIP。
-- k3s external-active target 的公开路径是 `client -> PK01 Caddy -> PK01 frps remotePort -> target frpc -> Sub2API`；PK01 host-Docker target 的公开路径是 `client -> PK01 Caddy -> 127.0.0.1:<local upstream port> -> Sub2API`。两者都不经过 pikanode，也不经过 master server 反代。PK01 Caddy 下载必须使用 YAML `publicExposure.pk01.caddyDownloadProxyUrl` 指定的 proxy；如果 Caddy 下载慢，先确认 apply 输出里是 `downloadProxy.mode=curl-proxy`。目标域名必须先解析到 YAML 声明的 PK01 公网地址，HTTPS 才能作为最终验证；`api.pikapython.com` 与 `api2.pikapython.com` 应分别按各自 YAML target 验收。
-
-## 配置 master Codex 消费端
-
-```bash
-bun scripts/cli.ts platform-infra sub2api codex-pool configure-local
-bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm
-```
-
-`configure-local --confirm` 会：
-
-- 从 `platform-infra/<apiKeySecretName>.<apiKeySecretKey>` 读取统一 API key。
-- 把当前 `~/.codex/config.toml` 和 `~/.codex/auth.json` 备份为 `.<backupSuffix>`，默认 `.pre-sub2api`。
-- 重写默认 `~/.codex` 消费端，固定指向 `https://sub2api.74-48-78-17.nip.io/`，provider 名称和 wire API 来自 `localCodex`。
-- 按 `localCodex.modelContextWindow` / `localCodex.modelAutoCompactTokenLimit` 写入 `model_context_window` / `model_auto_compact_token_limit`，用于统一控制 Codex auto compact 触发窗口，避免 GPT-5.5 消费端生成过大的 `/responses/compact` 长请求。
-- 按 `localCodex` 写入 Codex transport 标记：`supports_websockets` 与 `[features] responses_websockets_v2` 必须同开同关。只有至少一个上游通过 direct Codex WSv2 probe 时才启用；否则保持 HTTP Responses，避免每次原入口先经历无效 WS reconnect。
-- 用统一 key 做一次 gateway 验证。
-
-防递归规则：默认 `config.toml` / `auth.json` 是 Sub2API consumer，不得作为上游账号导回 pool；上游账号必须使用带后缀 profile 文件，并通过 `config/platform-infra/sub2api-codex-pool.yaml` 的 `profiles.entries` 增删。
-
-## 验收口径
-
-部署 closeout 至少包含：
-
-- `sub2api status`：Deployment/StatefulSet/Service/Secret/NetworkPolicy 可见，运行镜像与 YAML 一致，`NetworkPolicy/allow-all` 符合 `podSelector: {}`、Ingress/Egress 全放行。
-- `sub2api validate`：app、PostgreSQL、Redis、service proxy、`NetworkPolicy/allow-all` 和临时跨 Pod PostgreSQL/Redis 连通性检查通过。
-- `codex-pool validate`：统一 key 的 `GET /v1/models` 成功，并用 `localCodex.responsesSmokeModel` 跑一次小的 `POST /v1/responses` smoke；owner balance / owner concurrency 已满足 YAML 最小值，capacity、WebSocket v2、Sub2API 内置 temporary-unschedulable 开关/规则和 sentinel runtime 状态与 YAML 对齐；`validation.gatewayResponsesRecent` 汇总最近 6 小时普通 `/responses` 和 `/v1/responses` 的 failover、forward failure、最终 4xx/5xx、慢 final error 与 `context canceled` 证据，`validation.gatewayCompactRecent` 单独汇总 `/responses/compact` 证据。若当前 Responses smoke `ok=true` 但 recent 字段 `degraded=true`，先区分是历史窗口残留还是新的 request id 正在失败；长期判定见 `docs/reference/platform-infra.md`。
-- 若 `publicExposure.enabled=true`，确认 YAML 声明的 public path 可用。FRP target 检查 FRP path；PK01 local target 检查 PK01 Caddy managed block 和 loopback upstream。未带 key 的 public `/v1/models` 401 只能证明网关可达，不能证明账号池可调度。
-- 多 target 同时启用 public exposure 时，必须分别验证每个 target 的 root、`/health`、未带 key `/v1/models` 401，以及各自 `codex-pool validate --target <id>`；一个域名可用不能替代另一个域名的验收。
-- 若目标声明了 `egressProxy.enabled=true`，确认 proxy Deployment/Service ready，Sub2API 和 sentinel env 与 YAML 对齐，并通过 YAML 声明的 health URL 完成代理出站探针。
-
-如果要证明真实模型请求可用，使用最小 `/v1/responses` 或等价 Codex smoke。不要把 group-level `/v1/models` 成功解释成每个上游 account 都健康。
-
-## 排障
-
-- `api.pikapython.com` 返回 502/503 时，先按 YAML 判定 target 和 failure layer。PK01 host-Docker target 先跑 `sub2api status --target PK01` 和 `sub2api validate --target PK01`，再分别检查 PK01 Caddy managed block、loopback app health、Docker container health、admin account availability 和最小 public `/v1/responses` smoke。k3s/FRP target 先跑对应 `sub2api status --target <id>` 和 `validate --target <id>`；若 `sub2api`、`sub2api-frpc`、`sub2api-redis` 或 `sub2api-egress-proxy` 出现 `0/1`，或 validate 显示 `no endpoints available for service "sub2api"` / app Pod 已终止，先用 `bun scripts/cli.ts platform-infra sub2api apply --target <id> --confirm` 重新收敛 YAML 资源，按返回的 `job status` 轮询，再跑 `status`、`validate` 和可用的 Codex-pool 验证。不要先改账号池、哨兵状态、Secret 或 Caddy。
-- 快速恢复完成后，用分层证据 closeout：目标 public `/health` 应返回 200；最小公网 `/v1/responses` marker 应使用统一 key 或明确用户 key 返回 200；只输出 HTTP status、模型数量、marker、account id/group id 和 key fingerprint，不打印 key。不要为了公网验证运行 `configure-local --confirm`，它会重写本机 `~/.codex`；本机默认 `auth.json` key 返回 401 只能说明本机配置和公网统一 key 不一致，不能当作服务不可用证据。
-- Codex pool 哨兵、账号冻结/恢复、marker-only 判断或 probe 周期看不清：第一步跑 `bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-report`。这个报表是主观察面；只有报表缺字段或需要底层证据时，才继续看 `--raw`、CronJob log、state ConfigMap 或 Sub2API 管理 UI。若看到“临时不可调度状态”且包含规则序号/匹配关键词，检查 Sub2API `account_temp_unschedulable` 日志和账号 `temp_unschedulable_*` 字段；sentinel 只解释 `schedulable=false` 的 active quarantine，不解释这类内置临时冷却。
-- 只加强监控、不让哨兵自动冻结账号时，把 YAML `sentinel.actions.enabled=false` 后 `codex-pool sync --confirm`。此时 marker probe 和 gateway failure monitor 仍记录 `would-freeze` / observe-only 证据，但不会通过 Sub2API admin 写 `schedulable=false`；`/responses/compact` 的 `codex.remote_compact.failed` 和 compact 上游 5xx failover 只作为 `gateway-compact-*` 观察事件记录，不作为哨兵自动切换触发器。
-- 单个 request id 报 502/503/中断/没有自动切号：第一步跑 `bun scripts/cli.ts platform-infra sub2api codex-pool trace --request-id <requestId>`。先看 `outcome`、`reason`、`FAILOVER`、`SELECT-FAILED`、`ACCOUNT SIGNALS` 和 `WINDOW STATS`；只有 trace 报表缺字段或需要审计原始日志时，才加 `--show-lines` 或 `--raw`。若 `reason=failover-attempted-no-candidate`，说明切号动作已发生，但 scheduler 在排除失败账号后没有可用候选；继续用 `sentinel-report` 和 `validate --full` 区分 sentinel quarantine、request-path temp-unschedulable、账号 status 或容量耗尽。
-- profile invalid：先修 `~/.codex/config.toml.<profile>` 的 `base_url`、`wire_api`、`model` 或 `auth.json.<profile>` 的 API key；不要在 YAML 中写密钥。
-- 手动 OAuth/API-key 账号的 WebUI account test 连 `chatgpt.com` 超时，但同一 Pod 显式 HTTP proxy 探针可通：不要只看 Pod `HTTP_PROXY` env，按“受保护手动账号代理与分组绑定”小节确认 `manualAccounts.protected[].proxyBinding`，跑 `codex-pool sync --target D601 --confirm` 后再用原账号测试复测。若复测不再 reset/timeout，而是 `gpt-5.2-pro` 这类指定模型返回 ChatGPT OAuth Codex 不支持的能力错误，用默认/受支持模型或统一 key smoke 验证代理，不要把模型错误当作代理仍坏。
-- 手动 OAuth/API-key 账号 WebUI account test 正常，但 PC Codex 客户端通过统一 key 访问 `/responses` 返回 503 且 trace 是 `account-select-failed` / `no available accounts`：按“受保护手动账号代理与分组绑定”小节确认该账号已绑定统一 key 使用的 pool group。WebUI group 列表和账号详情不一定足以证明 scheduler 可调度；必要时核对 admin account availability 与 `account_groups` join。k3s target 通过 `codex-pool sync --target <id> --confirm` 后用 `codex-pool validate --target <id> --full` 复测统一 key；PK01 host-Docker 在 sync/validate adapter 补齐前，用最小 admin API/DB evidence 恢复并以 public `/v1/responses` smoke 验收。
-- Sub2API 卡在 `wait-postgres` / `wait-redis` 或服务内大量 `context deadline exceeded`：先跑 `sub2api status` 看 `networkPolicy.ok`，再跑 `sub2api validate` 看 `postgresCrossPodPgIsReady` / `redisCrossPodPing`；缺失或异常时用 `sub2api apply --confirm` 恢复受控 `NetworkPolicy/allow-all`，不要保留手工 iptables bypass 作为长期修复。
-- pool key 401：跑 `codex-pool sync --confirm` 重建 Sub2API key 与 k3s Secret 绑定，再跑 `codex-pool validate`。
-- pool key、admin password 或 k8s Secret `.data` 被 stdout、日志、issue 或本地 transcript 打印时，按泄露处理：撤销对应 Sub2API key 或 token，删除/重建受影响的 target Secret，通过 `codex-pool sync --target <id> --confirm` 或相应 YAML sourceRef 重新下发，再用 fingerprint、presence 和 `valuesPrinted=false` 作为 closeout 证据；不要复述旧值或新值。
-- 运行中过去的验证探针残留：只用 `codex-pool cleanup-probes --confirm` 清理 `unidesk-probe-*` 临时资源；不要把真实 managed account 删除当作探针清理或可用性恢复。
-- FRP 不通：先看 `codex-pool expose --confirm` 输出的 `masterFrps`、`masterCaddy`、`sub2api-frpc` 和 public 401 probe；需要低层证据时只用 `trans G14:k3s` 做 bounded 查询。
-- k3s external-active target 的 public URL 不通：先区分 DNS/TLS/Caddy/FRP/Sub2API。DNS 未解析到 YAML 声明的 PK01 地址时，Caddy ACME 会失败，HTTPS 不能算完成；可用 PK01 loopback FRP 端口和 PK01 公网 remotePort 证明 FRP 数据路径，但最终仍要等 DNS 生效后重跑 HTTPS health、`/v1/models` 和 `/v1/responses`。PK01 host-Docker local target 不走 FRP，不能用 FRP 端口探针替代本机 loopback/Caddy/app 验证。
-- D601 external-active apply 后其他 PK01 HTTPS 服务消失：优先怀疑共享 Caddy managed block 合并失败或旧整文件写入路径复现。用受控 Sub2API apply 输出和 PK01 Caddy managed block markers 取证，再通过各服务自己的 YAML apply/public-exposure 入口恢复；不要手工复制某一份 Caddyfile 作为长期修复。
-- Caddy 下载慢或失败：先确认 `config/platform-infra/sub2api.yaml` 已为对应 target 设置 `publicExposure.pk01.caddyDownloadProxyUrl`，并重跑 `sub2api apply --target <id> --confirm` 看 PK01 apply summary 中的 `downloadProxy.mode=curl-proxy`。不要反复裸连 GitHub release。
-- `/responses/compact` 在接近 master Caddy `response_header_timeout` 的固定时长后返回 504，或 Sub2API 日志稍后记录 `codex.remote_compact.succeeded` 时，优先检查 master Caddy `response_header_timeout` 是否由 YAML `publicExposure.masterCaddy.responseHeaderTimeoutSeconds` 渲染，修正后跑 `codex-pool expose --confirm`；这类边缘代理超时不会触发 Sub2API 账号级临时下线。reload 前已经在途的 compact 请求仍可能按旧 timeout 结束，判断修复是否生效时只看 reload 之后新发起的请求。
-- `/responses/compact` 或普通 public URL 在几秒窗口内出现 502，Caddy 日志显示 `dial tcp 127.0.0.1:<remotePort>: connect: connection refused`、`EOF` 或 `connection reset by peer`，同时 frps 日志出现 `platform-infra-sub2api proxy closing` / `listener is closed` / `new proxy ... success`，说明失败在 master Caddy 与 FRP remotePort 边缘层，Sub2API 和 sentinel 可能完全看不到。先确认 `publicExposure.masterCaddy.edgeRetry` 已按 YAML 渲染并 `codex-pool expose --confirm` 生效；若仍频繁发生，再继续查 G14 `sub2api-frpc` 到 master `frps` 的控制连接稳定性。不要把这类边缘 502 误判成账号池上游错误，也不要通过禁用账号恢复。
-- default profile 递归：检查 YAML default entry 是否使用 `*.pre-sub2api` 备份文件；必要时恢复备份后重新 `configure-local --confirm`。
-- 上游需要 WebSocket v2：先做 direct Codex WSv2 probe；通过后才给该 profile 配 `openaiResponsesWebSocketsV2Mode: ctx_pool|passthrough` 并跑 `sync --confirm`；把它当 capability candidate，容量仍以 YAML 中的 `capacity` 或默认值为准。
-- Codex 启动 WebSocket 回退：用原入口 Codex smoke 复现，再用 bounded Sub2API 日志确认 account；对 WS handshake 4xx/5xx、`openai.websocket_account_select_failed` 或 close-before-`response.completed` 的账号关闭 YAML WSv2 能力后同步。若没有剩余 WSv2-capable account，把 `localCodex.supportsWebSockets` 和 `localCodex.responsesWebSocketsV2` 一起关掉，不把临时可用性推断写成调度配置。
-- 上游要求 Codex User-Agent：只给该 profile 配 `upstreamUserAgent`，跑 `sync --confirm`。
-- 上游报 capacity/rate-limit/overload/Bad Gateway/Gateway Timeout 后没有隔离或频繁先失败再恢复：先看 `codex-pool sentinel-report` 的 marker、动作、冻结 TTL 和下一次 probe，也看 `codex-pool validate --full` 的 recent gateway failover/forward failure 证据；同时对照当前 Sub2API 源码里 `/v1/responses` handler、`Forward`、`shouldFailoverOpenAIUpstreamResponse` 和 `handleOpenAIAccountUpstreamError` 的真实传播路径。不要手动禁用账号、删除账号、改 membership/priority/capacity/loadFactor 或从 YAML 移除问题账号来替代通用 failover 与哨兵隔离/恢复。
-- `codex-pool sync --confirm` 或 `codex-pool validate` 超时：先区分 CLI 传输超时和 Sub2API 运行失败。受控 CLI 应返回远端作业进度和 stdout/stderr tail；如果只是低层 `trans` 60s 超时，不能据此判定 Sub2API failover 不工作。改用或修复 CLI 的远端 job/poll 路径后重跑，并以最终结构化结果作为证据。
-- Codex 报 weekly-limit、`less than 10% of your weekly limit left`、`Run /status for a breakdown` 等账号状态/软配额提示并要求切号：不要把新关键词写成 Sub2API 内置临时不可调度策略来恢复可用性；由 marker-only 哨兵按非 marker 响应统一冻结，并用 `sentinel-report` / `sentinel-probe` 验证。
-- 上游 400/503 响应体出现 `invalid_encrypted_content`、`bad_response_status_code`、`invalid_request_error` + 稳定 unsupported-model 文案、unsupported-model、`暂不支持` / `可用模型`、`model_not_found`、`No available channel for model ...` 或同类稳定模型路由 / Responses encrypted-content 兼容性失败：按通用 temp-unschedulable/failover 加哨兵 marker 证据处理，不用 account membership、priority、capacity、loadFactor、WebSocket mode、User-Agent 或 provider pinning 掩盖该错误族。
-- 上游错误反复触发：`invalid_encrypted_content`、unsupported-model、`Recovered upstream error ...`、`Bad Gateway`、`Gateway Timeout`、Cloudflare `524`、Codex-facing `Upstream request failed`、`Unknown error`、`context deadline exceeded`、`context canceled`、`model_not_found`、`No available channel for model`、大上下文 `413` 和 `openai_error` 这类稳定包装文案，先确认 YAML temp-unschedulable 已同步、Sub2API 源码会把该错误族传播成 `UpstreamFailoverError`、运行日志出现 `openai.upstream_failover_switching`。若匹配规则后仍只看到 `openai.forward_failed`，根因是 Sub2API HTTP `/responses` 没把该错误传播成 `UpstreamFailoverError`，应修 Sub2API failover classifier/error propagation，不硬编码账号或给 `only` 特权。
-- Codex auto compact 后丢上下文：先确认 YAML `localCodex` 是否声明启用 WSv2；若启用，再确认本机 `~/.codex/config.toml` 是否有 `supports_websockets = true` 和 `responses_websockets_v2 = true`，并看 `codex-pool validate` 的 WSv2 candidate 和 Sub2API 日志里的 `transport=responses_websockets_v2`。若 YAML 当前禁用 WSv2，则按 HTTP Responses 稳定性排查，不把旧 WS 口径当成验收要求。
-- Codex smoke 有 reconnect/1013：这是上游并发/可用性问题，和 HTTP-only compact context-loss 分开处理；记录 session/log 证据并关联专项 issue，不要用运行时手补覆盖 YAML 容量。
-
-## 禁止事项
-
-- 不用原生 `kubectl apply/delete/patch` 作为正式操作入口。
-- 不在 master server 部署或运行 Sub2API/PostgreSQL/Redis。
-- 不新增 Ingress、NodePort、LoadBalancer、hostPort、hostNetwork 或宽 FRP 端口段。
-- 不用 Sub2API 的 YAML 渲染结果整文件覆盖共享 PK01 Caddyfile；只能通过 managed block merge 更新 Sub2API 自己的块。
-- 不给 Sub2API manifest 添加 CPU/memory limits，除非有新的 YAML 化明确决策。
-- 不打印完整 API key、admin password 或 Secret 明文。
-- 不把普通上游增删做成代码变更、CI/CD、feature flag 或兼容双路径。
-- 不把手动禁用账号、删除账号、移除 YAML entry、降低 membership、临时改 priority/capacity/loadFactor、provider pinning 或给某个账号特权当作通用 failover / 哨兵隔离恢复问题的修复。
-- 不魔改 Sub2API：Sub2API 本身不支持的能力就不做，不通过 UniDesk 脚本、k8s 原地热补、本地 fork、YAML 伪声明或隐藏 fallback 代替上游实现。
diff --git a/.agents/skills/unidesk-sub2api/references/guardrails.md b/.agents/skills/unidesk-sub2api/references/guardrails.md
new file mode 100644
index 00000000..a3a620f7
--- /dev/null
+++ b/.agents/skills/unidesk-sub2api/references/guardrails.md
@@ -0,0 +1,11 @@
+## 禁止事项
+
+- 不用原生 `kubectl apply/delete/patch` 作为正式操作入口。
+- 不在 master server 部署或运行 Sub2API/PostgreSQL/Redis。
+- 不新增 Ingress、NodePort、LoadBalancer、hostPort、hostNetwork 或宽 FRP 端口段。
+- 不用 Sub2API 的 YAML 渲染结果整文件覆盖共享 PK01 Caddyfile；只能通过 managed block merge 更新 Sub2API 自己的块。
+- 不给 Sub2API manifest 添加 CPU/memory limits，除非有新的 YAML 化明确决策。
+- 不打印完整 API key、admin password 或 Secret 明文。
+- 不把普通上游增删做成代码变更、CI/CD、feature flag 或兼容双路径。
+- 不把手动禁用账号、删除账号、移除 YAML entry、降低 membership、临时改 priority/capacity/loadFactor、provider pinning 或给某个账号特权当作通用 failover / 哨兵隔离恢复问题的修复。
+- 不魔改 Sub2API：Sub2API 本身不支持的能力就不做，不通过 UniDesk 脚本、k8s 原地热补、本地 fork、YAML 伪声明或隐藏 fallback 代替上游实现。
diff --git a/.agents/skills/unidesk-sub2api/references/local-codex-consumer.md b/.agents/skills/unidesk-sub2api/references/local-codex-consumer.md
new file mode 100644
index 00000000..30975dc1
--- /dev/null
+++ b/.agents/skills/unidesk-sub2api/references/local-codex-consumer.md
@@ -0,0 +1,17 @@
+## 配置 master Codex 消费端
+
+```bash
+bun scripts/cli.ts platform-infra sub2api codex-pool configure-local
+bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm
+```
+
+`configure-local --confirm` 会：
+
+- 从 `platform-infra/<apiKeySecretName>.<apiKeySecretKey>` 读取统一 API key。
+- 把当前 `~/.codex/config.toml` 和 `~/.codex/auth.json` 备份为 `.<backupSuffix>`，默认 `.pre-sub2api`。
+- 重写默认 `~/.codex` 消费端，固定指向 `https://sub2api.74-48-78-17.nip.io/`，provider 名称和 wire API 来自 `localCodex`。
+- 按 `localCodex.modelContextWindow` / `localCodex.modelAutoCompactTokenLimit` 写入 `model_context_window` / `model_auto_compact_token_limit`，用于统一控制 Codex auto compact 触发窗口，避免 GPT-5.5 消费端生成过大的 `/responses/compact` 长请求。
+- 按 `localCodex` 写入 Codex transport 标记：`supports_websockets` 与 `[features] responses_websockets_v2` 必须同开同关。只有至少一个上游通过 direct Codex WSv2 probe 时才启用；否则保持 HTTP Responses，避免每次原入口先经历无效 WS reconnect。
+- 用统一 key 做一次 gateway 验证。
+
+防递归规则：默认 `config.toml` / `auth.json` 是 Sub2API consumer，不得作为上游账号导回 pool；上游账号必须使用带后缀 profile 文件，并通过 `config/platform-infra/sub2api-codex-pool.yaml` 的 `profiles.entries` 增删。
diff --git a/.agents/skills/unidesk-sub2api/references/manual-accounts.md b/.agents/skills/unidesk-sub2api/references/manual-accounts.md
new file mode 100644
index 00000000..4c65a3e6
--- /dev/null
+++ b/.agents/skills/unidesk-sub2api/references/manual-accounts.md
@@ -0,0 +1,17 @@
+## 受保护手动账号代理与分组绑定
+
+Sub2API 管理 UI 的账号连接测试使用账号级 `ProxyID` / proxy URL 配置上游 HTTP transport；账号没有绑定 proxy 时会直接出站，即使 Sub2API Pod 已经有 `HTTP_PROXY` / `HTTPS_PROXY` 环境变量。看到 WebUI 账号测试连 `chatgpt.com` 超时、但 Pod 内显式走目标 proxy 可通时，先检查该账号是否属于 `manualAccounts.protected` 并声明了 `proxyBinding`。如果同一账号用 `gpt-5.2-pro` 返回 ChatGPT OAuth 不支持 Codex 的模型能力错误，但默认/受支持模型能完成 `hi` 或 `/v1/responses` smoke，这不是代理失败；按模型映射/账号能力另行处理。
+
+如果 WebUI 账号连接测试显示 `proxyconnect tcp: dial tcp 127.0.0.1:<port>: connect: connection refused`，先确认该 proxy URL 是账号级 loopback 配置：在 k3s target 内，`127.0.0.1` 是 Sub2API Pod 自己，不是节点或 PK01。不要先改账号凭据、PK01 Caddy、`api.pikapython.com` 或统一 key；应在目标 `config/platform-infra/sub2api.yaml` 声明 `targets[].accountLocalProxy`，由 `platform-infra sub2api apply --target <id>` 渲染同 Pod sidecar 和 Secret，再用 `validate --target <id>` 的 `accountLocalProxy` 探针验证 `http://127.0.0.1:<port>`。输出仍只允许披露 sourceRef、fingerprint、secretName 和 proxyUrl，不打印 proxy 密码或生成配置。
+
+WebUI 账号连接测试也不经过统一消费 API key 的 pool group 选择器；账号测试正常不代表 PC Codex 客户端能选中该账号。看到 WebUI 账号测试正常、但 `/responses` 或 `/v1/responses` 以 `account-select-failed` / `no available accounts` 返回 503 时，先检查该手动账号是否声明了 `groupBinding.source: pool-group`，并确认 Sub2API `account_groups` join 里存在该账号与当前统一 API key `group_id` 的绑定。对已支持的 k3s target，通过 `sync --confirm` 加入当前 `pool.groupName`；对 PK01 host-Docker target，在 host-Docker codex-pool sync adapter 补齐前，只能用最小 admin API 写入 `group_ids` 做运行面恢复，且必须只输出 account id、group id、presence/fingerprint 和 smoke 状态，不打印密钥。
+
+受保护手动账号仍由人工在 Sub2API UI 维护 credentials/status 等字段；UniDesk 只允许通过 YAML 做代理和分组窄绑定：
+
+```bash
+bun scripts/cli.ts platform-infra sub2api codex-pool plan --target D601
+bun scripts/cli.ts platform-infra sub2api codex-pool sync --target D601 --confirm
+bun scripts/cli.ts platform-infra sub2api codex-pool validate --target D601
+```
+
+`sync` 输出应显示 `manualAccounts.ok=true`、`proxySync.ok=true`、`groupSync.ok=true`，且该账号的 proxy/group `bindingAligned=true`。`sentinel-probe --account <manual-account> --confirm` 对受保护手动账号必须继续拒绝，通常返回 `account-protected-manual`；不要为了测试而把该账号移入 `profiles.entries` 或取消保护。需要证明 WebUI 同款账号测试恢复时，用 Sub2API admin account test 原入口测最小 `hi` 和默认/受支持模型，并只记录 account id、proxy id、event types、HTTP status 和短 output preview，不记录 OAuth token 或 Secret 明文。若指定模型返回 “model is not supported when using Codex with a ChatGPT account” 一类能力错误，先归因到模型能力/映射，而不是 proxy。
diff --git a/.agents/skills/unidesk-sub2api/references/operations.md b/.agents/skills/unidesk-sub2api/references/operations.md
new file mode 100644
index 00000000..23f8d2cd
--- /dev/null
+++ b/.agents/skills/unidesk-sub2api/references/operations.md
@@ -0,0 +1,92 @@
+# UniDesk Sub2API
+
+UniDesk 通过 `platform-infra sub2api` 运维 YAML 选中的 Sub2API target。当前 active target 以 `config/platform-infra/sub2api.yaml` 为准；PK01 可作为 host-Docker active target 并通过 PK01 Caddy 本地反代提供 `api.pikapython.com`，D518/D601 等 k3s target 仍可按 YAML 声明为 external-active 或 retired，G14 由同一 YAML/CLI 控制为 standby predeploy。日常操作统一使用 UniDesk CLI，不直接写 Kubernetes 资源或手工调用 Sub2API 管理 API。
+
+**固定入口**: `cd /root/unidesk && bun scripts/cli.ts platform-infra sub2api ...`
+
+## 先看报表
+
+查 Codex pool 哨兵状态、账号冻结/恢复、marker 命中、下一次 probe、最近 CronJob run、token/cost 账本时，优先使用这个低噪声报表入口，不要先翻 ConfigMap、CronJob 日志或 Sub2API UI：
+
+```bash
+bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-report
+```
+
+需要机器处理或完整字段时再加 `--raw`；需要更多最近运行记录时加 `--events N`。
+
+追溯某个 Codex/Sub2API request id 的中断、上游账号、切号、临时不可调度、账号选择失败和同窗口账号池信号时，优先使用低噪声 trace 报表，不要先手写 `kubectl logs | grep`：
+
+```bash
+bun scripts/cli.ts platform-infra sub2api codex-pool trace --request-id <requestId>
+```
+
+默认输出类似 k8s/ps 的短表；机器处理用 `--raw` 读取 `.data.trace.*`；需要审计原始匹配日志时加 `--show-lines`；需要扩大搜索范围时使用 `--since 24h --tail 50000`。该命令只读：读取 Sub2API 日志、账号快照和 admin API 元数据，不改 `schedulable`、不清 runtime backoff、不中断请求。
+
+## 先读边界
+
+- 仓库长期开发边界见 `docs/reference/platform-infra.md`，本 skill 承担日常操作手册。
+- 配置真相是 YAML：`config/platform-infra/sub2api.yaml` 和 `config/platform-infra/sub2api-codex-pool.yaml`。
+- 业务策略和具体数值只以 YAML 为准。已有字段的数值调整只改 YAML 并跑 `plan` / `sync --confirm` / `validate`；不要自动补代码硬编码、schema 硬范围、合同测试、单元测试或长期参考文档。配置校验只校验格式、类型、必填和可渲染性，不判断数值策略是否“合理”。
+- 本 skill 目录下若存在 `agents/*.yaml`，只作为 skill/agent 展示与调用元数据，不是 Sub2API 或 Codex pool 运行配置；不要在 skill 目录维护第二份账号、capacity、priority、endpoint 或 Secret 配置。
+- Runtime target 由 `config/platform-infra/sub2api.yaml` 声明；默认 target 来自 YAML `defaults.targetId`，当前 `api.pikapython.com` 使用 PK01 host-Docker target。`D518:k3s`、`D601:k3s` 这类 k3s target 必须通过显式 `--target` 选择并按 YAML role 判定 active/retired/standby，`G14:k3s` 是 standby target。master server 只是控制端和消费者，不部署 Sub2API/PostgreSQL/Redis。
+- Standby target 不部署本地 PostgreSQL，不运行 sentinel、FRP 管理入口或 HTTPS egress proxy；只能预部署 namespace、NetworkPolicy、Service，以及 replicas=0 的 Sub2API/Redis Deployment。Redis 激活后也只允许 ephemeral cache。External-active target 仍不部署本地 PostgreSQL，必须直连 YAML 声明的外置 DB，使用本地 ephemeral Redis，并且只有在 YAML 启用时才运行 frpc、egress proxy 和目标级 sentinel；多个 external-active target 可以并存，不得把一个 target 的 public exposure 当作另一个 target 的替代或回退。
+- Secret、`~/.codex/config.toml*`、`~/.codex/auth.json*` 是运行时输入或本地状态，不提交。
+- 默认 `~/.codex/config.toml` 和 `~/.codex/auth.json` 只作为统一 Sub2API consumer 使用；`config.toml` 必须指向 YAML-selected active target 的 consumer URL，`auth.json` 必须使用统一 pool API key。新增上游账号不得覆盖这两个默认文件，只能新增 `config.toml.<profile>` / `auth.json.<profile>` 并在 YAML 里声明。
+- 输出只能包含 Secret 路径、key 名、presence、fingerprint 和 `valuesPrinted=false`；禁止打印完整 API key、admin password、JWT secret、TOTP key、base64 payload 或可复制的 preview。
+
+## 部署与状态
+
+```bash
+bun scripts/cli.ts platform-infra sub2api plan
+bun scripts/cli.ts platform-infra sub2api plan --target G14
+bun scripts/cli.ts platform-infra sub2api apply --dry-run
+bun scripts/cli.ts platform-infra sub2api apply --target G14 --dry-run
+bun scripts/cli.ts platform-infra sub2api apply --confirm
+bun scripts/cli.ts platform-infra sub2api apply --target G14 --confirm
+bun scripts/cli.ts platform-infra sub2api status
+bun scripts/cli.ts platform-infra sub2api status --target G14
+bun scripts/cli.ts platform-infra sub2api validate
+bun scripts/cli.ts platform-infra sub2api validate --target G14
+```
+
+- `plan` 读取 `config/platform-infra/sub2api.yaml`，渲染 `src/components/platform-infra/sub2api/sub2api.k8s.yaml`，检查 no Ingress/NodePort/LoadBalancer/hostPort/hostNetwork/resource limits，并要求 `NetworkPolicy/allow-all` 随 manifest 受控创建。
+- `apply --confirm` 默认创建异步 job；按返回的 `job status` 命令轮询，再跑 `status` 和 `validate`。
+- `status --full|--raw` 只在需要展开远端 stdout/stderr 或原始 JSON 时使用。
+- `validate` 是按需验收，不是连续可用性探针。对 standby target，`validate --target <id>` 验证预部署形态，不要求外置 DB 当前可连接；对 external-active target，必须验证外置 DB、ephemeral Redis、Sub2API service、YAML egress proxy 和目标级 public exposure。
+
+## PK01 host-Docker target
+
+PK01 host-Docker target 由 `config/platform-infra/sub2api.yaml` 的 target `runtimeMode: host-docker` 控制。`api.pikapython.com` 的当前路径是 `client -> PK01 Caddy -> 127.0.0.1:<YAML local upstream port> -> PK01 host-Docker Sub2API`，不是 D601 FRP 路径。优先用以下受控入口分层判断：
+
+```bash
+bun scripts/cli.ts platform-infra sub2api status --target PK01
+bun scripts/cli.ts platform-infra sub2api validate --target PK01
+```
+
+PK01 没有 k3s control plane。当前 `codex-pool sync`、`codex-pool validate`、`sentinel-report` 和 `trace` 的部分实现仍依赖 k8s/kubectl 远端脚本；在 PK01 host-Docker target 上看到 `kubectl` 缺失时，应归类为 CLI host-Docker adapter 缺口，不要误判为 Sub2API app、Caddy、上游或账号池故障。正式修复应补 host-Docker 版 codex-pool sync/validate/report/trace；临时排障只能做只读 admin API、DB join 表和最小公网 `/v1/responses` smoke，并且不得打印 admin password、API key 或账号凭据。
+
+## D601 Egress Proxy
+
+D601 的目标级 `egressProxy` 完全由 `config/platform-infra/sub2api.yaml` 控制。当前成熟形态是 master Docker `shadowsocks-rust` 作为加密出站源，D601 k3s 内 `sing-box` 暴露 HTTP/mixed ClusterIP proxy 给 Sub2API 和按 YAML 启用的 sentinel 使用。不要把 endpoint、端口、密码、健康探针或镜像 tag 写进 skill；只以 YAML 和 `config/platform-infra/sub2api-master-egress-proxy.compose.yaml` 为准。
+
+master 侧 proxy 由 UniDesk checkout 内的 compose 文件管理：
+
+```bash
+docker compose -f config/platform-infra/sub2api-master-egress-proxy.compose.yaml up -d --force-recreate
+bun scripts/cli.ts platform-infra sub2api apply --target D601 --confirm
+bun scripts/cli.ts platform-infra sub2api validate --target D601
+bun scripts/cli.ts platform-infra sub2api codex-pool sync --target D601 --confirm
+bun scripts/cli.ts platform-infra sub2api codex-pool validate --target D601
+```
+
+proxy secret/config 文件只允许放在受控 Secret/state 路径，输出只能披露路径、presence、fingerprint 或摘要，不能打印密码、完整订阅或生成配置。若 D601 到上游的 TLS/SNI 路径被 reset，不要用临时 JS 或简陋 HTTP CONNECT proxy 作为最终方案；通过 YAML/compose 更换或修复成熟加密 proxy source，再跑上面的 apply/validate/sync/validate 闭环。
+
+## 镜像升级
+
+1. 修改 `config/platform-infra/sub2api.yaml` 的 `image.repository`、`image.tag` 或 `pullPolicy`。
+2. 执行 `sub2api plan`，确认策略检查通过。
+3. 执行 `sub2api apply --confirm`，轮询 job 完成。
+4. 执行 `sub2api status`，确认运行镜像等于 YAML 声明。
+5. 执行 `sub2api validate` 或 `codex-pool validate` 做入口验收。
+
+不要把镜像版本写进脚本常量、JSON 或 manifest 模板。
diff --git a/.agents/skills/unidesk-sub2api/references/public-exposure.md b/.agents/skills/unidesk-sub2api/references/public-exposure.md
new file mode 100644
index 00000000..1cd7ddd1
--- /dev/null
+++ b/.agents/skills/unidesk-sub2api/references/public-exposure.md
@@ -0,0 +1,16 @@
+## FRP 暴露
+
+```bash
+bun scripts/cli.ts platform-infra sub2api codex-pool expose
+bun scripts/cli.ts platform-infra sub2api codex-pool expose --confirm
+```
+
+- 由 YAML `publicExposure` 控制。Codex pool 默认公共端是 target `publicBaseUrl`；host-Docker target 可以使用 `mode: pk01-local` 直接由 PK01 Caddy 反代本机 loopback app，k3s external-active target 可以使用 FRP remotePort。不要把某个 target 的 exposure mode 推断成其它 target 的默认。
+- `expose --confirm` 只为 YAML 指定的 `remotePort` 补 master `frps` allow port，并在 G14 创建/更新 `sub2api-frpc`。
+- master Caddy site 也由 `publicExposure.masterCaddy` 渲染；`responseHeaderTimeoutSeconds` 必须足够覆盖 Codex `/responses/compact` 长请求，避免 Caddy 先返回 504 而 Sub2API 后台实际稍后成功。具体数值只改 `config/platform-infra/sub2api-codex-pool.yaml`，修改后跑 `codex-pool expose --confirm`，再核对 Caddyfile 中渲染出的 `response_header_timeout`。
+- master Caddy 的短窗口边缘重试由 `publicExposure.masterCaddy.edgeRetry` 渲染；用于吸收 FRP remotePort 短暂关闭、`connect: connection refused`、EOF 或 connection reset 这类请求尚未稳定到达 Sub2API 的 502。具体 retry 时长、间隔和 `retryMatch` 范围只写 YAML，修改后跑 `codex-pool expose --confirm`，再核对 Caddyfile 中渲染出的 `lb_try_duration`、`lb_try_interval` 和 `lb_retry_match`。不要手工 patch `/etc/caddy/Caddyfile`。
+- PK01 `/etc/caddy/Caddyfile` 是 Sub2API、LangBot、n8n、HWLAB 等多 YAML 来源共享的 edge artifact。Sub2API apply/expose 只能更新自己的 managed block 并保留其他 blocks；同一 Sub2API 服务暴露多个 target 时，D601 保留既有 `# BEGIN unidesk managed sub2api`，非默认 target 必须使用 target-scoped owner（例如 `sub2api-d518`），避免 `api.pikapython.com` 和 `api2.pikapython.com` 互相覆盖。若 apply 输出显示 managed block 数异常，先停止 closeout，检查 PK01 Caddy 合并与 validation 结果，不要手工整文件覆盖。
+- 非幂等 POST 的 round-trip retry 必须收窄到 YAML `retryMatch` 声明的安全路径；普通 `/responses` 上游账号错误仍归 Sub2API failover / temp-unschedulable / sentinel 处理，不用 Caddy 重放整段推理请求来掩盖账号池问题。
+- 同一个公开入口同时暴露 OpenAI-compatible API 和 Sub2API 管理 UI `/login`。FRP target 使用同一个 FRP TCP 入口；PK01 local target 使用 PK01 Caddy 到本机 app 的 managed block。不要另开第二个管理端口，除非 YAML 明确声明新的暴露决策。
+- k3s target 的 Sub2API Kubernetes Service 继续保持 ClusterIP。
+- k3s external-active target 的公开路径是 `client -> PK01 Caddy -> PK01 frps remotePort -> target frpc -> Sub2API`；PK01 host-Docker target 的公开路径是 `client -> PK01 Caddy -> 127.0.0.1:<local upstream port> -> Sub2API`。两者都不经过 pikanode，也不经过 master server 反代。PK01 Caddy 下载必须使用 YAML `publicExposure.pk01.caddyDownloadProxyUrl` 指定的 proxy；如果 Caddy 下载慢，先确认 apply 输出里是 `downloadProxy.mode=curl-proxy`。目标域名必须先解析到 YAML 声明的 PK01 公网地址，HTTPS 才能作为最终验证；`api.pikapython.com` 与 `api2.pikapython.com` 应分别按各自 YAML target 验收。
diff --git a/.agents/skills/unidesk-sub2api/references/sentinel.md b/.agents/skills/unidesk-sub2api/references/sentinel.md
new file mode 100644
index 00000000..817426ec
--- /dev/null
+++ b/.agents/skills/unidesk-sub2api/references/sentinel.md
@@ -0,0 +1,22 @@
+## Sentinel
+
+- `sentinel.monitor.enabled`: 账号级 marker 哨兵监控开关；开启后 `codex-pool sync --confirm` 会在 `platform-infra` 创建/更新 k8s CronJob、ConfigMap、Secret、ServiceAccount、Role 和 RoleBinding。CronJob 直打 YAML-managed 上游账号的 OpenAI Responses `gpt-5.5`，用确定 marker 作为唯一健康标准，并在独立 state ConfigMap 中记录 token/cost 账本。
+- `sentinel.actions.enabled`: 账号级哨兵冻结/恢复动作开关；当前 marker-only guard 要求开启。动作关闭时只记录 `would-freeze`，不会调用 Sub2API admin API 改 `schedulable`。动作开启后，只要不满足 marker match，不论是 HTTP 200 私货、4xx/5xx、非 JSON、连接错误还是空输出，都进入同一个冻结/恢复状态机。
+- `sentinel.sdk.openaiPythonVersion`: 哨兵容器使用的 OpenAI Python SDK 固定版本；模型请求必须通过标准 SDK `responses.create`，不要手工拼 `/v1/responses` 请求体或手写响应解析。后续升级 SDK 只改 YAML 并 `sync --confirm`。
+- `sentinel.probe.maxOutputTokens`: 哨兵本地流式 delta 收集上限，必须保持小值；它不作为上游 `max_output_tokens` 字段发送，以保持与 Sub2API WebUI 默认账号连接测试的 Responses SSE 请求形态一致。哨兵不限制并发和每轮账号数，所有到期账号会在同一轮并发探测。
+- `sentinel.probe.userAgent`: 哨兵 direct upstream probe 的默认 User-Agent，通过 OpenAI SDK `extra_headers` 传递；默认贴近 Sub2API `net/http` 账号连接测试形态，个别账号仍可用 `profiles.entries[].upstreamUserAgent` 覆盖。
+- `sentinel.cadence`: 成功信任指数退避配置。当前口径是从 1 分钟开始，连续成功后按账号 `trustUpstream` 选择可信/不可信最大退避；任意非 marker match 清零成功信任并进入冻结退避。可信/不可信最大退避数值只写 YAML。
+- `sentinel.freeze`: 失败冻结 TTL 指数退避配置。当前口径是初始 1 分钟，失败后 `1m -> 2m -> 4m -> 8m -> 10m`，最大 10 分钟；失败 probe 基本不消耗有效输出 token，因此冻结窗口保持短周期。冻结到期后只做恢复 probe，通过才自动恢复，不能仅靠 TTL 到期解封。
+- `sentinel.pricing`: 直打上游时哨兵自己的 token/cost 估算价格。因为 direct upstream probe 不经过 Sub2API 普通用量账本，哨兵必须自己记录全局与 per-account token/cost；这些账本只用于观察，不作为跳过探测的预算门禁。
+
+对已支持的 k3s target，`sync --confirm` 会登录 Sub2API admin、创建/更新 group、创建/更新 YAML 中的 `unidesk-codex-*` accounts、创建/复用统一 API key Secret，并部署/更新哨兵资源；它不把既有 managed account 直接恢复为 `schedulable=true`。恢复只由哨兵在读取 Sub2API runtime `schedulable=false` 后触发 recovery probe，并在 marker 命中时执行。`sync` 默认不删除 YAML 中缺席的 managed account。只有明确退役上游时才使用 `sync --confirm --prune-removed` 删除缺席且 `extra.unidesk_managed=true` 的 `unidesk-codex-*` account。对 `manualAccounts.protected`，`sync` 只执行 YAML 显式允许的窄同步；当前允许项是从目标 `egressProxy` 创建/更新 Sub2API internal proxy 记录并绑定 `proxy_id`，以及把受保护手动账号加入当前 `pool.groupName`。它仍不接管该账号凭据、status、schedulable、priority/capacity/loadFactor 或哨兵状态。PK01 host-Docker target 在 codex-pool adapter 补齐前不具备这条完整 sync 路径。
+
+`sentinel-image status|build` 管理哨兵 Python 运行环境镜像。镜像由 YAML 的 `sentinel.image` 基础镜像和 `sentinel.sdk.openaiPythonVersion` 派生，发布到目标 runtime 的本地 registry；`build --confirm` 会先检查 registry tag，存在则快速复用，不存在才在目标 host 构建并 push。CronJob 启动时只校验 SDK 版本，不在运行时 `pip install`。目标是否启用哨兵以 `config/platform-infra/sub2api.yaml` 的 `sentinel.enabledOnTargets` 为准；未启用的 target 在 `sync`/`validate` 中应显示 `skipped-target-disabled`，不得要求镜像构建、CronJob、Secret 或 state ConfigMap 存在。
+
+`sync --confirm` 同时会按 YAML 渲染账号级哨兵资源，并在 monitor 开启时先确保可复用哨兵镜像存在。当前目标是 `sentinel.monitor.enabled=true` + `sentinel.actions.enabled=true` 的 marker-only 自动冻结/恢复；不要手工 patch CronJob、Secret 或 Sub2API account。若 YAML 新增账号或修改 profile/base URL/API key fingerprint/upstream User-Agent/Responses WebSocket mode，sync 会从变更前 runtime state 写入 pending probe 记录并立即安排 sentinel probe，但不会把既有账号直接恢复为可调度；只有 sentinel 读取到 Sub2API runtime `schedulable=false` 后执行 recovery probe，且 marker 命中，才恢复 `schedulable=true`。sentinel 冻结/恢复只改 `schedulable=false|true`，不得顺手调用 Sub2API `recover-state` 清除请求路径临时不可调度或其他 runtime backoff。无关账号的既有成功/失败退避不能被重置。若 YAML 下调失败冻结最大窗口，sync 会把仍 active 的旧冻结状态迁移到当前最大窗口内并立即安排 recovery probe，但不会直接解冻。若怀疑某个账号被误判，先用 `codex-pool sentinel-probe --account <accountName> --confirm` 立即触发该账号测量；该命令从现有 CronJob 模板派生一次性 Job，复用同一份 Secret、ConfigMap、OpenAI SDK probe、token/cost 账本和冻结/恢复状态机。
+
+`trace --request-id <requestId>` 是只读 request 追溯报表，不触发 probe、不修改账号。默认输出请求开始/最终状态、failover、`account_select_failed`、窗口内 `account_temp_unschedulable`、admin schedulable 写入计数和当前账号快照；`reason=failover-attempted-no-candidate` 表示 Sub2API 已进入自动切号，但排除当前失败账号后没有可用候选。需要机器处理时使用 `--raw`，需要原始匹配行时加 `--show-lines`。
+
+`sentinel-report` 是只读低噪声报表，不触发 probe、不修改账号。默认输出类似 `ps` 的文本表，展示每个账号的探测次数、Sub2API runtime `schedulable`、最近 marker/HTTP/动作、冻结 TTL、成功退避、下一次 probe 和最近 run 事件；`SCH` 展示 Sub2API runtime schedulable，`PROT` 展示账号级保护阈值，`P_FAIL` 展示最近一次保护确认中的失败次数/阈值；需要机器处理时使用 `sentinel-report --raw`。
+
+对已支持的 k3s target，`sync --confirm` 和 `validate` 可能超过单次 SSH/runtime 短连接窗口。必须继续使用 `bun scripts/cli.ts platform-infra sub2api codex-pool ... --target <k3s-target>`，由 CLI 在目标远端提交作业并短轮询状态；不要改用裸 `trans <target>:k3s sh` 等一个长连接等待完整结果。若看到 `UNIDESK_SSH_RUNTIME_TIMEOUT`，先按 `docs/reference/platform-infra.md` 的规则处理为控制面可见性问题，修 CLI/job/poll 或重跑受控命令，不要手工 patch Sub2API credentials 或源码。
diff --git a/.agents/skills/unidesk-sub2api/references/troubleshooting-accounts.md b/.agents/skills/unidesk-sub2api/references/troubleshooting-accounts.md
new file mode 100644
index 00000000..2338459f
--- /dev/null
+++ b/.agents/skills/unidesk-sub2api/references/troubleshooting-accounts.md
@@ -0,0 +1,21 @@
+## 账号池排障
+
+- Codex pool 哨兵、账号冻结/恢复、marker-only 判断或 probe 周期看不清：第一步跑 `bun scripts/cli.ts platform-infra sub2api codex-pool sentinel-report`。这个报表是主观察面；只有报表缺字段或需要底层证据时，才继续看 `--raw`、CronJob log、state ConfigMap 或 Sub2API 管理 UI。若看到“临时不可调度状态”且包含规则序号/匹配关键词，检查 Sub2API `account_temp_unschedulable` 日志和账号 `temp_unschedulable_*` 字段；sentinel 只解释 `schedulable=false` 的 active quarantine，不解释这类内置临时冷却。
+- 只加强监控、不让哨兵自动冻结账号时，把 YAML `sentinel.actions.enabled=false` 后 `codex-pool sync --confirm`。此时 marker probe 和 gateway failure monitor 仍记录 `would-freeze` / observe-only 证据，但不会通过 Sub2API admin 写 `schedulable=false`；`/responses/compact` 的 `codex.remote_compact.failed` 和 compact 上游 5xx failover 只作为 `gateway-compact-*` 观察事件记录，不作为哨兵自动切换触发器。
+- 单个 request id 报 502/503/中断/没有自动切号：第一步跑 `bun scripts/cli.ts platform-infra sub2api codex-pool trace --request-id <requestId>`。先看 `outcome`、`reason`、`FAILOVER`、`SELECT-FAILED`、`ACCOUNT SIGNALS` 和 `WINDOW STATS`；只有 trace 报表缺字段或需要审计原始日志时，才加 `--show-lines` 或 `--raw`。若 `reason=failover-attempted-no-candidate`，说明切号动作已发生，但 scheduler 在排除失败账号后没有可用候选；继续用 `sentinel-report` 和 `validate --full` 区分 sentinel quarantine、request-path temp-unschedulable、账号 status 或容量耗尽。
+- profile invalid：先修 `~/.codex/config.toml.<profile>` 的 `base_url`、`wire_api`、`model` 或 `auth.json.<profile>` 的 API key；不要在 YAML 中写密钥。
+- 手动 OAuth/API-key 账号的 WebUI account test 连 `chatgpt.com` 超时，但同一 Pod 显式 HTTP proxy 探针可通：不要只看 Pod `HTTP_PROXY` env，按“受保护手动账号代理与分组绑定”小节确认 `manualAccounts.protected[].proxyBinding`，跑 `codex-pool sync --target D601 --confirm` 后再用原账号测试复测。若复测不再 reset/timeout，而是 `gpt-5.2-pro` 这类指定模型返回 ChatGPT OAuth Codex 不支持的能力错误，用默认/受支持模型或统一 key smoke 验证代理，不要把模型错误当作代理仍坏。
+- 手动 OAuth/API-key 账号 WebUI account test 正常，但 PC Codex 客户端通过统一 key 访问 `/responses` 返回 503 且 trace 是 `account-select-failed` / `no available accounts`：按“受保护手动账号代理与分组绑定”小节确认该账号已绑定统一 key 使用的 pool group。WebUI group 列表和账号详情不一定足以证明 scheduler 可调度；必要时核对 admin account availability 与 `account_groups` join。k3s target 通过 `codex-pool sync --target <id> --confirm` 后用 `codex-pool validate --target <id> --full` 复测统一 key；PK01 host-Docker 在 sync/validate adapter 补齐前，用最小 admin API/DB evidence 恢复并以 public `/v1/responses` smoke 验收。
+- pool key 401：跑 `codex-pool sync --confirm` 重建 Sub2API key 与 k3s Secret 绑定，再跑 `codex-pool validate`。
+- pool key、admin password 或 k8s Secret `.data` 被 stdout、日志、issue 或本地 transcript 打印时，按泄露处理：撤销对应 Sub2API key 或 token，删除/重建受影响的 target Secret，通过 `codex-pool sync --target <id> --confirm` 或相应 YAML sourceRef 重新下发，再用 fingerprint、presence 和 `valuesPrinted=false` 作为 closeout 证据；不要复述旧值或新值。
+- 运行中过去的验证探针残留：只用 `codex-pool cleanup-probes --confirm` 清理 `unidesk-probe-*` 临时资源；不要把真实 managed account 删除当作探针清理或可用性恢复。
+- default profile 递归：检查 YAML default entry 是否使用 `*.pre-sub2api` 备份文件；必要时恢复备份后重新 `configure-local --confirm`。
+- 上游需要 WebSocket v2：先做 direct Codex WSv2 probe；通过后才给该 profile 配 `openaiResponsesWebSocketsV2Mode: ctx_pool|passthrough` 并跑 `sync --confirm`；把它当 capability candidate，容量仍以 YAML 中的 `capacity` 或默认值为准。
+- Codex 启动 WebSocket 回退：用原入口 Codex smoke 复现，再用 bounded Sub2API 日志确认 account；对 WS handshake 4xx/5xx、`openai.websocket_account_select_failed` 或 close-before-`response.completed` 的账号关闭 YAML WSv2 能力后同步。若没有剩余 WSv2-capable account，把 `localCodex.supportsWebSockets` 和 `localCodex.responsesWebSocketsV2` 一起关掉，不把临时可用性推断写成调度配置。
+- 上游要求 Codex User-Agent：只给该 profile 配 `upstreamUserAgent`，跑 `sync --confirm`。
+- 上游报 capacity/rate-limit/overload/Bad Gateway/Gateway Timeout 后没有隔离或频繁先失败再恢复：先看 `codex-pool sentinel-report` 的 marker、动作、冻结 TTL 和下一次 probe，也看 `codex-pool validate --full` 的 recent gateway failover/forward failure 证据；同时对照当前 Sub2API 源码里 `/v1/responses` handler、`Forward`、`shouldFailoverOpenAIUpstreamResponse` 和 `handleOpenAIAccountUpstreamError` 的真实传播路径。不要手动禁用账号、删除账号、改 membership/priority/capacity/loadFactor 或从 YAML 移除问题账号来替代通用 failover 与哨兵隔离/恢复。
+- Codex 报 weekly-limit、`less than 10% of your weekly limit left`、`Run /status for a breakdown` 等账号状态/软配额提示并要求切号：不要把新关键词写成 Sub2API 内置临时不可调度策略来恢复可用性；由 marker-only 哨兵按非 marker 响应统一冻结，并用 `sentinel-report` / `sentinel-probe` 验证。
+- 上游 400/503 响应体出现 `invalid_encrypted_content`、`bad_response_status_code`、`invalid_request_error` + 稳定 unsupported-model 文案、unsupported-model、`暂不支持` / `可用模型`、`model_not_found`、`No available channel for model ...` 或同类稳定模型路由 / Responses encrypted-content 兼容性失败：按通用 temp-unschedulable/failover 加哨兵 marker 证据处理，不用 account membership、priority、capacity、loadFactor、WebSocket mode、User-Agent 或 provider pinning 掩盖该错误族。
+- 上游错误反复触发：`invalid_encrypted_content`、unsupported-model、`Recovered upstream error ...`、`Bad Gateway`、`Gateway Timeout`、Cloudflare `524`、Codex-facing `Upstream request failed`、`Unknown error`、`context deadline exceeded`、`context canceled`、`model_not_found`、`No available channel for model`、大上下文 `413` 和 `openai_error` 这类稳定包装文案，先确认 YAML temp-unschedulable 已同步、Sub2API 源码会把该错误族传播成 `UpstreamFailoverError`、运行日志出现 `openai.upstream_failover_switching`。若匹配规则后仍只看到 `openai.forward_failed`，根因是 Sub2API HTTP `/responses` 没把该错误传播成 `UpstreamFailoverError`，应修 Sub2API failover classifier/error propagation，不硬编码账号或给 `only` 特权。
+- Codex auto compact 后丢上下文：先确认 YAML `localCodex` 是否声明启用 WSv2；若启用，再确认本机 `~/.codex/config.toml` 是否有 `supports_websockets = true` 和 `responses_websockets_v2 = true`，并看 `codex-pool validate` 的 WSv2 candidate 和 Sub2API 日志里的 `transport=responses_websockets_v2`。若 YAML 当前禁用 WSv2，则按 HTTP Responses 稳定性排查，不把旧 WS 口径当成验收要求。
+- Codex smoke 有 reconnect/1013：这是上游并发/可用性问题，和 HTTP-only compact context-loss 分开处理；记录 session/log 证据并关联专项 issue，不要用运行时手补覆盖 YAML 容量。
diff --git a/.agents/skills/unidesk-sub2api/references/troubleshooting-public.md b/.agents/skills/unidesk-sub2api/references/troubleshooting-public.md
new file mode 100644
index 00000000..44d0f801
--- /dev/null
+++ b/.agents/skills/unidesk-sub2api/references/troubleshooting-public.md
@@ -0,0 +1,8 @@
+## 公开暴露排障
+
+- FRP 不通：先看 `codex-pool expose --confirm` 输出的 `masterFrps`、`masterCaddy`、`sub2api-frpc` 和 public 401 probe；需要低层证据时只用 `trans G14:k3s` 做 bounded 查询。
+- k3s external-active target 的 public URL 不通：先区分 DNS/TLS/Caddy/FRP/Sub2API。DNS 未解析到 YAML 声明的 PK01 地址时，Caddy ACME 会失败，HTTPS 不能算完成；可用 PK01 loopback FRP 端口和 PK01 公网 remotePort 证明 FRP 数据路径，但最终仍要等 DNS 生效后重跑 HTTPS health、`/v1/models` 和 `/v1/responses`。PK01 host-Docker local target 不走 FRP，不能用 FRP 端口探针替代本机 loopback/Caddy/app 验证。
+- D601 external-active apply 后其他 PK01 HTTPS 服务消失：优先怀疑共享 Caddy managed block 合并失败或旧整文件写入路径复现。用受控 Sub2API apply 输出和 PK01 Caddy managed block markers 取证，再通过各服务自己的 YAML apply/public-exposure 入口恢复；不要手工复制某一份 Caddyfile 作为长期修复。
+- Caddy 下载慢或失败：先确认 `config/platform-infra/sub2api.yaml` 已为对应 target 设置 `publicExposure.pk01.caddyDownloadProxyUrl`，并重跑 `sub2api apply --target <id> --confirm` 看 PK01 apply summary 中的 `downloadProxy.mode=curl-proxy`。不要反复裸连 GitHub release。
+- `/responses/compact` 在接近 master Caddy `response_header_timeout` 的固定时长后返回 504，或 Sub2API 日志稍后记录 `codex.remote_compact.succeeded` 时，优先检查 master Caddy `response_header_timeout` 是否由 YAML `publicExposure.masterCaddy.responseHeaderTimeoutSeconds` 渲染，修正后跑 `codex-pool expose --confirm`；这类边缘代理超时不会触发 Sub2API 账号级临时下线。reload 前已经在途的 compact 请求仍可能按旧 timeout 结束，判断修复是否生效时只看 reload 之后新发起的请求。
+- `/responses/compact` 或普通 public URL 在几秒窗口内出现 502，Caddy 日志显示 `dial tcp 127.0.0.1:<remotePort>: connect: connection refused`、`EOF` 或 `connection reset by peer`，同时 frps 日志出现 `platform-infra-sub2api proxy closing` / `listener is closed` / `new proxy ... success`，说明失败在 master Caddy 与 FRP remotePort 边缘层，Sub2API 和 sentinel 可能完全看不到。先确认 `publicExposure.masterCaddy.edgeRetry` 已按 YAML 渲染并 `codex-pool expose --confirm` 生效；若仍频繁发生，再继续查 G14 `sub2api-frpc` 到 master `frps` 的控制连接稳定性。不要把这类边缘 502 误判成账号池上游错误，也不要通过禁用账号恢复。
diff --git a/.agents/skills/unidesk-sub2api/references/troubleshooting-runtime.md b/.agents/skills/unidesk-sub2api/references/troubleshooting-runtime.md
new file mode 100644
index 00000000..bc5889f7
--- /dev/null
+++ b/.agents/skills/unidesk-sub2api/references/troubleshooting-runtime.md
@@ -0,0 +1,6 @@
+## 运行面排障
+
+- `api.pikapython.com` 返回 502/503 时，先按 YAML 判定 target 和 failure layer。PK01 host-Docker target 先跑 `sub2api status --target PK01` 和 `sub2api validate --target PK01`，再分别检查 PK01 Caddy managed block、loopback app health、Docker container health、admin account availability 和最小 public `/v1/responses` smoke。k3s/FRP target 先跑对应 `sub2api status --target <id>` 和 `validate --target <id>`；若 `sub2api`、`sub2api-frpc`、`sub2api-redis` 或 `sub2api-egress-proxy` 出现 `0/1`，或 validate 显示 `no endpoints available for service "sub2api"` / app Pod 已终止，先用 `bun scripts/cli.ts platform-infra sub2api apply --target <id> --confirm` 重新收敛 YAML 资源，按返回的 `job status` 轮询，再跑 `status`、`validate` 和可用的 Codex-pool 验证。不要先改账号池、哨兵状态、Secret 或 Caddy。
+- 快速恢复完成后，用分层证据 closeout：目标 public `/health` 应返回 200；最小公网 `/v1/responses` marker 应使用统一 key 或明确用户 key 返回 200；只输出 HTTP status、模型数量、marker、account id/group id 和 key fingerprint，不打印 key。不要为了公网验证运行 `configure-local --confirm`，它会重写本机 `~/.codex`；本机默认 `auth.json` key 返回 401 只能说明本机配置和公网统一 key 不一致，不能当作服务不可用证据。
+- Sub2API 卡在 `wait-postgres` / `wait-redis` 或服务内大量 `context deadline exceeded`：先跑 `sub2api status` 看 `networkPolicy.ok`，再跑 `sub2api validate` 看 `postgresCrossPodPgIsReady` / `redisCrossPodPing`；缺失或异常时用 `sub2api apply --confirm` 恢复受控 `NetworkPolicy/allow-all`，不要保留手工 iptables bypass 作为长期修复。
+- `codex-pool sync --confirm` 或 `codex-pool validate` 超时：先区分 CLI 传输超时和 Sub2API 运行失败。受控 CLI 应返回远端作业进度和 stdout/stderr tail；如果只是低层 `trans` 60s 超时，不能据此判定 Sub2API failover 不工作。改用或修复 CLI 的远端 job/poll 路径后重跑，并以最终结构化结果作为证据。
diff --git a/.agents/skills/unidesk-sub2api/references/troubleshooting.md b/.agents/skills/unidesk-sub2api/references/troubleshooting.md
new file mode 100644
index 00000000..8e7ac7a4
--- /dev/null
+++ b/.agents/skills/unidesk-sub2api/references/troubleshooting.md
@@ -0,0 +1,9 @@
+## 排障入口
+
+Sub2API 排障先按失败层分流，不要把公网边缘、运行面健康、账号池调度和本机 Codex 配置混成一个处理面。
+
+- `api.pikapython.com` 502/503、Pod/容器健康、NetworkPolicy、wait-postgres、受控 apply/sync 超时：读 [troubleshooting-runtime.md](troubleshooting-runtime.md)。
+- Codex pool、sentinel、账号冻结/恢复、WebUI account test、统一 key、上游错误分类、WSv2、本机 Codex profile：读 [troubleshooting-accounts.md](troubleshooting-accounts.md)。
+- FRP、Caddy、PK01 shared Caddy managed block、public URL、edge timeout/502：读 [troubleshooting-public.md](troubleshooting-public.md)。
+
+分层 closeout 必须只输出 HTTP status、模型数量、marker、account id/group id、key fingerprint、对象名和 presence；不得打印完整 key、token、password 或 Secret `.data`。
diff --git a/.agents/skills/unidesk-sub2api/references/upstreams.md b/.agents/skills/unidesk-sub2api/references/upstreams.md
new file mode 100644
index 00000000..73797f26
--- /dev/null
+++ b/.agents/skills/unidesk-sub2api/references/upstreams.md
@@ -0,0 +1,23 @@
+## 添加上游
+
+1. 在 master `~/.codex/` 准备带后缀的上游 profile 文件，例如 `config.toml.<profile>` 和 `auth.json.<profile>`；禁止覆盖默认 `config.toml` / `auth.json`。
+2. 在 `config/platform-infra/sub2api-codex-pool.yaml` 添加 `profiles.entries` 项，指定 `profile`、`accountName`、`configFile`、`authFile`。
+3. 如需要，给该项加 `priority`、`capacity`、`loadFactor`、`trustUpstream`、`sentinelProtect`、`openaiResponsesWebSocketsV2Mode` 或 `upstreamUserAgent`；capacity/loadFactor/信任退避/保护阈值的具体数值只写在 YAML。只有显式恢复 Sub2API 内置临时不可调度时才添加 per-account `tempUnschedulable`。
+4. 如果新增账号会提高声明 capacity 总和，默认让省略的 `pool.minOwnerConcurrency` 继续按 capacity 总和自动解析；只有 YAML 已经显式写了该 override 时，才同步提高到不低于总 capacity，或删除 override 回到自动解析。
+5. 跑 `codex-pool plan`，确认 profile 可读、`base_url` 和 API key 来源有效，且 stdout 未泄露完整 key。
+6. 跑 `codex-pool sync --confirm`。
+7. 跑 `codex-pool validate`。
+
+普通新增上游是 YAML 操作，不走 CI/CD，不改代码。只有需要渲染或校验上游 Sub2API 已经存在的可复用能力时才修改 `scripts/src/platform-infra-sub2api-codex.ts`；Sub2API 本身不支持的能力不在 UniDesk 侧魔改实现。
+
+## 删除上游
+
+删除上游只用于明确退役、凭据所有权变更或用户明确要求移除 provider；不能作为上游 5xx、compact 失败、限流、模型路由失败或哨兵隔离/恢复问题的处理手段。
+
+1. 从 `config/platform-infra/sub2api-codex-pool.yaml` 删除对应 `profiles.entries` 项。
+2. 跑 `codex-pool plan` 检查 desired 列表。
+3. 跑 `codex-pool sync --confirm --prune-removed`。
+4. 确认输出 `accounts.pruned` 只包含期望删除项。
+5. 跑 `codex-pool validate`。
+
+CLI 默认保留缺席账号，避免把可用性问题误处理成删除；只有显式 `--prune-removed` 才会 prune `name` 以 `unidesk-codex-` 开头且 `extra.unidesk_managed=true` 的缺席账号。
diff --git a/.agents/skills/unidesk-sub2api/references/validation.md b/.agents/skills/unidesk-sub2api/references/validation.md
new file mode 100644
index 00000000..65357b15
--- /dev/null
+++ b/.agents/skills/unidesk-sub2api/references/validation.md
@@ -0,0 +1,12 @@
+## 验收口径
+
+部署 closeout 至少包含：
+
+- `sub2api status`：Deployment/StatefulSet/Service/Secret/NetworkPolicy 可见，运行镜像与 YAML 一致，`NetworkPolicy/allow-all` 符合 `podSelector: {}`、Ingress/Egress 全放行。
+- `sub2api validate`：app、PostgreSQL、Redis、service proxy、`NetworkPolicy/allow-all` 和临时跨 Pod PostgreSQL/Redis 连通性检查通过。
+- `codex-pool validate`：统一 key 的 `GET /v1/models` 成功，并用 `localCodex.responsesSmokeModel` 跑一次小的 `POST /v1/responses` smoke；owner balance / owner concurrency 已满足 YAML 最小值，capacity、WebSocket v2、Sub2API 内置 temporary-unschedulable 开关/规则和 sentinel runtime 状态与 YAML 对齐；`validation.gatewayResponsesRecent` 汇总最近 6 小时普通 `/responses` 和 `/v1/responses` 的 failover、forward failure、最终 4xx/5xx、慢 final error 与 `context canceled` 证据，`validation.gatewayCompactRecent` 单独汇总 `/responses/compact` 证据。若当前 Responses smoke `ok=true` 但 recent 字段 `degraded=true`，先区分是历史窗口残留还是新的 request id 正在失败；长期判定见 `docs/reference/platform-infra.md`。
+- 若 `publicExposure.enabled=true`，确认 YAML 声明的 public path 可用。FRP target 检查 FRP path；PK01 local target 检查 PK01 Caddy managed block 和 loopback upstream。未带 key 的 public `/v1/models` 401 只能证明网关可达，不能证明账号池可调度。
+- 多 target 同时启用 public exposure 时，必须分别验证每个 target 的 root、`/health`、未带 key `/v1/models` 401，以及各自 `codex-pool validate --target <id>`；一个域名可用不能替代另一个域名的验收。
+- 若目标声明了 `egressProxy.enabled=true`，确认 proxy Deployment/Service ready，Sub2API 和 sentinel env 与 YAML 对齐，并通过 YAML 声明的 health URL 完成代理出站探针。
+
+如果要证明真实模型请求可用，使用最小 `/v1/responses` 或等价 Codex smoke。不要把 group-level `/v1/models` 成功解释成每个上游 account 都健康。
diff --git a/docs/reference/observability.md b/docs/reference/observability.md
index fe0f4e92..06b7fed5 100644
--- a/docs/reference/observability.md
+++ b/docs/reference/observability.md
@@ -57,7 +57,7 @@ OA Event Flow 的高频 trace 统计不得把每个 `trace-stats-updated` 投影
 
 全局 stdout guard 不能只返回 dump 元数据。对已知高频长输出命令，bounded wrapper 的 `summary` 必须保留可直接 closeout 的命令特定字段，同时把完整 payload 留在 dump/raw drill-down 中。例如 `debug dispatch ... provider.upgrade` 超阈值时应保留 dispatch task、wait task、plan host root、当前/目标 gateway 版本、scheduler 结果和最终 promoted container 的 `version`、`restartPolicy`、`pidMode`、`heartbeatTimestamp`；`provider triage --full` 超阈值时应保留 `decision`、`scope`、`retryable`、failed/degraded/healthy scopes、signal counts、recommended cross-checks 和问题信号预览。新增会稳定超阈值的诊断命令时，优先补命令特定 compact summary，而不是扩大全局 stdout 阈值。
 
-本地或远端 `AGENTS.md`、`CLAUDE.md` 或同类 agent 入口文档超过 `10 KiB`、超过 YAML dump 阈值，或被 CLI/SSH/trans 读取时触发自动 dump，不能只把 dump 文件路径当成继续工作的正常入口。该现象表示入口文档已经过长，必须按 `docs-spec` 把入口文件拆成短索引：只保留 P0 规则摘要、关键命令入口和指向权威文档的链接；具体流程、背景、判定标准和长篇约束迁入对应 skill 的 `SKILL.md` 或 `docs/reference/` 长期参考。拆分后入口文档、skill 和长期参考必须互相交叉引用，避免同一规则在多个位置重复展开或产生第二真相。
+本地或远端 `AGENTS.md`、`CLAUDE.md`、`SKILL.md` 或同类 agent 入口文档超过 `10 KiB`、超过 YAML dump 阈值，或被 CLI/SSH/trans 读取时触发自动 dump，不能只把 dump 文件路径当成继续工作的正常入口。该现象表示入口文档已经过长，必须按 `docs-spec` 把入口文件拆成短索引：只保留 P0 规则摘要、关键命令入口和指向权威文档的链接；具体流程、背景、判定标准和长篇约束迁入对应职责文件。`SKILL.md` 拆到 `references/` 后禁止再堆成 `references/full.md`、`all.md`、`guide.md` 或其他变相超级 Markdown；必须按职责、生命周期和读取场景拆分成多个可选择的 reference，并在 `SKILL.md` 写清“何时读取哪个文件”。拆分后入口文档、skill 和长期参考必须互相交叉引用，避免同一规则在多个位置重复展开或产生第二真相。
 
 CLI 写 stdout/stderr 遇到下游 pipe 关闭的 `EPIPE` 必须安静退出，不能打印 Bun stack trace。常见验证命令是 `set -o pipefail; bun scripts/cli.ts server status | head -1`，应只看到第一行 JSON 而无额外错误噪声。