merge: sync master before issue 270 push

This commit is contained in:
Codex
2026-06-11 16:31:55 +00:00
14 changed files with 396 additions and 147 deletions
+4 -3
View File
@@ -56,6 +56,7 @@ bun scripts/cli.ts agentrun ack session/<sessionId>
bun scripts/cli.ts agentrun dispatch task/<taskId>
bun scripts/cli.ts agentrun send session/<sessionId> --aipod Artificer --prompt-stdin
bun scripts/cli.ts agentrun cancel session/<sessionId> --reason <text> --dry-run
bun scripts/cli.ts agentrun cancel command/<commandId> --run <runId> --reason <text> --dry-run
```
日常 task manifest 优先使用 YAML heredoc`agentrun apply -f -`;单 prompt 派单优先 `agentrun create task --aipod Artificer --prompt-stdin`;同 session 续跑只使用 `agentrun send session/<sessionId>`。UniDesk 客户端按 `config/agentrun.yaml` 直连 AgentRun REST API,不经过 HWLAB runtime、SSH official CLI 或旧 bridge wrapper`send` 是唯一用户级 session follow-up 写入口,服务端按 durable session/run/command 状态自动决定内部 `steer` 或新 `turn`,旧 CLI `turn/steer` 路径不保留兼容。`--json-file``--prompt-file``--runner-json-file` 只是客户端输入来源,用于已审阅且可复用的受控文件。它不是旧 Code Queue adapter,不双写,也不迁移旧历史。
@@ -71,9 +72,9 @@ AgentRun queue 生命周期不是一个单独的 `queue lifecycle` 命令,而
1. 默认总览用 `get tasks --queue commander --limit 20`,只看 task state、queue/lane、run/cmd/rjob/session ref、age 和 attention。
2. 单任务用 `describe task/<taskId>`,读取 `latestAttempt.runId``commandId``runnerJobId``sessionId/sessionPath` 和少量 `Next:`
3. Run 级状态用 `events run/<runId>``result run/<runId> --command <commandId>`,判断 terminalClassification、failureKind、provider interruption、timeoutBudget 和 recoveryActions。
4. Command 级状态用 `describe command/<commandId> --run <runId>``result command/<commandId> --run <runId>`,确认 command state、ack、terminal status 和结果摘要。
4. Command 级状态用 `describe command/<commandId> --run <runId>``result command/<commandId> --run <runId>`,确认 command state、ack、terminal status 和结果摘要;确认为单个 active command 卡住时,用 `cancel command/<commandId> --run <runId> --reason <text>` 清理该 command,保留同一个 session 后再用 `send session/<sessionId>` 续跑
5. Runner job 只读状态用 `describe runnerjob/<runnerJobId> --run <runId>`,确认 env image reuse、jobName、namespace、phase、exitCode、retention 和 `valuesPrinted=false`。不要为了这些字段手动调用 `trans G14:k3s kubectl ...`
6. Session trace/output 只在 `describe task` 或 result 里有实际 `sessionId` 时使用 `logs|ack|send|cancel session/<sessionId>``sessionRef=null` 时不要猜 session 命令。
6. Session trace/output 只在 `describe task` 或 result 里有实际 `sessionId` 时使用 `logs|ack|send|cancel session/<sessionId>``sessionRef=null` 时不要猜 session 命令。用户级 follow-up 一律使用 `send session/<sessionId>`,不要回到旧 `turn/steer``sessions ...` 兼容路径。
7. 已创建但尚未运行的 task 使用 `dispatch task/<taskId>` 派发,不再退回旧 bridge `queue dispatch`
默认视图必须低噪声且不是 JSON envelope`-o json|yaml` 才输出稳定机器结构,`--raw` 才保留直连 AgentRun REST envelope;命令返回里的下一步应优先是 `bun scripts/cli.ts agentrun ...` 资源原语,不得把人工 k8s 查询作为日常下一步。
@@ -188,7 +189,7 @@ bun scripts/cli.ts codex interrupt <taskId>
bun scripts/cli.ts codex cancel <taskId>
```
仅用于停止旧 Code Queue 残留任务;新 AgentRun session 使用 `bun scripts/cli.ts agentrun cancel session/<sessionId>`
仅用于停止旧 Code Queue 残留任务;新 AgentRun session 使用 `bun scripts/cli.ts agentrun cancel session/<sessionId>`,单个卡住 command 使用 `bun scripts/cli.ts agentrun cancel command/<commandId> --run <runId>`
---
+4 -2
View File
@@ -59,9 +59,10 @@ bun scripts/cli.ts server logs
```bash
bun scripts/cli.ts server cleanup plan [--min-age-hours 24] [--limit N]
bun scripts/cli.ts server cleanup run --confirm [--min-age-hours 24] [--limit N]
```
只生成 dry-run 计划,不执行删除。保守白名单:保留 running/stopped 容器镜像、deploy.json/CI.json commit-pinned artifact、Compose stable image。禁止 `docker system prune``docker volume rm``docker compose down -v`
`plan` 只生成 dry-run 计划`run --confirm` 只删除同一 classifier 选出的 stale Docker images。保守白名单:保留 running/stopped 容器镜像、deploy.json/CI.json commit-pinned artifact、Compose stable image。禁止 `docker system prune``docker image prune``docker volume rm``docker compose down -v` 和数据库清理。高风险候选必须额外显式 `--include-high-risk` 才会执行
---
@@ -82,12 +83,13 @@ bun scripts/cli.ts gc remote <providerId> [--target-use-percent N] [--dry-run|--
```bash
bun scripts/cli.ts gc plan --target-use-percent 69 \
--include-tool-caches \
--include-stale-tmp \
--include-vscode-stale-servers \
--include-vscode-stale-extensions \
--include-baidu-staging
```
`--target-use-percent``df` 显示口径估算 shortfall。工具缓存、VS Code 历史 server/extension 版本、Baidu staging 旧 PGDATA tarball 均默认不启用;必须显式 include 后才进入候选,且执行时仍受 allowlist 路径断言保护。默认 GC 不触碰 PGDATA、Docker volumes/images、Codex sessions/auth state 或 Baidu staging 根目录。
`--target-use-percent``df` 显示口径估算 shortfall。工具缓存、`/tmp` 非 allowlist 直接子项、VS Code 历史 server/extension 版本、Baidu staging 旧 PGDATA tarball 均默认不启用;必须显式 include 后才进入候选,且执行时仍受路径断言保护。stale `/tmp` 扫描按 `--limit` 有界枚举候选,避免为了估算全量临时目录而长时间无输出。默认 GC 不触碰 PGDATA、Docker volumes/images、Codex sessions/auth state 或 Baidu staging 根目录。
---
+12 -13
View File
@@ -74,16 +74,15 @@ bun scripts/cli.ts platform-infra sub2api codex-pool cleanup-probes --confirm
- `pool.apiKeySecretName` / `pool.apiKeySecretKey`: 统一消费 API key 的 k3s Secret 位置,默认 `platform-infra/sub2api-codex-pool-api-key.API_KEY`
- `pool.minOwnerBalanceUsd`: pool key owner 最低余额,sync/validate 会补齐。
- `pool.minOwnerConcurrency`: 可选统一消费 API key owner 最低并发;省略时 CLI 自动使用所有已解析账号 capacity 的总和,sync/validate 会补齐。显式 YAML 值只作为 override,仍必须不小于账号 capacity 总和;未显式写 `profiles.entries[].capacity` 的账号会使用 `pool.defaultAccountCapacity` 参与求和,不要用提高某个 provider capacity 来掩盖用户并发层 WS 1013。
- `pool.defaultTempUnschedulable`: 默认账号级临时下线规则;只声明 Sub2API 已支持的错误路径能力,用于在上游返回容量、限流、overload、service unavailable、gateway timeout、稳定模型路由错误或认证状态异常时,让 Sub2API 冷却该账号并切换到同组其他账号。不要用 YAML、UniDesk CLI、k8s 热补或本地 fork 魔改 Sub2API 不支持的行为
- `pool.defaultTempUnschedulable` `durationMinutes` 等业务数值只从 YAML 读取并同步到 Sub2API;不要在 TypeScript 默认值、schema、合同测试或文档 prose 中另写一份上限、下限或分层策略
- 自动冻结/切号失败时,必须修复 `temp_unschedulable` 与 failover 机制本身,并用运行时证据证明失败账号被临时冻结且请求切到其他可调度账号;禁止通过手动禁用账号、删除账号、移除 YAML entry、降低 membership 或临时改调度策略来替代自动恢复。只有明确的上游退役或所有权变更才走删除/禁用上游流程
- YAML 只选择和配置 Codex 上游,不声明 `schedulable` 长期字段;`schedulable=true` 只能作为 `codex-pool sync --confirm` 的过程控制基线恢复。自动冻结必须表现为 `temp_unschedulable_until` / `temp_unschedulable_reason`,避免把永久不可调度误当成自动冻结。
- `pool.defaultTempUnschedulable`: Sub2API 内置临时不可调度开关和 YAML 规则列表。当前要求是 `enabled=false`,YAML 保留规则用于以后显式恢复;sync 按 WebUI 关闭开关语义删除运行时 `temp_unschedulable_enabled` / `temp_unschedulable_rules` credentials 字段,不让 Sub2API 内置规则参与调度
- `pool.defaultTempUnschedulable` 与外部 `sentinel.*` 分开配置、互不驱动。内置开关关闭不影响哨兵;哨兵配置变化也不能隐式打开内置规则
- YAML 只选择和配置 Codex 上游,不声明 `schedulable` 长期字段;`schedulable=true` 只能作为 `codex-pool sync --confirm` 对未处于哨兵隔离账号的过程控制基线恢复
- `profiles.entries`: 从 master `~/.codex/` 选择上游 profile 并映射到 Sub2API account。
- `profiles.entries[].capacity`: 可选 per-account concurrency override;不写则使用 `pool.defaultAccountCapacity`。具体数值只以 `config/platform-infra/sub2api-codex-pool.yaml` 为准,skill 和长期参考只描述规则,不重复写当前值。
- `profiles.entries[].loadFactor`: 可选 per-account Sub2API `load_factor` override;不写则使用 `pool.defaultAccountLoadFactor`。具体数值只以 `config/platform-infra/sub2api-codex-pool.yaml` 为准,修改后必须 `codex-pool sync --confirm``codex-pool validate`
- `profiles.entries[].trustUpstream`: 可选账号级哨兵信任标记;默认 `false`。可信账号使用 `sentinel.cadence.trustedSuccessMaxIntervalMinutes` 作为连续成功后的最大探测退避,不可信账号使用 `sentinel.cadence.untrustedSuccessMaxIntervalMinutes`。它只影响哨兵探测频率和状态可见性,不改变 Sub2API account priority/capacity/loadFactor。
- 除非用户明确要求修改配置,不要仅凭推断改账号 membership、priority、capacity、loadFactor、WebSocket mode 或其他调度策略;先保留 YAML,完成 provenance/runtime evidence 溯源,并把结论写回相关 issue 或 runbook 后再提出变更。
- `profiles.entries[].tempUnschedulable`: 可选 per-account 临时下线规则覆盖;字段语义以 `docs/reference/platform-infra.md` 为权威。上游 Sub2API 不支持的成功体分类、调度策略或账号冷却行为不要在这里声明
- `profiles.entries[].tempUnschedulable`: 可选 per-account Sub2API 内置临时不可调度覆盖;当前同样应保持开关关闭,规则只保留在 YAML,不作为调度健康机制
- `profiles.entries[].openaiResponsesWebSocketsV2Mode`: 需要 Responses WebSocket v2 的上游才设置,值为 `off``ctx_pool``passthrough`
- `profiles.entries[].upstreamUserAgent`: 少数要求 Codex CLI User-Agent 的上游才设置,不能含换行。
- `sentinel.monitor.enabled`: 账号级 marker 哨兵监控开关;开启后 `codex-pool sync --confirm` 会在 `platform-infra` 创建/更新 k8s CronJob、ConfigMap、Secret、ServiceAccount、Role 和 RoleBinding。CronJob 直打 YAML-managed 上游账号的 OpenAI Responses `gpt-5.5`,用确定 marker 作为唯一健康标准,并在独立 state ConfigMap 中记录 token/cost 账本。
@@ -115,7 +114,7 @@ Codex 启动时反复出现 WebSocket reconnect、HTTPS fallback、`websocket cl
1. 在 master `~/.codex/` 准备带后缀的上游 profile 文件,例如 `config.toml.<profile>``auth.json.<profile>`;禁止覆盖默认 `config.toml` / `auth.json`
2.`config/platform-infra/sub2api-codex-pool.yaml` 添加 `profiles.entries` 项,指定 `profile``accountName``configFile``authFile`
3. 如需要,给该项加 `priority``capacity``loadFactor``trustUpstream``tempUnschedulable``openaiResponsesWebSocketsV2Mode``upstreamUserAgent`capacity/loadFactor/信任退避的具体数值只写在 YAML。
3. 如需要,给该项加 `priority``capacity``loadFactor``trustUpstream``openaiResponsesWebSocketsV2Mode``upstreamUserAgent`capacity/loadFactor/信任退避的具体数值只写在 YAML。只有显式恢复 Sub2API 内置临时不可调度时才添加 per-account `tempUnschedulable`
4. 如果新增账号会提高声明 capacity 总和,默认让省略的 `pool.minOwnerConcurrency` 继续按 capacity 总和自动解析;只有 YAML 已经显式写了该 override 时,才同步提高到不低于总 capacity,或删除 override 回到自动解析。
5.`codex-pool plan`,确认 profile 可读、`base_url` 和 API key 来源有效,且 stdout 未泄露完整 key。
6.`codex-pool sync --confirm`
@@ -125,7 +124,7 @@ Codex 启动时反复出现 WebSocket reconnect、HTTPS fallback、`websocket cl
## 删除上游
删除上游只用于明确退役、凭据所有权变更或用户明确要求移除 provider;不能作为上游 5xx、compact 失败、限流、模型路由失败或自动冻结/切号缺陷的恢复手段。
删除上游只用于明确退役、凭据所有权变更或用户明确要求移除 provider;不能作为上游 5xx、compact 失败、限流、模型路由失败或哨兵隔离/恢复问题的处理手段。
1.`config/platform-infra/sub2api-codex-pool.yaml` 删除对应 `profiles.entries` 项。
2.`codex-pool plan` 检查 desired 列表。
@@ -173,7 +172,7 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm
- `sub2api status`Deployment/StatefulSet/Service/Secret/NetworkPolicy 可见,运行镜像与 YAML 一致,`NetworkPolicy/allow-all` 符合 `podSelector: {}`、Ingress/Egress 全放行。
- `sub2api validate`app、PostgreSQL、Redis、service proxy、`NetworkPolicy/allow-all` 和临时跨 Pod PostgreSQL/Redis 连通性检查通过。
- `codex-pool validate`:统一 key 的 `GET /v1/models` 成功,并用 `localCodex.responsesSmokeModel` 跑一次小的 `POST /v1/responses` smokeowner balance / owner concurrency 已满足 YAML 最小值,capacity、WebSocket v2 temporary-unschedulable 运行时状态与 YAML 对齐;`validation.gatewayResponsesRecent` 汇总最近 6 小时普通 `/responses``/v1/responses` 的 failover、forward failure、最终 4xx/5xx、慢 final error 与 `context canceled` 证据,`validation.gatewayCompactRecent` 单独汇总 `/responses/compact` 证据。若当前 Responses smoke `ok=true` 但 recent 字段 `degraded=true`,先区分是历史窗口残留还是新的 request id 正在失败;长期判定见 `docs/reference/platform-infra.md`
- `codex-pool validate`:统一 key 的 `GET /v1/models` 成功,并用 `localCodex.responsesSmokeModel` 跑一次小的 `POST /v1/responses` smokeowner balance / owner concurrency 已满足 YAML 最小值,capacity、WebSocket v2、Sub2API 内置 temporary-unschedulable 开关/规则和 sentinel runtime 状态与 YAML 对齐;`validation.gatewayResponsesRecent` 汇总最近 6 小时普通 `/responses``/v1/responses` 的 failover、forward failure、最终 4xx/5xx、慢 final error 与 `context canceled` 证据,`validation.gatewayCompactRecent` 单独汇总 `/responses/compact` 证据。若当前 Responses smoke `ok=true` 但 recent 字段 `degraded=true`,先区分是历史窗口残留还是新的 request id 正在失败;长期判定见 `docs/reference/platform-infra.md`
-`publicExposure.enabled=true`,确认 FRP path 可用;`expose --confirm` 会用未带 key 的 public `/v1/models` 401 作为网关可达性探针。
如果要证明真实模型请求可用,使用最小 `/v1/responses` 或等价 Codex smoke。不要把 group-level `/v1/models` 成功解释成每个上游 account 都健康。
@@ -192,11 +191,11 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm
- 上游需要 WebSocket v2:先做 direct Codex WSv2 probe;通过后才给该 profile 配 `openaiResponsesWebSocketsV2Mode: ctx_pool|passthrough` 并跑 `sync --confirm`;把它当 capability candidate,容量仍以 YAML 中的 `capacity` 或默认值为准。
- Codex 启动 WebSocket 回退:用原入口 Codex smoke 复现,再用 bounded Sub2API 日志确认 account;对 WS handshake 4xx/5xx、`openai.websocket_account_select_failed` 或 close-before-`response.completed` 的账号关闭 YAML WSv2 能力后同步。若没有剩余 WSv2-capable account,把 `localCodex.supportsWebSockets``localCodex.responsesWebSocketsV2` 一起关掉,不把临时可用性推断写成调度配置。
- 上游要求 Codex User-Agent:只给该 profile 配 `upstreamUserAgent`,跑 `sync --confirm`
- 上游报 capacity/rate-limit/overload/Bad Gateway/Gateway Timeout 后没有切号或频繁先失败再恢复:先确认 `codex-pool validate``tempUnschedulable.ok=true` 且目标 account `runtimeEnabled=true`、规则数符合 YAML;再看 `validation.gatewayResponses.evidence.failovers` 的 account/upstream status。若 mismatch,跑 `codex-pool sync --confirm`;若 runtime 规则已对齐但仍不冻结或不切号,继续修 Sub2API 自动冻结/failover 能力并复测,不要手工 patch Sub2API credentials,也不要手动禁用、删除或从 YAML 移除问题账号来绕过机制缺陷
- 上游报 capacity/rate-limit/overload/Bad Gateway/Gateway Timeout 后没有隔离或频繁先失败再恢复:先 `codex-pool sentinel-report` 的 marker、动作、冻结 TTL 和下一次 probe;必要时用 `codex-pool sentinel-probe --account <accountName> --confirm` 立即测量。不要通过开启 Sub2API 内置临时不可调度、手动禁用账号、删除账号或从 YAML 移除问题账号来替代哨兵隔离/恢复
- `codex-pool sync --confirm``codex-pool validate` 超时:先区分 CLI 传输超时和 Sub2API 运行失败。受控 CLI 应返回远端作业进度和 stdout/stderr tail;如果只是低层 `trans` 60s 超时,不能据此判定 Sub2API failover 不工作。改用或修复 CLI 的远端 job/poll 路径后重跑,并以最终结构化结果作为证据。
- Codex 报 weekly-limit、`less than 10% of your weekly limit left``Run /status for a breakdown` 等账号状态/软配额提示并要求切号:如果上游以 403/429 等错误状态返回,把稳定 body 关键词放进 `pool.defaultTempUnschedulable` 的对应规则,跑 `codex-pool sync --confirm`,再用 `codex-pool validate` 确认每个 managed account 的 runtime 规则包含这些关键词。若该文案是 HTTP 200 成功内容,不要写 Sub2API 原生 YAML 200 规则、不要热补 Sub2API、不要绕过 sync;启用账号级哨兵时由 marker-only 哨兵按非 marker 响应统一指数冻结
- 上游 400/503 响应体出现 `invalid_encrypted_content``bad_response_status_code``invalid_request_error` + 稳定 unsupported-model 文案、unsupported-model、`暂不支持` / `可用模型``model_not_found``No available channel for model ...` 或同类稳定模型路由 / Responses encrypted-content 兼容性失败:把稳定 body 关键词放进 `pool.defaultTempUnschedulable` 的对应 400/503 规则,跑 `codex-pool sync --confirm`,再用 `codex-pool validate` 确认目标 account 的 runtime rule 包含这些关键词;不要用 account membership、priority、capacity、loadFactor、WebSocket modeUser-Agent 改动掩盖该错误族。
- 上游错误反复触发:`invalid_encrypted_content`、unsupported-model、`Recovered upstream error ...``Bad Gateway``Gateway Timeout`、Cloudflare `524`、Codex-facing `Upstream request failed``Unknown error``context deadline exceeded``context canceled``model_not_found``No available channel for model`、大上下文 `413``openai_error` 这类稳定包装文案都应留在对应 YAML 冷却政策里,特别是普通 `/responses` 与 compact 链路里上游兼容性错误或 524 可能最终表现为客户端 502/504 + `Unknown error`。冷却时长等具体数值只以 YAML 为准,修改后只需要 `codex-pool plan``codex-pool sync --confirm``codex-pool validate`;不要为数值调整新增合同测试、代码硬范围或长期参考数值口径。长期判定见 `docs/reference/platform-infra.md`
- Codex 报 weekly-limit、`less than 10% of your weekly limit left``Run /status for a breakdown` 等账号状态/软配额提示并要求切号:不要把新关键词写成 Sub2API 内置临时不可调度策略来恢复可用性;由 marker-only 哨兵按非 marker 响应统一冻结,并用 `sentinel-report` / `sentinel-probe` 验证
- 上游 400/503 响应体出现 `invalid_encrypted_content``bad_response_status_code``invalid_request_error` + 稳定 unsupported-model 文案、unsupported-model、`暂不支持` / `可用模型``model_not_found``No available channel for model ...` 或同类稳定模型路由 / Responses encrypted-content 兼容性失败:按哨兵 marker 失败处理,不用 account membership、priority、capacity、loadFactor、WebSocket modeUser-Agent 或 Sub2API 内置临时不可调度改动掩盖该错误族。
- 上游错误反复触发:`invalid_encrypted_content`、unsupported-model、`Recovered upstream error ...``Bad Gateway``Gateway Timeout`、Cloudflare `524`、Codex-facing `Upstream request failed``Unknown error``context deadline exceeded``context canceled``model_not_found``No available channel for model`、大上下文 `413``openai_error` 这类稳定包装文案都由外部哨兵和运行日志证据处理;内置临时不可调度规则保留但默认关闭,不作为当前恢复路径。长期判定见 `docs/reference/platform-infra.md`
- Codex auto compact 后丢上下文:先确认 YAML `localCodex` 是否声明启用 WSv2;若启用,再确认本机 `~/.codex/config.toml` 是否有 `supports_websockets = true``responses_websockets_v2 = true`,并看 `codex-pool validate` 的 WSv2 candidate 和 Sub2API 日志里的 `transport=responses_websockets_v2`。若 YAML 当前禁用 WSv2,则按 HTTP Responses 稳定性排查,不把旧 WS 口径当成验收要求。
- Codex smoke 有 reconnect/1013:这是上游并发/可用性问题,和 HTTP-only compact context-loss 分开处理;记录 session/log 证据并关联专项 issue,不要用运行时手补覆盖 YAML 容量。
@@ -208,5 +207,5 @@ bun scripts/cli.ts platform-infra sub2api codex-pool configure-local --confirm
- 不给 Sub2API manifest 添加 CPU/memory limits,除非有新的 YAML 化明确决策。
- 不打印完整 API key、admin password 或 Secret 明文。
- 不把普通上游增删做成代码变更、CI/CD、feature flag 或兼容双路径。
- 不把手动禁用账号、删除账号、移除 YAML entry、降低 membership临时改 priority/capacity/loadFactor 当作自动冻结/切号失败的修复。
- 不把手动禁用账号、删除账号、移除 YAML entry、降低 membership临时改 priority/capacity/loadFactor 或打开 Sub2API 内置临时不可调度当作哨兵隔离/恢复问题的修复。
- 不魔改 Sub2API:Sub2API 本身不支持的能力就不做,不通过 UniDesk 脚本、k8s 原地热补、本地 fork、YAML 伪声明或隐藏 fallback 代替上游实现。
+2 -2
View File
@@ -222,8 +222,8 @@ UniDesk 是一个以主 server 为统一入口的分布式工作平台;本文
- `bun scripts/cli.ts server status`:查询固定端口、swap 摘要、容器状态、健康检查和访问 URL,包含生产 frontend、dev frontend proxy 和 provider ingress,判定标准见 `docs/reference/deployment.md``docs/reference/dev-environment.md`
- `bun scripts/cli.ts server swap status|ensure [--path /swapfile] [--size 2GiB] [--dry-run]`:以 JSON 查看或幂等创建主 server swapfile`ensure` 输出 before/after、动作、持久化状态和 degraded/failed 详情,规则见 `docs/reference/deployment.md`
- `bun scripts/cli.ts server logs [--tail-bytes N]`:分页返回文件日志与 Docker 日志尾部并带截断元数据,日志规则见 `docs/reference/observability.md`
- `bun scripts/cli.ts server cleanup plan [--min-age-hours N] [--limit N]`只读/干跑生成主 server Docker 镜像清理计划,默认只列出至少 24 小时前创建的非保护镜像,输出 active/protected images、stale candidates、预计释放空间、风险等级和必须人工确认的 `docker image rm` 命令;禁止默认删除、禁止 prune、禁止触碰 database volume、registry storage 或 Baidu Netdisk 状态
- `bun scripts/cli.ts gc plan|run|db-trace|policy|remote`:主 server 或受控 provider 磁盘高水位一次性缓解和低风险防膨胀入口,覆盖日志、journald、Docker BuildKit cache、allowlisted `/tmp` 诊断目录、受限 core dump、显式 trace 遥测留存和 systemd 定时策略;规则见 `docs/reference/gc.md`
- `bun scripts/cli.ts server cleanup plan|run --confirm [--min-age-hours N] [--limit N]`:生成主 server Docker 镜像清理 dry-run 计划,并在显式确认后只删除同一 classifier 选出的 stale images;禁止 prune、禁止触碰 database volume、registry storage 或 Baidu Netdisk 状态,规则见 `docs/reference/cli.md``docs/reference/deployment.md`
- `bun scripts/cli.ts gc plan|run|db-trace|policy|remote`:主 server 或受控 provider 磁盘高水位一次性缓解和低风险防膨胀入口,覆盖日志、journald、Docker BuildKit cache、allowlisted `/tmp` 诊断目录、显式 opt-in stale `/tmp` 直接子项、受限 core dump、显式 trace 遥测留存和 systemd 定时策略;规则见 `docs/reference/gc.md`
- `bun scripts/cli.ts server rebuild <backend-core|frontend|dev-frontend-proxy|provider-gateway|todo-note|code-queue-mgr|project-manager|baidu-netdisk|oa-event-flow>`:以 build-first、Compose lock、no-deps force-recreate 和 post-up validation 的异步 job 重建主 server Compose 内单个服务;对 database、File Browser、Code Queue 执行面、k3sctl-adapter 或未知对象返回结构化 `unsupported-server-rebuild`,规则见 `docs/reference/deployment.md``docs/reference/cicd-standardization.md`
- `bun scripts/cli.ts provider attach <providerId> [--master-server URL] [--up] [--force]` / `bun scripts/cli.ts provider triage <providerId> [--observed-error text] [--observed-scope scope] [--microservice id ...] [--full|--raw]`:前者在新增计算节点上生成两项配置的 provider-gateway 挂载包;后者是只读多信号健康裁决入口,默认低噪声输出 `decision``healthyScopes``failedScopes``retryable` 和异常信号摘要,用来把单路径 `provider is not online`、SSH 超时、registry 失败或 proxy 失败归类为 `retryable-transient``service-degraded``global-offline`,完整 evidence 需显式 `--full|--raw`,规则见 `docs/reference/provider-gateway.md``docs/reference/code-queue-supervision.md`
- `trans <route> [operation args...]` / `tran <route> [operation args...]`:通过 provider-gateway 的 Host SSH / WSL SSH 维护桥进入 provider、host workspace、Windows cmd route、k3s 控制面或 pod workspace,并提供带 SHA-256 校验的 `upload`/`download` 文件传输;主 server 人工/Codex 分布式操作必须优先用本机 `trans` wrapper`tran` 只作为兼容入口,细则见 `docs/reference/cli.md``docs/reference/windows-passthrough.md``docs/reference/provider-gateway.md`
@@ -8,7 +8,7 @@ pool:
defaultAccountCapacity: 10
defaultAccountLoadFactor: 10
defaultTempUnschedulable:
enabled: true
enabled: false
rules:
- statusCode: 400
keywords: [invalid_encrypted_content, encrypted content, could not be verified, could not be decrypted, bad_response_status_code, model_not_found, no available channel for model, unsupported, not supported, not support, 暂不支持, 可用模型]
+2 -2
View File
@@ -32,8 +32,8 @@ CI/CD、GitOps、rollout、artifact 发布、PR 合并后的 runtime lane 滚动
- `server status` 查询公开端口、受限宿主端口、内部端口、主机 swap 摘要、Compose 容器、core/frontend/dev-frontend/provider/database 健康检查和访问 URLD601 Code Queue 使用的 PostgreSQL/OA Event Flow host mapping 必须出现在受限宿主端口而不是无条件公开入口中。低内存主 server 上 `swap.warning` 非空时,先执行 `server swap status``server swap ensure`
- `server swap status|ensure [--path /swapfile] [--size 2GiB] [--dry-run]` 是主 server swap 管理入口。`status` 仅读 `/proc/meminfo``/proc/swaps``/etc/fstab` 并返回 JSON`ensure` 在已有任何 active swap 时只报告 no-op,在无 active swap 时创建固定 swapfile、`chmod 600``mkswap``swapon` 并尽量写入 `/etc/fstab`。输出必须包含 `before``after`、total memory、active swap、持久化状态、关键动作和错误详情;若 swap 已启用但 fstab 写入失败,状态为 `degraded`,调用者需按返回的 detail 修复持久化。
- `server logs` 返回 `logs/` 文件日志和 Docker 容器日志的尾部,默认限制输出大小,避免日志爆炸。实现必须只读取文件末尾字节,不得为了 tail 先把巨大日志完整读入 CLI 内存。
- `server cleanup plan [--min-age-hours N] [--limit N]` 只生成主 server Docker 镜像清理 dry-run 计划,不执行删除;默认 `--min-age-hours 24`,避免把刚发布或刚验证的镜像列为 stale。输出必须包含 `dryRun=true``mutation=false``policy.deletionExecuted=false`active containers/images、受保护镜像、candidate stale images、估算释放空间、风险等级、`commandsToReview` 和人工审批清单。计划必须保守白名单:保留 running containers 使用的 image ID,保留 stopped containers 引用的 image ID 直到人工先复核容器,保留 `deploy.json`/`CI.json` 当前 commit-pinned artifact、Compose stable image、上游 digest pin 和 provider-gateway runner image`protectedStorage` 必须显式列出 PostgreSQL named volume、Baidu Netdisk `.state`、D601 registry storage 和 Docker volumes/host data policy。该入口禁止生成或执行 `docker system prune``docker image prune``docker builder prune``docker volume rm``docker compose down -v`、数据库清理或 host data `rm` 命令;未来若增加真实删除,必须另设显式审批参数并先复核 dry-run 输出
- `gc plan|run --confirm|db-trace|policy|remote` 是主 server 和受控 provider 的磁盘高水位一次性缓解与长期防膨胀入口。`plan` 只读输出候选、风险、估算收益和保护对象;`run` 必须显式 `--confirm``gc remote <providerId> ...` 通过 UniDesk SSH 透传执行远端 GC`--target-use-percent N` 会在 `summary.target` 中报告目标水位所需释放量、候选估算、预计水位、缺口和 safe-stop 决策。G14/HWLAB registry retention、受限 core dump、保护对象、safe-stop 线和长期收益表的权威规则见 `docs/reference/gc.md`
- `server cleanup plan|run --confirm [--min-age-hours N] [--limit N]` 主 server Docker 镜像高水位治理入口。`plan` 生成 dry-run 计划,不执行删除;`run --confirm` 只删除同一 classifier 选出的 stale Docker images,高风险候选必须额外 `--include-high-risk` 才会执行。默认 `--min-age-hours 24`,避免把刚发布或刚验证的镜像列为 stale。输出必须包含 active containers/images、受保护镜像、candidate stale images、估算释放空间、风险等级、执行/跳过结果和人工审批线索。计划必须保守白名单:保留 running containers 使用的 image ID,保留 stopped containers 引用的 image ID 直到人工先复核容器,保留 `deploy.json`/`CI.json` 当前 commit-pinned artifact、Compose stable image、上游 digest pin 和 provider-gateway runner image`protectedStorage` 必须显式列出 PostgreSQL named volume、Baidu Netdisk `.state`、D601 registry storage 和 Docker volumes/host data policy。该入口禁止 `docker system prune``docker image prune``docker builder prune``docker volume rm``docker compose down -v`、数据库清理或 host data `rm` 命令。
- `gc plan|run --confirm|db-trace|policy|remote` 是主 server 和受控 provider 的磁盘高水位一次性缓解与长期防膨胀入口。`plan` 只读输出候选、风险、估算收益和保护对象;`run` 必须显式 `--confirm``gc remote <providerId> ...` 通过 UniDesk SSH 透传执行远端 GC`--target-use-percent N` 会在 `summary.target` 中报告目标水位所需释放量、候选估算、预计水位、缺口和 safe-stop 决策。默认只包含 allowlisted `/tmp` 诊断目录;非 allowlist stale `/tmp` 直接子项必须显式 `--include-stale-tmp`,并只允许删除 `/tmp` 一级子项且避开系统 socket/session 前缀。G14/HWLAB registry retention、受限 core dump、保护对象、safe-stop 线和长期收益表的权威规则见 `docs/reference/gc.md`
- `server rebuild <backend-core|frontend|dev-frontend-proxy|provider-gateway|todo-note|code-queue-mgr|project-manager|baidu-netdisk|oa-event-flow>` 创建异步 job,先构建目标服务镜像,随后在 `.state/locks/server-compose.lock` 串行保护下用 `--no-deps --force-recreate` 替换目标 service 并等待容器 `healthy/running`;该命令用于替代手工删除容器的兜底流程,其中 `dev-frontend-proxy` 只更新主 server dev 入口薄代理,`todo-note``code-queue-mgr``project-manager``baidu-netdisk``oa-event-flow` 只重建主 server 承载的对应后端,不会重建或删除 database 命名卷。D601 Code Queue 执行面不由 `server rebuild` 管理;Rust backend-core 常规迭代不得用该命令在 master server 编译,只有明确的 backend-core 主 server 上线例外可以按限流、异步轮询和 health 证据执行,规则见 `docs/reference/dev-environment.md`
- `provider attach <providerId> [--master-server URL] [--up] [--force]` 在新计算节点生成两项配置的 provider-gateway 挂载包:`.state/provider-<ID>.env` 默认只包含 `UNIDESK_MASTER_SERVER``PROVIDER_ID``provider-<ID>.yml` 固定 Docker socket、`pid: "host"``restart: always`、只读 `/workspace` 和 SSH 维护私钥挂载;`--up` 会立即执行生成的 `docker compose up -d --build``provider triage <providerId> [--observed-error text] [--observed-scope scope] [--microservice id ...] [--full|--raw]` 是只读多信号健康裁决入口,会把单路径 `provider is not online`、SSH 超时、registry 失败和 service proxy 失败归类成 `runner-local-observation-gap``service-degraded``provider-degraded``global-blocker`。默认输出只返回裁决、scope、失败/降级/未知信号和有界 evidence 摘要,完整 evidence 必须显式加 `--full``--raw`;推荐交叉验证命令仍包含 `debug health``debug dispatch <providerId> host.ssh --wait-ms 15000``trans <providerId> argv true``artifact-registry health --provider-id <providerId>``microservice health k3sctl-adapter``microservice health code-queue``codex tasks --view supervisor --limit 20`
- `trans <route> [operation args...]` / `tran <route> [operation args...]` 通过 backend-core 内网 WebSocket broker 和 provider-gateway 的 Host SSH / WSL SSH 维护桥连接目标节点;`route` 基础形态是 provider id,例如 `D601``G14`,也可以扩展为纯定位路径 `provider:plane[:namespace:resource[:container]]`,例如 `D601:win``D601:win/c/test``G14:k3s``D601:k3s``G14:k3s:<namespace>:<workload>`。WSL provider 的 Windows plane 固定使用 `win`,不得使用 `win32`Windows operation 必须显式区分:`ps` 执行 Windows PowerShell heredoc 或一行 PowerShell 命令,`cmd` 执行 cmd.exe/batch`skills` 发现 Windows skill 目录。需要 Windows cwd 时用 `trans D601:win/c/test ps``trans D601:win/c/test cmd cd`CLI 自动设置 UTF-8/Python 编码默认值;`cmd` 额外设置 `chcp 65001`。非交互远端命令优先使用 `trans <providerId> argv ...`;需要 POSIX shell 脚本、管道、变量或循环时优先使用 quoted heredoc 单步传输,例如 `trans G14 script <<'SCRIPT'``trans G14:k3s script <<'SCRIPT'``trans G14:k3s:<namespace>:<workload> script <<'SCRIPT'`,把脚本走 stdin。`script` 只表示 host/k3s POSIX shell,不表示 Windows PowerShellWindows PowerShell 必须写 `trans <provider>:win ps <<'PS'``script -- '<单个字符串>'` 是无需 stdin 的远端 POSIX shell one-liner,例如 `trans G14:/root/hwlab script -- 'cd /root/hwlab && git status --short --branch'``script -- <多个 argv>` 才是 direct argv,适合 `trans D601:/path script -- sed -n '1,20p' file` 这类带短横线的单进程命令。顶层 remote option parser 必须保留命令已经开始后的 `--`,不得把它吞成全局选项结束符。需要远端改文本文件时默认优先使用 `<route> apply-patch < patch.diff`;需要可靠传输非文本或整文件时使用 `<route> upload <local-file> <remote-file>``<route> download <remote-file> <local-file>`CLI 会按字节数与 SHA-256 自动校验并在 provider-gateway stdin/argv 限制下切换客户端分块策略;需要旧 helper 时显式使用 `<provider>:k3s:<namespace>:<workload> apply-patch-v1``<providerId> apply-patch-v1`。ssh-like 命令遇到 timeout/kex/255 类失败时,CLI 会在 stderr 追加一行 `UNIDESK_SSH_HINT` JSON,提示 stdin script/argv 重试和 provider triage 交叉验证。
+1 -1
View File
@@ -52,7 +52,7 @@ MiniMax-M3 配置必须保持 profile/provider 级隔离。当前 `mxcx` 的稳
swap 管理不能被强塞进所有热路径。`server start/status` 可以暴露 warning 或摘要,但不会自动创建 swap;需要变更主机 swap 时必须显式运行 `server swap ensure`,并用返回的 `before`/`after``fstab.persisted` 作为验收记录。
根分区 Docker 镜像高水位治理必须先走 `bun scripts/cli.ts server cleanup plan`只读 dry-run。该计划只针对 Docker image inventory:默认只把创建时间超过 24 小时且不在保护集里的镜像列为 stale,输出 active containers/images、protected images、candidate stale images、风险、估算释放空间人工复核命令,但不删除、不 prune、不改容器、不碰 volume。候选必须从白名单保护集中排除:running container image ID、stopped container 引用 image ID、Compose stable image、`deploy.json`/`CI.json` 当前 commit artifact、上游 digest pin 和 provider-gateway runner image。计划还必须显式保护 PostgreSQL named volume、Baidu Netdisk `.state`/staging、D601 registry storage 和所有 Docker volume/host data 目录。任何真实清理必须作为未来显式授权操作实现,且不得用 `docker system prune``docker image prune``docker builder prune` 或数据库清理替代 dry-run 审批;数据库清理前必须先确认可用备份。
根分区 Docker 镜像高水位治理必须先走 `bun scripts/cli.ts server cleanup plan` 的 dry-run,再用 `bun scripts/cli.ts server cleanup run --confirm` 执行同一 classifier 选出的 stale image 删除。该入口只针对 Docker image inventory:默认只把创建时间超过 24 小时且不在保护集里的镜像列为 stale,输出 active containers/images、protected images、candidate stale images、风险、估算释放空间人工复核命令和执行结果;高风险候选必须额外显式 `--include-high-risk` 才会执行。候选必须从白名单保护集中排除:running container image ID、stopped container 引用 image ID、Compose stable image、`deploy.json`/`CI.json` 当前 commit artifact、上游 digest pin 和 provider-gateway runner image。计划还必须显式保护 PostgreSQL named volume、Baidu Netdisk `.state`/staging、D601 registry storage 和所有 Docker volume/host data 目录。该入口不得用 `docker system prune``docker image prune``docker builder prune`、volume 清理或数据库清理替代 dry-run 审批;数据库清理前必须先确认可用备份。
## Start And Stop
+2
View File
@@ -12,6 +12,8 @@ UniDesk 的磁盘治理入口是 `bun scripts/cli.ts gc ...`。该入口用于
所有成功和失败输出都必须是 JSON。`plan` 必须标记 `dryRun=true``mutation=false``run` 必须要求 `--confirm` 并报告 `diskBefore``diskAfter``summary``results``protected`。远端 GC 可用 `--target-use-percent N` 显式表达目标根盘水位;`summary.target` 必须给出目标所需释放量、候选估算、预计水位、缺口和 `safeStop` 决策,避免靠人工心算判断是否应该继续扩大清理范围。
默认 `/tmp` GC 只包含 allowlisted 诊断目录和已知低风险路径。非 allowlist 的 stale `/tmp` 一级子项必须显式 `--include-stale-tmp` 才能进入候选;扫描按 `--limit` 有界枚举候选,执行时仍只允许删除 `/tmp` 直接子项,并避开 X11/ICE/font socket、systemd private、tmux、ssh、vscode 等系统/session 前缀。该入口不能递归扩大成通用 `/tmp` 清空器,也不能为了估算全量临时目录而长时间阻塞。
## Protected Data
默认 GC 不得删除或 prune 以下对象:
+9 -12
View File
@@ -26,9 +26,9 @@
- `pool.groupName` names the Sub2API group that represents the pool.
- `pool.apiKeySecretName` and `pool.apiKeySecretKey` name the k3s Secret that stores the single consumer API key.
- `pool.minOwnerConcurrency` is optional; when omitted, the CLI automatically uses the sum of all resolved account capacities as the minimum concurrency for the Sub2API user that owns the unified consumer API key. A YAML value is only an explicit override and must still be at least that capacity sum, so the shared key does not fail requests or WS sessions at the user-concurrency layer. "Resolved" means each account's explicit `profiles.entries[].capacity` or, when omitted, `pool.defaultAccountCapacity`. Do not compensate for owner-concurrency 1013 errors by pinning capacity to one provider.
- `pool.defaultTempUnschedulable` declares Sub2API account-level temporary unschedulable rules for capabilities that Sub2API itself already supports. Keep 429/overload/capacity, service-unavailable, gateway timeout, and stable model-routing failures in this YAML policy so the scheduler can cool down a failing account and choose another candidate instead of hard-pinning one provider. Do not declare unsupported Sub2API behavior in YAML as a promise that UniDesk code or runtime patches should emulate.
- When a managed upstream repeatedly causes `/v1/responses` or `/responses/compact` failures, the required fix path is to make automatic temporary-unschedulable and failover work, then verify it with runtime evidence. Do not restore availability by manually disabling an account, deleting a managed account, removing its YAML entry, lowering membership, or otherwise changing routing policy merely to avoid the failing upstream; those actions are allowed only for an explicit upstream retirement or ownership change.
- Codex accounts selected by YAML do not declare `schedulable` as durable configuration. `schedulable=true` is a `codex-pool sync --confirm` process-control baseline for UniDesk-managed accounts, not a YAML field. Account cooling must be represented by `temp_unschedulable_until` / `temp_unschedulable_reason`, so validation can distinguish real automatic cooldown from stale manual unschedulable state.
- `pool.defaultTempUnschedulable` is the Sub2API built-in temporary-unschedulable switch plus its YAML rule list. UniDesk keeps this built-in switch disabled by default while preserving the rule list in YAML for explicit future recovery; sync follows the WebUI close-switch behavior by omitting the runtime `temp_unschedulable_enabled` and `temp_unschedulable_rules` credential fields. The external account-level sentinel is the active account health and freeze/restore mechanism.
- The built-in temporary-unschedulable configuration and external `sentinel.*` configuration are separate control surfaces. Changing `pool.defaultTempUnschedulable.enabled` or `profiles.entries[].tempUnschedulable` must not change sentinel cadence, marker health semantics, or sentinel quarantine state; changing sentinel settings must not implicitly enable Sub2API built-in temporary-unschedulable rules.
- Codex accounts selected by YAML do not declare `schedulable` as durable configuration. `schedulable=true` is a `codex-pool sync --confirm` process-control baseline for UniDesk-managed accounts that are not under sentinel quarantine, not a YAML field.
- `codex-pool sync --confirm` preserves UniDesk-managed accounts that are absent from YAML by default; explicit upstream retirement requires `codex-pool sync --confirm --prune-removed`. This keeps account deletion out of the normal availability-recovery path and prevents temporary YAML edits from becoming destructive runtime changes.
- `profiles.entries` selects local Codex profile files from `~/.codex/` and maps them to Sub2API account names.
- The unsuffixed master `~/.codex/config.toml` and `~/.codex/auth.json` are reserved for the unified Sub2API consumer. `config.toml` must keep `base_url = "https://sub2api.74-48-78-17.nip.io/"`, and `auth.json` must contain the unified pool API key from `pool.apiKeySecretName` / `pool.apiKeySecretKey`. Do not replace these two files with direct upstream account credentials.
@@ -36,9 +36,8 @@
- `profiles.entries[].capacity` optionally overrides `pool.defaultAccountCapacity` for one account. Capacity is a YAML-controlled routing input; concrete current values belong only in `config/platform-infra/sub2api-codex-pool.yaml` and runtime validation output, not in long-term reference prose. Code constants, Secrets, ad-hoc runtime patches, or stale tests must not override YAML source of truth.
- `profiles.entries[].loadFactor` optionally overrides `pool.defaultAccountLoadFactor` for one account and is rendered to Sub2API `load_factor`. Treat it as routing policy: values belong in YAML and `codex-pool validate` output, not code constants, Secrets, or ad-hoc runtime patches.
- Do not change account membership, priority, capacity, load factor, WebSocket mode, or other routing policy from inference alone. Unless the user explicitly asks for a configuration change, first preserve the current YAML, collect provenance and runtime evidence, and write the finding to the relevant issue or runbook before proposing a change.
- `profiles.entries[].tempUnschedulable` may override the pool default for one account. The CLI renders it into Sub2API credentials as `temp_unschedulable_enabled` and `temp_unschedulable_rules`; rules match HTTP status plus response-body keywords and place only that account into a temporary unschedulable cooldown.
- Codex account-state or quota prompts that stop a task and ask the operator to switch accounts belong in `pool.defaultTempUnschedulable`, not in account membership, priority, capacity, load factor, WebSocket mode, or `pool_mode`. Keep stable body phrases such as weekly-limit and `/status` prompts in both the 403 account-state rule and the 429 quota/rate-limit rule, then run `codex-pool sync --confirm` and `codex-pool validate`. The validation evidence must include runtime temporary-unschedulable alignment for each managed account, not only successful group-level `/v1/models` or `/v1/responses` smoke output.
- Upstream model-routing and Responses compatibility failures that surface as 400 responses, such as `invalid_encrypted_content`, `bad_response_status_code`, `invalid_request_error` with a stable unsupported-model message, unsupported-model wrappers, or stable "available models" messages, belong in `pool.defaultTempUnschedulable` when another account can handle the same Codex request. Upstream model-routing failures that surface as 503 responses, such as `model_not_found` or "no available channel for model" wrappers, also belong there. Gateway and timeout failures that surface as 502, 504, or 524 responses, including `Gateway Timeout`, `Unknown error`, `Upstream request failed`, `context deadline exceeded`, `context canceled`, or recovered upstream-error wrappers, belong in the same YAML policy. This is especially important for compact and long `/responses` requests, where an upstream Cloudflare 524 or account-specific compatibility failure may eventually reach Codex as a 502/504 unknown-error wrapper after failover or client cancellation. They are not membership, priority, capacity, load factor, WebSocket mode, or User-Agent decisions by themselves. After adding stable body phrases, run `codex-pool sync --confirm` and `codex-pool validate`, and verify the affected account's runtime status-specific rule includes the new keywords.
- `profiles.entries[].tempUnschedulable` may override the pool default for one account. When enabled, the CLI renders it into Sub2API credentials as `temp_unschedulable_enabled` and `temp_unschedulable_rules`; when disabled, runtime credentials omit both fields and the YAML rule list remains only source-side configuration.
- Codex account-state, quota prompts, model-routing failures, gateway wrappers, and timeout-like upstream errors are handled by the external marker-only sentinel unless the Sub2API built-in temporary-unschedulable switch is explicitly re-enabled. Do not change membership, priority, capacity, load factor, WebSocket mode, or `pool_mode` merely to work around those errors.
- `profiles.entries[].openaiResponsesWebSocketsV2Mode` is the account-level Responses WebSocket v2 switch for OpenAI-compatible upstreams that require WebSocket transport. Allowed values are `off`, `ctx_pool`, and `passthrough`; omit the field unless that upstream needs it.
- `profiles.entries[].upstreamUserAgent` is an optional account-level upstream request User-Agent override. Use it only for upstreams that require a Codex CLI compatible User-Agent; keep the value YAML-controlled and newline-free.
- `publicExposure` controls the optional FRP bridge from master server to the G14 ClusterIP service.
@@ -52,11 +51,9 @@ When Codex startup repeatedly reports WebSocket reconnects or HTTPS fallback, pr
Do not encode current availability assumptions in long-term reference prose. If an account needs a higher concurrency or load factor than the pool default, make that a deliberate YAML override and verify it with `codex-pool validate`; the reference document should describe the rule, not repeat the current numeric value.
Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path, while UniDesk's desired failover behavior is to mark the failing account temporarily unschedulable and let Sub2API choose another account from the group. `codex-pool validate` reports each managed account's temporary-unschedulable runtime alignment and should be used after `codex-pool sync --confirm`. Generic 502/503/504 bodies such as `Recovered upstream error 502`, `Bad Gateway`, `Gateway Timeout`, Codex-facing `Upstream request failed`, `Unknown error`, context-deadline/canceled wrappers, stable 400 `invalid_encrypted_content` / unsupported-model wrappers, and stable `model_not_found` / "no available channel for model" wrappers must stay in the YAML cooldown policy so an intermittently bad account is cooled down instead of repeatedly adding latency at the next compact or Responses request. Exact current cooldown values and any business-policy grouping belong only in YAML and runtime validation output; do not repeat those values here, encode them as code/schema hard limits, or require contract tests for value changes.
Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path and does not replace sentinel quarantine. The current failover and recovery model is: the external marker-only sentinel freezes or restores account schedulability, while Sub2API routes among currently schedulable accounts in the group.
Sub2API temporary-unschedulable rules require both an HTTP status match and a response-body keyword match in the upstream failure/error path. Do not treat them as a general successful-response content filter, and do not add a YAML 200 cooldown rule, patch Sub2API in place, fork Sub2API behavior in UniDesk, or bypass `codex-pool sync` to make the native pool pretend that HTTP 200 content cooling exists. HTTP 200 private content, maintenance text, quota prompts, ads, and similar semantic failures are handled by the external account-level sentinel when that sentinel is enabled, not by Sub2API native `temp_unschedulable_rules`.
If automatic cooling or same-request failover does not happen for an error that the YAML policy declares, treat that as a Sub2API capability or integration defect. The closeout must show the failing account being marked temporarily unschedulable and the next request or same request selecting another schedulable account; a manually disabled, deleted, or pruned account is not valid evidence for this class of fix.
Sub2API temporary-unschedulable rules require both an HTTP status match and a response-body keyword match in the upstream failure/error path when the built-in switch is enabled. UniDesk currently keeps that switch disabled and does not use built-in rules as a successful-response content filter. HTTP 200 private content, maintenance text, quota prompts, ads, and similar semantic failures are handled by the external account-level sentinel.
## Sub2API Account Test Semantics
@@ -76,7 +73,7 @@ The UniDesk account-level sentinel uses marker-only health semantics. A probe is
The sentinel must not maintain separate classifiers for "private content", "maintenance", "quota", "ads", or provider-specific body phrases as health gates. The only recovery condition is a later recovery probe that matches the marker. Freeze TTL expiry only schedules the next recovery probe; it does not restore an account by itself. Repeated non-marker results use a short exponential freeze backoff because failed marker probes produce little or no useful output token usage; repeated marker-matching results use the configured success cadence backoff. This contract applies equally to OpenAI Responses `gpt-5.5` direct account probes and manual `codex-pool sentinel-probe --account ... --confirm` measurements.
`profiles.entries[].trustUpstream` is the durable account-level trust marker for sentinel success cadence, and the absence of the field means untrusted. Trusted and untrusted accounts use separate YAML cadence maximums after marker-matching probes; the values belong only in `config/platform-infra/sub2api-codex-pool.yaml`. This field must not change Sub2API scheduler priority, capacity, load factor, membership, native temporary-unschedulable rules, or the marker-only health contract. Its purpose is to keep intermittently unreliable 200-success providers under more frequent direct probes without adding provider-specific content classifiers.
`profiles.entries[].trustUpstream` is the durable account-level trust marker for sentinel success cadence, and the absence of the field means untrusted. Trusted and untrusted accounts use separate YAML cadence maximums after marker-matching probes; the values belong only in `config/platform-infra/sub2api-codex-pool.yaml`. This field must not change Sub2API scheduler priority, capacity, load factor, membership, built-in temporary-unschedulable settings, or the marker-only health contract. Its purpose is to keep intermittently unreliable 200-success providers under more frequent direct probes without adding provider-specific content classifiers.
When `codex-pool sync --confirm` creates a YAML-managed account or changes direct-probe-relevant account inputs such as the profile mapping, upstream base URL, API key fingerprint, upstream User-Agent, Responses WebSocket mode, or `trustUpstream`, only that account must be default-frozen before it can enter the scheduler. Sync first records a pending sentinel quality gate from the pre-mutation runtime state, then updates the account, then schedules the account probe immediately. This ordering prevents a new or changed account from being written to Sub2API without a matching sentinel quarantine record if sync fails midway. Passing the marker clears the quality gate and restores schedulability; any non-marker result continues the failure freeze backoff. Unchanged accounts must not have their existing success or failure backoff reset by unrelated YAML syncs.
@@ -106,7 +103,7 @@ When `publicExposure.enabled` is true, the same FRP TCP bridge exposes both Open
The public management UI is an operations endpoint. Keep Sub2API itself in `platform-infra`, keep the Kubernetes Service as ClusterIP, and treat FRP as the only public bridge unless a later decision explicitly changes the exposure model.
The public bridge has two separate failure classes. Sub2API upstream/account failures are visible in Sub2API logs and should be handled by temporary-unschedulable rules, sentinel quarantine, or Sub2API failover. Edge failures between master Caddy and the FRP remotePort are not visible to Sub2API; symptoms include Caddy `connect: connection refused`, EOF, connection reset, or short 502 bursts while frps closes and reopens the configured remotePort. Those failures must be diagnosed from Caddy and frps/frpc evidence and mitigated through YAML-controlled Caddy edge retry or FRP stability fixes, not by disabling accounts or changing pool membership.
The public bridge has two separate failure classes. Sub2API upstream/account failures are visible in Sub2API logs and currently belong to sentinel quarantine plus normal Sub2API routing among schedulable accounts. Edge failures between master Caddy and the FRP remotePort are not visible to Sub2API; symptoms include Caddy `connect: connection refused`, EOF, connection reset, or short 502 bursts while frps closes and reopens the configured remotePort. Those failures must be diagnosed from Caddy and frps/frpc evidence and mitigated through YAML-controlled Caddy edge retry or FRP stability fixes, not by disabling accounts or changing pool membership.
## Availability And Probes
+31 -6
View File
@@ -198,7 +198,7 @@ function agentRunHelpText(args: string[]): string {
return "Usage: bun scripts/cli.ts agentrun ack task/<taskId>|session/<sessionId> [--reader-id cli]";
}
if (verb === "cancel") {
return "Usage: bun scripts/cli.ts agentrun cancel task/<taskId>|session/<sessionId> --reason <text> [--dry-run]";
return "Usage: bun scripts/cli.ts agentrun cancel task/<taskId>|session/<sessionId>|run/<runId>|command/<commandId> --reason <text> [--run <runId>] [--dry-run]";
}
if (verb === "dispatch") {
return "Usage: bun scripts/cli.ts agentrun dispatch task/<taskId>";
@@ -576,16 +576,41 @@ async function resourceAck(config: UniDeskConfig | null, command: string, action
async function resourceCancel(config: UniDeskConfig | null, command: string, action: string | undefined, args: string[], options: AgentRunResourceOptions): Promise<RenderedCliResult> {
const ref = parseResourceRef(action, args, "task");
const cancelArgs = ref.kind === "task" ? ["cancel", ref.name] : ref.kind === "session" ? ["cancel", ref.name] : null;
if (cancelArgs === null) throw new Error("cancel supports task/<taskId> or session/<sessionId>");
if (options.reason !== null && ref.kind === "task") cancelArgs.push("--reason", options.reason);
if (options.dryRun) cancelArgs.push("--dry-run");
const cancelArgs = ["cancel", ref.name];
if (options.reason !== null) cancelArgs.push("--reason", options.reason);
if (ref.kind === "command") cancelArgs.push("--run-id", options.runId ?? requiredContext("command cancel", "--run <runId>"));
if (options.dryRun) {
const result = agentRunResourceCancelDryRunPlan(ref, options, rerunWithoutDryRun(command));
return renderMutationSummary(command, result, options, `Planned cancel ${ref.kind}/${shortId(ref.name)}`, [rerunWithoutDryRun(command)]);
}
const result = ref.kind === "task"
? await runAgentRunRestCommand(config, "queue", cancelArgs)
: await runAgentRunRestCommand(config, "sessions", cancelArgs);
: ref.kind === "session"
? await runAgentRunRestCommand(config, "sessions", cancelArgs)
: ref.kind === "run"
? await runAgentRunRestCommand(config, "runs", cancelArgs)
: ref.kind === "command"
? await runAgentRunRestCommand(config, "commands", cancelArgs)
: null;
if (result === null) throw new Error("cancel supports task/<taskId>, session/<sessionId>, run/<runId>, or command/<commandId>");
return renderMutationSummary(command, result, options, `${options.dryRun ? "Planned cancel" : "Cancel requested"} ${ref.kind}/${shortId(ref.name)}`, options.dryRun ? [rerunWithoutDryRun(command)] : undefined);
}
function agentRunResourceCancelDryRunPlan(ref: AgentRunResourceRef, options: AgentRunResourceOptions, confirmCommand: string): Record<string, unknown> {
const body: Record<string, unknown> = {};
if (options.reason !== null) body.reason = options.reason;
if (ref.kind === "task") return agentRunDryRunPlan("task-cancel", `/api/v1/queue/tasks/${encodeURIComponent(ref.name)}/cancel`, body, confirmCommand);
if (ref.kind === "session") return agentRunDryRunPlan("session-cancel", `/api/v1/sessions/${encodeURIComponent(ref.name)}/control`, { action: "cancel", ...body }, confirmCommand);
if (ref.kind === "run") return agentRunDryRunPlan("run-cancel", `/api/v1/runs/${encodeURIComponent(ref.name)}/cancel`, body, confirmCommand);
if (ref.kind === "command") {
const runId = options.runId ?? requiredContext("command cancel", "--run <runId>");
return agentRunDryRunPlan("command-cancel", `/api/v1/commands/${encodeURIComponent(ref.name)}/cancel`, body, confirmCommand, "POST", {
commandRef: { runId, commandId: ref.name, valuesPrinted: false },
});
}
throw new Error("cancel supports task/<taskId>, session/<sessionId>, run/<runId>, or command/<commandId>");
}
async function resourceDispatch(config: UniDeskConfig | null, command: string, action: string | undefined, args: string[], options: AgentRunResourceOptions): Promise<RenderedCliResult> {
const ref = parseResourceRef(action, args, "task");
if (ref.kind !== "task") throw new Error("dispatch supports task/<taskId>");
-1
View File
@@ -73,7 +73,6 @@ const agentRunQueueReplacementCommands = {
runEvents: "bun scripts/cli.ts agentrun events run/<runId> --after-seq 0 --limit 100",
runResult: "bun scripts/cli.ts agentrun result run/<runId> --command <commandId>",
sessionLogs: "bun scripts/cli.ts agentrun logs session/<sessionId> --tail 100",
sessionSteer: "bun scripts/cli.ts agentrun steer session/<sessionId> --prompt-stdin",
sessionSend: "bun scripts/cli.ts agentrun send session/<sessionId> --aipod Artificer --prompt-stdin",
taskAck: "bun scripts/cli.ts agentrun ack task/<taskId>",
taskCancel: "bun scripts/cli.ts agentrun cancel task/<taskId> --reason <text>",
+103 -11
View File
@@ -1,5 +1,5 @@
import { spawnSync } from "node:child_process";
import { closeSync, existsSync, ftruncateSync, lstatSync, mkdirSync, openSync, readdirSync, readSync, rmSync, statSync, unlinkSync, writeFileSync, writeSync } from "node:fs";
import { closeSync, existsSync, ftruncateSync, lstatSync, mkdirSync, opendirSync, openSync, readdirSync, readSync, rmSync, statSync, unlinkSync, writeFileSync, writeSync } from "node:fs";
import { basename, join, resolve } from "node:path";
import { type UniDeskConfig, repoRoot, rootPath } from "./config";
@@ -13,6 +13,7 @@ type GcItemKind =
| "journal-vacuum"
| "docker-build-cache-prune"
| "tmp-path-delete"
| "stale-tmp-path-delete"
| "browser-cache-delete"
| "tool-cache-delete"
| "vscode-server-delete"
@@ -33,6 +34,7 @@ interface GcOptions {
buildCacheAll: boolean;
tmp: boolean;
tmpMinAgeHours: number;
staleTmp: boolean;
browserCache: boolean;
toolCaches: boolean;
vscodeStaleServers: boolean;
@@ -169,6 +171,7 @@ const DEFAULT_OPTIONS: GcOptions = {
buildCacheAll: false,
tmp: true,
tmpMinAgeHours: 24,
staleTmp: false,
browserCache: false,
toolCaches: false,
vscodeStaleServers: false,
@@ -225,6 +228,17 @@ const TMP_EXACT_PROTECT = new Set([
"/tmp/tmux-0",
]);
const STALE_TMP_PROTECTED_PREFIXES = [
".X",
".ICE",
".font-unix",
".Test-unix",
"systemd-private-",
"tmux-",
"ssh-",
"vscode-",
];
const TOOL_CACHE_ALLOWLIST = [
{
id: "npm-cacache",
@@ -281,6 +295,9 @@ const TOOL_CACHE_ALLOWLIST = [
const VSCODE_SERVER_ROOT = "/root/.vscode-server/cli/servers";
const VSCODE_EXTENSION_ROOT = "/root/.vscode-server/extensions";
const BAIDU_STAGING_RELATIVE_ROOT = [".state", "baidu-netdisk", "staging"];
const DEFAULT_PATH_SIZE_TIMEOUT_MS = 5_000;
const STALE_TMP_PATH_SIZE_TIMEOUT_MS = 1_500;
const STALE_TMP_MAX_CANDIDATES = 1_000;
export async function runGcCommand(config: UniDeskConfig, args: string[]): Promise<unknown> {
const [action = "plan", ...rest] = args;
@@ -374,6 +391,10 @@ export function gcPlan(config: UniDeskConfig, options: GcOptions = DEFAULT_OPTIO
if (options.tmp) {
candidates.push(...collectTmpCandidates(options, observedAt));
}
if (options.staleTmp) {
const alreadySelected = new Set(candidates.map((candidate) => candidate.path).filter((path): path is string => path !== undefined));
candidates.push(...collectStaleTmpCandidates(options, observedAt, alreadySelected));
}
if (options.browserCache) {
const item = collectBrowserCacheCandidate();
if (item !== null) candidates.push(item);
@@ -459,7 +480,7 @@ export function gcPlan(config: UniDeskConfig, options: GcOptions = DEFAULT_OPTIO
notes: [
"gc run only executes listed one-time cleanup actions after --confirm.",
options.full ? "Full candidate output requested." : `Default output is capped to ${options.limit} candidates; use --full or --limit N for broader disclosure.`,
"Tool caches, stale VS Code server versions and stale VS Code extension versions are opt-in and require explicit include flags.",
"Tool caches, stale /tmp direct children, stale VS Code server versions and stale VS Code extension versions are opt-in and require explicit include flags.",
"Baidu Netdisk staging cleanup is opt-in and only selects old PGDATA backup tarballs under server-data/unidesk-pg-data.",
"Database event retention is diagnostic-only in this command; cleanups for oa_events require a backup and a separate schema/retention change.",
"Docker image cleanup stays under server cleanup plan; gc does not run docker system prune or docker image prune.",
@@ -540,6 +561,10 @@ function parseGcOptions(args: string[]): GcOptions {
options.buildCacheAll = true;
} else if (arg === "--tmp-min-age-hours") {
options.tmpMinAgeHours = parseNonNegativeNumber(arg, args[++index]);
} else if (arg === "--include-stale-tmp") {
options.staleTmp = true;
} else if (arg === "--no-stale-tmp") {
options.staleTmp = false;
} else if (arg === "--include-browser-cache") {
options.browserCache = true;
} else if (arg === "--no-browser-cache") {
@@ -705,6 +730,7 @@ function publicOptions(options: GcOptions): Record<string, unknown> {
buildCacheAll: options.buildCacheAll,
tmp: options.tmp,
tmpMinAgeHours: options.tmpMinAgeHours,
staleTmp: options.staleTmp,
browserCache: options.browserCache,
toolCaches: options.toolCaches,
vscodeStaleServers: options.vscodeStaleServers,
@@ -834,7 +860,7 @@ function collectTmpCandidates(options: GcOptions, observedAt: string): GcCandida
continue;
}
if (stat.mtimeMs >= cutoffMs) continue;
const sizeBytes = safePathSize(path);
const sizeBytes = safePathSize(path, STALE_TMP_PATH_SIZE_TIMEOUT_MS);
if (sizeBytes <= 0) continue;
result.push({
id: `tmp:${path}`,
@@ -850,6 +876,52 @@ function collectTmpCandidates(options: GcOptions, observedAt: string): GcCandida
return result.sort((left, right) => right.estimatedReclaimBytes - left.estimatedReclaimBytes);
}
function collectStaleTmpCandidates(options: GcOptions, observedAt: string, alreadySelected: Set<string>): GcCandidate[] {
const root = "/tmp";
if (!existsSync(root)) return [];
const cutoffMs = new Date(observedAt).getTime() - options.tmpMinAgeHours * 60 * 60 * 1000;
const candidateLimit = Math.min(options.limit, STALE_TMP_MAX_CANDIDATES);
const result: GcCandidate[] = [];
const dir = opendirSync(root);
try {
let entry;
while ((entry = dir.readSync()) !== null && result.length < candidateLimit) {
const name = entry.name;
const path = join(root, name);
if (alreadySelected.has(path)) continue;
if (TMP_EXACT_PROTECT.has(path)) continue;
if (isStaleTmpProtectedName(name)) continue;
if (!entry.isDirectory() && !entry.isFile() && !entry.isSymbolicLink()) continue;
let stat;
try {
stat = lstatSync(path);
} catch {
continue;
}
if (stat.mtimeMs >= cutoffMs) continue;
const sizeBytes = safePathSize(path, STALE_TMP_PATH_SIZE_TIMEOUT_MS);
if (sizeBytes <= 0) continue;
result.push({
id: `stale-tmp:${path}`,
kind: "stale-tmp-path-delete",
risk: "medium",
description: `Delete one bounded direct /tmp child older than ${options.tmpMinAgeHours} hours`,
path,
sizeBytes,
estimatedReclaimBytes: sizeBytes,
action: { op: "rm-recursive", allowlist: "tmp-direct-stale", minAgeHours: options.tmpMinAgeHours, boundedByLimit: candidateLimit },
});
}
} finally {
dir.closeSync();
}
return result.sort((left, right) => right.estimatedReclaimBytes - left.estimatedReclaimBytes);
}
function isStaleTmpProtectedName(name: string): boolean {
return STALE_TMP_PROTECTED_PREFIXES.some((prefix) => name.startsWith(prefix));
}
function collectBrowserCacheCandidate(): GcCandidate | null {
const path = rootPath(".state", "playwright-browsers");
if (!existsSync(path)) return null;
@@ -1361,6 +1433,12 @@ function executeCandidate(candidate: GcCandidate, options: GcOptions): { reclaim
rmSync(candidate.path, { recursive: true, force: true });
return { reclaimedBytes: before };
}
if (candidate.kind === "stale-tmp-path-delete" && candidate.path !== undefined) {
assertStaleTmpCandidatePath(candidate.path);
const before = safePathSize(candidate.path);
rmSync(candidate.path, { recursive: true, force: true });
return { reclaimedBytes: before };
}
if (candidate.kind === "browser-cache-delete" && candidate.path !== undefined) {
const expected = rootPath(".state", "playwright-browsers");
if (resolve(candidate.path) !== resolve(expected)) throw new Error(`refusing to remove unexpected browser cache path: ${candidate.path}`);
@@ -1447,6 +1525,16 @@ function assertTmpCandidatePath(path: string): void {
}
}
function assertStaleTmpCandidatePath(path: string): void {
const resolved = resolve(path);
if (!resolved.startsWith("/tmp/")) throw new Error(`refusing to remove non-/tmp path: ${path}`);
if (TMP_EXACT_PROTECT.has(resolved)) throw new Error(`refusing to remove protected tmp path: ${path}`);
const relativePath = resolved.slice("/tmp/".length);
if (relativePath.length === 0 || relativePath.includes("/")) throw new Error(`refusing to remove nested tmp path: ${path}`);
const name = basename(resolved);
if (isStaleTmpProtectedName(name)) throw new Error(`refusing to remove protected stale tmp path: ${path}`);
}
function assertToolCacheCandidatePath(path: string): void {
const resolved = resolve(path);
const allowed = TOOL_CACHE_ALLOWLIST.some((item) => resolve(item.path) === resolved);
@@ -1560,19 +1648,23 @@ function collectFiles(root: string): Array<{ path: string; sizeBytes: number; mt
return result;
}
function safePathSize(path: string): number {
function safePathSize(path: string, timeoutMs = DEFAULT_PATH_SIZE_TIMEOUT_MS): number {
return pathSizeFromDu(path, timeoutMs) ?? 0;
}
function pathSizeFromDu(path: string, timeoutMs: number): number | null {
try {
const stat = lstatSync(path);
if (stat.isFile() || stat.isSymbolicLink()) return stat.size;
if (!stat.isDirectory()) return 0;
let total = 0;
for (const entry of readdirSync(path)) {
total += safePathSize(join(path, entry));
}
return total;
if (!stat.isDirectory()) return null;
} catch {
return 0;
return null;
}
const result = command(["du", "-sb", "--one-file-system", "--", path], timeoutMs);
if (result.exitCode !== 0 || result.timedOut) return null;
const rawSize = result.stdout.trim().split(/\s+/u)[0] ?? "";
const sizeBytes = Number(rawSize);
return Number.isFinite(sizeBytes) && sizeBytes >= 0 ? sizeBytes : null;
}
function safeFileSize(path: string): number {
+24 -87
View File
@@ -1041,75 +1041,8 @@ function defaultCodexPoolConfig(): CodexPoolConfig {
export function defaultCodexTempUnschedulablePolicy(): CodexTempUnschedulablePolicy {
return {
enabled: true,
rules: [
{
statusCode: 400,
keywords: ["invalid_encrypted_content", "encrypted content", "could not be verified", "could not be decrypted", "bad_response_status_code", "model_not_found", "no available channel for model", "unsupported", "not supported", "not support", "暂不支持", "可用模型"],
durationMinutes: 120,
description: "Stable upstream 400 model-routing or Responses encrypted-content compatibility failures should use another account.",
},
{
statusCode: 401,
keywords: ["unauthorized", "invalid api key", "invalid_api_key", "authentication", "recovered upstream error"],
durationMinutes: 120,
description: "Credential/auth failures should use the longest cooldown.",
},
{
statusCode: 403,
keywords: ["forbidden", "access denied", "quota", "billing", "capacity", "weekly limit", "less than 10% of your weekly limit left", "run /status for a breakdown", "recovered upstream error"],
durationMinutes: 120,
description: "Permission, quota, or account-state failures should use the longest cooldown.",
},
{
statusCode: 429,
keywords: ["capacity", "rate limit", "rate_limit", "quota", "weekly limit", "less than 10% of your weekly limit left", "run /status for a breakdown", "too many requests", "overloaded", "resource_exhausted", "recovered upstream error"],
durationMinutes: 10,
description: "Capacity and rate-limit responses are often temporary; start with a ten-minute cooldown and use another account.",
},
{
statusCode: 500,
keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream", "recovered upstream error"],
durationMinutes: 10,
description: "Transient upstream server failures should start with a ten-minute cooldown and prefer another account.",
},
{
statusCode: 502,
keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream", "bad gateway", "upstream request failed", "unknown error", "context deadline exceeded", "context canceled", "websocket dial", "handshake response", "recovered upstream error"],
durationMinutes: 30,
description: "Gateway upstream failures, including recovered upstream error wrappers, should cool down longer.",
},
{
statusCode: 413,
keywords: ["openai_error", "payload too large", "request too large", "context length", "context window", "maximum context"],
durationMinutes: 30,
description: "Large-context upstream failures should cool down the selected account so a larger-context channel can handle the request.",
},
{
statusCode: 503,
keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "upstream", "recovered upstream error", "model_not_found", "no available channel for model"],
durationMinutes: 30,
description: "Service unavailable and upstream model-routing failures should cool down longer than one-off transient failures.",
},
{
statusCode: 504,
keywords: ["gateway timeout", "timeout", "upstream", "upstream request failed", "unknown error", "context deadline exceeded", "context canceled", "recovered upstream error"],
durationMinutes: 30,
description: "Gateway timeout responses should cool down the selected account so another account can handle the next request.",
},
{
statusCode: 524,
keywords: ["timeout", "a timeout occurred", "cloudflare", "gateway timeout", "upstream", "upstream request failed", "unknown error", "context deadline exceeded", "context canceled", "recovered upstream error"],
durationMinutes: 30,
description: "Cloudflare 524 timeout responses should cool down the selected account so another account can handle the next request.",
},
{
statusCode: 529,
keywords: ["capacity", "overloaded", "temporarily unavailable", "temporary", "recovered upstream error"],
durationMinutes: 30,
description: "Provider overloaded responses should cool down longer than generic transient failures and use another account.",
},
],
enabled: false,
rules: [],
};
}
@@ -1542,8 +1475,8 @@ function compactProfile(profile: CodexProfile): Record<string, unknown> {
trustUpstream: profile.trustUpstream,
capacity: profile.capacity,
loadFactor: profile.loadFactor,
tempUnschedulableEnabled: profile.tempUnschedulable.enabled && profile.tempUnschedulable.rules.length > 0,
tempUnschedulableRuleCount: profile.tempUnschedulable.enabled ? profile.tempUnschedulable.rules.length : 0,
tempUnschedulableEnabled: profile.tempUnschedulable.enabled,
tempUnschedulableRuleCount: profile.tempUnschedulable.rules.length,
apiKeyPresent: profile.apiKey !== null && profile.apiKey.length > 0,
ok: profile.ok,
error: profile.error,
@@ -1552,24 +1485,23 @@ function compactProfile(profile: CodexProfile): Record<string, unknown> {
}
export function renderSub2ApiTempUnschedulableCredentials(policy: CodexTempUnschedulablePolicy): Record<string, unknown> {
if (!policy.enabled) return {};
return {
temp_unschedulable_enabled: policy.enabled && policy.rules.length > 0,
temp_unschedulable_rules: policy.enabled
? policy.rules.map((rule) => ({
error_code: rule.statusCode,
keywords: [...rule.keywords],
duration_minutes: rule.durationMinutes,
description: rule.description ?? "",
}))
: [],
temp_unschedulable_enabled: policy.enabled,
temp_unschedulable_rules: policy.rules.map((rule) => ({
error_code: rule.statusCode,
keywords: [...rule.keywords],
duration_minutes: rule.durationMinutes,
description: rule.description ?? "",
})),
};
}
function tempUnschedulableSummary(policy: CodexTempUnschedulablePolicy): Record<string, unknown> {
return {
enabled: policy.enabled && policy.rules.length > 0,
ruleCount: policy.enabled ? policy.rules.length : 0,
statusCodes: policy.enabled ? policy.rules.map((rule) => rule.statusCode) : [],
enabled: policy.enabled,
ruleCount: policy.rules.length,
statusCodes: policy.rules.map((rule) => rule.statusCode),
};
}
@@ -3711,8 +3643,9 @@ def account_payload(profile, group_id):
if upstream_user_agent:
credentials["user_agent"] = upstream_user_agent
temp_unschedulable = temp_unschedulable_credentials(profile)
credentials["temp_unschedulable_enabled"] = temp_unschedulable["enabled"]
credentials["temp_unschedulable_rules"] = temp_unschedulable["rules"]
if temp_unschedulable["enabled"]:
credentials["temp_unschedulable_enabled"] = True
credentials["temp_unschedulable_rules"] = temp_unschedulable["rules"]
return {
"name": profile["accountName"],
"notes": f"UniDesk-managed Codex profile {profile['profile']} from {profile['configFile']} and {profile['authFile']}; secret source={profile['apiKeySource']}; fingerprint={profile['apiKeyFingerprint']}.",
@@ -4803,8 +4736,8 @@ def normalize_temp_unschedulable_credentials(credentials):
"description": description,
})
return {
"enabled": enabled and len(rules) > 0,
"rules": rules if enabled else [],
"enabled": enabled,
"rules": rules,
}
def summarize_temp_unschedulable_rules(rules):
@@ -4819,6 +4752,8 @@ def summarize_temp_unschedulable_rules(rules):
def success_body_reclassification_requirement():
for name in sorted(EXPECTED_ACCOUNT_TEMP_UNSCHEDULABLE):
expected = normalize_temp_unschedulable_credentials(EXPECTED_ACCOUNT_TEMP_UNSCHEDULABLE[name])
if expected["enabled"] is not True:
continue
for rule in expected["rules"]:
error_code = rule.get("error_code")
keywords = rule.get("keywords") or []
@@ -4844,6 +4779,8 @@ def model_routing_400_failover_requirement():
preferred = ["暂不支持", "可用模型", "unsupported model", "model not supported", "does not support", "not supported", "model_not_found", "no available channel for model"]
for name in sorted(EXPECTED_ACCOUNT_TEMP_UNSCHEDULABLE):
expected = normalize_temp_unschedulable_credentials(EXPECTED_ACCOUNT_TEMP_UNSCHEDULABLE[name])
if expected["enabled"] is not True:
continue
for rule in expected["rules"]:
error_code = rule.get("error_code")
keywords = rule.get("keywords") or []
+201 -6
View File
@@ -56,6 +56,8 @@ export interface DockerCleanupInventory {
export interface ServerCleanupPlanOptions {
minAgeHours: number;
limit: number;
confirm: boolean;
includeHighRisk: boolean;
}
export interface CleanupImageSummary {
@@ -92,6 +94,15 @@ export interface CleanupCommandReview {
reviewChecklist: string[];
}
interface DiskSnapshot {
filesystem: string;
sizeBytes: number;
usedBytes: number;
availableBytes: number;
usePercent: number;
mount: string;
}
export interface ServerCleanupPlan {
ok: boolean;
dryRun: true;
@@ -107,7 +118,7 @@ export interface ServerCleanupPlan {
dockerVolumesTouched: false;
dataDirectoriesTouched: false;
databaseCleanupIncluded: false;
liveCleanupImplemented: false;
liveCleanupImplemented: boolean;
note: string;
};
inventory: {
@@ -147,6 +158,49 @@ export interface ServerCleanupPlan {
prohibitedCommands: string[];
}
interface ServerCleanupRun {
ok: boolean;
dryRun: false;
mutation: true;
action: "server cleanup run";
scope: "docker-images-only";
observedAt: string;
options: ServerCleanupPlanOptions;
diskBefore: DiskSnapshot | null;
diskAfter: DiskSnapshot | null;
summary: {
plannedCandidateCount: number;
attemptedCount: number;
succeededCount: number;
failedCount: number;
skippedHighRiskCount: number;
estimatedReclaimBytes: number;
actualDiskReclaimBytes: number | null;
};
results: Array<{
imageId: string;
shortId: string;
repoTags: string[];
repoDigests: string[];
risk: Exclude<RiskLevel, "blocked">;
estimatedReclaimBytes: number;
command: string[];
status: "succeeded" | "failed" | "skipped";
reason?: string;
exitCode?: number | null;
stdoutTail?: string;
stderrTail?: string;
}>;
policy: {
deletionExecuted: true;
dockerPruneUsed: false;
dockerVolumesTouched: false;
dataDirectoriesTouched: false;
databaseCleanupIncluded: false;
highRiskRequiresIncludeFlag: true;
};
}
interface DeployServiceCommit {
environment: string;
serviceId: string;
@@ -156,6 +210,8 @@ interface DeployServiceCommit {
const defaultOptions: ServerCleanupPlanOptions = {
minAgeHours: 24,
limit: 200,
confirm: false,
includeHighRisk: false,
};
export function parseServerCleanupOptions(args: string[]): ServerCleanupPlanOptions {
@@ -174,8 +230,16 @@ export function parseServerCleanupOptions(args: string[]): ServerCleanupPlanOpti
if (!Number.isInteger(value) || value <= 0) throw new Error("--limit must be a positive integer");
options.limit = Math.min(value, 1000);
index += 1;
} else if (arg === "--confirm") {
options.confirm = true;
} else if (arg === "--dry-run") {
options.confirm = false;
} else if (arg === "--include-high-risk") {
options.includeHighRisk = true;
} else if (arg === "--no-high-risk") {
options.includeHighRisk = false;
} else {
throw new Error(`unknown server cleanup plan option: ${arg}`);
throw new Error(`unknown server cleanup option: ${arg}`);
}
}
return options;
@@ -186,14 +250,31 @@ export async function runServerCleanupCommand(config: UniDeskConfig, args: strin
if (action === "plan" || action === "dry-run") {
return serverCleanupPlan(config, parseServerCleanupOptions(rest));
}
if (action === "run") {
const options = parseServerCleanupOptions(rest);
if (!options.confirm) {
return {
ok: false,
error: "server-cleanup-run-requires-confirm",
dryRun: true,
mutation: false,
next: {
plan: `bun scripts/cli.ts server cleanup plan --min-age-hours ${options.minAgeHours} --limit ${options.limit}`,
confirm: `bun scripts/cli.ts server cleanup run --confirm --min-age-hours ${options.minAgeHours} --limit ${options.limit}`,
},
policy: "server cleanup run removes only listed stale Docker images; high-risk images require --include-high-risk.",
};
}
return serverCleanupRun(config, options);
}
return {
ok: false,
error: "unsupported-server-cleanup-action",
action,
supportedActions: ["plan"],
supportedActions: ["plan", "run"],
dryRunOnly: true,
mutation: false,
policy: "This task implements only server cleanup plan. Real image deletion is intentionally not implemented and must require a future explicit approval parameter.",
policy: "server cleanup run requires --confirm and removes only stale Docker images selected by the same plan classifier.",
};
}
@@ -307,8 +388,8 @@ export function buildDockerCleanupPlan(inventory: DockerCleanupInventory, option
dockerVolumesTouched: false,
dataDirectoriesTouched: false,
databaseCleanupIncluded: false,
liveCleanupImplemented: false,
note: "This command only inventories Docker images and builds a dry-run review plan. It never runs docker image rm, docker prune, docker volume rm, rm, or database cleanup.",
liveCleanupImplemented: true,
note: "This command inventories Docker images and builds a dry-run review plan. Confirmed execution is available through server cleanup run --confirm; it never runs docker prune, docker volume rm, rm, or database cleanup.",
},
inventory: {
dockerAvailable: inventory.collection.dockerAvailable,
@@ -369,6 +450,91 @@ export function buildDockerCleanupPlan(inventory: DockerCleanupInventory, option
};
}
export function serverCleanupRun(config: UniDeskConfig, options: ServerCleanupPlanOptions = defaultOptions): ServerCleanupRun {
const diskBefore = rootDiskSnapshot();
const plan = serverCleanupPlan(config, options);
const selected = plan.candidateStaleImages;
const results: ServerCleanupRun["results"] = [];
for (const image of selected) {
const command = image.commandsToReview[0] ?? imageRemoveCommand({
id: image.id,
repoTags: image.repoTags,
repoDigests: image.repoDigests,
sizeBytes: image.sizeBytes,
createdAt: image.createdAt,
labels: {},
});
if (image.risk === "high" && !options.includeHighRisk) {
results.push({
imageId: image.id,
shortId: image.shortId,
repoTags: image.repoTags,
repoDigests: image.repoDigests,
risk: image.risk,
estimatedReclaimBytes: image.sizeBytes,
command,
status: "skipped",
reason: "high-risk-requires-include-high-risk",
});
continue;
}
const remove = runCommand(command, repoRoot, { timeoutMs: 60_000 });
const presentAfterRemove = remove.exitCode === 0 ? false : dockerImagePresent(image.id);
const succeeded = remove.exitCode === 0 || presentAfterRemove === false;
results.push({
imageId: image.id,
shortId: image.shortId,
repoTags: image.repoTags,
repoDigests: image.repoDigests,
risk: image.risk,
estimatedReclaimBytes: image.sizeBytes,
command,
status: succeeded ? "succeeded" : "failed",
reason: remove.exitCode !== 0 && presentAfterRemove === false ? "image-absent-after-remove" : undefined,
exitCode: remove.exitCode,
stdoutTail: tailText(remove.stdout, 4000),
stderrTail: tailText(remove.stderr, 4000),
});
}
const diskAfter = rootDiskSnapshot();
const failedCount = results.filter((item) => item.status === "failed").length;
const succeededCount = results.filter((item) => item.status === "succeeded").length;
const skippedHighRiskCount = results.filter((item) => item.status === "skipped").length;
const attempted = results.filter((item) => item.status !== "skipped");
const actualDiskReclaimBytes = diskBefore !== null && diskAfter !== null ? diskAfter.availableBytes - diskBefore.availableBytes : null;
return {
ok: plan.ok && failedCount === 0,
dryRun: false,
mutation: true,
action: "server cleanup run",
scope: "docker-images-only",
observedAt: new Date().toISOString(),
options,
diskBefore,
diskAfter,
summary: {
plannedCandidateCount: selected.length,
attemptedCount: attempted.length,
succeededCount,
failedCount,
skippedHighRiskCount,
estimatedReclaimBytes: attempted.reduce((sum, item) => sum + item.estimatedReclaimBytes, 0),
actualDiskReclaimBytes,
},
results,
policy: {
deletionExecuted: true,
dockerPruneUsed: false,
dockerVolumesTouched: false,
dataDirectoriesTouched: false,
databaseCleanupIncluded: false,
highRiskRequiresIncludeFlag: true,
},
};
}
function collectDockerCleanupInventory(config: UniDeskConfig): DockerCleanupInventory {
const observedAt = new Date().toISOString();
const desired = collectDesiredImagePolicy(config);
@@ -829,6 +995,35 @@ function commandError(command: string[], message: string, exitCode: number | nul
return { command, message, exitCode, stderrTail: stderr.slice(-1200) };
}
function dockerImagePresent(imageId: string): boolean | null {
const inspect = runCommand(["docker", "image", "inspect", imageId], repoRoot, { timeoutMs: 15_000 });
if (inspect.exitCode === 0) return true;
const text = `${inspect.stdout}\n${inspect.stderr}`;
if (/no such image|no such object/i.test(text)) return false;
return null;
}
function rootDiskSnapshot(): DiskSnapshot | null {
const result = runCommand(["df", "-B1", "-P", "/"], repoRoot, { timeoutMs: 5000 });
if (result.exitCode !== 0) return null;
const line = result.stdout.trim().split(/\r?\n/u)[1];
if (!line) return null;
const parts = line.trim().split(/\s+/u);
if (parts.length < 6) return null;
return {
filesystem: parts[0] ?? "",
sizeBytes: Number(parts[1]),
usedBytes: Number(parts[2]),
availableBytes: Number(parts[3]),
usePercent: Number((parts[4] ?? "0").replace("%", "")),
mount: parts[5] ?? "/",
};
}
function tailText(value: string, maxChars: number): string {
return value.length <= maxChars ? value : value.slice(-maxChars);
}
function shortContainerId(id: string): string {
return id.slice(0, 12);
}