From 8008a4977dd05887fe941ec810d300f709fbb280 Mon Sep 17 00:00:00 2001 From: Codex Date: Sat, 13 Jun 2026 09:46:30 +0000 Subject: [PATCH] docs: clarify sub2api sentinel restore authority --- .agents/skills/unidesk-sub2api/SKILL.md | 8 ++++---- docs/reference/platform-infra.md | 6 +++--- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/.agents/skills/unidesk-sub2api/SKILL.md b/.agents/skills/unidesk-sub2api/SKILL.md index a73093ed..97613df3 100644 --- a/.agents/skills/unidesk-sub2api/SKILL.md +++ b/.agents/skills/unidesk-sub2api/SKILL.md @@ -100,7 +100,7 @@ bun scripts/cli.ts platform-infra sub2api codex-pool cleanup-probes --target D60 - `pool.defaultTempUnschedulable`: Sub2API 内置请求路径临时不可调度开关和 YAML 规则列表。当前要求是按 YAML 开启通用规则;sync 把 `temp_unschedulable_enabled` / `temp_unschedulable_rules` 渲染到 managed accounts,让匹配的 400/5xx/超时/模型路由/加密内容错误短暂冷却当前账号并触发同组 failover。 - `pool.defaultTempUnschedulable` 与外部 `sentinel.*` 分开配置、互不驱动。内置规则负责 near-real-time request-path cooling/failover;哨兵负责 marker health、账号级隔离/恢复和 probe 退避。 - 外部 sentinel 的写入面只允许通过 Sub2API admin `schedulable` 接口冻结/恢复账号;不能写入、清理或间接清理 `temp_unschedulable_until` / `temp_unschedulable_reason`、rate-limit、overload、model-rate-limit 等 Sub2API 请求路径 runtime 状态,也不能调用 `recover-state` 作为恢复动作。看到 UI 里的“触发时间/解除时间/规则序号/匹配关键词”临时不可调度状态时,默认先归因到 Sub2API 内置 request-path temp-unschedulable,而不是 sentinel。 -- YAML 只选择和配置 Codex 上游,不声明 `schedulable` 长期字段;`schedulable=true` 只能作为 `codex-pool sync --confirm` 对未处于哨兵隔离账号的过程控制基线恢复。 +- YAML 只选择和配置 Codex 上游,不声明 `schedulable` 长期字段;`codex-pool sync --confirm` 不负责把既有账号恢复为 `schedulable=true`。既有账号的 `schedulable=false` 必须由哨兵先同步 Sub2API runtime 状态,再在 marker probe 命中后恢复。 - `profiles.entries`: 从 master `~/.codex/` 选择上游 profile 并映射到 Sub2API account。 - `profiles.entries[].capacity`: 可选 per-account concurrency override;不写则使用 `pool.defaultAccountCapacity`。具体数值只以 `config/platform-infra/sub2api-codex-pool.yaml` 为准,skill 和长期参考只描述规则,不重复写当前值。 - `profiles.entries[].loadFactor`: 可选 per-account Sub2API `load_factor` override;不写则使用 `pool.defaultAccountLoadFactor`。具体数值只以 `config/platform-infra/sub2api-codex-pool.yaml` 为准,修改后必须 `codex-pool sync --confirm` 和 `codex-pool validate`。 @@ -121,15 +121,15 @@ bun scripts/cli.ts platform-infra sub2api codex-pool cleanup-probes --target D60 - `sentinel.freeze`: 失败冻结 TTL 指数退避配置。当前口径是初始 1 分钟,失败后 `1m -> 2m -> 4m -> 8m -> 10m`,最大 10 分钟;失败 probe 基本不消耗有效输出 token,因此冻结窗口保持短周期。冻结到期后只做恢复 probe,通过才自动恢复,不能仅靠 TTL 到期解封。 - `sentinel.pricing`: 直打上游时哨兵自己的 token/cost 估算价格。因为 direct upstream probe 不经过 Sub2API 普通用量账本,哨兵必须自己记录全局与 per-account token/cost;这些账本只用于观察,不作为跳过探测的预算门禁。 -`sync --confirm` 会登录 Sub2API admin、创建/更新 group、创建/更新 YAML 中的 `unidesk-codex-*` accounts、创建/复用统一 API key Secret,并把未处于哨兵 active quarantine 的 managed account 的 `schedulable=true` 恢复为过程控制基线;它默认不删除 YAML 中缺席的 managed account。只有明确退役上游时才使用 `sync --confirm --prune-removed` 删除缺席且 `extra.unidesk_managed=true` 的 `unidesk-codex-*` account。 +`sync --confirm` 会登录 Sub2API admin、创建/更新 group、创建/更新 YAML 中的 `unidesk-codex-*` accounts、创建/复用统一 API key Secret,并部署/更新哨兵资源;它不把既有 managed account 直接恢复为 `schedulable=true`。恢复只由哨兵在读取 Sub2API runtime `schedulable=false` 后触发 recovery probe,并在 marker 命中时执行。`sync` 默认不删除 YAML 中缺席的 managed account。只有明确退役上游时才使用 `sync --confirm --prune-removed` 删除缺席且 `extra.unidesk_managed=true` 的 `unidesk-codex-*` account。 `sentinel-image status|build` 管理哨兵 Python 运行环境镜像。镜像由 YAML 的 `sentinel.image` 基础镜像和 `sentinel.sdk.openaiPythonVersion` 派生,发布到目标 runtime 的本地 registry;`build --confirm` 会先检查 registry tag,存在则快速复用,不存在才在目标 host 构建并 push。CronJob 启动时只校验 SDK 版本,不在运行时 `pip install`。 -`sync --confirm` 同时会按 YAML 渲染账号级哨兵资源,并在 monitor 开启时先确保可复用哨兵镜像存在。当前目标是 `sentinel.monitor.enabled=true` + `sentinel.actions.enabled=true` 的 marker-only 自动冻结/恢复;不要手工 patch CronJob、Secret 或 Sub2API account。若 YAML 新增账号或修改 profile/base URL/API key fingerprint/upstream User-Agent/Responses WebSocket mode,sync 会从变更前 runtime state 写入 pending probe 记录并立即安排 sentinel probe,但默认仍保持该 account 可调度;只有实际 marker probe 非命中或已有 active quarantine 才会冻结账号。sentinel 冻结/恢复只改 `schedulable=false|true`,不得顺手调用 Sub2API `recover-state` 清除请求路径临时不可调度或其他 runtime backoff。无关账号的既有成功/失败退避不能被重置。若 YAML 下调失败冻结最大窗口,sync 会把仍 active 的旧冻结状态迁移到当前最大窗口内并立即安排 recovery probe,但不会直接解冻。若怀疑某个账号被误判,先用 `codex-pool sentinel-probe --account --confirm` 立即触发该账号测量;该命令从现有 CronJob 模板派生一次性 Job,复用同一份 Secret、ConfigMap、OpenAI SDK probe、token/cost 账本和冻结/恢复状态机。 +`sync --confirm` 同时会按 YAML 渲染账号级哨兵资源,并在 monitor 开启时先确保可复用哨兵镜像存在。当前目标是 `sentinel.monitor.enabled=true` + `sentinel.actions.enabled=true` 的 marker-only 自动冻结/恢复;不要手工 patch CronJob、Secret 或 Sub2API account。若 YAML 新增账号或修改 profile/base URL/API key fingerprint/upstream User-Agent/Responses WebSocket mode,sync 会从变更前 runtime state 写入 pending probe 记录并立即安排 sentinel probe,但不会把既有账号直接恢复为可调度;只有 sentinel 读取到 Sub2API runtime `schedulable=false` 后执行 recovery probe,且 marker 命中,才恢复 `schedulable=true`。sentinel 冻结/恢复只改 `schedulable=false|true`,不得顺手调用 Sub2API `recover-state` 清除请求路径临时不可调度或其他 runtime backoff。无关账号的既有成功/失败退避不能被重置。若 YAML 下调失败冻结最大窗口,sync 会把仍 active 的旧冻结状态迁移到当前最大窗口内并立即安排 recovery probe,但不会直接解冻。若怀疑某个账号被误判,先用 `codex-pool sentinel-probe --account --confirm` 立即触发该账号测量;该命令从现有 CronJob 模板派生一次性 Job,复用同一份 Secret、ConfigMap、OpenAI SDK probe、token/cost 账本和冻结/恢复状态机。 `trace --request-id ` 是只读 request 追溯报表,不触发 probe、不修改账号。默认输出请求开始/最终状态、failover、`account_select_failed`、窗口内 `account_temp_unschedulable`、admin schedulable 写入计数和当前账号快照;`reason=failover-attempted-no-candidate` 表示 Sub2API 已进入自动切号,但排除当前失败账号后没有可用候选。需要机器处理时使用 `--raw`,需要原始匹配行时加 `--show-lines`。 -`sentinel-report` 是只读低噪声报表,不触发 probe、不修改账号。默认输出类似 `ps` 的文本表,展示每个账号的探测次数、最近 marker/HTTP/动作、冻结 TTL、成功退避、下一次 probe 和最近 run 事件;`PROT` 展示账号级保护阈值,`P_FAIL` 展示最近一次保护确认中的失败次数/阈值;需要机器处理时使用 `sentinel-report --raw`。 +`sentinel-report` 是只读低噪声报表,不触发 probe、不修改账号。默认输出类似 `ps` 的文本表,展示每个账号的探测次数、Sub2API runtime `schedulable`、最近 marker/HTTP/动作、冻结 TTL、成功退避、下一次 probe 和最近 run 事件;`SCH` 展示 Sub2API runtime schedulable,`PROT` 展示账号级保护阈值,`P_FAIL` 展示最近一次保护确认中的失败次数/阈值;需要机器处理时使用 `sentinel-report --raw`。 `sync --confirm` 和 `validate` 可能超过单次 SSH/runtime 短连接窗口。必须继续使用 `bun scripts/cli.ts platform-infra sub2api codex-pool ...`,由 CLI 在 G14 远端提交作业并短轮询状态;不要改用裸 `trans G14:k3s script` 等一个长连接等待完整结果。若看到 `UNIDESK_SSH_RUNTIME_TIMEOUT`,先按 `docs/reference/platform-infra.md` 的规则处理为控制面可见性问题,修 CLI/job/poll 或重跑受控命令,不要手工 patch Sub2API credentials 或源码。 diff --git a/docs/reference/platform-infra.md b/docs/reference/platform-infra.md index 9db3a599..c54ca9ad 100644 --- a/docs/reference/platform-infra.md +++ b/docs/reference/platform-infra.md @@ -82,7 +82,7 @@ - `pool.defaultTempUnschedulable` is the Sub2API built-in request-path temporary-unschedulable switch plus its YAML rule list. When enabled, `codex-pool sync --confirm` renders `temp_unschedulable_enabled` and `temp_unschedulable_rules` into every managed account unless an account-level override says otherwise. This is the generic same-request recovery path for selected-account upstream failures: a matching upstream error briefly cools the selected account so Sub2API's existing failover loop can select another account in the same group. - The built-in temporary-unschedulable configuration and external `sentinel.*` configuration are separate control surfaces. `pool.defaultTempUnschedulable` handles near-real-time request-path cooling and failover; `sentinel.*` handles account-level marker health, quarantine, restore, and probe cadence. Changing one surface must not silently rewrite the other surface's cadence, marker semantics, quarantine state, or rule list. - The external sentinel write surface is intentionally limited to the Sub2API admin `schedulable` action. Sentinel freeze/restore may set `schedulable=false|true`, but must not write, clear, or indirectly clear Sub2API request-path runtime state such as `temp_unschedulable_until`, `temp_unschedulable_reason`, rate-limit, overload, or model-rate-limit state. In particular, sentinel restore must not call Sub2API `recover-state`, because that endpoint is a broader runtime-state recovery operation rather than a pure schedulability restore. -- Codex accounts selected by YAML do not declare `schedulable` as durable configuration. `schedulable=true` is a `codex-pool sync --confirm` process-control baseline for UniDesk-managed accounts that are not under sentinel quarantine, not a YAML field. +- Codex accounts selected by YAML do not declare `schedulable` as durable configuration. `codex-pool sync --confirm` must not restore existing account schedulability merely because YAML selects the account or sentinel state lacks an active quarantine. Existing `schedulable=false` is runtime state: the sentinel first reads Sub2API's actual account state, schedules a recovery probe for unschedulable managed accounts, and restores `schedulable=true` only after the marker probe matches. - `codex-pool sync --confirm` preserves UniDesk-managed accounts that are absent from YAML by default; explicit upstream retirement requires `codex-pool sync --confirm --prune-removed`. This keeps account deletion out of the normal availability-recovery path and prevents temporary YAML edits from becoming destructive runtime changes. - `profiles.entries` selects local Codex profile files from `~/.codex/` and maps them to Sub2API account names. - The unsuffixed master `~/.codex/config.toml` and `~/.codex/auth.json` are reserved for the unified Sub2API consumer. `config.toml` must keep the YAML-selected consumer base URL written by `codex-pool configure-local --target --confirm`, and `auth.json` must contain the unified pool API key from `pool.apiKeySecretName` / `pool.apiKeySecretKey` on that active target. Do not replace these two files with direct upstream account credentials. @@ -136,13 +136,13 @@ The sentinel must not maintain separate classifiers for "private content", "main `pool.defaultSentinelProtect` is the default protection policy for sentinel freeze decisions, and `profiles.entries[].sentinelProtect` may override it for a specific account. For protected accounts, the marker-only health contract still applies, but the sentinel must exhaust the configured consecutive marker confirmation attempts before treating the account as failed and entering the freeze state machine. The retry count, initial delay, maximum delay, and backoff multiplier are YAML values; long-term reference prose must not duplicate the current numbers. This policy exists only to absorb occasional marker/probe or gateway-failure confirmation jitter. It must not change Sub2API scheduler priority, capacity, load factor, membership, built-in temporary-unschedulable settings, or the recovery condition. -When `codex-pool sync --confirm` creates a YAML-managed account or changes direct-probe-relevant account inputs such as the profile mapping, upstream base URL, API key fingerprint, upstream User-Agent, Responses WebSocket mode, `trustUpstream`, pool/profile `sentinelProtect`, sync records a pending sentinel probe from the pre-mutation runtime state, updates the account, restores `schedulable=true` unless an active sentinel quarantine already exists, and schedules the account probe immediately. New or changed accounts are not default-frozen; only an actual non-marker probe result or an existing active quarantine may remove an account from the scheduler. This avoids zero-available windows during sync while still ensuring that later marker failures enter the normal freeze/restore state machine. Unchanged accounts must not have their existing success or failure backoff reset by unrelated YAML syncs. +When `codex-pool sync --confirm` creates a YAML-managed account or changes direct-probe-relevant account inputs such as the profile mapping, upstream base URL, API key fingerprint, upstream User-Agent, Responses WebSocket mode, `trustUpstream`, pool/profile `sentinelProtect`, sync records a pending sentinel probe from the pre-mutation runtime state, updates the account, and schedules the account probe immediately. It does not restore existing accounts to `schedulable=true`; restoration belongs to the marker-only sentinel after it has synced Sub2API runtime state and observed a marker-matching probe. New or changed accounts are not default-frozen; only an actual non-marker probe result or an existing active quarantine may remove an account from the scheduler. This avoids zero-available windows during sync while still ensuring that later marker failures enter the normal freeze/restore state machine. Unchanged accounts must not have their existing success or failure backoff reset by unrelated YAML syncs. If the YAML failure freeze maximum is lowered, `codex-pool sync --confirm` may migrate only currently active sentinel quarantines whose stored interval or next recovery time exceeds the current maximum. The migration keeps the account frozen, marks the next recovery probe due immediately, and lets the next marker result decide restore versus the new shorter failure backoff. It must not clear quarantine or restore schedulability merely because an older TTL has expired. If the YAML success cadence maximum is lowered or an account changes trust class, `codex-pool sync --confirm` may clamp existing successful account state so the next probe is due under the current YAML policy instead of waiting for an older, longer success window to expire. This clamp only affects sentinel state and probe timing; it does not by itself restore a quarantined account or bypass the next marker result. -Operational observation for this sentinel should use the read-only `codex-pool sentinel-report` table or its `--raw` form. It is the canonical low-noise view for per-account probe count, trust class, protect threshold and latest protect confirmation result, marker result, HTTP/error diagnostics, freeze TTL, success cadence, success cadence maximum, next probe time, and recent CronJob runs; raw ConfigMap dumps and ad hoc log scraping are fallback diagnostics, not the primary state surface. +Operational observation for this sentinel should use the read-only `codex-pool sentinel-report` table or its `--raw` form. It is the canonical low-noise view for per-account probe count, trust class, Sub2API runtime schedulability, protect threshold and latest protect confirmation result, marker result, HTTP/error diagnostics, freeze TTL, success cadence, success cadence maximum, next probe time, and recent CronJob runs; raw ConfigMap dumps and ad hoc log scraping are fallback diagnostics, not the primary state surface. The active Codex-pool request path follows the YAML-selected active target: