diff --git a/.agents/skills/unidesk-sub2api/references/manual-accounts.md b/.agents/skills/unidesk-sub2api/references/manual-accounts.md index 202af73a..03993a41 100644 --- a/.agents/skills/unidesk-sub2api/references/manual-accounts.md +++ b/.agents/skills/unidesk-sub2api/references/manual-accounts.md @@ -4,14 +4,14 @@ Sub2API 管理 UI 的账号连接测试使用账号级 `ProxyID` / proxy URL 配 如果 WebUI 账号连接测试显示 `proxyconnect tcp: dial tcp 127.0.0.1:: connect: connection refused`,先确认该 proxy URL 是账号级 loopback 配置:在 k3s target 内,`127.0.0.1` 是 Sub2API Pod 自己,不是节点或 PK01。不要先改账号凭据、PK01 Caddy、`api.pikapython.com` 或统一 key;应在目标 `config/platform-infra/sub2api.yaml` 声明 `targets[].accountLocalProxy`,由 `platform-infra sub2api apply --target ` 渲染同 Pod sidecar 和 Secret,再用 `validate --target ` 的 `accountLocalProxy` 探针验证 `http://127.0.0.1:`。输出仍只允许披露 sourceRef、fingerprint、secretName 和 proxyUrl,不打印 proxy 密码或生成配置。 -WebUI 账号连接测试也不经过统一消费 API key 的 pool group 选择器;账号测试正常不代表 PC Codex 客户端能选中该账号。看到 WebUI 账号测试正常、但 `/responses` 或 `/v1/responses` 以 `account-select-failed` / `no available accounts` 返回 503 时,先检查该手动账号是否声明了 `groupBinding.source: pool-group`,并确认 Sub2API `account_groups` join 里存在该账号与当前统一 API key `group_id` 的绑定。对已支持的 k3s target,通过 `sync --confirm` 加入当前 `pool.groupName`;对 PK01 host-Docker target,在 host-Docker codex-pool sync adapter 补齐前,只能用最小 admin API 写入 `group_ids` 做运行面恢复,且必须只输出 account id、group id、presence/fingerprint 和 smoke 状态,不打印密钥。 +WebUI 账号连接测试也不经过统一消费 API key 的 pool group 选择器;账号测试正常不代表 PC Codex 客户端能选中该账号。看到 WebUI 账号测试正常、但 `/responses` 或 `/v1/responses` 以 `account-select-failed` / `no available accounts` 返回 503 时,先检查该手动账号是否声明了 `groupBinding.source: pool-group`,并确认 Sub2API `account_groups` join 里存在该账号与当前统一 API key `group_id` 的绑定。对 k3s target 和 PK01 host-Docker target,都通过 `codex-pool sync --target --confirm` 把受保护手动账号加入当前 `pool.groupName`;PK01 走 host-Docker adapter,不使用 k8s Secret/CronJob。证据只输出 account id、group id、presence/fingerprint 和 smoke 状态,不打印密钥。 受保护手动账号仍由人工在 Sub2API UI 维护 credentials/status 等字段;UniDesk 只允许通过 YAML 做代理和分组窄绑定: ```bash -bun scripts/cli.ts platform-infra sub2api codex-pool plan --target D601 -bun scripts/cli.ts platform-infra sub2api codex-pool sync --target D601 --confirm -bun scripts/cli.ts platform-infra sub2api codex-pool validate --target D601 +bun scripts/cli.ts platform-infra sub2api codex-pool plan --target +bun scripts/cli.ts platform-infra sub2api codex-pool sync --target --confirm +bun scripts/cli.ts platform-infra sub2api codex-pool validate --target ``` `manualAccounts.protected[].targetIds` 是账号保护和窄同步的 target 作用域。省略时该手动账号在所有 target 上都受保护;设置如 `[PK01]` 时,JD01/D601 等其他 target 的 `codex-pool sync --confirm`、`validate` 和 `sentinel-probe` 不再把这个手动账号纳入当前运行面要求。不要通过自动删除不在 YAML 的账号来解决漂移;只增/改 YAML 控制的账号,未被当前 target 的 YAML 控制的账号保持人工所有权。 diff --git a/.agents/skills/unidesk-sub2api/references/sentinel.md b/.agents/skills/unidesk-sub2api/references/sentinel.md index c1e5b729..ad75f9fc 100644 --- a/.agents/skills/unidesk-sub2api/references/sentinel.md +++ b/.agents/skills/unidesk-sub2api/references/sentinel.md @@ -9,7 +9,7 @@ - `sentinel.freeze`: 失败冻结 TTL 指数退避配置。当前口径是初始 1 分钟,失败后 `1m -> 2m -> 4m -> 8m -> 10m`,最大 10 分钟;失败 probe 基本不消耗有效输出 token,因此冻结窗口保持短周期。冻结到期后只做恢复 probe,通过才自动恢复,不能仅靠 TTL 到期解封。 - `sentinel.pricing`: 直打上游时哨兵自己的 token/cost 估算价格。因为 direct upstream probe 不经过 Sub2API 普通用量账本,哨兵必须自己记录全局与 per-account token/cost;这些账本只用于观察,不作为跳过探测的预算门禁。 -对已支持的 k3s target,`sync --confirm` 会登录 Sub2API admin、创建/更新 group、创建/更新 YAML 中的 `unidesk-codex-*` accounts、创建/复用统一 API key Secret,并部署/更新哨兵资源;它不把既有 managed account 直接恢复为 `schedulable=true`。恢复只由哨兵在读取 Sub2API runtime `schedulable=false` 后触发 recovery probe,并在 marker 命中时执行。`sync` 默认不删除 YAML 中缺席的 managed account。只有明确退役上游时才使用 `sync --confirm --prune-removed` 删除缺席且 `extra.unidesk_managed=true` 的 `unidesk-codex-*` account。对 `manualAccounts.protected`,`sync` 只执行 YAML 显式允许的窄同步;当前允许项是从目标 `egressProxy` 创建/更新 Sub2API internal proxy 记录并绑定 `proxy_id`,以及把受保护手动账号加入当前 `pool.groupName`。它仍不接管该账号凭据、status、schedulable、priority/capacity/loadFactor 或哨兵状态。PK01 host-Docker target 在 codex-pool adapter 补齐前不具备这条完整 sync 路径。 +`codex-pool sync --confirm` 会登录 Sub2API admin、创建/更新 group、创建/更新 YAML 中的 `unidesk-codex-*` accounts,并创建/复用统一 API key。k3s target 还会写入统一 API key Secret 并部署/更新哨兵资源;PK01 host-Docker target 只对齐 Sub2API runtime 和 YAML 声明的 host env key source,不创建 k8s Secret/CronJob,也不要求 sentinel runtime 存在。`sync` 不把既有 managed account 直接恢复为 `schedulable=true`。恢复只由哨兵在读取 Sub2API runtime `schedulable=false` 后触发 recovery probe,并在 marker 命中时执行。`sync` 默认不删除 YAML 中缺席的 managed account。只有明确退役上游时才使用 `sync --confirm --prune-removed` 删除缺席且 `extra.unidesk_managed=true` 的 `unidesk-codex-*` account。对 `manualAccounts.protected`,`sync` 只执行 YAML 显式允许的窄同步;当前允许项是从目标 `egressProxy` 创建/更新 Sub2API internal proxy 记录并绑定 `proxy_id`,以及把受保护手动账号加入当前 `pool.groupName`。它仍不接管该账号凭据、status、schedulable、priority/capacity/loadFactor 或哨兵状态。 `sentinel-image status|build` 管理哨兵 Python 运行环境镜像。镜像由 YAML 的 `sentinel.image` 基础镜像和 `sentinel.sdk.openaiPythonVersion` 派生,发布到目标 runtime 的本地 registry;`build --confirm` 会先检查 registry tag,存在则快速复用,不存在才在目标 host 构建并 push。k3s target 的 CronJob 必须使用这个派生 runtime image,不要退回基础 Python 镜像再在容器启动时 `pip install openai`。需要外网拉官方基础镜像或 Python wheel 时,构建脚本应读取目标 YAML/host-proxy 提供的 proxy env(例如 JD01 `/etc/unidesk/proxy.env`)并用 host network 让 `127.0.0.1` hostproxy 生效;不要为了这条路径新增镜像源。CronJob 启动脚本只做 `OPENAI_PYTHON_VERSION` 校验,版本不符才兜底安装。目标是否启用哨兵以 `config/platform-infra/sub2api.yaml` 的 `sentinel.enabledOnTargets` 为准;未启用的 target 在 `sync`/`validate` 中应显示 `skipped-target-disabled`,不得要求镜像构建、CronJob、Secret 或 state ConfigMap 存在。 diff --git a/.agents/skills/unidesk-sub2api/references/troubleshooting-accounts.md b/.agents/skills/unidesk-sub2api/references/troubleshooting-accounts.md index 2338459f..eeca3c0b 100644 --- a/.agents/skills/unidesk-sub2api/references/troubleshooting-accounts.md +++ b/.agents/skills/unidesk-sub2api/references/troubleshooting-accounts.md @@ -4,8 +4,8 @@ - 只加强监控、不让哨兵自动冻结账号时,把 YAML `sentinel.actions.enabled=false` 后 `codex-pool sync --confirm`。此时 marker probe 和 gateway failure monitor 仍记录 `would-freeze` / observe-only 证据,但不会通过 Sub2API admin 写 `schedulable=false`;`/responses/compact` 的 `codex.remote_compact.failed` 和 compact 上游 5xx failover 只作为 `gateway-compact-*` 观察事件记录,不作为哨兵自动切换触发器。 - 单个 request id 报 502/503/中断/没有自动切号:第一步跑 `bun scripts/cli.ts platform-infra sub2api codex-pool trace --request-id `。先看 `outcome`、`reason`、`FAILOVER`、`SELECT-FAILED`、`ACCOUNT SIGNALS` 和 `WINDOW STATS`;只有 trace 报表缺字段或需要审计原始日志时,才加 `--show-lines` 或 `--raw`。若 `reason=failover-attempted-no-candidate`,说明切号动作已发生,但 scheduler 在排除失败账号后没有可用候选;继续用 `sentinel-report` 和 `validate --full` 区分 sentinel quarantine、request-path temp-unschedulable、账号 status 或容量耗尽。 - profile invalid:先修 `~/.codex/config.toml.` 的 `base_url`、`wire_api`、`model` 或 `auth.json.` 的 API key;不要在 YAML 中写密钥。 -- 手动 OAuth/API-key 账号的 WebUI account test 连 `chatgpt.com` 超时,但同一 Pod 显式 HTTP proxy 探针可通:不要只看 Pod `HTTP_PROXY` env,按“受保护手动账号代理与分组绑定”小节确认 `manualAccounts.protected[].proxyBinding`,跑 `codex-pool sync --target D601 --confirm` 后再用原账号测试复测。若复测不再 reset/timeout,而是 `gpt-5.2-pro` 这类指定模型返回 ChatGPT OAuth Codex 不支持的能力错误,用默认/受支持模型或统一 key smoke 验证代理,不要把模型错误当作代理仍坏。 -- 手动 OAuth/API-key 账号 WebUI account test 正常,但 PC Codex 客户端通过统一 key 访问 `/responses` 返回 503 且 trace 是 `account-select-failed` / `no available accounts`:按“受保护手动账号代理与分组绑定”小节确认该账号已绑定统一 key 使用的 pool group。WebUI group 列表和账号详情不一定足以证明 scheduler 可调度;必要时核对 admin account availability 与 `account_groups` join。k3s target 通过 `codex-pool sync --target --confirm` 后用 `codex-pool validate --target --full` 复测统一 key;PK01 host-Docker 在 sync/validate adapter 补齐前,用最小 admin API/DB evidence 恢复并以 public `/v1/responses` smoke 验收。 +- 手动 OAuth/API-key 账号的 WebUI account test 连 `chatgpt.com` 超时,但目标运行面显式 HTTP proxy 探针可通:不要只看 Pod 或容器环境变量,按“受保护手动账号代理与分组绑定”小节确认 `manualAccounts.protected[].proxyBinding`,跑 `codex-pool sync --target --confirm` 后再用原账号测试复测。若复测不再 reset/timeout,而是 `gpt-5.2-pro` 这类指定模型返回 ChatGPT OAuth Codex 不支持的能力错误,用默认/受支持模型或统一 key smoke 验证代理,不要把模型错误当作代理仍坏。 +- 手动 OAuth/API-key 账号 WebUI account test 正常,但 PC Codex 客户端通过统一 key 访问 `/responses` 返回 503 且 trace 是 `account-select-failed` / `no available accounts`:按“受保护手动账号代理与分组绑定”小节确认该账号已绑定统一 key 使用的 pool group。WebUI group 列表和账号详情不一定足以证明 scheduler 可调度;必要时核对 admin account availability 与 `account_groups` join。k3s target 和 PK01 host-Docker target 都通过 `codex-pool sync --target --confirm` 后用 `codex-pool validate --target --full` 复测统一 key;如果 `MANUAL=N` 指向受保护手动账号缺失,先按账号所有权恢复或退役该手动账号,不把它混同为 YAML-managed pool 账号失败。 - pool key 401:跑 `codex-pool sync --confirm` 重建 Sub2API key 与 k3s Secret 绑定,再跑 `codex-pool validate`。 - pool key、admin password 或 k8s Secret `.data` 被 stdout、日志、issue 或本地 transcript 打印时,按泄露处理:撤销对应 Sub2API key 或 token,删除/重建受影响的 target Secret,通过 `codex-pool sync --target --confirm` 或相应 YAML sourceRef 重新下发,再用 fingerprint、presence 和 `valuesPrinted=false` 作为 closeout 证据;不要复述旧值或新值。 - 运行中过去的验证探针残留:只用 `codex-pool cleanup-probes --confirm` 清理 `unidesk-probe-*` 临时资源;不要把真实 managed account 删除当作探针清理或可用性恢复。 diff --git a/docs/reference/platform-infra.md b/docs/reference/platform-infra.md index 787249e7..22c2cb09 100644 --- a/docs/reference/platform-infra.md +++ b/docs/reference/platform-infra.md @@ -30,7 +30,7 @@ - k3s Sub2API targets should stay ClusterIP-only by default. Host-Docker targets should bind app ports to loopback or a YAML-declared host interface and use a managed edge such as PK01 Caddy for public HTTPS. Do not add Ingress, NodePort, LoadBalancer, hostPort, or broad FRP exposure unless a YAML-controlled public exposure decision exists. - Sub2API currently has no resource limits by design. Do not add CPU or memory limits unless a later explicit decision changes that policy and stores the new policy in YAML. - Master server is a consumer/control host, not the runtime location. Do not deploy Sub2API, PostgreSQL, Redis, or heavy validation loops on master server. -- Sub2API active/standby placement is selected by YAML, not by ad hoc runtime patches. A standby target must render without a local PostgreSQL StatefulSet, keep the Sub2API app and local Redis cache scaled to zero, use only ephemeral Redis storage if Redis is later activated, and omit public exposure, HTTPS egress proxy, and account sentinel resources unless YAML explicitly promotes that target. An externally backed active target connects directly to the YAML-declared external PostgreSQL endpoint with `sslmode=require`, keeps durable app state outside the runtime node, and uses local Redis only as ephemeral cache. Host-Docker active targets such as PK01 are still Sub2API platform targets, but k3s-only Codex-pool helper paths must not be assumed to work there until the CLI implements host-Docker adapters. Multiple externally backed active targets may coexist when YAML declares distinct target ids, host routes, public URLs, FRP remote ports or local edge bindings, and Secret sources; target-scoped operations must use `--target ` and must not treat one target's URL or Secret as a fallback for another. Promotion or failback must be applied by editing `config/platform-infra/sub2api.yaml` and running the same `platform-infra sub2api --target ` CLI path. +- Sub2API active/standby placement is selected by YAML, not by ad hoc runtime patches. A standby target must render without a local PostgreSQL StatefulSet, keep the Sub2API app and local Redis cache scaled to zero, use only ephemeral Redis storage if Redis is later activated, and omit public exposure, HTTPS egress proxy, and account sentinel resources unless YAML explicitly promotes that target. An externally backed active target connects directly to the YAML-declared external PostgreSQL endpoint with `sslmode=require`, keeps durable app state outside the runtime node, and uses local Redis only as ephemeral cache. Host-Docker active targets such as PK01 are still Sub2API platform targets; `codex-pool plan|sync|validate` has a host-Docker adapter for PK01, while sentinel image/report/probe and parts of trace remain target-capability-specific and must not be treated as PK01 runtime health failures when they require k3s resources. Multiple externally backed active targets may coexist when YAML declares distinct target ids, host routes, public URLs, FRP remote ports or local edge bindings, and Secret sources; target-scoped operations must use `--target ` and must not treat one target's URL or Secret as a fallback for another. Promotion or failback must be applied by editing `config/platform-infra/sub2api.yaml` and running the same `platform-infra sub2api --target ` CLI path. - External platform PostgreSQL endpoints for Sub2API are produced by the platform DB YAML and its `platform-db postgres` CLI. Cross-node Sub2API consumers connect directly to that endpoint; the master server is not a PostgreSQL data-plane relay. DNS aliases are optional when the exported `DATABASE_URL` uses a reachable IP with `sslmode=require`; current PK01-specific rules live in `docs/reference/pk01.md`. - Sub2API account sentinel, public exposure, and HTTPS egress proxy are target-scoped YAML decisions. The active target may run them when YAML enables them; the standby G14 target must stay deployed but inactive until YAML promotion. `sentinel.enabledOnTargets` is the authority for where Codex-pool sentinel image, CronJob, Secret and state resources are expected; disabled targets should report sentinel validation as skipped instead of failing on missing runtime sentinel objects. Do not create a second sentinel, FRP client, public management surface, or edge proxy by hand; enable or move those resources only through the target YAML and the `platform-infra sub2api` / `codex-pool --target` CLI paths.