docs: record sub2api 504 and D518 recovery runbooks

2026-06-09 13:31:41 +00:00
parent 2c2852d0b1
commit 6454d52014
3 changed files with 8 additions and 6 deletions
@@ -26,14 +26,14 @@
 - `pool.groupName` names the Sub2API group that represents the pool.
 - `pool.apiKeySecretName` and `pool.apiKeySecretKey` name the k3s Secret that stores the single consumer API key.
 - `pool.minOwnerConcurrency` declares the minimum concurrency for the Sub2API user that owns the unified consumer API key. Keep it high enough to cover the declared account capacity set, so the shared key does not fail WS sessions at the user-concurrency layer. Do not compensate for owner-concurrency 1013 errors by pinning capacity to one provider.
- `pool.defaultTempUnschedulable` declares Sub2API account-level temporary unschedulable rules. Keep 429/overload/capacity, service-unavailable, and stable model-routing failures in this YAML policy so the scheduler can cool down a failing account and choose another candidate instead of hard-pinning one provider.
+- `pool.defaultTempUnschedulable` declares Sub2API account-level temporary unschedulable rules. Keep 429/overload/capacity, service-unavailable, gateway timeout, and stable model-routing failures in this YAML policy so the scheduler can cool down a failing account and choose another candidate instead of hard-pinning one provider.
 - `profiles.entries` selects local Codex profile files from `~/.codex/` and maps them to Sub2API account names.
 - `profiles.entries[].capacity` optionally overrides `pool.defaultAccountCapacity` for one account. Capacity is a YAML-controlled routing input; concrete current values belong only in `config/platform-infra/sub2api-codex-pool.yaml` and runtime validation output, not in long-term reference prose. Code constants, Secrets, ad-hoc runtime patches, or stale tests must not override YAML source of truth.
 - `profiles.entries[].loadFactor` optionally overrides `pool.defaultAccountLoadFactor` for one account and is rendered to Sub2API `load_factor`. Treat it as routing policy: values belong in YAML and `codex-pool validate` output, not code constants, Secrets, or ad-hoc runtime patches.
 - Do not change account membership, priority, capacity, load factor, WebSocket mode, or other routing policy from inference alone. Unless the user explicitly asks for a configuration change, first preserve the current YAML, collect provenance and runtime evidence, and write the finding to the relevant issue or runbook before proposing a change.
 - `profiles.entries[].tempUnschedulable` may override the pool default for one account. The CLI renders it into Sub2API credentials as `temp_unschedulable_enabled` and `temp_unschedulable_rules`; rules match HTTP status plus response-body keywords and place only that account into a temporary unschedulable cooldown.
 - Codex account-state or quota prompts that stop a task and ask the operator to switch accounts belong in `pool.defaultTempUnschedulable`, not in account membership, priority, capacity, load factor, WebSocket mode, or `pool_mode`. Keep stable body phrases such as weekly-limit and `/status` prompts in both the 403 account-state rule and the 429 quota/rate-limit rule, then run `codex-pool sync --confirm` and `codex-pool validate`. The validation evidence must include runtime temporary-unschedulable alignment for each managed account, not only successful group-level `/v1/models` or `/v1/responses` smoke output.
- Upstream model-routing failures that surface as 503 responses, such as `model_not_found` or "no available channel for model" wrappers, also belong in `pool.defaultTempUnschedulable`. They are not membership, priority, capacity, load factor, WebSocket mode, or User-Agent decisions by themselves. After adding stable body phrases, run `codex-pool sync --confirm` and `codex-pool validate`, and verify the affected account's runtime 503 rule includes the new keywords.
+- Upstream model-routing failures that surface as 503 responses, such as `model_not_found` or "no available channel for model" wrappers, also belong in `pool.defaultTempUnschedulable`. Gateway timeout failures that surface as 504 responses, including `Gateway Timeout`, `Unknown error`, `Upstream request failed`, `context deadline exceeded`, `context canceled`, or recovered upstream-error wrappers, belong in the same YAML policy. They are not membership, priority, capacity, load factor, WebSocket mode, or User-Agent decisions by themselves. After adding stable body phrases, run `codex-pool sync --confirm` and `codex-pool validate`, and verify the affected account's runtime status-specific rule includes the new keywords.
 - `profiles.entries[].openaiResponsesWebSocketsV2Mode` is the account-level Responses WebSocket v2 switch for OpenAI-compatible upstreams that require WebSocket transport. Allowed values are `off`, `ctx_pool`, and `passthrough`; omit the field unless that upstream needs it.
 - `profiles.entries[].upstreamUserAgent` is an optional account-level upstream request User-Agent override. Use it only for upstreams that require a Codex CLI compatible User-Agent; keep the value YAML-controlled and newline-free.
 - `publicExposure` controls the optional FRP bridge from master server to the G14 ClusterIP service.
@@ -45,7 +45,7 @@ When Codex startup repeatedly reports WebSocket reconnects or HTTPS fallback, pr

 Do not encode current availability assumptions in long-term reference prose. If an account needs a higher concurrency or load factor than the pool default, make that a deliberate YAML override and verify it with `codex-pool validate`; the reference document should describe the rule, not repeat the current numeric value.

-Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path, while UniDesk's desired failover behavior is to mark the failing account temporarily unschedulable and let Sub2API choose another account from the group. `codex-pool validate` reports each managed account's temporary-unschedulable runtime alignment and should be used after `codex-pool sync --confirm`. Generic 502/503 bodies such as `Recovered upstream error 502`, `Bad Gateway`, Codex-facing `Upstream request failed`, and stable `model_not_found` / "no available channel for model" wrappers must stay in the YAML cooldown policy so an intermittently bad account is cooled down instead of repeatedly adding latency at the next compact or Responses request. The Codex pool default error cooldown is severity-tiered: temporary signals can start at ten minutes, gateway/service/overload/model-routing failures should cool down longer, and credential, permission, quota, or account-state failures should use the longest cooldown. Exact current values belong in YAML and runtime validation output.
+Do not enable Sub2API `pool_mode` for UniDesk-managed Codex accounts. `pool_mode` retries the same selected account path, while UniDesk's desired failover behavior is to mark the failing account temporarily unschedulable and let Sub2API choose another account from the group. `codex-pool validate` reports each managed account's temporary-unschedulable runtime alignment and should be used after `codex-pool sync --confirm`. Generic 502/503/504 bodies such as `Recovered upstream error 502`, `Bad Gateway`, `Gateway Timeout`, Codex-facing `Upstream request failed`, `Unknown error`, context-deadline/canceled wrappers, and stable `model_not_found` / "no available channel for model" wrappers must stay in the YAML cooldown policy so an intermittently bad account is cooled down instead of repeatedly adding latency at the next compact or Responses request. The Codex pool default error cooldown is severity-tiered: temporary signals can start at ten minutes, gateway/service/overload/model-routing failures should cool down longer, and credential, permission, quota, or account-state failures should use the longest cooldown. Exact current values belong in YAML and runtime validation output.

 Sub2API temporary-unschedulable rules require both an HTTP status match and a response-body keyword match. Do not treat them as a general successful-response content filter. If an upstream returns a quota warning as normal HTTP 200 assistant content, track that as a separate response-classification capability issue instead of claiming the YAML cooldown policy has covered it.

@@ -170,6 +170,8 @@ D601 这类长期 WSL provider 不得因为单一路径失败被直接写成全

 手动升级只用于把旧节点 bootstrap 到支持 always-enabled 远程升级的版本；bootstrap 完成后，常规重建/升级必须回到 `provider.upgrade mode=schedule`，不得再用 SSH 透传同步重建 `provider-gateway`。节点侧维护步骤是：进入节点本地 UniDesk 仓库，确认 GitHub 访问走本机 provider-gateway WS egress proxy，例如 `git config --local http.proxy http://127.0.0.1:18789` 和 `git config --local https.proxy http://127.0.0.1:18789` 后再执行 `git pull --ff-only` 获取主 server 已推送版本；不得让 provider 侧 Git 拉取退回直连公网、SSH SOCKS 或公开 master proxy。随后确认 `.state/provider-<ID>.env` 中存在 `PROVIDER_SERVER_URL=ws://74.48.78.17:18082/ws/provider`、`PROVIDER_ID=<ID>`、`PROVIDER_NAME=<ID>`、`PROVIDER_TOKEN`、`PROVIDER_LABELS_JSON`、`PROVIDER_UPGRADE_HOST_PROJECT_ROOT=/home/ubuntu/unidesk`、`PROVIDER_UPGRADE_WORKSPACE_PATH=/workspace`、`PROVIDER_UPGRADE_COMPOSE_FILE`、`PROVIDER_UPGRADE_ENV_FILE`、`PROVIDER_UPGRADE_COMPOSE_PROJECT`、`PROVIDER_UPGRADE_SERVICE=provider-gateway`、`PROVIDER_UPGRADE_RUNNER_IMAGE=unidesk_provider-gateway:<id>`、`DOCKER_SOCKET_PATH=/var/run/docker.sock`、`MONITOR_DISK_PATH=/`、心跳和重连参数。旧 env 文件中如果还残留 `PROVIDER_UPGRADE_ENABLED`，新版 provider-gateway 会忽略它；长期文档和新部署不得再依赖这个键。

+如果节点固定 UniDesk 仓库存在并行未提交修改或落后太多，不能为了 provider-gateway 恢复而 stash、reset 或把脏工作区作为构建真相。应在同一节点的固定仓库下从最新 remote 创建独立 clean runtime worktree，例如 `/home/ubuntu/unidesk/.worktree/provider-gateway-<ID>-runtime`，把节点本地 `.state/provider-<ID>.env`、`provider-<ID>.yml` 和必要的本地 Dockerfile/Compose 运行态指向该 worktree，并把 `PROVIDER_UPGRADE_HOST_PROJECT_ROOT` 改为该 clean worktree。bootstrap 成功后仍必须立刻执行标准 `provider.upgrade mode=schedule` 和 Host SSH/`trans` 原入口验收；clean runtime worktree 是恢复固定 repo 隔离性的手段，不是新的长期 source truth。
+
 如果节点已有专用 Compose，优先用节点本地 Compose 手动重建一次：`docker compose --env-file .state/provider-<ID>.env -f <compose-file> -p <compose-project> up -d --no-deps --build --force-recreate provider-gateway`。这条命令必须在节点本地终端、节点自有 Web terminal、系统计划任务或 detached shell 中执行；不得通过正在被重建的 UniDesk provider-gateway 自己提供的 SSH 透传同步执行，否则旧 provider 容器停止时会切断 SSH client，可能导致重建中断在旧容器已停、新容器未起的状态。若只能通过 UniDesk 触达该节点，必须使用 `provider.upgrade mode=schedule` 的 detached updater，或先用节点本地 `nohup`/systemd 启动一个不依赖当前 provider 容器生命周期的重建脚本。老版 `docker-compose` 可能在重建已存在容器时因为 `ContainerConfig` 兼容问题失败；此时只能移除目标 provider-gateway 容器后重新 `up -d --no-deps provider-gateway`，不得执行 `down -v`、`docker volume rm` 或任何会影响 database 命名卷的命令。如果节点当前只有 `docker run` 部署，则先构建镜像 `docker build -f src/components/provider-gateway/Dockerfile -t unidesk_provider-gateway:<id> .`，再以固定容器名重建：使用 `--restart always --pid host`，挂载 `/var/run/docker.sock:/var/run/docker.sock`、`/home/ubuntu/unidesk:/workspace:ro`、节点日志目录到 `/var/log/unidesk`，如需 WSL SSH 维护桥还要把只读私钥目录挂载到 `/run/host-ssh`，并使用同一个 `.state/provider-<ID>.env` 启动。无论 Compose 还是 `docker run`，容器名和镜像 tag 都必须带 Provider ID，便于 Docker 状态页、进程资源表、任务历史和节点本地排障互相对应。

 手动升级完成后的判定标准固定为主 server 可观测结果，而不是节点容器 `running`：访问公网 frontend `http://74.48.78.17:18081/`，确认该 Provider 在线；随后在任意装有本仓库且 `config.json` 含正确 frontend 登录凭据的计算节点上执行 `bun scripts/cli.ts --main-server-ip 74.48.78.17 debug dispatch <PROVIDER_ID> provider.upgrade --mode schedule --wait-ms 300000`，确认任务 `succeeded` 且 result 包含最终 Compose 容器 `containerId`、目标 gateway `version`、`restartPolicy=always`、`pidMode=host` 和新的 `heartbeatTimestamp`；最后再次查看 frontend 或执行 `bun scripts/cli.ts --main-server-ip 74.48.78.17 debug health`，确认节点重连、指标恢复、labels 中 `host.ssh` 能力存在。每个 provider-gateway 手动升级后都必须用 remote CLI 再执行 `bun scripts/cli.ts --main-server-ip 74.48.78.17 debug dispatch <PROVIDER_ID> host.ssh --wait-ms 15000` 和 `bun scripts/cli.ts --main-server-ip 74.48.78.17 ssh <PROVIDER_ID> hostname`，验证维护桥没有在升级后丢失；该 remote CLI 默认走公网 frontend，不需要指定 `--main-server-key`。