From 88516cec6acd10cd70616016017337b6c72d6138 Mon Sep 17 00:00:00 2001 From: Codex Date: Sat, 16 May 2026 04:10:02 +0000 Subject: [PATCH] feat: add code queue deploy cli --- AGENTS.md | 2 + TEST.md | 2 +- docs/reference/cli.md | 3 +- docs/reference/codex-deploy.md | 50 ++ docs/reference/deployment.md | 4 +- docs/reference/microservices.md | 6 +- scripts/cli.ts | 9 + scripts/src/check.ts | 1 + scripts/src/codex-deploy.ts | 601 ++++++++++++++++++ .../v3sctl-adapter/v3s/code-queue.k8s.yaml | 225 ++++++- 10 files changed, 888 insertions(+), 15 deletions(-) create mode 100644 docs/reference/codex-deploy.md create mode 100644 scripts/src/codex-deploy.ts diff --git a/AGENTS.md b/AGENTS.md index 6dc9f02c..3fcf9ede 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -28,6 +28,7 @@ UniDesk 是一个以主 server 为统一入口的分布式工作平台;本文 - `bun scripts/cli.ts provider attach [--master-server URL] [--up] [--force]`:在新增计算节点上生成两项配置的 provider-gateway 挂载包;默认只需要主 server URL(默认 `http://74.48.78.17/`)和唯一 Provider ID,生成的 Compose 固定 Docker socket、`pid: "host"`、`restart: always`、只读 `/workspace`、SSH 维护私钥挂载和 loopback egress proxy 端口,规则见 `docs/reference/provider-gateway.md`。 - `bun scripts/cli.ts ssh [ssh-like args...]`:通过 provider-gateway 的 Host SSH / WSL SSH 维护桥打开近似原生 ssh 的交互会话或远端命令,并在远端 PATH 注入 `apply_patch`、`glob` 与 `skill-discover`;`apply-patch`、`py`、`skills`、结构化 `find`、`glob` 和 `argv` 子命令用于避免远端补丁、Python stdin、skill 发现与常用只读命令的嵌套转义问题,使用规则见 `docs/reference/cli.md` 和 `docs/reference/provider-gateway.md`。 - `bun scripts/cli.ts microservice list/status/health/proxy`:管理和验证挂载在主 server、计算节点 Docker 或 v3s 控制面上的用户服务,OA Event Flow/Todo Note/Baidu Netdisk on main-server、V3S Control/Code Queue/MDTODO/FindJob/Pipeline/MET Nonlinear on D601 的规则见 `docs/reference/microservices.md`。 +- `bun scripts/cli.ts codex deploy `:按已 push 到 remote 的 UniDesk commit 部署 D601 v3s/k8s Code Queue,自动 fetch/export、同步 `/home/ubuntu/cq-deploy`、构建镜像、导入 k3s、apply manifest、rollout 和健康验证;规则见 `docs/reference/codex-deploy.md`。 - `bun scripts/cli.ts codex task `:按 Code Queue 任务 ID 查询初始 prompt、最后 assistant message、工具调用摘要、attempt/judge/error 和耗时,便于新任务引用历史 session。 - `bun scripts/cli.ts codex judge --attempt [--dry-run]`:按指定 task/attempt 用与队列 worker 相同的上下文构建和 MiniMax judge 调用路径单步复现完成判定;`--dry-run` 只输出 prompt/payload 诊断。 - `bun scripts/cli.ts server stop`:以异步 job 停止固定 Compose 项目中的全部 UniDesk 服务,停止后用 `server status` 复核。 @@ -57,4 +58,5 @@ UniDesk 是一个以主 server 为统一入口的分布式工作平台;本文 - `docs/reference/oa-event-flow.md`:统一 OA 事件流微服务、事件表、tag 订阅、Trace/STEP 统计中心和前端可见性规则。 - `docs/reference/pipeline-oa-event-flow.md`:Pipeline/OA 事件流、审核/无审核流转、单步调试、甘特图渲染和最终去残留规则。 - `docs/reference/pipeline-model-proxy.md`:Pipeline v2 model proxy 链路架构、D601 宿主 proxy 服务部署、harness token 注入规则和 smoke test 验证流程。 +- `docs/reference/codex-deploy.md`:D601 Code Queue `codex deploy ` 异步部署管线、路径约定和验证入口。 - `reference`:兼容旧路径的符号链接,指向 `docs/reference/`。 diff --git a/TEST.md b/TEST.md index 956772d3..fedb6037 100644 --- a/TEST.md +++ b/TEST.md @@ -99,7 +99,7 @@ ## T23 D601 Code Queue User Service -阅读 `AGENTS.md`(本项目 `AGENTS.md` 同时承担 `SKILL.md` 对 `scripts/cli.ts` 的解释职责),然后用 cli 手动测试以下内容:运行 `bun scripts/cli.ts microservice list`,确认 `code-queue` 显示为 `providerId=D601`、`public=false`、`frontendOnly=true`、仓库 URL `https://github.com/pikasTech/unidesk`、v3s/k8s `v3s://unidesk/code-queue:4222` 逻辑服务映射、`deployment.mode=v3sctl-managed`、`runtime.orchestrator=v3sctl` 且无业务直连容器摘要;在 D601 v3s/k8s 控制面中使用 `src/components/microservices/v3sctl-adapter/v3s/code-queue.k8s.yaml` 重建/启动 Code Queue,并确认主 server 根目录 `docker-compose.yml` 中不再存在 `code-queue` service。运行 `bun scripts/cli.ts microservice health code-queue`、`bun scripts/cli.ts microservice proxy code-queue /api/dev-ready --raw`、`bun scripts/cli.ts microservice proxy code-queue '/api/tasks/overview?limit=5&transcriptLimit=1&compact=1&afterSeq=0&preferId='` 和 `bun scripts/cli.ts codex task <已有taskId>`,确认链路通过 backend-core、v3sctl-adapter、Kubernetes API service proxy 和 D601 active Code Queue Service,且 task id 查询返回初始 prompt、最后 assistant message、工具调用摘要、attempt/judge/error 和耗时,`queue.storage.primary=postgres`、`queue.storage.postgresReady=true`、`queue.devReady.missingTools=[]`、`queue.devReady.docker.versionOk=true`、`queue.devReady.docker.composeOk=true`;`queue.devReady.ssh.ready` 只在需要跨 Provider SSH/Windows-native 任务时作为强制项。在 D601 `code-queue-backend` 容器内验证主 PostgreSQL 端口映射可执行 `select 1`,主 OA Event Flow 端口映射 `/health` 可访问,本机 ClaudeQQ `http://host.docker.internal:3290/health` 可访问;这些映射不得成为任意公网入口。 +阅读 `AGENTS.md`(本项目 `AGENTS.md` 同时承担 `SKILL.md` 对 `scripts/cli.ts` 的解释职责),然后用 cli 手动测试以下内容:运行 `bun scripts/cli.ts microservice list`,确认 `code-queue` 显示为 `providerId=D601`、`public=false`、`frontendOnly=true`、仓库 URL `https://github.com/pikasTech/unidesk`、v3s/k8s `v3s://unidesk/code-queue:4222` 逻辑服务映射、`deployment.mode=v3sctl-managed`、`runtime.orchestrator=v3sctl` 且无业务直连容器摘要;使用 `bun scripts/cli.ts codex deploy <已push的commitId>` 重建/启动 D601 Code Queue,确认命令立即返回异步 job id,`bun scripts/cli.ts job status --tail-bytes 30000` 能看到 fetch/export、rsync、Docker build、k3s image import、kubectl apply、rollout 和 health 验证进度,并确认主 server 根目录 `docker-compose.yml` 中不再存在 `code-queue` service。运行 `bun scripts/cli.ts microservice health code-queue`、`bun scripts/cli.ts microservice proxy code-queue /api/dev-ready --raw`、`bun scripts/cli.ts microservice proxy code-queue '/api/tasks/overview?limit=5&transcriptLimit=1&compact=1&afterSeq=0&preferId='` 和 `bun scripts/cli.ts codex task <已有taskId>`,确认链路通过 backend-core、v3sctl-adapter、Kubernetes API service proxy 和 D601 active Code Queue Service,且 task id 查询返回初始 prompt、最后 assistant message、工具调用摘要、attempt/judge/error 和耗时,`queue.storage.primary=postgres`、`queue.storage.postgresReady=true`、`queue.devReady.missingTools=[]`、`queue.devReady.docker.versionOk=true`、`queue.devReady.docker.composeOk=true`;`queue.devReady.ssh.ready` 只在需要跨 Provider SSH/Windows-native 任务时作为强制项。在 D601 `code-queue-backend` 容器内验证主 PostgreSQL 端口映射可执行 `select 1`,主 OA Event Flow 端口映射 `/health` 可访问,本机 ClaudeQQ `http://host.docker.internal:3290/health` 可访问;这些映射不得成为任意公网入口。 随后登录公网 frontend `http://74.48.78.17:18081/`,进入 `用户服务 / Code Queue`,确认页面显示默认模型 `gpt-5.5`、默认执行 Provider `D601`、默认工作目录 `/workspace`、模型下拉菜单包含 `gpt-5.4-mini`/`gpt-5.4`/`gpt-5.5`、入队份数、队列指标、任务 ID、复制任务 ID、引用按钮、任务耗时、引用任务 ID、清空输入、创建成功提示、任务提交表单、Trace 输出、attempt 表、MiniMax/fallback judge 状态、追加 prompt、打断和重试控件;通过页面提交一个小任务,确认任务进入 queued/running/succeeded 或可解释的 failed 状态,并且输出区能看到运行中的 Codex 消息。批量验收时设置 `入队份数=5` 或用 `---` 分隔 5 段 prompt,一次性入队 5 条任务,确认 5 条任务按顺序运行并全部进入 succeeded 或可解释的非成功终态,不能只运行第一条后停止;其中任一任务被 judge 判定 `fail` 时只能把当前任务标为 failed,后续 queued 任务仍必须继续推进。测试异常中断时可以提交长任务后点击 `打断`,确认任务变为 canceled 或被 judge 标记为非成功终态;自动重试只应在服务端/传输异常、任务正常结束但 execution record 显示未完成、或 judge 判定 retry 时发生;retry 必须复用已有 Codex thread 并 append 继续执行 prompt,只有当前任务 complete 后才推进队列中的下一个任务。MiniMax judge 必须能处理 Markdown fence/夹杂文本等 JSON 去噪;若去噪后仍失败,必须把解析错误和上一轮去噪前原始回答反馈给 MiniMax 修复后重试,日志中应出现 `judge_json_parse_retry`,且 repair 成功时仍以 `source=minimax` 返回。Codex provider key 只能通过 `OPENAI_API_KEY`、`CRS_OAI_KEY` 这类运行时环境透传,MiniMax API key 只能通过 D601 env-file 运行时环境传入,禁止写入 `config.json`、Dockerfile、源码或测试文档。 diff --git a/docs/reference/cli.md b/docs/reference/cli.md index 0379e6d4..d567b5d7 100644 --- a/docs/reference/cli.md +++ b/docs/reference/cli.md @@ -19,6 +19,7 @@ UniDesk 的统一 CLI 入口是根目录 `scripts/cli.ts`,运行方式固定 - `ssh py [script-args...] < script.py` 把本地 stdin 落到远端临时 `.py` 文件后再以 `python3 -u` 执行并自动清理,避免再手写 `'python3 -'`、heredoc 或多层引号;`script-args` 会按 argv 安全透传给远端脚本。 - `ssh skills [--scope all|wsl|windows] [--limit N]` 发现目标节点上的 WSL/Linux skill 根目录;当 provider 是 WSL 时同一次调用还会扫描 Windows 用户目录下的 `.agents/skills` 与 `.codex/skills`。 - `microservice list/status/health/proxy` 通过 backend-core 内网 API 管理挂载在计算节点 Docker 中的用户服务(底层命令名仍为 microservice);`health` 和 `proxy` 会走真实 backend-core -> provider-gateway -> 节点本机后端链路,`proxy` 对超大 body 默认输出有界预览,规则见 `docs/reference/microservices.md`。 +- `codex deploy ` 创建异步 job,将已 push 到 remote 的 UniDesk commit 部署为 D601 v3s/k8s Code Queue:fetch/export tracked files、同步 `/home/ubuntu/cq-deploy`、构建 `unidesk-code-queue:d601`、导入 k3s containerd、apply manifest、rollout restart 和 live health 验证;详细规则见 `docs/reference/codex-deploy.md`。 - `codex task ` 通过 Code Queue 私有代理按任务 ID 查询结构化执行摘要;默认只返回有界 prompt/response 预览、执行 Provider、工作目录、最后 assistant message、最近工具调用摘要、attempt、judge、错误、耗时和 trace 翻页提示,适合在新队列任务中引用历史 session 且避免噪声爆炸。 - `codex task --trace --tail|--from-start|--after-seq N|--before-seq N --limit N` 按页拉取 Code Queue 的逻辑 trace;响应会返回 `nextAfterSeq`、`previousBeforeSeq`、`hasMore`、`hasBefore` 和下一页/上一页命令,默认 `--trace` 取最新一页,需要完整 prompt/最后 response 时加 `--full`。 - `codex output --tail|--from-start|--after-seq N|--before-seq N --limit N [--full-text]` 按原始 output seq 分页读取底层记录;当 trace 行提示 `commandOmittedLines`、`bodyOmittedLines` 或 `rawSeqs` 时,用该命令按 seq 补取完整信息,默认仍有单条文本预览上限,显式 `--full-text` 才返回该页全文。 @@ -32,7 +33,7 @@ UniDesk 的统一 CLI 入口是根目录 `scripts/cli.ts`,运行方式固定 长时操作采用 Fire-and-Forget 模式:CLI 创建 `.state/jobs/{jobId}.json`,后台进程执行真实命令,并将 stdout、stderr 分别写入 `.state/jobs/{jobId}.stdout.log` 与 `.state/jobs/{jobId}.stderr.log`。调用者通过 `bun scripts/cli.ts job status ` 查询进度和尾部输出。 -`server rebuild` 与 `server start`、`server stop` 一样必须通过返回的 job id 确认结果;不要把连续 `server rebuild` 命令理解成“前一个重建已完成”,因为两个命令只是在快速创建异步 job。重建 frontend 的标准流程是运行 `bun scripts/cli.ts server rebuild frontend`,随后轮询 `bun scripts/cli.ts job status ` 到 `succeeded`,再用 `server status` 或 `e2e run` 验证公网 frontend;重建 Todo Note 后端使用 `bun scripts/cli.ts server rebuild todo-note`,随后用 `microservice health todo-note` 和 `microservice proxy todo-note /api/instances` 验证;重建 Project Manager 后端使用 `bun scripts/cli.ts server rebuild project-manager`,随后用 `microservice health project-manager` 和 `microservice proxy project-manager /api/projects` 验证;重建 Baidu Netdisk 后端使用 `bun scripts/cli.ts server rebuild baidu-netdisk`,随后用 `microservice health baidu-netdisk` 和 `microservice proxy baidu-netdisk /api/transfers` 验证;重建 OA Event Flow 后端使用 `bun scripts/cli.ts server rebuild oa-event-flow`,随后用 `microservice health oa-event-flow` 和 `microservice proxy oa-event-flow /api/diagnostics` 验证。Code Queue 后端由 D601 v3s/k8s 控制面代管,用 `src/components/microservices/v3sctl-adapter/v3s/code-queue.k8s.yaml` 重建,再用 `microservice health code-queue` 和 `microservice proxy code-queue /api/tasks/overview` 验证。不得把 `docker rm` 手工兜底当成正式交付步骤。 +`server rebuild` 与 `server start`、`server stop` 一样必须通过返回的 job id 确认结果;不要把连续 `server rebuild` 命令理解成“前一个重建已完成”,因为两个命令只是在快速创建异步 job。重建 frontend 的标准流程是运行 `bun scripts/cli.ts server rebuild frontend`,随后轮询 `bun scripts/cli.ts job status ` 到 `succeeded`,再用 `server status` 或 `e2e run` 验证公网 frontend;重建 Todo Note 后端使用 `bun scripts/cli.ts server rebuild todo-note`,随后用 `microservice health todo-note` 和 `microservice proxy todo-note /api/instances` 验证;重建 Project Manager 后端使用 `bun scripts/cli.ts server rebuild project-manager`,随后用 `microservice health project-manager` 和 `microservice proxy project-manager /api/projects` 验证;重建 Baidu Netdisk 后端使用 `bun scripts/cli.ts server rebuild baidu-netdisk`,随后用 `microservice health baidu-netdisk` 和 `microservice proxy baidu-netdisk /api/transfers` 验证;重建 OA Event Flow 后端使用 `bun scripts/cli.ts server rebuild oa-event-flow`,随后用 `microservice health oa-event-flow` 和 `microservice proxy oa-event-flow /api/diagnostics` 验证。Code Queue 后端由 D601 v3s/k8s 控制面代管,必须使用 `bun scripts/cli.ts codex deploy ` 部署已 push 的 remote commit,再用 `microservice health code-queue` 和 `microservice proxy code-queue /api/tasks/overview` 验证。不得把 `docker rm` 手工兜底当成正式交付步骤。 ## Output Contract diff --git a/docs/reference/codex-deploy.md b/docs/reference/codex-deploy.md new file mode 100644 index 00000000..3076216b --- /dev/null +++ b/docs/reference/codex-deploy.md @@ -0,0 +1,50 @@ +# Code Queue Deploy + +`bun scripts/cli.ts codex deploy ` 是 D601 Code Queue 的正式部署入口。命令只在主 server 工作区执行;它会立即返回异步 job id,后台 job 通过 backend-core 的 `host.ssh` dispatch 在 D601 完成实际部署。 + +## Command + +```bash +bun scripts/cli.ts codex deploy +bun scripts/cli.ts job status --tail-bytes 30000 +``` + +- `commitId` 必须是已经 push 到 remote 的 7-40 位 hex commit SHA。 +- `--provider-id D601` 是默认值;当前部署路径只支持 D601 active instance。 +- `--timeout-ms N` 控制后台部署总超时,默认 `900000`。 +- `--skip-build` 只用于已确认目标镜像已在 D601 Docker 和 k3s containerd 中存在的诊断场景;正式部署默认必须构建镜像。 + +## Pipeline + +部署 job 的步骤固定为: + +1. 在 D601 `/home/ubuntu/unidesk` 中 `git fetch` remote,并用 `git archive ` 导出 tracked files 到 `/tmp/unidesk-codex-deploy-src-*`。 +2. 用 `rsync --delete` 同步导出的 repo 到 `/home/ubuntu/cq-deploy`,保留 `.state/`、`logs/`、`.git/`、`node_modules/` 和 `dist/`。 +3. 在 `/home/ubuntu/cq-deploy` 构建 `unidesk-code-queue:d601`。 +4. `docker save` 镜像并导入 k3s containerd:`docker exec -i unidesk-v8s-server ctr -n k8s.io images import -`。 +5. `kubectl apply -f src/components/microservices/v3sctl-adapter/v3s/code-queue.k8s.yaml`,其中包含 Code Queue 和 `d601-tcp-egress-gateway`。 +6. `kubectl -n unidesk rollout restart deployment/d601-tcp-egress-gateway deployment/code-queue` 并等待 rollout 完成。 +7. 通过 `bun scripts/cli.ts microservice health code-queue` 等价的 backend-core 路径验证 live health。 + +## Observability + +`codex deploy` 本身不阻塞等待部署结束。返回 JSON 中的 `statusCommand` 和 `tailCommand` 是唯一状态入口。后台 job 的 stderr 是 JSONL progress,每个长步骤会记录远端 `/tmp/unidesk-codex-deploy-*.log` 和 sentinel 文件;失败时 `job status` 会显示最后日志尾部。 + +`job status` 到 `succeeded` 后,还要用以下命令做 live 验证: + +```bash +bun scripts/cli.ts microservice health code-queue +bun scripts/cli.ts microservice proxy code-queue '/api/tasks/overview?limit=5&transcriptLimit=1&compact=1&afterSeq=0&preferId=' +``` + +## Boundaries + +Code Queue 由 D601 v3s/k8s 控制面代管,不再通过 `server rebuild` 或手工 `docker compose up` 作为正式部署路径。`codex deploy` 可以在 Code Queue 自身正在执行任务时运行;服务重启后由 restart-recovery 恢复任务状态,不能等待当前 Code Queue task 退出后再部署。 + +## TCP Egress Gateway + +D601 k3s Pod 不能依赖主 server 公开 `15432`/`4255` 作为直连 TCP 入口;PostgreSQL 也不会自动使用 HTTP proxy 环境变量。`code-queue.k8s.yaml` 因此提供一个通用 `d601-tcp-egress-gateway`: + +- Pod 内业务只访问集群内 Service:`d601-tcp-egress-gateway.unidesk.svc.cluster.local:15432` 和 `:4255`。 +- gateway 通过 D601 provider-gateway egress proxy 的 HTTP CONNECT 转发到底层 TCP 目标。 +- 新增 TCP 依赖时只扩展 `TCP_EGRESS_ROUTES`,不要在业务容器里散落公网直连地址或 ad hoc 隧道脚本。 diff --git a/docs/reference/deployment.md b/docs/reference/deployment.md index 13d921d6..d3ea7196 100644 --- a/docs/reference/deployment.md +++ b/docs/reference/deployment.md @@ -32,7 +32,7 @@ Compose v2 安装后仍然必须遵守 UniDesk 的服务控制入口:全栈生 ## Single Service Rebuild -前端、backend-core、本机 provider-gateway 或主 server 承载的 Todo Note/Project Manager/Baidu Netdisk/OA Event Flow 用户服务需要重建时,统一使用 `bun scripts/cli.ts server rebuild `,其中 `` 只能是 `backend-core`、`frontend`、`provider-gateway`、`todo-note`、`project-manager`、`baidu-netdisk` 或 `oa-event-flow`。Code Queue、File Browser、FindJob、Pipeline、MET Nonlinear 和 ClaudeQQ 部署在计算节点,不属于主 server Compose 可重建服务。该命令先执行目标服务镜像构建,构建成功后才通过 `up -d --no-deps --force-recreate ` 替换目标容器,避免构建失败导致运行中的服务被提前停掉。 +前端、backend-core、本机 provider-gateway 或主 server 承载的 Todo Note/Project Manager/Baidu Netdisk/OA Event Flow 用户服务需要重建时,统一使用 `bun scripts/cli.ts server rebuild `,其中 `` 只能是 `backend-core`、`frontend`、`provider-gateway`、`todo-note`、`project-manager`、`baidu-netdisk` 或 `oa-event-flow`。Code Queue、File Browser、FindJob、Pipeline、MET Nonlinear 和 ClaudeQQ 部署在计算节点,不属于主 server Compose 可重建服务;其中 D601 Code Queue 的正式入口是 `bun scripts/cli.ts codex deploy `。该命令先执行目标服务镜像构建,构建成功后才通过 `up -d --no-deps --force-recreate ` 替换目标容器,避免构建失败导致运行中的服务被提前停掉。 frontend 改动必须明确上线到公网:修改 `src/components/frontend/src/`、`src/components/frontend/public/style.css`、frontend 使用的共享 TSX/TS 模块或 WebUI 导航后,必须在同一变更集中执行 `bun scripts/cli.ts server rebuild frontend`,并等待 job 成功。公网 WebUI 的 `/app.js` 是 `unidesk-frontend` 容器启动时从镜像内源码转译生成的运行时 bundle;只改工作区文件、只跑 `bun run check`、只跑 `Bun.build` 或只刷新浏览器都不会替换已经运行的容器。 @@ -42,7 +42,7 @@ frontend 的 Docker 上线顺序为:先运行必要的本地校验,例如 `b 紧急灾备或数据迁移期间如需手工启动单个 Compose service,也必须保持与 CLI 相同的隔离语义:使用固定 `--env-file .state/docker-compose.env` 和 `up -d --no-deps `,只启动目标容器;如果需要刷新 backend-core 的服务目录或环境变量,应把 `backend-core` 作为显式目标单独重建/替换,不能依赖 `up` 的依赖解析顺手重建 database、backend-core 或其他服务。 -正式流程不得依赖人工 `docker rm` 兜底;手工删除旧容器后若 job、Docker client 或 daemon 在 `up` 前中断,会直接造成用户服务代理失败。`server rebuild ` 必须是可观测 job:build-first、Compose lock、no-deps force-recreate、post-up validation、保留 named volume。Code Queue 等计算节点长任务服务即使被重建也必须依赖服务自身 restart-recovery 恢复任务,不能用“避免重建”掩盖恢复缺陷。 +正式流程不得依赖人工 `docker rm` 兜底;手工删除旧容器后若 job、Docker client 或 daemon 在 `up` 前中断,会直接造成用户服务代理失败。`server rebuild ` 和 `codex deploy ` 都必须是可观测 job:build-first、受控替换、post-up validation、保留命名卷或 `.state` 运行态目录。Code Queue 等计算节点长任务服务即使被重建也必须依赖服务自身 restart-recovery 恢复任务,不能用“避免重建”掩盖恢复缺陷。 ## User Service Restart Recovery diff --git a/docs/reference/microservices.md b/docs/reference/microservices.md index cbf89aac..36f3aaa6 100644 --- a/docs/reference/microservices.md +++ b/docs/reference/microservices.md @@ -146,12 +146,12 @@ Baidu Netdisk 在 UniDesk 语境中按纯后端服务管理:不得暴露百度 - Direct path ban:`code-queue` 不得再登记 `http://code-queue:4222`、`http://host.docker.internal:4222`、NodePort 或 provider-gateway `microservice.http` 作为业务代理目标;frontend 也不得使用旧 `/api/code-queue-direct` 兼容别名作为 Code Queue 页面数据源。provider-gateway 只允许用于维护 D601/D518、部署 adapter、部署 v3s/k8s 节点或诊断节点本机容器。 - 实例语义:D601 是默认 active/single-writer 实例,`CODE_QUEUE_INSTANCE_ID=D601` 且 `CODE_QUEUE_SCHEDULER_ENABLED=true`;D518 是 standby 实例,必须设置 `CODE_QUEUE_INSTANCE_ID=D518`、`CODE_QUEUE_SCHEDULER_ENABLED=false` 和 `CODE_QUEUE_STARTUP_OA_BACKFILL_ENABLED=false`,避免两个实例同时消费同一 PostgreSQL 队列或重复回放 OA 统计。D601 active 也默认关闭 `CODE_QUEUE_STARTUP_OA_BACKFILL_ENABLED`;历史 OA Trace/STEP 回填必须通过显式 `/api/oa/backfill` 运维动作触发,不能在每次 Pod 重启时自动批量发布旧事件。 - 部署引用:Code Queue 镜像仍复用 `src/components/microservices/code-queue/Dockerfile`,Kubernetes 运行清单为 `src/components/microservices/v3sctl-adapter/v3s/code-queue.k8s.yaml`,`config.json` 对外记录 v3s manifest `src/components/microservices/v3sctl-adapter/v3s/code-queue.v3s.json`;主 server 根目录 `docker-compose.yml` 不包含 `code-queue` service,旧 D601 direct Compose 文件只作为迁移/本地诊断参考,不是正式运行入口。 -- 主服务依赖映射:Code Queue 仍以主 PostgreSQL 为权威数据库,`DATABASE_URL` 必须指向主 server 受限端口映射;`OA_EVENT_FLOW_BASE_URL` 必须指向主 server OA Event Flow 受限端口映射;D601 active 实例的 `CODE_QUEUE_NOTIFY_CLAUDEQQ_BASE_URL` 直接使用本机 ClaudeQQ 映射 `http://host.docker.internal:3290`。这些端口映射只服务受控节点运行时,必须用防火墙或等价策略限制来源,不得成为浏览器或任意公网客户端入口。 +- 主服务依赖映射:Code Queue 仍以主 PostgreSQL 为权威数据库,但 D601 k3s Pod 不能依赖公网直连 `74.48.78.17:15432/4255`。Pod 内 `DATABASE_URL` 和 `OA_EVENT_FLOW_BASE_URL` 必须指向集群内 `d601-tcp-egress-gateway` Service,再由该 gateway 通过 D601 provider-gateway egress proxy 的 HTTP CONNECT 转发到主 PostgreSQL 和 OA Event Flow;新增 TCP 依赖时扩展 `TCP_EGRESS_ROUTES`,不得在业务容器里新增一次性公网直连或 ad hoc 隧道。D601 active 实例的 `CODE_QUEUE_NOTIFY_CLAUDEQQ_BASE_URL` 直接使用本机 ClaudeQQ 映射 `http://host.docker.internal:3290`。这些端口映射只服务受控节点运行时,必须用防火墙或等价策略限制来源,不得成为浏览器或任意公网客户端入口。 - K8s 探针与启动维护:Kubernetes liveness/startup probe 必须使用轻量 `/live`,readiness 和用户服务健康使用 `/health`;`/health` 不得执行全量任务聚合、历史回填或长事务索引维护,历史任务总览应由 `/api/tasks/overview` 读取 PostgreSQL。启动时允许后台执行队列元数据 flush、通知 outbox 读取、任务表索引维护和 overview warmup,但这些维护不得阻塞 Bun server、readiness endpoint 或 frontend overview;通知表索引和大批量 OA backfill 不得作为默认启动副作用。 - MiniMax/OpenCode 并发:`minimax-m2.7` 通过 OpenCode JSON 事件端口运行;每个 Code Queue task 必须使用独立的 OpenCode XDG data/config/cache/state 目录,禁止多队列并发任务共享同一个 OpenCode SQLite/WAL 状态目录,否则并发 smoke 会触发 `PRAGMA journal_mode = WAL` 之类的数据库锁或初始化错误。用于验证 v3s/k8s 链路的 MiniMax smoke 以“至少 4 个任务、分布到 2 个 queue、至少 2 个终态成功”为链路验收线;剩余失败如果是 OpenCode 最终回复捕获、业务任务判定或模型限流,应作为 Code Queue 执行可靠性问题单独排查,不能反推 v3s 代理链路失败。 - 默认出网代理:D601 active Code Queue Pod 必须默认把 `HTTP_PROXY`、`HTTPS_PROXY` 和 `ALL_PROXY` 注入给 Codex/OpenCode、`git`、`curl`、`npm` 等任务子进程;当前唯一上游是 D601 provider-gateway egress HTTP CONNECT 代理,并通过 Kubernetes `Service d601-provider-egress-proxy` 暴露给 `unidesk` namespace 内的 Pod。该 Service 的 EndpointSlice 指向 D601 provider-gateway 私有 Docker network endpoint,Pod 内代理 URL 使用 `http://d601-provider-egress-proxy.unidesk.svc.cluster.local:18789`,provider-gateway 宿主端口仍只允许绑定 `127.0.0.1`,不得开放公网;如 provider-gateway 容器 IP 变化,必须同步刷新 EndpointSlice 并用 Code Queue `/health.egressProxy.connected=true` 验证。这里的 provider-gateway 只承担出网代理,不承担 Code Queue 业务 HTTP 代理;业务访问仍只能走 Kubernetes API service proxy。k3s/k8s 原生 egress gateway、service mesh 或 CNI egress policy 只作为后续网络层增强方向,当前交付态不引入第二套出网控制面。远程开发/执行容器不得只依赖这些环境变量,必须在容器网络层用 TUN 默认路由和 OUTPUT 防火墙强制外网流量只能经 master TUN 出口。 - 出网代理无 fallback 纪律:Code Queue 的运行时配置只允许一个默认出网路径,即 provider-gateway egress proxy;不得在代码中同时保留 Code Queue 自建 WebSocket proxy、临时 shell proxy、D601 本地直连公网、主 server direct HTTP proxy 等隐式分支。任何新增网络 fallback 都必须先进入本参考文档并配套 `/health` 可见状态,否则视为残留旧路径。 -- 上线纪律:Code Queue 相关的前端或后端改进必须在同一任务内正式上线并验证公网 frontend 或 live API,不能只停留在源码、构建产物或“后续再上线”。修改 Code Queue 自身时不得等待当前 Code Queue task 结束、等待 queue idle 或等待 `0 running` 后才重启;应通过 v3s 控制面或 D601/D518 维护入口做 build-first 替换,并用 v3s adapter、Code Queue live API 或公网 frontend 证明任务和队列仍可读可继续。 +- 上线纪律:Code Queue 相关的前端或后端改进必须在同一任务内正式上线并验证公网 frontend 或 live API,不能只停留在源码、构建产物或“后续再上线”。修改 Code Queue 自身时不得等待当前 Code Queue task 结束、等待 queue idle 或等待 `0 running` 后才重启;D601 active 实例的正式后端部署入口是 `bun scripts/cli.ts codex deploy `,它按已 push 的 remote commit 做 build-first 镜像替换、k3s image import、manifest apply、rollout 和健康验证,并用 v3s adapter、Code Queue live API 或公网 frontend 证明任务和队列仍可读可继续。 - 更名与灾备恢复:旧版 Codex 队列服务名只允许作为兼容诊断和一次性迁移来源;`code-queue-backend` 容器自身 `/health` 正常但 `microservice health code-queue` 返回 provider 直连错误时,优先判定为 backend-core 仍加载旧 `MICROSERVICES_JSON` 或 adapter manifest 未刷新,必须刷新 `.state/docker-compose.env`、重建/替换 `backend-core` 与 `v3sctl-adapter`,随后用 `microservice list` 验证 `code-queue` 的 `runtime.orchestrator=v3sctl`、`backend.proxyMode=v3sctl-adapter-http` 和无业务容器直连摘要。 - Codex 认证:容器只从 D601 的 `/home/ubuntu/.codex/config.toml` 同步 Codex provider 配置到 D601 `.state/code-queue/codex-home`,并通过 D601 `.state/code-queue-d601.env` 透传 `OPENAI_API_KEY`、`CRS_OAI_KEY` 等 provider 所需变量;这些 provider 环境变量不得写入仓库,必须由 D601 Compose env-file 注入,确保容器重建和重启后不会丢失认证。新增 provider 的 `env_key` 时必须增加同类运行时透传和 Compose env 持久化,禁止把 Codex 或 MiniMax 密钥写入仓库文件。Code Queue 容器必须只读挂载 D601 host 的 SSH 目录到 `/root/.ssh`(默认 `/home/ubuntu/.ssh`),让容器内 `git push`、`ssh -T git@github.com` 与 host 使用同一套 GitHub SSH key/known_hosts;不得把私钥复制进镜像或仓库。 - Develop-ready 镜像:Code Queue 镜像必须在启动前预装 UniDesk/Pipeline 调试所需工具,至少包含 `codex`、`bun`、`node`、`npm`/`npx`、`git`、`rg`、`curl`、`python3`/`pip3`、`docker`、`docker compose`、`docker-compose`、`jq`、`ssh`、`rsync`、`make`、`gcc`/`g++`、`iptables`、`tar`、`gzip` 和 `unzip`;不得依赖 Codex 任务运行时再 `apt-get install` 这些基础环境。 @@ -162,7 +162,7 @@ Baidu Netdisk 在 UniDesk 语境中按纯后端服务管理:不得暴露百度 - Codex 控制:服务内部启动 `codex app-server --listen stdio://`,用 JSON-RPC 调用 `thread/start`、`turn/start`、`turn/steer` 和 `turn/interrupt`,并监听 `turn/completed`、assistant delta、reasoning delta、command output delta、file diff delta 等通知生成前端可轮询的 transcript。 - 用户输入持久化:任务初始 prompt 以 `basePrompt/displayPrompt` 作为结构化来源,运行中追加的 `turn/steer` prompt 必须写入 `promptHistory`;transcript 构建时从这些结构化字段合成 `Submitted prompt` 和 `Steer prompt`,不能只依赖有 600 条上限的 raw output,否则长任务输出增长后会丢失关键人工指令。 - 队列语义:`POST /api/tasks` 或 `/api/tasks/batch` 入队,服务始终只运行一个 Codex turn;当前任务真正终止后才推进下一个任务。`GET /api/tasks` 与 `GET /api/tasks/{id}` 返回队列、attempt、judge 和输出;`GET /api/tasks/{id}/summary` 返回按任务 ID 查询的结构化摘要,包括初始 prompt、最后 assistant message、工具调用摘要、attempt、judge、错误和耗时;CLI 入口是 `bun scripts/cli.ts codex task `。`GET|POST /api/tasks/{id}/judge?attempt=N` 与 CLI `bun scripts/cli.ts codex judge --attempt N` 用于单步复现指定 attempt 的 judge,必须复用真实队列 worker 的上下文构建、prompt 压缩、MiniMax 调用、JSON 去噪/repair 和 fallback 路径;`dryRun=1`/`--dry-run` 只输出 prompt/payload 和重建诊断,不调用 MiniMax。`POST /api/tasks/{id}/steer` 向运行中 turn 推入 prompt;`POST /api/tasks/{id}/interrupt` 或 `DELETE /api/tasks/{id}` 打断/取消;`POST /api/tasks/{id}/retry` 手动重试。队列 worker 必须隔离单个 task 的异常,不能因为某个 app-server、judge 异常或 judge 判定 `fail` 让后续 queued 任务停止;`fail` 只把当前任务标为 failed,随后必须继续扫描并推进下一个 queued/retry_wait 任务。当存在 queued/retry_wait 且 worker 空闲时,watchdog 必须自动重新调度。 -- 稳定性与重启恢复:Code Queue 的第一目标是长期稳定可用;部署修复或运维排障时不得因为担心容器重启会打断任务而拒绝重启、重建或替换 `code-queue-backend`。容器重启、服务进程重启和镜像替换后,队列、`promptHistory`、running/judging/retry_wait 任务和 active session 元数据必须从 PostgreSQL 恢复,并在已有 `codexThreadId` 可用时用 `thread/resume` 和 continuation prompt 无缝继续当前任务;如果原 app-server turn 已丢失,也必须把当前任务恢复到可 retry/continue 的状态,不能错误推进下一个任务或永久卡住。D601 侧重建必须走 `src/components/microservices/code-queue/docker-compose.d601.yml`,并且必须在 build 后执行 force-recreate 与 post-up health validation;禁止先手工 `docker rm` 再依赖后续命令补救,因为中断窗口会让 Pod/容器消失并触发 frontend/core 用户服务代理失败。重启后出现 active task 丢失、手动 steer/interrupt 记录丢失、running 任务卡死、误判完成、跳过当前任务、容器消失或阻塞队列,均属于 Code Queue 的 P0 核心缺陷,必须先修复并补充 restart-recovery 验收,不能把“避免重启”作为交付策略。 +- 稳定性与重启恢复:Code Queue 的第一目标是长期稳定可用;部署修复或运维排障时不得因为担心容器重启会打断任务而拒绝重启、重建或替换 active Pod。容器重启、服务进程重启和镜像替换后,队列、`promptHistory`、running/judging/retry_wait 任务和 active session 元数据必须从 PostgreSQL 恢复,并在已有 `codexThreadId` 可用时用 `thread/resume` 和 continuation prompt 无缝继续当前任务;如果原 app-server turn 已丢失,也必须把当前任务恢复到可 retry/continue 的状态,不能错误推进下一个任务或永久卡住。D601 侧重建必须走 `bun scripts/cli.ts codex deploy `;禁止先手工 `docker rm` 或只手工 `docker compose up` 再依赖后续命令补救,因为中断窗口会让 Pod/容器消失并触发 frontend/core 用户服务代理失败。重启后出现 active task 丢失、手动 steer/interrupt 记录丢失、running 任务卡死、误判完成、跳过当前任务、容器消失或阻塞队列,均属于 Code Queue 的 P0 核心缺陷,必须先修复并补充 restart-recovery 验收,不能把“避免重启”作为交付策略。 - 调度与 active run slot:Code Queue 必须把“queue processor 正在等待/退避/轮询”和“实际占用 Codex/OpenCode 子进程运行槽”分开建模;`CODE_QUEUE_MAX_ACTIVE_QUEUES` 只限制真实 active run slot,不能把 retry backoff、等待内存下降或等待前序任务的 `processingQueues` 计入 active slot,否则设置全局 active slot 上限时,一个空等队列会把其他 runnable queue 永久饿死。多个 queue 同时等待 active slot 时必须显式维护 FIFO waiter 队列,避免某个长 retry/backoff 队列刚释放 slot 就立刻重抢,导致更早进入等待的 `retry_wait` 任务长期饥饿;`/health` 必须同时暴露真实 `activeQueueIds`、`activeRunSlotCount`、等待中的 `processingQueueIds` 和 active slot waiters,排障时以 active run slot 与 waiter 顺序判断是否真的有任务在跑、谁应下一个启动。restart-recovery 后的 `retry_wait` 任务若缺失 `codexThreadId`/OpenCode session id,不得无限拒绝 retry;必须用紧凑 recovery prompt 和原始任务摘要重新开一个 agent thread/session,让任务继续推进并在 Trace 中留下 recovery 证据。任何修改 scheduler、retry backoff、queue move、manual retry、shutdown recovery 或内存等待逻辑时,都必须保留“空等 processor 不占 active run slot”、“等待者 FIFO 不饥饿”和“缺失 thread/session 可恢复”的自测或 live 验证。 - 内存优化过程与防回归:Code Queue 已迁移到 D601,但内存治理仍必须按“PostgreSQL 权威源优先、进程热状态最小化、容器硬上限兜底”的顺序设计。长期可复用的优化路径是:先确认任务、queue、readAt、promptHistory、active session 和通知 outbox 均可从 PostgreSQL 恢复;再把历史任务列表、详情、统计、Trace/output 和 `/health` 的只读查询改为 PostgreSQL 直读或聚合查询;随后只把 `queued`、`running`、`judging`、`retry_wait` 等调度必需任务载入 Bun 堆,并在 PostgreSQL 查询侧裁剪 hot `output`/`events`;最后用 dirty-only flush、append-only 输出归档、Codex SQLite 小批量导出、`bun --smol`、`mem_limit=600m`、`memswap_limit=1536m`、`NODE_OPTIONS=--max-old-space-size=768` 和 cgroup memory watchdog 作为运行时防线。PostgreSQL 到进程的单次读取足够快,不能为了减少 SQL 查询把全部历史 `task_json`、Trace、output 或统计摘要常驻内存;任何新增缓存都必须有默认较小的环境变量上限、明确淘汰策略、可从 PostgreSQL 或 append-only 归档重建,且不得影响重启恢复。新增或修改 `/api/tasks`、overview、stats、summary、transcript、output、trace、health、flush、scheduler 和通知路径时,禁止在常规请求中调用会物化全量历史任务 JSON 的代码,禁止启动后无条件重写全量历史 task JSON,禁止用未设上限的 `Map`/数组保存历史 output/event/Trace,`CODE_QUEUE_MAX_ACTIVE_QUEUES=0` 表示不按 queue 数量设置全局排队上限;如显式设置为正数,必须同时说明内存预算并补充内存压测验收。memory watchdog 必须以 cgroup working set 为主要判断,且在 swap 仍有余量时不得提前杀掉唯一 active run;否则 TypeScript/Playwright 这类短时高内存验证会被错误中断并让 retry 队列反复震荡。 - 列表/详情延迟优化原则:Code Queue 控制面交互的长期目标是常规历史规模下首屏、`GET /api/tasks/overview`、`POST /api/tasks//read` 和分页加载均在 1s 内完成;性能面板出现十几秒级 `core_proxy` 或 Code Queue 用户服务代理慢操作时,必须优先按后端查询形态和前后端通信策略定位,不能把问题归因于 React 渲染后只改 UI。后端优化顺序是:先为 queue、status、updated/created 时间、readAt/terminal unread 和常用筛选条件补齐 PostgreSQL 索引;再用 SQL `COUNT`、`GROUP BY`、条件聚合和分页 ID 查询生成 queue/status/stats/unread 摘要;随后按 ID 轻量加载当前页、selected、active 和 unread priority task,禁止为了列表或已读操作解析完整 Trace、output archive、Codex transcript 或物化全量历史 `task_json`。`read`/`read-all` 这类 mutation 必须是 SQL-only 更新并返回最小 patch/queue 计数,不能触发 overview 全量重算或重载所有任务;启动 warm 只能预热小体积聚合和索引路径,不得把历史任务作为常驻缓存。允许 frontend/backend 代理使用秒级、严格有界、mutation 自动失效的 overview micro-cache 来吸收重复刷新,但 cache 只能作为抖动保护,不能替代数据库索引、聚合查询和分页披露,也不能让 stale readAt/queue/status 状态跨设备可见。 diff --git a/scripts/cli.ts b/scripts/cli.ts index dad6a4ed..0a32e528 100644 --- a/scripts/cli.ts +++ b/scripts/cli.ts @@ -9,6 +9,7 @@ import { runSsh } from "./src/ssh"; import { extractRemoteCliOptions, runRemoteCli } from "./src/remote"; import { runMicroserviceCommand } from "./src/microservices"; import { runCodeQueueCommand } from "./src/code-queue"; +import { runCodexDeployCommand } from "./src/codex-deploy"; import { runProviderCommand } from "./src/provider-attach"; import { runScheduleCommand } from "./src/schedules"; @@ -44,6 +45,7 @@ function help(): unknown { { command: "microservice proxy [--method GET|POST|PUT|PATCH|DELETE] [--raw] [--max-body-bytes N]", description: "Access a private user-service backend path through the same frontend-only proxy used by WebUI; large bodies are summarized unless --raw is set." }, { command: "schedule list|get|runs|run|delete", description: "Manage backend-core scheduled tasks and run history; schedule run supports --wait-ms N." }, { command: "schedule upsert-pgdata-backup [--time HH:MM] [--remote-base /SERVER_DATA/UNIDESK_PG_DATA]", description: "Create or update the daily PGDATA physical backup task that uploads monthly rotated archives to Baidu Netdisk." }, + { command: "codex deploy [--provider-id D601] [--timeout-ms N] [--skip-build]", description: "Start an async D601 v3s/k8s Code Queue deployment job from a specific remote git commit." }, { command: "codex task [--trace --tail|--from-start|--after-seq N|--before-seq N --limit N] [--full]", description: "Fetch a compact Code Queue task summary; trace rows are opt-in and paged with next/previous commands to avoid output explosion." }, { command: "codex output [--tail|--from-start|--after-seq N|--before-seq N --limit N] [--full-text]", description: "Fetch paged raw Code Queue output records by seq when a trace row has omitted command/output text." }, { command: "codex judge --attempt N [--dry-run] [--include-prompt]", description: "Replay one stored Code Queue attempt through the same judge context builder and MiniMax judge call path used by the live queue worker." }, @@ -188,6 +190,13 @@ async function main(): Promise { } if (top === "codex") { + if (sub === "deploy") { + const result = await runCodexDeployCommand(config, args.slice(2)); + const ok = (result as { ok?: unknown }).ok !== false; + emitJson(commandName, result, ok); + if (!ok) process.exitCode = 1; + return; + } emitJson(commandName, await runCodeQueueCommand(config, args.slice(1))); return; } diff --git a/scripts/src/check.ts b/scripts/src/check.ts index ab556240..8c7743c4 100644 --- a/scripts/src/check.ts +++ b/scripts/src/check.ts @@ -70,6 +70,7 @@ export function runChecks(config: UniDeskConfig): { ok: boolean; items: CheckIte fileItem("src/components/microservices/oa-event-flow/src/index.ts"), fileItem("src/components/microservices/v3sctl-adapter/src/index.ts"), fileItem("src/components/microservices/mdtodo/src/index.ts"), + fileItem("scripts/src/codex-deploy.ts"), fileItem("scripts/src/e2e.ts"), unifiedLogRotationItem(), commandItem("bun:version", ["bun", "--version"]), diff --git a/scripts/src/codex-deploy.ts b/scripts/src/codex-deploy.ts new file mode 100644 index 00000000..43bc2761 --- /dev/null +++ b/scripts/src/codex-deploy.ts @@ -0,0 +1,601 @@ +import { startJob, type JobRecord } from "./jobs"; +import { coreInternalFetch } from "./microservices"; +import { type UniDeskConfig, rootPath } from "./config"; + +const defaultProviderId = "D601"; +const defaultTimeoutMs = 900_000; +const pollIntervalMs = 5_000; +const shortDispatchWaitMs = 25_000; +const shortRemoteTimeoutMs = 20_000; + +const defaultSourceRepoDir = "/home/ubuntu/unidesk"; +const defaultDeployDir = "/home/ubuntu/cq-deploy"; +const defaultKubeconfigPath = "/home/ubuntu/cq-deploy/.state/v8s/kubeconfig"; +const defaultK3sContainer = "unidesk-v8s-server"; +const defaultImageTag = "unidesk-code-queue:d601"; +const k8sNamespace = "unidesk"; +const k8sDeployment = "code-queue"; +const tcpEgressDeployment = "d601-tcp-egress-gateway"; +const k8sManifestRelPath = "src/components/microservices/v3sctl-adapter/v3s/code-queue.k8s.yaml"; + +interface DeployCliOptions { + commitId: string; + providerId: string; + repoUrl: string; + sourceRepoDir: string; + deployDir: string; + kubeconfigPath: string; + k3sContainer: string; + imageTag: string; + timeoutMs: number; + skipBuild: boolean; + runNow: boolean; +} + +interface StepResult { + step: string; + ok: boolean; + detail: string; + startedAt: string; + finishedAt: string; + raw?: unknown; +} + +interface DispatchResult { + ok: boolean; + taskId: string | null; + status: string | null; + stdout: string; + stderr: string; + exitCode: number | null; + raw: unknown; +} + +interface BackgroundPoll { + done: boolean; + exitCode: number | null; + logTail: string; + raw: unknown; +} + +function asRecord(value: unknown): Record | null { + return typeof value === "object" && value !== null && !Array.isArray(value) ? value as Record : null; +} + +function asString(value: unknown): string { + return typeof value === "string" ? value : ""; +} + +function optionValue(args: string[], names: string[]): string | undefined { + for (const name of names) { + const index = args.indexOf(name); + if (index === -1) continue; + const raw = args[index + 1]; + if (raw === undefined || raw.length === 0) throw new Error(`${name} requires a non-empty value`); + return raw; + } + return undefined; +} + +function positiveIntegerOption(args: string[], names: string[], defaultValue: number): number { + const raw = optionValue(args, names); + if (raw === undefined) return defaultValue; + const value = Number(raw); + if (!Number.isInteger(value) || value <= 0) throw new Error(`${names[0]} must be a positive integer`); + return value; +} + +function positionalArgs(args: string[]): string[] { + const result: string[] = []; + for (let index = 0; index < args.length; index += 1) { + const value = args[index] ?? ""; + if (value.startsWith("--")) { + if (!["--skip-build", "--run-now"].includes(value)) index += 1; + continue; + } + result.push(value); + } + return result; +} + +function codeQueueRepoUrl(config: UniDeskConfig): string { + const service = config.microservices.find((item) => item.id === "code-queue"); + if (service === undefined) throw new Error("config.json does not contain microservice id=code-queue"); + return service.repository.url; +} + +function parseOptions(config: UniDeskConfig, args: string[]): DeployCliOptions { + const commitId = optionValue(args, ["--commit", "--commit-id"]) ?? positionalArgs(args)[0] ?? ""; + if (commitId.length === 0) { + throw new Error("codex deploy requires a commit ID: codex deploy "); + } + return { + commitId, + providerId: optionValue(args, ["--provider-id", "--provider"]) ?? defaultProviderId, + repoUrl: optionValue(args, ["--repo-url"]) ?? codeQueueRepoUrl(config), + sourceRepoDir: optionValue(args, ["--source-repo-dir"]) ?? defaultSourceRepoDir, + deployDir: optionValue(args, ["--deploy-dir"]) ?? defaultDeployDir, + kubeconfigPath: optionValue(args, ["--kubeconfig"]) ?? defaultKubeconfigPath, + k3sContainer: optionValue(args, ["--k3s-container"]) ?? defaultK3sContainer, + imageTag: optionValue(args, ["--image"]) ?? defaultImageTag, + timeoutMs: positiveIntegerOption(args, ["--timeout-ms"], defaultTimeoutMs), + skipBuild: args.includes("--skip-build"), + runNow: args.includes("--run-now"), + }; +} + +function validateOptions(options: DeployCliOptions): void { + if (!/^[0-9a-f]{7,40}$/iu.test(options.commitId)) { + throw new Error(`commit id must be a 7-40 character hex SHA, got: ${options.commitId}`); + } + if (!/^[A-Za-z0-9_.-]+$/u.test(options.providerId)) throw new Error(`invalid provider id: ${options.providerId}`); + if (options.providerId !== defaultProviderId) { + throw new Error(`codex deploy currently supports only ${defaultProviderId}; got ${options.providerId}`); + } + if (!/^https?:\/\//u.test(options.repoUrl) && !/^[A-Za-z0-9_.-]+@/u.test(options.repoUrl)) { + throw new Error(`repo url must be an http(s) or ssh git URL, got: ${options.repoUrl}`); + } +} + +function shellQuote(value: string): string { + return `'${value.replace(/'/g, `'\\''`)}'`; +} + +function nowIso(): string { + return new Date().toISOString(); +} + +function elapsedMs(startedAt: number): number { + return Math.max(0, Date.now() - startedAt); +} + +function compactTail(text: string, maxChars = 900): string { + return text.length > maxChars ? text.slice(text.length - maxChars) : text; +} + +function progressLine(step: string, message: string, detail?: unknown): void { + const payload = detail === undefined + ? { at: nowIso(), step, message } + : { at: nowIso(), step, message, detail }; + process.stderr.write(`${JSON.stringify(payload)}\n`); +} + +function coreBody(response: unknown): Record | null { + const record = asRecord(response); + return asRecord(record?.body); +} + +async function dispatchSsh( + config: UniDeskConfig, + providerId: string, + command: string, + cwd: string | null, + waitMs = shortDispatchWaitMs, + remoteTimeoutMs = shortRemoteTimeoutMs, +): Promise { + const dispatchResponse = coreInternalFetch("/api/dispatch", { + method: "POST", + body: { + providerId, + command: "host.ssh", + payload: { + source: "codex-deploy", + mode: "exec", + command, + timeoutMs: remoteTimeoutMs, + ...(cwd === null ? {} : { cwd }), + }, + }, + }); + const dispatchBody = coreBody(dispatchResponse); + const taskId = asString(dispatchBody?.taskId); + if (dispatchBody?.ok !== true || taskId.length === 0) { + return { + ok: false, + taskId: taskId || null, + status: null, + stdout: "", + stderr: asString(dispatchBody?.error) || "dispatch did not return a task id", + exitCode: null, + raw: dispatchResponse, + }; + } + + const deadline = Date.now() + waitMs; + let latest: unknown = null; + while (Date.now() < deadline) { + latest = coreInternalFetch(`/api/tasks/${encodeURIComponent(taskId)}`); + const task = asRecord(coreBody(latest)?.task); + const status = asString(task?.status); + if (status === "succeeded" || status === "failed") { + const result = asRecord(task?.result); + const exitCode = typeof result?.exitCode === "number" ? result.exitCode : null; + const stdout = asString(result?.stdout); + const stderr = asString(result?.stderr); + return { + ok: status === "succeeded" && (exitCode === null || exitCode === 0), + taskId, + status, + stdout, + stderr, + exitCode, + raw: task, + }; + } + await Bun.sleep(500); + } + + return { + ok: false, + taskId, + status: "timeout", + stdout: "", + stderr: `host.ssh task ${taskId} did not finish within ${waitMs}ms`, + exitCode: null, + raw: latest, + }; +} + +async function launchBackground( + config: UniDeskConfig, + providerId: string, + shellScript: string, + cwd: string, + logFile: string, + sentinelFile: string, +): Promise<{ ok: boolean; pid: string; raw: unknown; error: string }> { + const wrapped = [ + `bash -lc ${shellQuote(shellScript)}`, + "code=$?", + `printf '%s\\n' "$code" > ${shellQuote(sentinelFile)}`, + "exit \"$code\"", + ].join("; "); + const command = [ + `rm -f ${shellQuote(sentinelFile)} ${shellQuote(logFile)}`, + `nohup bash -lc ${shellQuote(wrapped)} > ${shellQuote(logFile)} 2>&1 < /dev/null & echo $!`, + ].join("; "); + const result = await dispatchSsh(config, providerId, command, cwd, shortDispatchWaitMs, shortRemoteTimeoutMs); + const pid = result.stdout.trim().split("\n").pop()?.trim() ?? ""; + if (!result.ok || !/^\d+$/u.test(pid)) { + return { ok: false, pid: "", raw: result.raw, error: result.stderr || result.stdout || "failed to launch background command" }; + } + return { ok: true, pid, raw: result.raw, error: "" }; +} + +async function pollBackground( + config: UniDeskConfig, + providerId: string, + cwd: string, + logFile: string, + sentinelFile: string, +): Promise { + const command = [ + `if [ -f ${shellQuote(sentinelFile)} ]; then printf 'SENTINEL:%s\\n' "$(cat ${shellQuote(sentinelFile)} 2>/dev/null || true)"; else echo RUNNING; fi`, + `tail -n 80 ${shellQuote(logFile)} 2>/dev/null || true`, + ].join("; "); + const result = await dispatchSsh(config, providerId, command, cwd, shortDispatchWaitMs, shortRemoteTimeoutMs); + const stdout = result.stdout.trimEnd(); + const [head = "", ...rest] = stdout.split("\n"); + if (head.startsWith("SENTINEL:")) { + const rawExitCode = head.slice("SENTINEL:".length).trim(); + const exitCode = /^\d+$/u.test(rawExitCode) ? Number(rawExitCode) : null; + return { done: true, exitCode, logTail: rest.join("\n").trim(), raw: result.raw }; + } + return { done: false, exitCode: null, logTail: rest.join("\n").trim(), raw: result.raw }; +} + +async function backgroundStep( + config: UniDeskConfig, + options: DeployCliOptions, + step: string, + shellScript: string, + timeoutMs: number, + cwd: string | null = options.deployDir, +): Promise { + const startedAt = nowIso(); + const startedMs = Date.now(); + const runId = `${Date.now().toString(36)}-${Math.random().toString(16).slice(2, 8)}`; + const logFile = `/tmp/unidesk-codex-deploy-${step}-${runId}.log`; + const sentinelFile = `/tmp/unidesk-codex-deploy-${step}-${runId}.done`; + progressLine(step, "launching remote background step", { logFile, sentinelFile, timeoutMs }); + const launch = await launchBackground(config, options.providerId, shellScript, cwd ?? "/home/ubuntu", logFile, sentinelFile); + if (!launch.ok) { + return { step, ok: false, detail: launch.error, startedAt, finishedAt: nowIso(), raw: launch.raw }; + } + progressLine(step, "remote background step started", { pid: launch.pid, logFile }); + const deadline = Date.now() + timeoutMs; + let lastTail = ""; + while (Date.now() < deadline) { + await Bun.sleep(pollIntervalMs); + const poll = await pollBackground(config, options.providerId, cwd ?? "/home/ubuntu", logFile, sentinelFile); + const tail = compactTail(poll.logTail, 1200); + if (tail.length > 0 && tail !== lastTail) { + lastTail = tail; + progressLine(step, "remote log tail", { elapsedMs: elapsedMs(startedMs), tail }); + } + if (poll.done) { + const ok = poll.exitCode === 0; + return { + step, + ok, + detail: ok + ? `completed in ${elapsedMs(startedMs)}ms; log=${logFile}` + : `failed with exit ${poll.exitCode}; log=${logFile}; tail=${compactTail(poll.logTail)}`, + startedAt, + finishedAt: nowIso(), + raw: poll.raw, + }; + } + } + return { step, ok: false, detail: `timed out after ${timeoutMs}ms; log=${logFile}`, startedAt, finishedAt: nowIso(), raw: null }; +} + +function bashScriptCommand(script: string): string { + return `bash -lc ${shellQuote(script)}`; +} + +async function directStep( + config: UniDeskConfig, + options: DeployCliOptions, + step: string, + command: string, + cwd: string | null, + waitMs = 60_000, + remoteTimeoutMs = 45_000, +): Promise { + const startedAt = nowIso(); + const startedMs = Date.now(); + progressLine(step, "running remote command"); + const result = await dispatchSsh(config, options.providerId, command, cwd, waitMs, remoteTimeoutMs); + const detail = compactTail([result.stdout, result.stderr].filter(Boolean).join("\n"), 1200); + return { + step, + ok: result.ok, + detail: result.ok ? detail || `completed in ${elapsedMs(startedMs)}ms` : detail || `failed with status=${result.status} exit=${result.exitCode}`, + startedAt, + finishedAt: nowIso(), + raw: result.raw, + }; +} + +function prepareSourceScript(options: DeployCliOptions, exportDir: string): string { + return [ + "set -euo pipefail", + `repo=${shellQuote(options.sourceRepoDir)}`, + `repo_url=${shellQuote(options.repoUrl)}`, + `commit=${shellQuote(options.commitId)}`, + `export_dir=${shellQuote(exportDir)}`, + "mkdir -p \"$(dirname \"$repo\")\"", + "if [ ! -d \"$repo/.git\" ]; then rm -rf \"$repo\"; git clone --no-checkout \"$repo_url\" \"$repo\"; fi", + "cd \"$repo\"", + "git remote get-url origin >/dev/null", + "git fetch --no-tags origin \"$commit\" || git fetch --no-tags origin '+refs/heads/*:refs/remotes/origin/*'", + "resolved=$(git rev-parse --verify \"$commit^{commit}\")", + "rm -rf \"$export_dir\"", + "mkdir -p \"$export_dir\"", + "git archive --format=tar \"$resolved\" | tar -xf - -C \"$export_dir\"", + "printf 'resolved_commit=%s\\nexport_dir=%s\\n' \"$resolved\" \"$export_dir\"", + ].join("\n"); +} + +function syncDeployScript(options: DeployCliOptions, exportDir: string): string { + return [ + "set -euo pipefail", + `export_dir=${shellQuote(exportDir)}`, + `deploy_dir=${shellQuote(options.deployDir)}`, + "test -f \"$export_dir/src/components/microservices/code-queue/Dockerfile\"", + "mkdir -p \"$deploy_dir\"", + [ + "rsync -a --delete", + "--exclude '.git/'", + "--exclude '.state/'", + "--exclude 'logs/'", + "--exclude '**/node_modules/'", + "--exclude '**/dist/'", + "\"$export_dir/\"", + "\"$deploy_dir/\"", + ].join(" "), + "test -f \"$deploy_dir/src/components/microservices/code-queue/Dockerfile\"", + "test -f \"$deploy_dir/src/components/microservices/v3sctl-adapter/v3s/code-queue.k8s.yaml\"", + "printf 'synced deploy tree to %s\\n' \"$deploy_dir\"", + ].join("\n"); +} + +function buildImageScript(options: DeployCliOptions): string { + return [ + "set -euo pipefail", + `deploy_dir=${shellQuote(options.deployDir)}`, + `image=${shellQuote(options.imageTag)}`, + "cd \"$deploy_dir\"", + "docker build -t \"$image\" -f src/components/microservices/code-queue/Dockerfile .", + "docker image inspect \"$image\" --format 'image_id={{.Id}} repo_tags={{json .RepoTags}}'", + ].join("\n"); +} + +function importImageScript(options: DeployCliOptions): string { + return [ + "set -euo pipefail", + `image=${shellQuote(options.imageTag)}`, + `k3s_container=${shellQuote(options.k3sContainer)}`, + "docker image inspect \"$image\" >/dev/null", + "docker ps --format '{{.Names}}' | grep -Fx \"$k3s_container\" >/dev/null", + "docker save \"$image\" | docker exec -i \"$k3s_container\" ctr -n k8s.io images import -", + "docker exec \"$k3s_container\" ctr -n k8s.io images ls | grep -F \"$image\" || true", + ].join("\n"); +} + +function applyManifestCommand(options: DeployCliOptions): string { + const manifest = `${options.deployDir}/${k8sManifestRelPath}`; + return `KUBECONFIG=${shellQuote(options.kubeconfigPath)} kubectl apply -f ${shellQuote(manifest)}`; +} + +function rolloutRestartCommand(options: DeployCliOptions): string { + return `KUBECONFIG=${shellQuote(options.kubeconfigPath)} kubectl -n ${shellQuote(k8sNamespace)} rollout restart deployment/${shellQuote(tcpEgressDeployment)} deployment/${shellQuote(k8sDeployment)}`; +} + +function rolloutStatusScript(options: DeployCliOptions): string { + return [ + "set -euo pipefail", + `kubeconfig=${shellQuote(options.kubeconfigPath)}`, + `namespace=${shellQuote(k8sNamespace)}`, + `deployment=${shellQuote(k8sDeployment)}`, + `tcp_deployment=${shellQuote(tcpEgressDeployment)}`, + "KUBECONFIG=\"$kubeconfig\" kubectl -n \"$namespace\" rollout status \"deployment/$tcp_deployment\" --timeout=120s", + "KUBECONFIG=\"$kubeconfig\" kubectl -n \"$namespace\" rollout status \"deployment/$deployment\" --timeout=180s", + "KUBECONFIG=\"$kubeconfig\" kubectl -n \"$namespace\" get deploy \"$tcp_deployment\" \"$deployment\" -o wide", + "KUBECONFIG=\"$kubeconfig\" kubectl -n \"$namespace\" get pods -l app.kubernetes.io/name=code-queue,unidesk.ai/instance-id=D601 -o wide", + ].join("\n"); +} + +async function microserviceHealthStep(config: UniDeskConfig, timeoutMs: number): Promise { + const step = "unidesk-health"; + const startedAt = nowIso(); + const startedMs = Date.now(); + const deadline = Date.now() + timeoutMs; + let latest: unknown = null; + while (Date.now() < deadline) { + latest = coreInternalFetch("/api/microservices/code-queue/health"); + const record = asRecord(latest); + const body = asRecord(record?.body); + const ok = record?.ok === true && body?.ok !== false; + progressLine(step, "health probe", { ok, status: record?.status ?? null, body }); + if (ok) { + return { + step, + ok: true, + detail: `Code Queue health passed in ${elapsedMs(startedMs)}ms`, + startedAt, + finishedAt: nowIso(), + raw: latest, + }; + } + await Bun.sleep(pollIntervalMs); + } + return { + step, + ok: false, + detail: `Code Queue health did not pass within ${timeoutMs}ms`, + startedAt, + finishedAt: nowIso(), + raw: latest, + }; +} + +function remainingTimeout(deadline: number, fallbackMs: number): number { + return Math.max(30_000, Math.min(fallbackMs, deadline - Date.now())); +} + +export async function codexDeploy(config: UniDeskConfig, options: DeployCliOptions): Promise { + validateOptions(options); + const startedAt = nowIso(); + const deadline = Date.now() + options.timeoutMs; + const exportDir = `/tmp/unidesk-codex-deploy-src-${Date.now().toString(36)}-${Math.random().toString(16).slice(2, 8)}`; + const steps: StepResult[] = []; + const pushStep = (step: StepResult): boolean => { + steps.push(step); + progressLine(step.step, step.ok ? "step succeeded" : "step failed", { detail: step.detail }); + return step.ok; + }; + + progressLine("deploy", "starting Code Queue deployment", { + commitId: options.commitId, + providerId: options.providerId, + repoUrl: options.repoUrl, + sourceRepoDir: options.sourceRepoDir, + deployDir: options.deployDir, + imageTag: options.imageTag, + timeoutMs: options.timeoutMs, + skipBuild: options.skipBuild, + }); + + const prepare = await backgroundStep(config, options, "prepare-source", prepareSourceScript(options, exportDir), remainingTimeout(deadline, 180_000), null); + if (!pushStep(prepare)) return { ok: false, startedAt, finishedAt: nowIso(), options, steps }; + + const sync = await directStep(config, options, "sync-deploy-tree", bashScriptCommand(syncDeployScript(options, exportDir)), null, 60_000, 45_000); + if (!pushStep(sync)) return { ok: false, startedAt, finishedAt: nowIso(), options, steps }; + + if (options.skipBuild) { + steps.push({ step: "docker-build", ok: true, detail: "skipped by --skip-build", startedAt: nowIso(), finishedAt: nowIso() }); + steps.push({ step: "import-k3s-image", ok: true, detail: "skipped by --skip-build", startedAt: nowIso(), finishedAt: nowIso() }); + } else { + const build = await backgroundStep(config, options, "docker-build", buildImageScript(options), remainingTimeout(deadline, 540_000)); + if (!pushStep(build)) return { ok: false, startedAt, finishedAt: nowIso(), options, steps }; + const imageImport = await backgroundStep(config, options, "import-k3s-image", importImageScript(options), remainingTimeout(deadline, 180_000)); + if (!pushStep(imageImport)) return { ok: false, startedAt, finishedAt: nowIso(), options, steps }; + } + + const apply = await directStep(config, options, "kubectl-apply", applyManifestCommand(options), options.deployDir, 60_000, 45_000); + if (!pushStep(apply)) return { ok: false, startedAt, finishedAt: nowIso(), options, steps }; + + const restart = await directStep(config, options, "rollout-restart", rolloutRestartCommand(options), options.deployDir, 60_000, 45_000); + if (!pushStep(restart)) return { ok: false, startedAt, finishedAt: nowIso(), options, steps }; + + const rollout = await backgroundStep(config, options, "rollout-status", rolloutStatusScript(options), remainingTimeout(deadline, 240_000)); + if (!pushStep(rollout)) return { ok: false, startedAt, finishedAt: nowIso(), options, steps }; + + const health = await microserviceHealthStep(config, remainingTimeout(deadline, 90_000)); + pushStep(health); + + return { + ok: health.ok, + startedAt, + finishedAt: nowIso(), + options, + steps, + statusCommands: { + health: "bun scripts/cli.ts microservice health code-queue", + overview: "bun scripts/cli.ts microservice proxy code-queue '/api/tasks/overview?limit=5&transcriptLimit=1&compact=1&afterSeq=0&preferId='", + }, + }; +} + +function deployJobCommand(options: DeployCliOptions): string[] { + const command = [ + process.execPath, + rootPath("scripts", "cli.ts"), + "codex", + "deploy", + options.commitId, + "--run-now", + "--provider-id", + options.providerId, + "--repo-url", + options.repoUrl, + "--source-repo-dir", + options.sourceRepoDir, + "--deploy-dir", + options.deployDir, + "--kubeconfig", + options.kubeconfigPath, + "--k3s-container", + options.k3sContainer, + "--image", + options.imageTag, + "--timeout-ms", + String(options.timeoutMs), + ]; + if (options.skipBuild) command.push("--skip-build"); + return command; +} + +function startDeployJob(options: DeployCliOptions): { ok: true; mode: "async-job"; job: JobRecord; commitId: string; providerId: string; statusCommand: string; tailCommand: string; note: string } { + const command = deployJobCommand(options); + const job = startJob("codex_deploy", command, `Deploy Code Queue ${options.commitId} to ${options.providerId} v3s/k8s`); + return { + ok: true, + mode: "async-job", + job, + commitId: options.commitId, + providerId: options.providerId, + statusCommand: `bun scripts/cli.ts job status ${job.id}`, + tailCommand: `bun scripts/cli.ts job status ${job.id} --tail-bytes 30000`, + note: "Deployment continues in the background: fetch remote commit, export tracked files, sync D601 deploy tree, build/import image, apply k8s manifest, restart rollout, then verify Code Queue health.", + }; +} + +export async function runCodexDeployCommand(config: UniDeskConfig, args: string[]): Promise { + const options = parseOptions(config, args); + validateOptions(options); + if (!options.runNow) return startDeployJob(options); + return codexDeploy(config, options); +} diff --git a/src/components/microservices/v3sctl-adapter/v3s/code-queue.k8s.yaml b/src/components/microservices/v3sctl-adapter/v3s/code-queue.k8s.yaml index 92b7c64c..36929124 100644 --- a/src/components/microservices/v3sctl-adapter/v3s/code-queue.k8s.yaml +++ b/src/components/microservices/v3sctl-adapter/v3s/code-queue.k8s.yaml @@ -42,6 +42,211 @@ endpoints: - addresses: - "172.25.0.3" --- +apiVersion: v1 +kind: ConfigMap +metadata: + name: d601-tcp-egress-gateway + namespace: unidesk + labels: + app.kubernetes.io/name: tcp-egress-gateway + app.kubernetes.io/part-of: unidesk + unidesk.ai/provider-id: D601 +data: + tcp-egress-gateway.js: | + const net = require("node:net"); + + const proxyUrl = new URL(process.env.TCP_EGRESS_HTTP_PROXY || "http://d601-provider-egress-proxy.unidesk.svc.cluster.local:18789"); + const proxyHost = proxyUrl.hostname; + const proxyPort = Number(proxyUrl.port || 80); + const healthPort = Number(process.env.TCP_EGRESS_HEALTH_PORT || 18080); + const rawRoutes = process.env.TCP_EGRESS_ROUTES || ""; + const startedAt = new Date().toISOString(); + + function parseRoute(text) { + const parts = text.trim().split("="); + if (parts.length !== 3) throw new Error(`invalid route: ${text}`); + const [name, listenPortText, target] = parts; + const [targetHost, targetPortText] = target.split(":"); + const listenPort = Number(listenPortText); + const targetPort = Number(targetPortText); + if (!name || !targetHost || !Number.isInteger(listenPort) || !Number.isInteger(targetPort)) throw new Error(`invalid route: ${text}`); + return { name, listenPort, targetHost, targetPort }; + } + + const routes = rawRoutes.split(",").map((item) => item.trim()).filter(Boolean).map(parseRoute); + if (routes.length === 0) throw new Error("TCP_EGRESS_ROUTES must define at least one route"); + + function findHeaderEnd(buffer) { + for (let index = 0; index <= buffer.length - 4; index += 1) { + if (buffer[index] === 13 && buffer[index + 1] === 10 && buffer[index + 2] === 13 && buffer[index + 3] === 10) return index + 4; + } + return -1; + } + + function log(level, message, data = {}) { + console.log(JSON.stringify({ ts: new Date().toISOString(), service: "tcp-egress-gateway", level, message, data })); + } + + function pipeViaHttpConnect(client, route) { + const proxy = net.connect({ host: proxyHost, port: proxyPort }); + const earlyClientChunks = []; + const proxyHeaderChunks = []; + let tunnelReady = false; + let closed = false; + + const closeBoth = () => { + if (closed) return; + closed = true; + client.destroy(); + proxy.destroy(); + }; + + client.on("data", (chunk) => { + if (tunnelReady) proxy.write(chunk); + else earlyClientChunks.push(chunk); + }); + client.on("error", closeBoth); + client.on("close", closeBoth); + + proxy.on("connect", () => { + proxy.write(`CONNECT ${route.targetHost}:${route.targetPort} HTTP/1.1\r\nHost: ${route.targetHost}:${route.targetPort}\r\n\r\n`); + }); + proxy.on("data", (chunk) => { + if (tunnelReady) { + client.write(chunk); + return; + } + proxyHeaderChunks.push(chunk); + const headerBuffer = Buffer.concat(proxyHeaderChunks); + const headerEnd = findHeaderEnd(headerBuffer); + if (headerEnd < 0) return; + const header = headerBuffer.subarray(0, headerEnd).toString("latin1"); + if (!/^HTTP\/1\.[01] 200\b/u.test(header)) { + log("warn", "connect_rejected", { route: route.name, header: header.slice(0, 200) }); + closeBoth(); + return; + } + tunnelReady = true; + const remaining = headerBuffer.subarray(headerEnd); + if (remaining.length > 0) client.write(remaining); + for (const early of earlyClientChunks) proxy.write(early); + earlyClientChunks.length = 0; + }); + proxy.on("error", (error) => { + log("warn", "proxy_error", { route: route.name, error: String(error?.message || error) }); + closeBoth(); + }); + proxy.on("close", closeBoth); + } + + for (const route of routes) { + net.createServer((client) => pipeViaHttpConnect(client, route)).listen(route.listenPort, "0.0.0.0", () => { + log("info", "route_listening", { route: route.name, listenPort: route.listenPort, target: `${route.targetHost}:${route.targetPort}`, proxy: `${proxyHost}:${proxyPort}` }); + }); + } + + Bun.serve({ + hostname: "0.0.0.0", + port: healthPort, + fetch() { + return Response.json({ ok: true, service: "tcp-egress-gateway", startedAt, proxy: `${proxyHost}:${proxyPort}`, routes }); + }, + }); +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: d601-tcp-egress-gateway + namespace: unidesk + labels: + app.kubernetes.io/name: tcp-egress-gateway + app.kubernetes.io/part-of: unidesk + unidesk.ai/provider-id: D601 +spec: + replicas: 1 + selector: + matchLabels: + app.kubernetes.io/name: tcp-egress-gateway + unidesk.ai/provider-id: D601 + template: + metadata: + labels: + app.kubernetes.io/name: tcp-egress-gateway + app.kubernetes.io/part-of: unidesk + unidesk.ai/provider-id: D601 + unidesk.ai/node-id: D601 + spec: + nodeSelector: + unidesk.ai/node-id: D601 + containers: + - name: tcp-egress-gateway + image: unidesk-code-queue:d601 + imagePullPolicy: IfNotPresent + command: + - bun + - /etc/unidesk-tcp-egress/tcp-egress-gateway.js + ports: + - name: pg + containerPort: 15432 + - name: oa + containerPort: 4255 + - name: health + containerPort: 18080 + env: + - name: TCP_EGRESS_HTTP_PROXY + value: "http://d601-provider-egress-proxy.unidesk.svc.cluster.local:18789" + - name: TCP_EGRESS_ROUTES + value: "postgres=15432=74.48.78.17:15432,oa-event-flow=4255=74.48.78.17:4255" + - name: TCP_EGRESS_HEALTH_PORT + value: "18080" + volumeMounts: + - name: script + mountPath: /etc/unidesk-tcp-egress + readOnly: true + readinessProbe: + httpGet: + path: /health + port: health + periodSeconds: 5 + timeoutSeconds: 3 + failureThreshold: 12 + livenessProbe: + httpGet: + path: /health + port: health + periodSeconds: 10 + timeoutSeconds: 3 + failureThreshold: 6 + volumes: + - name: script + configMap: + name: d601-tcp-egress-gateway +--- +apiVersion: v1 +kind: Service +metadata: + name: d601-tcp-egress-gateway + namespace: unidesk + labels: + app.kubernetes.io/name: tcp-egress-gateway + app.kubernetes.io/part-of: unidesk + unidesk.ai/provider-id: D601 +spec: + type: ClusterIP + selector: + app.kubernetes.io/name: tcp-egress-gateway + unidesk.ai/provider-id: D601 + ports: + - name: pg + port: 15432 + targetPort: pg + - name: oa + port: 4255 + targetPort: oa + - name: health + port: 18080 + targetPort: health +--- apiVersion: apps/v1 kind: Deployment metadata: @@ -86,6 +291,8 @@ spec: value: "0.0.0.0" - name: PORT value: "4222" + - name: DATABASE_URL + value: "postgres://unidesk:unidesk_dev_password@d601-tcp-egress-gateway.unidesk.svc.cluster.local:15432/unidesk" - name: CODE_QUEUE_INSTANCE_ID value: "D601" - name: CODE_QUEUE_SCHEDULER_ENABLED @@ -139,7 +346,7 @@ spec: - name: CODE_QUEUE_EGRESS_PROXY_URL value: "http://d601-provider-egress-proxy.unidesk.svc.cluster.local:18789" - name: CODE_QUEUE_EGRESS_PROXY_NO_PROXY - value: "localhost,127.0.0.1,::1,host.docker.internal,d601-provider-egress-proxy,d601-provider-egress-proxy.unidesk,d601-provider-egress-proxy.unidesk.svc,d601-provider-egress-proxy.unidesk.svc.cluster.local,172.25.0.3,unidesk-provider-gateway-D601,74.48.78.17,backend-core,oa-event-flow,database" + value: "localhost,127.0.0.1,::1,host.docker.internal,d601-provider-egress-proxy,d601-provider-egress-proxy.unidesk,d601-provider-egress-proxy.unidesk.svc,d601-provider-egress-proxy.unidesk.svc.cluster.local,d601-tcp-egress-gateway,d601-tcp-egress-gateway.unidesk,d601-tcp-egress-gateway.unidesk.svc,d601-tcp-egress-gateway.unidesk.svc.cluster.local,172.25.0.3,unidesk-provider-gateway-D601,backend-core,oa-event-flow,database" - name: HTTP_PROXY value: "http://d601-provider-egress-proxy.unidesk.svc.cluster.local:18789" - name: HTTPS_PROXY @@ -153,11 +360,11 @@ spec: - name: all_proxy value: "http://d601-provider-egress-proxy.unidesk.svc.cluster.local:18789" - name: NO_PROXY - value: "localhost,127.0.0.1,::1,host.docker.internal,d601-provider-egress-proxy,d601-provider-egress-proxy.unidesk,d601-provider-egress-proxy.unidesk.svc,d601-provider-egress-proxy.unidesk.svc.cluster.local,172.25.0.3,unidesk-provider-gateway-D601,74.48.78.17,backend-core,oa-event-flow,database" + value: "localhost,127.0.0.1,::1,host.docker.internal,d601-provider-egress-proxy,d601-provider-egress-proxy.unidesk,d601-provider-egress-proxy.unidesk.svc,d601-provider-egress-proxy.unidesk.svc.cluster.local,d601-tcp-egress-gateway,d601-tcp-egress-gateway.unidesk,d601-tcp-egress-gateway.unidesk.svc,d601-tcp-egress-gateway.unidesk.svc.cluster.local,172.25.0.3,unidesk-provider-gateway-D601,backend-core,oa-event-flow,database" - name: no_proxy - value: "localhost,127.0.0.1,::1,host.docker.internal,d601-provider-egress-proxy,d601-provider-egress-proxy.unidesk,d601-provider-egress-proxy.unidesk.svc,d601-provider-egress-proxy.unidesk.svc.cluster.local,172.25.0.3,unidesk-provider-gateway-D601,74.48.78.17,backend-core,oa-event-flow,database" + value: "localhost,127.0.0.1,::1,host.docker.internal,d601-provider-egress-proxy,d601-provider-egress-proxy.unidesk,d601-provider-egress-proxy.unidesk.svc,d601-provider-egress-proxy.unidesk.svc.cluster.local,d601-tcp-egress-gateway,d601-tcp-egress-gateway.unidesk,d601-tcp-egress-gateway.unidesk.svc,d601-tcp-egress-gateway.unidesk.svc.cluster.local,172.25.0.3,unidesk-provider-gateway-D601,backend-core,oa-event-flow,database" - name: OA_EVENT_FLOW_BASE_URL - value: "http://74.48.78.17:4255" + value: "http://d601-tcp-egress-gateway.unidesk.svc.cluster.local:4255" - name: CODE_QUEUE_NOTIFY_CLAUDEQQ_ENABLED value: "true" - name: CODE_QUEUE_NOTIFY_CLAUDEQQ_BASE_URL @@ -310,6 +517,8 @@ spec: value: "0.0.0.0" - name: PORT value: "4222" + - name: DATABASE_URL + value: "postgres://unidesk:unidesk_dev_password@d601-tcp-egress-gateway.unidesk.svc.cluster.local:15432/unidesk" - name: CODE_QUEUE_INSTANCE_ID value: "D518" - name: CODE_QUEUE_SCHEDULER_ENABLED @@ -363,7 +572,7 @@ spec: - name: CODE_QUEUE_EGRESS_PROXY_URL value: "" - name: CODE_QUEUE_EGRESS_PROXY_NO_PROXY - value: "localhost,127.0.0.1,::1,host.docker.internal,74.48.78.17,backend-core,oa-event-flow,database" + value: "localhost,127.0.0.1,::1,host.docker.internal,d601-tcp-egress-gateway,d601-tcp-egress-gateway.unidesk,d601-tcp-egress-gateway.unidesk.svc,d601-tcp-egress-gateway.unidesk.svc.cluster.local,backend-core,oa-event-flow,database" - name: HTTP_PROXY value: "" - name: HTTPS_PROXY @@ -377,11 +586,11 @@ spec: - name: all_proxy value: "" - name: NO_PROXY - value: "localhost,127.0.0.1,::1,host.docker.internal,74.48.78.17,backend-core,oa-event-flow,database" + value: "localhost,127.0.0.1,::1,host.docker.internal,d601-tcp-egress-gateway,d601-tcp-egress-gateway.unidesk,d601-tcp-egress-gateway.unidesk.svc,d601-tcp-egress-gateway.unidesk.svc.cluster.local,backend-core,oa-event-flow,database" - name: no_proxy - value: "localhost,127.0.0.1,::1,host.docker.internal,74.48.78.17,backend-core,oa-event-flow,database" + value: "localhost,127.0.0.1,::1,host.docker.internal,d601-tcp-egress-gateway,d601-tcp-egress-gateway.unidesk,d601-tcp-egress-gateway.unidesk.svc,d601-tcp-egress-gateway.unidesk.svc.cluster.local,backend-core,oa-event-flow,database" - name: OA_EVENT_FLOW_BASE_URL - value: "http://74.48.78.17:4255" + value: "http://d601-tcp-egress-gateway.unidesk.svc.cluster.local:4255" - name: CODE_QUEUE_NOTIFY_CLAUDEQQ_ENABLED value: "false" - name: CODE_QUEUE_NOTIFY_CLAUDEQQ_BASE_URL