feat(microservices): manage code queue through v3s

This commit is contained in:
Codex
2026-05-15 11:54:41 +00:00
parent c334c4f082
commit 00add260e3
42 changed files with 2010 additions and 354 deletions
+5 -5
View File
@@ -31,17 +31,17 @@ Typical targeted commands:
- `bun scripts/cli.ts e2e run --only frontend --skip frontend:todo-note-integrated-visible,frontend:findjob-integrated-visible`
- `bun scripts/cli.ts e2e run --only network,provider-ingress`
- Public exposure: Docker port summary must not show core REST or Code Queue public host mappings; frontend and provider ingress are the only browser/provider public entries. PostgreSQL `15432` and OA Event Flow `4255` may be host-mapped only for D601 Code Queue and must be protected by the `DOCKER-USER` source restrictions generated from `network.restrictedHostAccess`; E2E treats either an unreachable generic probe or a verified restricted rule as passing. Known private user-service ports such as FindJob `3254`, MET Nonlinear `3288`, Todo Note `4211`, Code Queue host port `14222` and File Browser provider port `4251` probes must fail.
- Public exposure: Docker port summary must not show core REST, Code Queue NodePort, or Code Queue public host mappings; frontend and provider ingress are the only browser/provider public entries. PostgreSQL `15432` and OA Event Flow `4255` may be host-mapped only for controlled Code Queue nodes and must be protected by the `DOCKER-USER` source restrictions generated from `network.restrictedHostAccess`; E2E treats either an unreachable generic probe or a verified restricted rule as passing. Known private user-service ports such as FindJob `3254`, MET Nonlinear `3288`, Todo Note `4211`, legacy Code Queue host ports and File Browser provider port `4251` probes must fail.
- Core API: `docker exec unidesk-backend-core` calls internal `GET /api/overview`, which must report `dbReady: true`, `pgdata.volumeName=unidesk_pgdata_10gb`, a positive PostgreSQL database byte count, and at least one online node; internal `GET /api/performance` must report component request statistics, internal operation statistics, PGDATA usage and Code Queue PostgreSQL storage metadata.
- Provider self-connection: internal `GET /api/nodes` must contain `main-server` with `status: online`, `labels.providerGatewayVersion` equal to `src/components/provider-gateway/package.json`, `labels.providerGatewayUpgradePolicy: "always-enabled"`, `labels.providerGatewayRestartPolicyOk: true`, `labels.providerGatewayPidModeOk: true`, and `labels.providerGatewayRuntimeGuardOk: true`; internal `GET /api/nodes/system-status` must contain CPU/memory/disk samples plus a non-empty process resource list sorted by memory by default; internal `GET /api/nodes/docker-status` must contain a Docker snapshot for `main-server`; every running `provider-gateway` container visible in Docker snapshots must report `restartPolicy: "always"` and `pidMode: "host"`; public provider ingress `/health` must return ok.
- Provider remote control: internal `/api/dispatch` must successfully complete a real `provider.upgrade` task in `mode: "plan"` so the upgrade path is validated without recreating the running gateway during E2E.
- User services: internal `/api/microservices` must include `todo-note` and `oa-event-flow` on `main-server`, canonical `filebrowser` on `D518`, plus `code-queue`, `findjob`, `pipeline`, `met-nonlinear`, `claudeqq` and `filebrowser-d601` on `D601` with `public=false`; `/api/microservices/todo-note/health` must report `storage=postgres`, `/api/microservices/todo-note/proxy/api/instances` must expose the migrated Todo Note lists, and a temporary Todo Note list create/add/toggle/undo/delete cycle must succeed through the real provider-gateway proxy; `/api/microservices/oa-event-flow/health`, `/api/microservices/oa-event-flow/proxy/api/diagnostics`, `/api/microservices/oa-event-flow/proxy/api/events`, `/api/microservices/oa-event-flow/proxy/api/events?tags=service:pipeline` and `/api/microservices/oa-event-flow/proxy/api/stats/trace` must prove the independent OA event table、Pipeline bridge 和 stats center are reachable through UniDesk proxy; `/api/microservices/code-queue/health` must return a D601 Code Queue summary with default model `gpt-5.5`, and `/api/microservices/code-queue/proxy/api/tasks` must return queue state through the same D601 provider-gateway proxy; `/api/microservices/filebrowser/health`, `/api/microservices/filebrowser-d601/health` and `/api/microservices/filebrowser/proxy/` must prove File Browser health and WebUI access through UniDesk proxy; `/api/microservices/findjob/health` and `/api/microservices/findjob/proxy/api/summary` must succeed through the real provider-gateway proxy; `/api/microservices/findjob/proxy/api/jobs?__unideskArrayLimit=jobs:5` must return a bounded preview with `_unidesk.arrayLimits` metadata; `/api/microservices/pipeline/health`, `/api/microservices/pipeline/proxy/api/snapshot?__unideskArrayLimit=registry.components:8,runs:3` and `/api/microservices/pipeline/proxy/api/oa-event-flow/diagnostics` must return Pipeline health, registry/run previews and OA event-flow evidence; `/api/microservices/met-nonlinear/health`, `/api/microservices/met-nonlinear/proxy/api/queue`, `/api/microservices/met-nonlinear/proxy/api/projects?root=projects&limit=500`, `/api/microservices/met-nonlinear/proxy/api/projects?root=ex_projects&limit=500`, `/api/microservices/met-nonlinear/proxy/api/projects/config?path=<projectPath>` and `/api/microservices/met-nonlinear/proxy/api/images` must return the D601 TS backend health, queue/GPU policy, full project tree inputs, structured project detail and ready `met-nonlinear-ml:tf26` image status.
- User services: internal `/api/microservices` must include `todo-note` and `oa-event-flow` on `main-server`, canonical `filebrowser` on `D518`, plus `v3sctl-adapter`, `code-queue`, `findjob`, `pipeline`, `met-nonlinear`, `claudeqq` and `filebrowser-d601` on `D601` with `public=false`; `/api/microservices/todo-note/health` must report `storage=postgres`, `/api/microservices/todo-note/proxy/api/instances` must expose the migrated Todo Note lists, and a temporary Todo Note list create/add/toggle/undo/delete cycle must succeed through the real provider-gateway proxy; `/api/microservices/oa-event-flow/health`, `/api/microservices/oa-event-flow/proxy/api/diagnostics`, `/api/microservices/oa-event-flow/proxy/api/events`, `/api/microservices/oa-event-flow/proxy/api/events?tags=service:pipeline` and `/api/microservices/oa-event-flow/proxy/api/stats/trace` must prove the independent OA event table、Pipeline bridge 和 stats center are reachable through UniDesk proxy; `/api/microservices/v3sctl-adapter/health` and `/api/microservices/v3sctl-adapter/proxy/api/control-plane` must expose the D601 v3s/k8s control plane, `kubeApiProxy.mode=kubernetes-api-service-proxy`, D601 active instance `servingHealthy=true`, D518 expected/missing state when D518 has not joined, `status=degraded` for incomplete topology, and `noFallback=true`; `/api/microservices/code-queue/health` must return the active Code Queue backend summary with default model `gpt-5.5`, and `/api/microservices/code-queue/proxy/api/tasks/overview` must return queue state through backend-core -> v3sctl-adapter -> Kubernetes API service proxy -> v3s/k8s Service, not through a `serviceId=code-queue` provider-gateway direct task or `/api/code-queue-direct`; `/api/microservices/filebrowser/health`, `/api/microservices/filebrowser-d601/health` and `/api/microservices/filebrowser/proxy/` must prove File Browser health and WebUI access through UniDesk proxy; `/api/microservices/findjob/health` and `/api/microservices/findjob/proxy/api/summary` must succeed through the real provider-gateway proxy; `/api/microservices/findjob/proxy/api/jobs?__unideskArrayLimit=jobs:5` must return a bounded preview with `_unidesk.arrayLimits` metadata; `/api/microservices/pipeline/health`, `/api/microservices/pipeline/proxy/api/snapshot?__unideskArrayLimit=registry.components:8,runs:3` and `/api/microservices/pipeline/proxy/api/oa-event-flow/diagnostics` must return Pipeline health, registry/run previews and OA event-flow evidence; `/api/microservices/met-nonlinear/health`, `/api/microservices/met-nonlinear/proxy/api/queue`, `/api/microservices/met-nonlinear/proxy/api/projects?root=projects&limit=500`, `/api/microservices/met-nonlinear/proxy/api/projects?root=ex_projects&limit=500`, `/api/microservices/met-nonlinear/proxy/api/projects/config?path=<projectPath>` and `/api/microservices/met-nonlinear/proxy/api/images` must return the D601 TS backend health, queue/GPU policy, full project tree inputs, structured project detail and ready `met-nonlinear-ml:tf26` image status.
- ClaudeQQ availability: `/api/microservices/claudeqq/health` must only pass when `ready=true`, NapCat HTTP and WebSocket are connected, and `napcat.loginState=logged_in`; `/api/microservices/claudeqq/proxy/api/napcat/login` must show the same logged-in account state and `/api/microservices/claudeqq/proxy/api/events/recent` must prove the backend can read the persistent event cache. A QR-code-only or not-logged-in NapCat state must be treated as unhealthy.
- Database: the command writes an `unidesk_e2e_markers` row through `docker exec unidesk-database psql`, confirms provider state is stored in PostgreSQL, and checks Todo Note rows exist in `todo_note_instances` using the same named volume.
- Pipeline OA event flow: `microservice:pipeline-oa-event-flow` must prove both no-audit and monitor-audit runs are driven by OA events end to end. The event stream must show `node-finished` as a neutral fact with `pipeline:{pipelineId}` and `epoch:{runId}` tags, OA policy as the source of downstream/audit decisions, monitor decisions as OA control events, and runner control-result evidence. E2E must fail if delivery still depends on a legacy detail audit policy flag as policy authority, independent legacy audit-request points, a legacy batch completion gate, direct monitor-to-runner calls, or frontend/CLI writes to Pipeline `.state`.
- The same Pipeline OA diagnostics must fail on legacy file-transport residuals. Procedure containers, monitor sessions, UI/Gantt DTO builders and CLI fetches must consume prompt/control/stop/display evidence only from the OA event ledger and normalized HTTP read APIs; `control-prompts.jsonl`, `monitor-prompts.jsonl`, `monitor-control`, `control-events.jsonl`, monitor stop files, `.state/pipeline-runs/{runId}/control/commands/`, `PIPELINE_*_APPEND_FILE`, local JSONL append/read helpers, and monitor `/pipeline-state` mounts are forbidden in runtime source.
- Pipeline live Gantt setup: when `frontend:pipeline-gantt-observation-live-running` is selected, E2E first looks for a current Pipeline run that already contains both a `node-long-running-observation` marker and a still-running execution interval. If no such candidate exists, the E2E setup starts the D601 `monitor-management-behavior-test` pipeline through `bun scripts/cli.ts ssh D601 ...` and polls the private backend proxy until the observation candidate exists; the acceptance assertion itself still opens the public frontend with Playwright and verifies the rendered arrows, absence of observation source pseudo-points, target arrow inset, and live flashing running bar through React DOM controls.
- Frontend: Playwright must open the public frontend URL derived from `network.publicHost`, not localhost or a Docker-internal URL; it logs in with the configured account, waits for `核心在线`, asserts that `main-server` and `Main Server Provider` are visible, verifies desktop sidebar collapse and `PGDATA` overview metric, opens `运行总览 / 性能面板` to verify `Bwebui`、组件汇总、最近失败请求、内部操作汇总和最近慢操作, clicks `查看原始JSON` to verify Provider data from the frontend, confirms no raw JSON is visible before that click, opens task history to verify duration and failure diagnostics, opens resource nodes `资源监控` to verify CPU/Memory/Disk curves, the structured process resource table, default memory-desc sorting, sortable CPU column and provider upgrade precheck dispatch, opens `Docker 状态`, switches to `main-server`, and verifies the Docker Desktop-style container view including the database named volume `unidesk_pgdata_10gb`, opens `网关版本` and verifies the provider-gateway version, SSH 透传可用性、远程更新可用性 plus structured remote update records for `provider.upgrade`, then opens `用户服务 / 服务目录``用户服务 / Todo Note``用户服务 / OA Event Flow``用户服务 / Code Queue``用户服务 / FindJob``用户服务 / Pipeline` and `用户服务 / MET Nonlinear` to verify 主 server Todo Note/OA Event Flow、D601 Code Queue、D601 业务服务、仓库引用、私有后端映射、Todo Note 迁移清单和树形任务、OA Event Flow 事件表和 Trace stats 表、Code Queue 队列/模型/输出/初始 `Submitted prompt`/终态任务自动加载完整 Trace/追加 prompt/打断控件、FindJob 指标和岗位预览、Pipeline 组件矩阵、MiniMax 限额卡片、结构化 OA 事件流诊断面板、React Flow 控制图、epoch 甘特图、甘特图渲染图导出、monitor 首列排序、长任务观察连线、无观察来源伪点、running node 实时闪动执行条和 OpenCode Trace、MET Nonlinear 项目库/Fork/待启动队列/当前队列/已完成/失败诊断/GPU/镜像都通过 React 控件展示。Playwright 还必须验证深链接直达路由例如公网 `http://<publicHost>:<frontendPort>/app/pipeline/` 能直接落到 Pipeline 页面,随后切到 `资源节点 / Docker 状态` 时地址栏更新为 `/nodes/docker/`,并且浏览器 history 返回链路仍能回到 `/app/pipeline/`;还必须直开 `/app/code-queue/` 验证页面存在 `app-shell`、左侧主模块边栏、顶部状态栏、顶部子标签和 `code-queue-page`,防止用户服务 deep link 退化成缺 shell 的 standalone 页面;同时 `态势总览` 这类非用户服务页面应落在自己的模块前缀下,例如 `/ops/status/`。Playwright 必须覆盖默认可见时间按北京时间显示,至少包括顶部 `北京时间` 时钟、任务历史/网关版本更新时间和用户服务刷新时间,不得随浏览器本地时区漂移。Task history and provider upgrade records must not display a real sub-second duration as `0s`; MET Nonlinear running rows must show an ETA derived from backend progress or from `startedAt` plus epoch progress, and queue/completed rows must show training speed as `epoch/h`.
- Frontend: Playwright must open the public frontend URL derived from `network.publicHost`, not localhost or a Docker-internal URL; it logs in with the configured account, waits for `核心在线`, asserts that `main-server` and `Main Server Provider` are visible, verifies desktop sidebar collapse and `PGDATA` overview metric, opens `运行总览 / 性能面板` to verify `Bwebui`、组件汇总、最近失败请求、内部操作汇总和最近慢操作, clicks `查看原始JSON` to verify Provider data from the frontend, confirms no raw JSON is visible before that click, opens task history to verify duration and failure diagnostics, opens resource nodes `资源监控` to verify CPU/Memory/Disk curves, the structured process resource table, default memory-desc sorting, sortable CPU column and provider upgrade precheck dispatch, opens `Docker 状态`, switches to `main-server`, and verifies the Docker Desktop-style container view including the database named volume `unidesk_pgdata_10gb`, opens `网关版本` and verifies the provider-gateway version, SSH 透传可用性、远程更新可用性 plus structured remote update records for `provider.upgrade`, then opens `用户服务 / 服务目录``用户服务 / Todo Note``用户服务 / OA Event Flow``用户服务 / V3S Control``用户服务 / Code Queue``用户服务 / FindJob``用户服务 / Pipeline` and `用户服务 / MET Nonlinear` to verify 主 server Todo Note/OA Event Flow、D601 Code Queue、D601 业务服务、仓库引用、私有后端映射、Todo Note 迁移清单和树形任务、OA Event Flow 事件表和 Trace stats 表、V3S 控制面/D601-D518 实例/Kubernetes API service proxy/no-fallback 路径、Code Queue 队列/模型/输出/初始 `Submitted prompt`/终态任务自动加载完整 Trace/追加 prompt/打断控件、FindJob 指标和岗位预览、Pipeline 组件矩阵、MiniMax 限额卡片、结构化 OA 事件流诊断面板、React Flow 控制图、epoch 甘特图、甘特图渲染图导出、monitor 首列排序、长任务观察连线、无观察来源伪点、running node 实时闪动执行条和 OpenCode Trace、MET Nonlinear 项目库/Fork/待启动队列/当前队列/已完成/失败诊断/GPU/镜像都通过 React 控件展示。Playwright 还必须验证 Code Queue 页面所有 API 请求走 `/api/microservices/code-queue/proxy`,不得再出现 `/api/code-queue-direct`深链接直达路由例如公网 `http://<publicHost>:<frontendPort>/app/pipeline/` 能直接落到 Pipeline 页面,随后切到 `资源节点 / Docker 状态` 时地址栏更新为 `/nodes/docker/`,并且浏览器 history 返回链路仍能回到 `/app/pipeline/`;还必须直开 `/app/code-queue/` 验证页面存在 `app-shell`、左侧主模块边栏、顶部状态栏、顶部子标签和 `code-queue-page`,防止用户服务 deep link 退化成缺 shell 的 standalone 页面;同时 `态势总览` 这类非用户服务页面应落在自己的模块前缀下,例如 `/ops/status/`。Playwright 必须覆盖默认可见时间按北京时间显示,至少包括顶部 `北京时间` 时钟、任务历史/网关版本更新时间和用户服务刷新时间,不得随浏览器本地时区漂移。Task history and provider upgrade records must not display a real sub-second duration as `0s`; MET Nonlinear running rows must show an ETA derived from backend progress or from `startedAt` plus epoch progress, and queue/completed rows must show training speed as `epoch/h`.
- Frontend dense-layout regression gate: whenever a frontend change touches Pipeline 右侧边栏、Trace timeline、详情抽屉、甘特图坐标或其他高信息密度面板, Playwright acceptance must inspect both `总高度` and `横向滚动条`. For Pipeline specifically, the OpenCode Trace session head must carry shared agent/model/session facts and the Trace body must use the same Code Queue `TraceView` styling; Playwright must fail if old `.pipeline-opencode-step`, `.pipeline-opencode-flow`, `.pipeline-step-message-card` or `.pipeline-opencode-part` user-visible styles reappear, if the Trace container introduces an internal horizontal scrollbar, or if `frontend:pipeline-gantt-frontend-y-accuracy` fails to prove the frontend `frontend-y` layout maps ticks, markers and execution bars from timestamps to y coordinates within tolerance.
- OpenCode Trace must use Code Queue Trace styling and must not render the deprecated Pipeline continuous step connector; Playwright should fail if `.pipeline-opencode-flow`, `.pipeline-opencode-step` or any equivalent continuous connector/card returns to the user-visible Trace.
- User service frontend assertions must wait for real backend data, not only the page skeleton. For Todo Note this means the page must show the migrated lists `CONSTAR``大论文``找工作``小论文``事务`, support creating a temporary list and task through the frontend, and delete that temporary list afterwards. The temporary list must be selected again by its unique generated name before deletion so E2E never deletes a migrated source list by accident. For FindJob this means the page must show a numeric `岗位总量`, `HEALTH OK`, and a non-empty `PREVIEW` count such as `40/1463 PREVIEW`; for Pipeline this means the page must show `Pipeline v2 工作台`, `Health OK`, a numeric component count, a non-empty React Flow control graph, `控制图`, `Epoch 甘特图`, and after clicking a Gantt execution line it must show `OpenCode Trace` rendered by the shared Code Queue-style Trace component with messages and tool-call groups; for MET Nonlinear this means the page must show `MET Nonlinear 训练编排`, `Health OK`, `Fork Project`, `加入待启动队列`, `启动队列`, `当前队列`, 最大并发设置、task queue and GPU/image panels, and must not show the removed hard-coded `创建10个10轮任务` frontend entry. The MET Nonlinear project library must render `projects/` and `ex_projects/` as a true path tree with folder Project counts; clicking a project row must open a structured detail panel containing `config.json`, `data/ 训练状态`, `模型参数`, `指标` and a parameter count such as `Total Params`; clicking a completed/current/failed job row must open a structured job detail and both the row and detail must show `epoch/h`. Full MET Nonlinear acceptance is driven by public frontend controls: choose a visible source Project, set batch size, epochs and max concurrency in inputs, fork into `projects/unidesk_forks/`, stage the selected forks, start the queue, and verify completed rows plus automatic `metnl-train-*` container removal; loading placeholders like `--` or empty states are not sufficient for E2E success.
@@ -59,7 +59,7 @@ User service pages are covered by the same rule. `Todo Note` must show lists, ta
## Public Boundary Rule
The public frontend URL and provider ingress URL are the only unrestricted public network interfaces. backend-core REST API remains Docker-internal only; PostgreSQL and OA Event Flow may expose restricted host mappings solely for D601 Code Queue, and E2E must prove those mappings are unreachable to generic clients or protected by explicit source rules.
The public frontend URL and provider ingress URL are the only unrestricted public network interfaces. backend-core REST API remains Docker-internal only; PostgreSQL and OA Event Flow may expose restricted host mappings solely for controlled Code Queue nodes, and E2E must prove those mappings are unreachable to generic clients or protected by explicit source rules.
## Database Persistence Rule
@@ -67,7 +67,7 @@ The PostgreSQL data volume is the named Docker volume `unidesk_pgdata_10gb`. CLI
## User Service Restart-Recovery Rule
Any new user service, service migration, or change to a service's Compose/docker run configuration must prove it can recover after container restart and Docker daemon restart. The delivery evidence must include the service's `config.json` id/provider/container mapping, restart policy, host-bound private port, persistent mounts or PostgreSQL tables, health readiness fields, and at least one post-restart `bun scripts/cli.ts microservice health <id>` plus a representative `microservice proxy` check through the real provider-gateway path.
Any new user service, service migration, or change to a service's Compose/docker run/k8s configuration must prove it can recover after container restart and Docker daemon restart. The delivery evidence must include the service's `config.json` id/provider/container or Kubernetes Service mapping, restart policy or Deployment replica policy, private port or ClusterIP Service, persistent mounts or PostgreSQL tables, health readiness fields, and at least one post-restart `bun scripts/cli.ts microservice health <id>` plus a representative `microservice proxy` check through the real UniDesk path. `v3sctl-managed` services must prove the proxy path through `v3sctl-adapter` and Kubernetes API service proxy, not the provider-gateway direct business path.
D601 services have an extra gate because Windows, WSL and Docker Desktop are separate supervisors: record the Windows scheduled task or equivalent keepalive, run `docker inspect` to confirm `met-nonlinear-ts`, `claudeqq-backend`, `claudeqq-napcat` and any changed service have non-empty restart policies and host bind mounts for durable state, then verify MET Nonlinear queue/image health and ClaudeQQ logged-in NapCat HTTP/WebSocket state after the restart. A service that only becomes `running` but loses login, queue, token, subscription, data directory or pending work is not restart-recovery complete.