diff --git a/AGENTS.md b/AGENTS.md index 2109841a..05860492 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -15,6 +15,10 @@ UniDesk 是一个以主 server 为统一入口的分布式工作平台;本文 - 计算节点 `provider-gateway` 容器的重建/升级必须走带 sleep-and-validate 回滚保护的 `provider.upgrade mode=schedule` 远程升级路径或前端等价调度;禁止通过 `bun scripts/cli.ts ssh ` 同步执行 `docker compose up --build provider-gateway` 这类自重建命令,权威规则见 `docs/reference/provider-gateway.md`。 - Host SSH / WSL SSH 透传只能用于节点诊断、前置条件修复和升级后验证,不能作为计算节点 `provider-gateway` 自身的重建/升级通道;部署验收必须同时证明远程升级和 SSH 透传可用,测试门禁见 `TEST.md`。 +## Critical Native k3s Runtime Rule + +- 所有计算资源节点上的 k3s server、控制平面、agent 和 worker 必须原生安装在 host OS 或 WSL 发行版内,禁止用 Docker/Compose/`rancher/k3s` 长驻容器承载 k3s;权威规则见 `docs/reference/arch.md`、`docs/reference/microservices.md` 与 `docs/reference/deploy.md`。 + ## CLI - `bun scripts/cli.ts help`:输出所有可用命令的 JSON 索引,详细规范见 `docs/reference/cli.md`。 @@ -27,7 +31,7 @@ UniDesk 是一个以主 server 为统一入口的分布式工作平台;本文 - `bun scripts/cli.ts server rebuild `:以 build-first、Compose lock、no-deps force-recreate 和 post-up validation 的异步 job 重建主 server Compose 内单个服务;Code Queue 部署在 D601,规则见 `docs/reference/deployment.md`。 - `bun scripts/cli.ts provider attach [--master-server URL] [--up] [--force]`:在新增计算节点上生成两项配置的 provider-gateway 挂载包;默认只需要主 server URL(默认 `http://74.48.78.17/`)和唯一 Provider ID,生成的 Compose 固定 Docker socket、`pid: "host"`、`restart: always`、只读 `/workspace`、SSH 维护私钥挂载和 loopback egress proxy 端口,规则见 `docs/reference/provider-gateway.md`。 - `bun scripts/cli.ts ssh [ssh-like args...]`:通过 provider-gateway 的 Host SSH / WSL SSH 维护桥打开近似原生 ssh 的交互会话或远端命令,并在远端 PATH 注入 `apply_patch`、`glob` 与 `skill-discover`;`apply-patch`、`py`、`skills`、结构化 `find`、`glob` 和 `argv` 子命令用于避免远端补丁、Python stdin、skill 发现与常用只读命令的嵌套转义问题,使用规则见 `docs/reference/cli.md` 和 `docs/reference/provider-gateway.md`。 -- `bun scripts/cli.ts microservice list/status/health/proxy`:管理和验证挂载在主 server、计算节点 Docker 或 k3s 控制面上的用户服务,OA Event Flow/Todo Note/Baidu Netdisk on main-server、K3S Control/Code Queue/MDTODO/FindJob/Pipeline/MET Nonlinear on D601 的规则见 `docs/reference/microservices.md`。 +- `bun scripts/cli.ts microservice list/status/health/proxy`:管理和验证挂载在主 server、计算节点 Docker 或 k3s 控制面上的用户服务,OA Event Flow/Todo Note/Baidu Netdisk on main-server、k3s Control/Code Queue/MDTODO/FindJob/Pipeline/MET Nonlinear on D601 的规则见 `docs/reference/microservices.md`。 - `bun scripts/cli.ts deploy check/plan/apply [--file deploy.json] [--service ]`:按根目录 `deploy.json` 的服务 repo 和 commit 期望状态校验或更新用户服务,目标侧自行 fetch、构建、部署和 live commit 验证;规则见 `docs/reference/deploy.md`。 - `bun scripts/cli.ts codex deploy `:Code Queue 兼容部署入口,会生成临时 desired manifest 并调用 `deploy apply --service code-queue` 的同一条 target-side build 与 live commit 验证路径;规则见 `docs/reference/codex-deploy.md`。 - `bun scripts/cli.ts codex task `:按 Code Queue 任务 ID 查询初始 prompt、最后 assistant message、工具调用摘要、attempt/judge/error 和耗时,便于新任务引用历史 session。 @@ -41,7 +45,7 @@ UniDesk 是一个以主 server 为统一入口的分布式工作平台;本文 - `bun`:TypeScript 运行时固定使用 Bun,组件入口和 CLI 都直接运行 `.ts` 文件,约束见 `docs/reference/config.md`。 - `docker-compose.yml`:主 server 统一编排 core、frontend、database、本机 provider gateway、Todo Note 后端、Baidu Netdisk 后端和 OA Event Flow 后端;Code Queue 和 MDTODO 由 D601 k3s/k8s 控制面代管,并经 `k3sctl-adapter` 的 Kubernetes API service proxy 单一路径接入,服务拓扑见 `docs/reference/deployment.md`。 -- `src/components/frontend`:前端源码固定使用 TypeScript + React,`app.tsx` 只做 shell/router,左侧主模块与顶部子标签统一编译为模块前缀路由:`/ops//`、`/nodes//`、`/tasks//`、`/config//`,只有用户服务使用 `/app//` 深链接,运行总览包含通用性能面板,资源监控含曲线和进程资源排序表,Todo Note、FindJob、Pipeline、MET Nonlinear、Baidu Netdisk、Code Queue、MDTODO、OA Event Flow、K3S Control 等业务页必须拆到独立 TSX 模块,界面规则见 `docs/reference/frontend.md`。 +- `src/components/frontend`:前端源码固定使用 TypeScript + React,`app.tsx` 只做 shell/router,左侧主模块与顶部子标签统一编译为模块前缀路由:`/ops//`、`/nodes//`、`/tasks//`、`/config//`,只有用户服务使用 `/app//` 深链接,运行总览包含通用性能面板,资源监控含曲线和进程资源排序表,Todo Note、FindJob、Pipeline、MET Nonlinear、Baidu Netdisk、Code Queue、MDTODO、OA Event Flow、k3s Control 等业务页必须拆到独立 TSX 模块,界面规则见 `docs/reference/frontend.md`。 - `backend-core / frontend performance`:backend-core 暴露 `/api/performance`,frontend 暴露同源 `/api/frontend-performance` 并在 `/ops/performance/` 汇总组件请求、失败请求、内部操作和慢操作,规则见 `docs/reference/observability.md`。backend-core 源码已拆分为 15 个模块,结构见 `docs/reference/repo-tree.md`。 - `Unified OA event flow`:`oa-event-flow` 是独立主 server 用户服务,提供事件表、按 tag 订阅和 Trace/STEP 统计中心,Code Queue 与 Pipeline 都必须接入统一事件流;共享契约见 `docs/reference/oa-event-flow.md`,Pipeline 专有控制流规则见 `docs/reference/pipeline-oa-event-flow.md`。 - `src/components/provider-gateway`:当前主 server `74.48.78.17` 也作为 provider gateway 接入 UniDesk,外部节点通过 `ws://74.48.78.17:18082/ws/provider` 接入,必须以 `restart: always` 部署 always-enabled 远程升级、sleep-and-validate 回滚保护和 Host SSH / WSL SSH 透传并完成自测,部署与 Playwright 公网前端验证方法见 `docs/reference/provider-gateway.md`。 @@ -53,7 +57,7 @@ UniDesk 是一个以主 server 为统一入口的分布式工作平台;本文 - `docs/reference/arch.md`:UniDesk 分布式工作平台的长期架构约束。 - `docs/reference/repo-tree.md`:仓库结构目标与组件边界。 - `docs/reference/observability.md`:服务日志、任务活性、通用性能指标 API 和性能面板的可观测性规则。 -- `docs/reference/microservices.md`:用户服务(兼容命名 `microservice`)的配置、代理、安全边界、unidesk-direct/k3sctl-managed 部署模式、Todo Note/Baidu Netdisk on main-server、K3S Control/Code Queue/MDTODO/FindJob/Pipeline/MET Nonlinear on D601 和验证规则。 +- `docs/reference/microservices.md`:用户服务(兼容命名 `microservice`)的配置、代理、安全边界、unidesk-direct/k3sctl-managed 部署模式、Todo Note/Baidu Netdisk on main-server、k3s Control/Code Queue/MDTODO/FindJob/Pipeline/MET Nonlinear on D601 和验证规则。 - `docs/reference/windows-passthrough.md`:WSL provider 通过 SSH 透传调用 Windows cmd/PowerShell、Keil、COM 串口和 Windows 侧 skill 的长期规则。 - `docs/reference/constar-d601.md`:D601 上 ConStart/constar 固件工作区的 UniDesk SSH 入口、WSL skill wrapper、Keil 编译下载和串口/JSON-RPC 验证简要引导。 - `docs/reference/oa-event-flow.md`:统一 OA 事件流微服务、事件表、tag 订阅、Trace/STEP 统计中心和前端可见性规则。 diff --git a/TEST.md b/TEST.md index edb4ecfd..427f7c0f 100644 --- a/TEST.md +++ b/TEST.md @@ -99,7 +99,7 @@ ## T23 D601 Code Queue User Service -阅读 `AGENTS.md`(本项目 `AGENTS.md` 同时承担 `SKILL.md` 对 `scripts/cli.ts` 的解释职责),然后用 cli 手动测试以下内容:运行 `bun scripts/cli.ts microservice list`,确认 `code-queue` 显示为 `providerId=D601`、`public=false`、`frontendOnly=true`、仓库 URL `https://github.com/pikasTech/unidesk`、k3s/k8s `k3s://unidesk/code-queue:4222` 逻辑服务映射、`deployment.mode=k3sctl-managed`、`runtime.orchestrator=k3sctl` 且无业务直连容器摘要;使用 `bun scripts/cli.ts codex deploy <已push的commitId>` 重建/启动 D601 Code Queue,确认命令立即返回异步 job id,`bun scripts/cli.ts job status --tail-bytes 30000` 能看到 fetch/export、rsync、Docker build、k3s image import、kubectl apply、部署 commit 戳记、rollout 和 health commit 验证进度,并确认 job 最终校验真实 Code Queue `/health` 返回的 `deploy.commit` 精确匹配本次 remote commit,不能由旧服务或旧 Pod 充数;同时确认主 server 根目录 `docker-compose.yml` 中不再存在 `code-queue` service。运行 `bun scripts/cli.ts microservice health code-queue`、`bun scripts/cli.ts microservice proxy code-queue /api/dev-ready --raw`、`bun scripts/cli.ts microservice proxy code-queue '/api/tasks/overview?limit=5&transcriptLimit=1&compact=1&afterSeq=0&preferId='` 和 `bun scripts/cli.ts codex task <已有taskId>`,确认链路通过 backend-core、k3sctl-adapter、Kubernetes API service proxy 和 D601 active Code Queue Service,且 task id 查询返回初始 prompt、最后 assistant message、工具调用摘要、attempt/judge/error 和耗时,`queue.storage.primary=postgres`、`queue.storage.postgresReady=true`、`queue.devReady.missingTools=[]`、`queue.devReady.docker.versionOk=true`、`queue.devReady.docker.composeOk=true`;`queue.devReady.ssh.ready` 只在需要跨 Provider SSH/Windows-native 任务时作为强制项。在 D601 `code-queue-backend` 容器内验证主 PostgreSQL 端口映射可执行 `select 1`,主 OA Event Flow 端口映射 `/health` 可访问,本机 ClaudeQQ `http://host.docker.internal:3290/health` 可访问;这些映射不得成为任意公网入口。 +阅读 `AGENTS.md`(本项目 `AGENTS.md` 同时承担 `SKILL.md` 对 `scripts/cli.ts` 的解释职责),然后用 cli 手动测试以下内容:运行 `bun scripts/cli.ts microservice list`,确认 `code-queue` 显示为 `providerId=D601`、`public=false`、`frontendOnly=true`、仓库 URL `https://github.com/pikasTech/unidesk`、k3s/k8s `k3s://unidesk/code-queue:4222` 逻辑服务映射、`deployment.mode=k3sctl-managed`、`runtime.orchestrator=k3sctl` 且无业务直连容器摘要;使用 `bun scripts/cli.ts codex deploy <已push的commitId>` 重建/启动 D601 Code Queue,确认命令立即返回异步 job id,`bun scripts/cli.ts job status --tail-bytes 30000` 能看到 fetch/export、rsync、Docker build、k3s image import、kubectl apply、部署 commit 戳记、rollout 和 health commit 验证进度,并确认 job 最终校验真实 Code Queue `/health` 返回的 `deploy.commit` 精确匹配本次 remote commit,不能由旧服务或旧 Pod 充数;同时确认主 server 根目录 `docker-compose.yml` 中不再存在 `code-queue` service,并通过 `bun scripts/cli.ts ssh D601 argv bash -lc 'systemctl is-active k3s && KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl get nodes -o wide && ! docker ps --format "{{.Names}} {{.Image}}" | grep -q "rancher/k3s"'` 或等价检查证明 D601 k3s 是 WSL 原生 systemd 服务且没有 active `rancher/k3s` 控制面容器。运行 `bun scripts/cli.ts microservice health code-queue`、`bun scripts/cli.ts microservice proxy code-queue /api/dev-ready --raw`、`bun scripts/cli.ts microservice proxy code-queue '/api/tasks/overview?limit=5&transcriptLimit=1&compact=1&afterSeq=0&preferId='` 和 `bun scripts/cli.ts codex task <已有taskId>`,确认链路通过 backend-core、k3sctl-adapter、Kubernetes API service proxy 和 D601 active Code Queue Service,且 task id 查询返回初始 prompt、最后 assistant message、工具调用摘要、attempt/judge/error 和耗时,`queue.storage.primary=postgres`、`queue.storage.postgresReady=true`、`queue.devReady.missingTools=[]`、`queue.devReady.docker.versionOk=true`、`queue.devReady.docker.composeOk=true`;`queue.devReady.ssh.ready` 只在需要跨 Provider SSH/Windows-native 任务时作为强制项。在 D601 `code-queue-backend` 容器内验证主 PostgreSQL 端口映射可执行 `select 1`,主 OA Event Flow 端口映射 `/health` 可访问,本机 ClaudeQQ `http://host.docker.internal:3290/health` 可访问;这些映射不得成为任意公网入口。 随后登录公网 frontend `http://74.48.78.17:18081/`,进入 `用户服务 / Code Queue`,确认页面显示默认模型 `gpt-5.5`、默认执行 Provider `D601`、默认工作目录 `/workspace`、模型下拉菜单包含 `gpt-5.4-mini`/`gpt-5.4`/`gpt-5.5`、入队份数、队列指标、任务 ID、复制任务 ID、引用按钮、任务耗时、引用任务 ID、清空输入、创建成功提示、任务提交表单、Trace 输出、attempt 表、MiniMax/fallback judge 状态、追加 prompt、打断和重试控件;通过页面提交一个小任务,确认任务进入 queued/running/succeeded 或可解释的 failed 状态,并且输出区能看到运行中的 Codex 消息。批量验收时设置 `入队份数=5` 或用 `---` 分隔 5 段 prompt,一次性入队 5 条任务,确认 5 条任务按顺序运行并全部进入 succeeded 或可解释的非成功终态,不能只运行第一条后停止;其中任一任务被 judge 判定 `fail` 时只能把当前任务标为 failed,后续 queued 任务仍必须继续推进。测试异常中断时可以提交长任务后点击 `打断`,确认任务变为 canceled 或被 judge 标记为非成功终态;自动重试只应在服务端/传输异常、任务正常结束但 execution record 显示未完成、或 judge 判定 retry 时发生;retry 必须复用已有 Codex thread 并 append 继续执行 prompt,只有当前任务 complete 后才推进队列中的下一个任务。MiniMax judge 必须能处理 Markdown fence/夹杂文本等 JSON 去噪;若去噪后仍失败,必须把解析错误和上一轮去噪前原始回答反馈给 MiniMax 修复后重试,日志中应出现 `judge_json_parse_retry`,且 repair 成功时仍以 `source=minimax` 返回。Codex provider key 只能通过 `OPENAI_API_KEY`、`CRS_OAI_KEY` 这类运行时环境透传,MiniMax API key 只能通过 D601 env-file 运行时环境传入,禁止写入 `config.json`、Dockerfile、源码或测试文档。 diff --git a/docs/reference/arch.md b/docs/reference/arch.md index f8828b27..21df8fea 100644 --- a/docs/reference/arch.md +++ b/docs/reference/arch.md @@ -54,6 +54,11 @@ - The primary path for automated task dispatching and execution is via the local Docker socket - Access to the local environment via WSL SSH is reserved solely as an auxiliary path for emergency maintenance and troubleshooting, exposed only as bounded `host.ssh` probe/exec tasks - Automating task deployment or dispatching through the WSL SSH channel is forbidden + - Native k3s Runtime + - Any k3s server, control-plane node, agent, or worker running on a computing resource machine must be installed directly on that node's host OS or WSL distro, normally as the native `k3s` systemd service. + - Running k3s itself inside Docker or any long-lived container is forbidden. Docker remains available for provider-gateway, target-side image builds, user workload containers, and temporary artifact extraction, but not as the k3s runtime boundary. + - Kubernetes hostPath, local-path storage, node labels, kubelet, CNI, and `/workspace` semantics must resolve against the real computing node filesystem. For WSL nodes such as D601, a Code Queue task working in `/workspace` must see the WSL host `/home/ubuntu`, not a container-private `/home/ubuntu`. + - If a legacy `rancher/k3s` container is present during migration, it is only an artifact source or rollback reference; it must not remain the active control plane and must be stopped before accepting the node as healthy. - Connection Management - When registering, a node carries an authentication token to verify its identity and declares resources such as GPU/CPU - The authentication token is pre-issued by the main server and configured at Provider Gateway startup diff --git a/docs/reference/deploy.md b/docs/reference/deploy.md index 7af485b2..e791d17a 100644 --- a/docs/reference/deploy.md +++ b/docs/reference/deploy.md @@ -68,7 +68,7 @@ The reconciler selects the executor from `config.json`: - `deployment.mode=unidesk-direct` on `main-server`: build the image on the main server, then use the fixed UniDesk Compose project and `up -d --no-build --no-deps --force-recreate `. - `deployment.mode=unidesk-direct` on a provider: dispatch `host.ssh` to that provider, build on the provider, then use the service's provider-local compose file and project. The executor resolves the actual Compose project, image name, build context, Dockerfile and target from the running container labels and `docker compose config`; it must not guess an image tag that the service will not actually run. -- `deployment.mode=k3sctl-managed`: dispatch to the active control target, build on that target, import the image into k3s/containerd, apply the existing Kubernetes manifest, stamp the Deployment and wait for rollout. +- `deployment.mode=k3sctl-managed`: dispatch to the active control target, build on that target, verify or install native k3s on the host OS/WSL distro, import the image into native k3s/containerd, apply the existing Kubernetes manifest, stamp the Deployment and wait for rollout. The executor must use the native kubeconfig and containerd socket, for example `/etc/rancher/k3s/k3s.yaml` and `/run/k3s/containerd/containerd.sock`; running k3s itself in Docker is forbidden for both control-plane and worker nodes. A `rancher/k3s` image or legacy container may only be used as a temporary artifact source during migration, and any active containerized k3s control plane must be stopped before verification succeeds. Existing service-specific commands such as Code Queue deploy should converge onto this reconciler path instead of keeping a parallel implementation. diff --git a/docs/reference/deployment.md b/docs/reference/deployment.md index 91ec99e8..2bdc64fb 100644 --- a/docs/reference/deployment.md +++ b/docs/reference/deployment.md @@ -52,7 +52,7 @@ frontend 的 Docker 上线顺序为:先运行必要的本地校验,例如 `b ## Health Criteria -服务跑通的最低标准是:backend-core 内网 `/health` 返回 ok,frontend 公网 `/health` 返回 ok,provider ingress 公网 `/health` 返回 ok,database 在容器内 `pg_isready` 可用,Todo Note 后端 `/api/health` 返回 `storage=postgres`,`k3sctl-adapter` `/api/control-plane` 可见 `unidesk-k3s` Kubernetes API service proxy 状态、D601 active serving healthy、D518 standby pod ready、`presentNodeIds=[D601,D518]`、`missingNodeIds=[]` 和 no-fallback 路径,Code Queue `/health` 经 k3s active Service 返回轻量 readiness、默认模型、`queue.storage` 和 `egressProxy.connected=true`,`/api/tasks/overview` 返回 PostgreSQL 队列总览,Project Manager `/health` 返回 `storage.primary=postgres` 和项目数量,backend-core `/api/performance` 返回性能指标,`/api/nodes` 中出现 `main-server`、`D601` 和 `D518` provider 且状态为 `online`,`/api/nodes/system-status` 中出现对应 CPU/内存/硬盘采样,`/api/nodes/docker-status` 中能看到主 server、D601 与 D518 Docker 快照。D518 必须通过 K3S agent 加入 K3S 控制面并运行 `code-queue-d518` standby Pod;不得用 D601->D518 直连、NodePort 或 provider-gateway business HTTP 绕过 Kubernetes service route。交付前还必须运行 `bun scripts/cli.ts e2e run`,并以 `docs/reference/e2e.md` 的门禁作为最终判定。 +服务跑通的最低标准是:backend-core 内网 `/health` 返回 ok,frontend 公网 `/health` 返回 ok,provider ingress 公网 `/health` 返回 ok,database 在容器内 `pg_isready` 可用,Todo Note 后端 `/api/health` 返回 `storage=postgres`,`k3sctl-adapter` `/api/control-plane` 可见 `unidesk-k3s` Kubernetes API service proxy 状态、D601 active serving healthy、D518 standby pod ready、`presentNodeIds=[D601,D518]`、`missingNodeIds=[]` 和 no-fallback 路径,Code Queue `/health` 经 k3s active Service 返回轻量 readiness、默认模型、`queue.storage` 和 `egressProxy.connected=true`,`/api/tasks/overview` 返回 PostgreSQL 队列总览,Project Manager `/health` 返回 `storage.primary=postgres` 和项目数量,backend-core `/api/performance` 返回性能指标,`/api/nodes` 中出现 `main-server`、`D601` 和 `D518` provider 且状态为 `online`,`/api/nodes/system-status` 中出现对应 CPU/内存/硬盘采样,`/api/nodes/docker-status` 中能看到主 server、D601 与 D518 Docker 快照。D518 必须通过原生 k3s agent 加入原生 k3s 控制面并运行 `code-queue-d518` standby Pod;不得用 Docker 化 k3s、D601->D518 直连、NodePort 或 provider-gateway business HTTP 绕过 Kubernetes service route。交付前还必须运行 `bun scripts/cli.ts e2e run`,并以 `docs/reference/e2e.md` 的门禁作为最终判定。 ## Code Queue D601 Resource Budget diff --git a/docs/reference/e2e.md b/docs/reference/e2e.md index 8c5f2c26..0db0ac87 100644 --- a/docs/reference/e2e.md +++ b/docs/reference/e2e.md @@ -41,7 +41,7 @@ Typical targeted commands: - Pipeline OA event flow: `microservice:pipeline-oa-event-flow` must prove both no-audit and monitor-audit runs are driven by OA events end to end. The event stream must show `node-finished` as a neutral fact with `pipeline:{pipelineId}` and `epoch:{runId}` tags, OA policy as the source of downstream/audit decisions, monitor decisions as OA control events, and runner control-result evidence. E2E must fail if delivery still depends on a legacy detail audit policy flag as policy authority, independent legacy audit-request points, a legacy batch completion gate, direct monitor-to-runner calls, or frontend/CLI writes to Pipeline `.state`. - The same Pipeline OA diagnostics must fail on legacy file-transport residuals. Procedure containers, monitor sessions, UI/Gantt DTO builders and CLI fetches must consume prompt/control/stop/display evidence only from the OA event ledger and normalized HTTP read APIs; `control-prompts.jsonl`, `monitor-prompts.jsonl`, `monitor-control`, `control-events.jsonl`, monitor stop files, `.state/pipeline-runs/{runId}/control/commands/`, `PIPELINE_*_APPEND_FILE`, local JSONL append/read helpers, and monitor `/pipeline-state` mounts are forbidden in runtime source. - Pipeline live Gantt setup: when `frontend:pipeline-gantt-observation-live-running` is selected, E2E first looks for a current Pipeline run that already contains both a `node-long-running-observation` marker and a still-running execution interval. If no such candidate exists, the E2E setup starts the D601 `monitor-management-behavior-test` pipeline through `bun scripts/cli.ts ssh D601 ...` and polls the private backend proxy until the observation candidate exists; the acceptance assertion itself still opens the public frontend with Playwright and verifies the rendered arrows, absence of observation source pseudo-points, target arrow inset, and live flashing running bar through React DOM controls. -- Frontend: Playwright must open the public frontend URL derived from `network.publicHost`, not localhost or a Docker-internal URL; it logs in with the configured account, waits for `核心在线`, asserts that `main-server` and `Main Server Provider` are visible, verifies desktop sidebar collapse and `PGDATA` overview metric, opens `运行总览 / 性能面板` to verify `Bwebui`、组件汇总、最近失败请求、内部操作汇总和最近慢操作, clicks `查看原始JSON` to verify Provider data from the frontend, confirms no raw JSON is visible before that click, opens task history to verify duration and failure diagnostics, opens resource nodes `资源监控` to verify CPU/Memory/Disk curves, the structured process resource table, default memory-desc sorting, sortable CPU column and provider upgrade precheck dispatch, opens `Docker 状态`, switches to `main-server`, and verifies the Docker Desktop-style container view including the database named volume `unidesk_pgdata_10gb`, opens `网关版本` and verifies the provider-gateway version, SSH 透传可用性、远程更新可用性 plus structured remote update records for `provider.upgrade`, then opens `用户服务 / 服务目录`、`用户服务 / Todo Note`、`用户服务 / OA Event Flow`、`用户服务 / K3S Control`、`用户服务 / Code Queue`、`用户服务 / FindJob`、`用户服务 / Pipeline` and `用户服务 / MET Nonlinear` to verify 主 server Todo Note/OA Event Flow、D601 Code Queue、D601 业务服务、仓库引用、私有后端映射、Todo Note 迁移清单和树形任务、OA Event Flow 事件表和 Trace stats 表、K3S 控制面/D601-D518 实例/Kubernetes API service proxy/no-fallback 路径、Code Queue 队列/模型/输出/初始 `Submitted prompt`/终态任务自动加载完整 Trace/追加 prompt/打断控件、FindJob 指标和岗位预览、Pipeline 组件矩阵、MiniMax 限额卡片、结构化 OA 事件流诊断面板、React Flow 控制图、epoch 甘特图、甘特图渲染图导出、monitor 首列排序、长任务观察连线、无观察来源伪点、running node 实时闪动执行条和 OpenCode Trace、MET Nonlinear 项目库/Fork/待启动队列/当前队列/已完成/失败诊断/GPU/镜像都通过 React 控件展示。Playwright 还必须验证 Code Queue 页面所有 API 请求走 `/api/microservices/code-queue/proxy`,不得再出现 `/api/code-queue-direct`;深链接直达路由例如公网 `http://:/app/pipeline/` 能直接落到 Pipeline 页面,随后切到 `资源节点 / Docker 状态` 时地址栏更新为 `/nodes/docker/`,并且浏览器 history 返回链路仍能回到 `/app/pipeline/`;还必须直开 `/app/code-queue/` 验证页面存在 `app-shell`、左侧主模块边栏、顶部状态栏、顶部子标签和 `code-queue-page`,防止用户服务 deep link 退化成缺 shell 的 standalone 页面;同时 `态势总览` 这类非用户服务页面应落在自己的模块前缀下,例如 `/ops/status/`。Playwright 必须覆盖默认可见时间按北京时间显示,至少包括顶部 `北京时间` 时钟、任务历史/网关版本更新时间和用户服务刷新时间,不得随浏览器本地时区漂移。Task history and provider upgrade records must not display a real sub-second duration as `0s`; MET Nonlinear running rows must show an ETA derived from backend progress or from `startedAt` plus epoch progress, and queue/completed rows must show training speed as `epoch/h`. +- Frontend: Playwright must open the public frontend URL derived from `network.publicHost`, not localhost or a Docker-internal URL; it logs in with the configured account, waits for `核心在线`, asserts that `main-server` and `Main Server Provider` are visible, verifies desktop sidebar collapse and `PGDATA` overview metric, opens `运行总览 / 性能面板` to verify `Bwebui`、组件汇总、最近失败请求、内部操作汇总和最近慢操作, clicks `查看原始JSON` to verify Provider data from the frontend, confirms no raw JSON is visible before that click, opens task history to verify duration and failure diagnostics, opens resource nodes `资源监控` to verify CPU/Memory/Disk curves, the structured process resource table, default memory-desc sorting, sortable CPU column and provider upgrade precheck dispatch, opens `Docker 状态`, switches to `main-server`, and verifies the Docker Desktop-style container view including the database named volume `unidesk_pgdata_10gb`, opens `网关版本` and verifies the provider-gateway version, SSH 透传可用性、远程更新可用性 plus structured remote update records for `provider.upgrade`, then opens `用户服务 / 服务目录`、`用户服务 / Todo Note`、`用户服务 / OA Event Flow`、`用户服务 / k3s Control`、`用户服务 / Code Queue`、`用户服务 / FindJob`、`用户服务 / Pipeline` and `用户服务 / MET Nonlinear` to verify 主 server Todo Note/OA Event Flow、D601 Code Queue、D601 业务服务、仓库引用、私有后端映射、Todo Note 迁移清单和树形任务、OA Event Flow 事件表和 Trace stats 表、k3s 控制面/D601-D518 实例/Kubernetes API service proxy/no-fallback 路径、Code Queue 队列/模型/输出/初始 `Submitted prompt`/终态任务自动加载完整 Trace/追加 prompt/打断控件、FindJob 指标和岗位预览、Pipeline 组件矩阵、MiniMax 限额卡片、结构化 OA 事件流诊断面板、React Flow 控制图、epoch 甘特图、甘特图渲染图导出、monitor 首列排序、长任务观察连线、无观察来源伪点、running node 实时闪动执行条和 OpenCode Trace、MET Nonlinear 项目库/Fork/待启动队列/当前队列/已完成/失败诊断/GPU/镜像都通过 React 控件展示。Playwright 还必须验证 Code Queue 页面所有 API 请求走 `/api/microservices/code-queue/proxy`,不得再出现 `/api/code-queue-direct`;深链接直达路由例如公网 `http://:/app/pipeline/` 能直接落到 Pipeline 页面,随后切到 `资源节点 / Docker 状态` 时地址栏更新为 `/nodes/docker/`,并且浏览器 history 返回链路仍能回到 `/app/pipeline/`;还必须直开 `/app/code-queue/` 验证页面存在 `app-shell`、左侧主模块边栏、顶部状态栏、顶部子标签和 `code-queue-page`,防止用户服务 deep link 退化成缺 shell 的 standalone 页面;同时 `态势总览` 这类非用户服务页面应落在自己的模块前缀下,例如 `/ops/status/`。Playwright 必须覆盖默认可见时间按北京时间显示,至少包括顶部 `北京时间` 时钟、任务历史/网关版本更新时间和用户服务刷新时间,不得随浏览器本地时区漂移。Task history and provider upgrade records must not display a real sub-second duration as `0s`; MET Nonlinear running rows must show an ETA derived from backend progress or from `startedAt` plus epoch progress, and queue/completed rows must show training speed as `epoch/h`. - Frontend dense-layout regression gate: whenever a frontend change touches Pipeline 右侧边栏、Trace timeline、详情抽屉、甘特图坐标或其他高信息密度面板, Playwright acceptance must inspect both `总高度` and `横向滚动条`. For Pipeline specifically, the OpenCode Trace session head must carry shared agent/model/session facts and the Trace body must use the same Code Queue `TraceView` styling; Playwright must fail if old `.pipeline-opencode-step`, `.pipeline-opencode-flow`, `.pipeline-step-message-card` or `.pipeline-opencode-part` user-visible styles reappear, if the Trace container introduces an internal horizontal scrollbar, or if `frontend:pipeline-gantt-frontend-y-accuracy` fails to prove the frontend `frontend-y` layout maps ticks, markers and execution bars from timestamps to y coordinates within tolerance. - OpenCode Trace must use Code Queue Trace styling and must not render the deprecated Pipeline continuous step connector; Playwright should fail if `.pipeline-opencode-flow`, `.pipeline-opencode-step` or any equivalent continuous connector/card returns to the user-visible Trace. - User service frontend assertions must wait for real backend data, not only the page skeleton. For Todo Note this means the page must show the migrated lists `CONSTAR`、`大论文`、`找工作`、`小论文`、`事务`, support creating a temporary list and task through the frontend, and delete that temporary list afterwards. The temporary list must be selected again by its unique generated name before deletion so E2E never deletes a migrated source list by accident. For FindJob this means the page must show a numeric `岗位总量`, `HEALTH OK`, and a non-empty `PREVIEW` count such as `40/1463 PREVIEW`; for Pipeline this means the page must show `Pipeline v2 工作台`, `Health OK`, a numeric component count, a non-empty React Flow control graph, `控制图`, `Epoch 甘特图`, and after clicking a Gantt execution line it must show `OpenCode Trace` rendered by the shared Code Queue-style Trace component with messages and tool-call groups; for MET Nonlinear this means the page must show `MET Nonlinear 训练编排`, `Health OK`, `Fork Project`, `加入待启动队列`, `启动队列`, `当前队列`, 最大并发设置、task queue and GPU/image panels, and must not show the removed hard-coded `创建10个10轮任务` frontend entry. The MET Nonlinear project library must render `projects/` and `ex_projects/` as a true path tree with folder Project counts; clicking a project row must open a structured detail panel containing `config.json`, `data/ 训练状态`, `模型参数`, `指标` and a parameter count such as `Total Params`; clicking a completed/current/failed job row must open a structured job detail and both the row and detail must show `epoch/h`. Full MET Nonlinear acceptance is driven by public frontend controls: choose a visible source Project, set batch size, epochs and max concurrency in inputs, fork into `projects/unidesk_forks/`, stage the selected forks, start the queue, and verify completed rows plus automatic `metnl-train-*` container removal; loading placeholders like `--` or empty states are not sufficient for E2E success. diff --git a/docs/reference/frontend.md b/docs/reference/frontend.md index 5dd80442..c11688d7 100644 --- a/docs/reference/frontend.md +++ b/docs/reference/frontend.md @@ -6,7 +6,7 @@ UniDesk 前端是 React 组件化工业控制台,不追求展示型大屏效 frontend 应用源码必须使用 TypeScript + React,禁止在 `src/components/frontend` 中维护手写 `.js` / `.jsx` 应用源码。浏览器请求的 `/app.js` 只能由 frontend Bun server 从 `src/components/frontend/src/app.tsx` 及其 TSX imports 转译生成;`public/` 目录只保存 HTML/CSS 等静态资产,不提交手写 `app.js`。 -`src/components/frontend/src/app.tsx` 只承担应用 shell、登录、全局数据加载、主模块/子标签路由和通用控制台页面。用户服务前端必须模块化到独立 TSX 文件,禁止继续把所有业务页面堆进 `app.tsx`。当前长期固定入口为:`todo-note.tsx` 承载 Todo Note 工作台,`findjob.tsx` 承载 FindJob 工作台,`pipeline.tsx` 承载 Pipeline 工作台,`met-nonlinear.tsx` 承载 MET Nonlinear 训练编排工作台,`claudeqq.tsx` 承载 ClaudeQQ QQ 消息网关工作台,`baidu-netdisk.tsx` 承载 Baidu Netdisk 存储工作台,`code-queue.tsx` 承载 Code Queue 控制台,`oa-event-flow.tsx` 承载统一 OA 事件流控制台,`k3sctl.tsx` 承载 K3S Control Plane 页面;新增用户服务也必须按同样规则新增独立页面模块,并由 `app.tsx` 只做导入和路由分发。 +`src/components/frontend/src/app.tsx` 只承担应用 shell、登录、全局数据加载、主模块/子标签路由和通用控制台页面。用户服务前端必须模块化到独立 TSX 文件,禁止继续把所有业务页面堆进 `app.tsx`。当前长期固定入口为:`todo-note.tsx` 承载 Todo Note 工作台,`findjob.tsx` 承载 FindJob 工作台,`pipeline.tsx` 承载 Pipeline 工作台,`met-nonlinear.tsx` 承载 MET Nonlinear 训练编排工作台,`claudeqq.tsx` 承载 ClaudeQQ QQ 消息网关工作台,`baidu-netdisk.tsx` 承载 Baidu Netdisk 存储工作台,`code-queue.tsx` 承载 Code Queue 控制台,`oa-event-flow.tsx` 承载统一 OA 事件流控制台,`k3sctl.tsx` 承载 k3s Control Plane 页面;新增用户服务也必须按同样规则新增独立页面模块,并由 `app.tsx` 只做导入和路由分发。 ## Shared Dialog Component @@ -95,7 +95,7 @@ frontend shell 必须把左侧主模块与顶部子标签编译为统一的 URL - `ClaudeQQ` 子标签必须把 D601 ClaudeQQ 后端渲染为 UniDesk React 控件,包括 NapCat 容器登录二维码、NapCat HTTP/WS 状态、事件缓存、QQ 事件订阅表、订阅创建表单、消息推送表单、主用户私聊账号 `645275593` 标记、最近 QQ 事件、已发送记录和显式原始 JSON 按钮。 - `Baidu Netdisk` 子标签必须把主 server `baidu-netdisk-backend` 后端渲染为 UniDesk React 控件,包括 OAuth 设备码二维码/用户码登录、账号容量、配置工作根文件浏览(当前默认百度网盘根目录 `/`)、staging 目录上传/下载任务、上传/下载自测按钮与 MD5 结果、脱敏安全说明、日志摘要和显式原始 JSON 按钮;不得把 access token、refresh token、dlink 或 staging 文件字节流裸露到浏览器。 - `OA Event Flow` 子标签必须把主 server `oa-event-flow-backend` 后端渲染为 UniDesk React 控件,包括服务健康、事件表、tag 过滤、SSE live 状态、Trace/STEP stats 表、Code Queue/Pipeline 标签入口和显式原始 JSON 按钮;默认页面不得裸铺完整事件 JSON,事件表只展示结构化摘要,完整 envelope/payload 只能通过 `查看原始JSON` 打开。 - - `K3S Control` 子标签必须把 D601 `k3sctl-adapter` 控制面渲染为 UniDesk React 控件,包括 control plane 状态、manifest 列表、D601/D518 节点实例、active instance、single-writer/no-fallback 路径、Kubernetes API service proxy 状态、kubectl/k3s snapshot 摘要和显式原始 JSON 按钮;页面只能通过 `/api/microservices/k3sctl-adapter/proxy/api/control-plane` 取数,不得直接访问 provider-gateway、NodePort、业务容器端口或裸 k3s/kubectl API。 + - `k3s Control` 子标签必须把 D601 `k3sctl-adapter` 控制面渲染为 UniDesk React 控件,包括 control plane 状态、manifest 列表、D601/D518 节点实例、active instance、single-writer/no-fallback 路径、Kubernetes API service proxy 状态、kubectl/k3s snapshot 摘要和显式原始 JSON 按钮;页面只能通过 `/api/microservices/k3sctl-adapter/proxy/api/control-plane` 取数,不得直接访问 provider-gateway、NodePort、业务容器端口或裸 k3s/kubectl API。 - `Code Queue` 子标签必须把 D601 k3s/k8s `code-queue` Service 渲染为 UniDesk React 控件,前端 API 基址只能是 `/api/microservices/code-queue/proxy`,不能继续使用旧 `/api/code-queue-direct` 别名;页面包括多 queue lane、queue 内串行、queue 间并行、queue 合并(点击“合并 queue”后必须用公共 `UniDeskDialog` 打开独立小窗口,用下拉菜单选择源 queue;不得把源 queue 选择控件塞进正常提交任务的 Queue 选择区;合并后自动删除源 queue,只保留合并后的目标 queue,目标 queue 按原 queueEnteredAt/createdAt 时间顺序串行)、任务 ID/复制任务 ID、引用按钮、任务耗时、任务提交/批量提交、引用任务 ID、创建成功提示、清空输入、模型下拉、执行 Provider 下拉、执行模式下拉(默认容器/本机或 `windows-native`)、显式入队份数、默认模型 `gpt-5.5`、MiniMax judge 状态、Codex CLI-like 输出流、attempt 终态、运行中追加 prompt、打断、手动重试和显式原始 JSON 按钮;`windows-native` 模式必须在任务 JSON、卡片和 Trace 头部显示,并要求非本机 WSL Provider 与 `/mnt/` 工作目录;Codex CLI-like 输出流必须始终保留任务的初始 `Submitted prompt` 和运行中 `Steer prompt`;整个 agent loop 消息流统一命名为专有名词 `Trace`,`Trace` 包含 assistant message、user prompt、system event 和 tool call,但非错误 system event 默认只保留在原始输出/数据库中,不在 TraceView 展示;Code Queue 与 Pipeline/OpenCode messages 必须共用 `src/components/frontend/src/trace.tsx` 的 Trace 公共组件、统一 Trace item 接口和 codex/opencode port 适配层;连续 read/edit/run 工具调用只是在 Trace 内折叠为可展开工具调用组,汇总格式至少包含 `xx read, xx edit, xx run`,并展示读取文件、编辑文件、运行命令和耗时摘要;最近 3 个工具调用保持展开,工具调用内容不得自动换行且必须在工具调用块内部横向滚动,工具调用组展开后不得再增加额外左侧缩进;message 与 prompt 必须自动换行,普通 message 不显示左侧项目符号缩进且永不折叠;Trace 首屏可以是摘要预览,但终态任务被选中后必须自动在后台加载完整 Trace,手动“加载完整 Trace”也必须从 Code Queue output archive 分页补齐早期 trace,不得把 preview 的 `hasMore=false` 当成完整历史;即使热状态为控制体积裁剪了早期 raw output,也要从结构化 `basePrompt/displayPrompt/promptHistory` 和 archive 合成完整用户输入与 agent trace,并且初始 prompt 默认显示注入前 prompt 而不是引用注入全文;当初始 prompt 含引用注入时,引用内容必须默认折叠,并只在 Trace 的初始消息中提供可展开的“最终传入 Codex 的真实完整 prompt”,不得再渲染独立 Prompt 全量卡片;多轮引用注入必须按上游/最早上下文在前、直接引用在后的顺序排列,每一轮必须有明确 `Reference Round N/M` 分割线和时间范围,不能用固定 6 轮截断引用链;点击队列引用按钮必须自动把该任务 ID 写入提交表单的引用输入框,引用任务 ID 创建新任务时必须自动注入 `bun scripts/cli.ts codex task ` 的提示;连续执行同一 prompt 应通过入队份数一次性生成多条任务,避免快速连点造成操作员误判。 - `MDTODO` 子标签必须把 D601 k3s `mdtodo` Service 渲染为 UniDesk React 控件,前端 API 基址只能是 `/api/microservices/mdtodo/proxy`;页面包括 TODO Markdown 文件列表、任务树、状态徽标、标题与正文编辑、新增根任务/子任务、删除任务、执行命令生成、hostPath 健康摘要和显式原始 JSON 按钮,不得 iframe 原 VS Code webview、公开 VSIX 旧前端或把完整 Markdown/JSON 默认铺在页面上。 - `Code Queue` 前端改进必须在同一任务内重建并上线公网 frontend,不能只修改源码或本地 bundle;重建 frontend 是无状态 WebUI 替换,不会导致 Code Queue 长期任务失败。已结束未读任务只能在 task card 边角显示类似未读消息的 `codex-unread-badge` 圆点和“标为已读”操作,不得把整张卡片改成红色/琥珀色失败态边框、背景或胶囊标签;状态栏的“结束未读”提示也不得使用失败态红色。 diff --git a/docs/reference/microservices.md b/docs/reference/microservices.md index 62e5e5fa..b7a80b79 100644 --- a/docs/reference/microservices.md +++ b/docs/reference/microservices.md @@ -123,22 +123,23 @@ Baidu Netdisk 在 UniDesk 语境中按纯后端服务管理:不得暴露百度 - provider-gateway 容器必须配置 `extra_hosts: ["host.docker.internal:host-gateway"]`,确保 D601/D518 Linux Docker 容器内能访问 WSL host 侧 `127.0.0.1:4251` 映射。 - UniDesk 前端:`用户服务 / File Browser` React 页面展示 D518 主目标和 D601 备用目标的健康状态、仓库引用、私有后端映射,并以 iframe 嵌入对应 File Browser WebUI;页面必须提供截图导出入口,并对上游 WebUI 注入紧凑布局样式,避免 material icon 字体异常时 `folder` 文本遮挡文件名;File Browser 自身端口不得直接暴露公网。 -### K3S Control Plane Adapter On D601 +### k3s Control Plane Adapter On D601 `k3sctl-adapter` 是 UniDesk 直管的控制面适配微服务;它本身仍按 `deployment.mode=unidesk-direct` 登记在 D601,由 UniDesk 负责健康检查、重建和前端可见性,但它管理的业务服务必须走 k3s 标准服务路由,不得再被 backend-core 直接下发到业务容器所在 provider。 - Provider:`D601`,由 D601 provider-gateway 仅维护和访问 `k3sctl-adapter` 的本机私有端口 `127.0.0.1:4266`;provider-gateway 不再作为 `code-queue` 业务请求的直接代理。 - 代码引用:`https://github.com/pikasTech/unidesk` 与配置中的 `repository.commitId`;服务源码位于 `src/components/microservices/k3sctl-adapter`,属于 UniDesk 自有控制面组件。 - 部署引用:UniDesk 仓库中的 `src/components/microservices/k3sctl-adapter/docker-compose.d601.yml`,Dockerfile 为 `src/components/microservices/k3sctl-adapter/Dockerfile`,容器名为 `k3sctl-adapter`。 -- k3s 实现:当前运行控制面为 D601 上的 `unidesk-k3s` K3S server 和 D518 上的 K3S agent;`k3sctl-managed` 可以落到 k3s、k8s 或等价标准 Kubernetes 控制面,但必须使用 Kubernetes 原生命名空间、Deployment、Service、readiness/liveness probe、Kubernetes API service proxy 等规范对象;不得把裸容器端口、NodePort、SSH curl、provider-gateway `microservice.http` 或 host 直连地址伪装成 k3s 服务路由。 -- K3S 系统组件:`unidesk-k3s` server 必须禁用非必要的 `traefik`、`servicelb` 和 `metrics-server`,只保留业务必需的 API server、CoreDNS 与 local-path provisioner;CoreDNS 和 local-path provisioner 固定运行在 D601 控制面节点,避免 D518 维护隧道限制导致系统 DNS/readiness 抖动。 +- k3s 实现:D601 控制面、D518 或其他计算资源节点上的 k3s agent/worker 都必须原生安装在节点 host OS 或 WSL 发行版内,以 `/usr/local/bin/k3s` 和 systemd `k3s.service`/`k3s-agent.service` 运行;不得用 Docker、Compose、`rancher/k3s` 长驻容器、kind/k3d 或其他容器化方式承载 k3s 控制面或 kubelet。Docker 只允许用于 provider-gateway、业务容器镜像构建、运行用户 workload 或临时提取 k3s 二进制/镜像 artifact,不能成为 k3s runtime 边界。验收时必须证明 `systemctl is-active k3s` 或 agent 服务正常、`kubectl get nodes -o wide` 看到真实节点 OS/内核、k3s containerd socket 位于 `/run/k3s/containerd/containerd.sock`,且不存在 active `rancher/k3s` 控制面容器。 +- k3s 路由对象:`k3sctl-managed` 可以落到 k3s、k8s 或等价标准 Kubernetes 控制面,但必须使用 Kubernetes 原生命名空间、Deployment、Service、readiness/liveness probe、Kubernetes API service proxy 等规范对象;不得把裸容器端口、NodePort、SSH curl、provider-gateway `microservice.http` 或 host 直连地址伪装成 k3s 服务路由。WSL 节点的 hostPath 和 local-path 语义必须解析到 WSL host 文件系统;例如 D601 Code Queue Pod 的 `/workspace` 必须映射 WSL `/home/ubuntu`,不能映射到容器化 k3s 内部的 `/home/ubuntu`。 +- k3s 系统组件:D601 原生 k3s server 必须禁用非必要的 `traefik`、`servicelb` 和 `metrics-server`,只保留业务必需的 API server、CoreDNS 与 local-path provisioner;CoreDNS 和 local-path provisioner 固定运行在 D601 控制面节点,避免 D518 维护隧道限制导致系统 DNS/readiness 抖动。 - manifest:代管服务声明放在 `src/components/microservices/k3sctl-adapter/k3s/*.k3s.json`,adapter 启动时通过 `K3SCTL_MANIFEST_PATHS` 读取;manifest 是 D601/D518 实例、active instance、single writer、expected nodes 和 health policy 的权威来源。`K3SCTL_SERVICES_JSON` 不得承载 static HTTP 服务、不得覆盖同名服务、不得作为隐藏 fallback;如需追加服务也必须提供完整 `ManagedKubernetesService` manifest。 - API:`GET /health` 只表示 adapter 控制面自身可用,并把代管服务 serving 健康作为 `managedServicesHealthy` 字段展示;`GET /api/control-plane` 返回控制面、manifest、kubectl/k3s snapshot 和代管服务状态;`GET /api/services` 返回代管服务列表;`GET|HEAD /api/services//health` 返回该 k3s 服务的 active serving 健康;`/api/services//proxy/*` 是业务请求进入 active service 的唯一代理入口。 -- 代理路径:adapter 访问 active 业务服务的唯一正式路径是 Kubernetes API service proxy:`/api/v1/namespaces//services/:/proxy/...`。D601 与 D518 不要求能彼此直连;D518 通过 K3S agent 加入控制面,控制面连接可以借助节点维护隧道建立,但业务请求不得退化为 provider-gateway 直连 Code Queue HTTP 端口。standby/worker 节点如果受 kubelet/service-proxy 可达性限制,可以在 manifest 中显式使用 `healthMode=pod-ready` 作为拓扑健康探针;这只读取 Kubernetes Pod readiness,不是业务代理路径,也不能替代 active Service proxy。 +- 代理路径:adapter 访问 active 业务服务的唯一正式路径是 Kubernetes API service proxy:`/api/v1/namespaces//services/:/proxy/...`。D601 与 D518 不要求能彼此直连;D518 通过 k3s agent 加入控制面,控制面连接可以借助节点维护隧道建立,但业务请求不得退化为 provider-gateway 直连 Code Queue HTTP 端口。standby/worker 节点如果受 kubelet/service-proxy 可达性限制,可以在 manifest 中显式使用 `healthMode=pod-ready` 作为拓扑健康探针;这只读取 Kubernetes Pod readiness,不是业务代理路径,也不能替代 active Service proxy。 - 拓扑健康:`expectedNodeIds` 负责展示计划内节点;当前 Code Queue 目标拓扑必须同时包含 D601 和 D518,`presentNodeIds` 应为 `["D601","D518"]`、`missingNodeIds=[]`、`topologyComplete=true`、`status=healthy`。D518 未加入只允许作为迁移中的显式 degraded 状态,不能隐藏为 fallback;只有显式 `requireAllInstancesHealthy=true` 的服务才允许把缺失 standby/worker 节点提升为整体不健康。 -- 前端:`用户服务 / K3S Control` React 页面必须只通过 `/api/microservices/k3sctl-adapter/proxy/api/control-plane` 通信,展示控制面状态、manifest、D601/D518 实例、active instance、Kubernetes API service proxy/no-fallback 路径和显式原始 JSON 按钮;页面不得直接访问 provider-gateway、D601/D518 业务容器端口、NodePort 或 raw k3s/kubectl API。 +- 前端:`用户服务 / k3s Control` React 页面必须只通过 `/api/microservices/k3sctl-adapter/proxy/api/control-plane` 通信,展示控制面状态、manifest、D601/D518 实例、active instance、Kubernetes API service proxy/no-fallback 路径和显式原始 JSON 按钮;页面不得直接访问 provider-gateway、D601/D518 业务容器端口、NodePort 或 raw k3s/kubectl API。 -### Code Queue K3S-Managed +### Code Queue k3s-Managed 当前 Code Queue 作为 `id=code-queue` 的 `k3sctl-managed` 用户服务登记在 `config.json`,业务实例由 D601 k3s 控制面代管,并接入统一 `oa-event-flow` 发布 Trace/STEP 事实事件与读取统计中心: @@ -180,7 +181,7 @@ Baidu Netdisk 在 UniDesk 语境中按纯后端服务管理:不得暴露百度 - 代理路径:只允许 `/health`、`/logs` 和 `/api/` 前缀;允许方法为 `GET`、`HEAD`、`POST`、`DELETE`、`PATCH`。Code Queue 只在 Compose 内网暴露 `4222/tcp`,不得映射或开放到公网。 - UniDesk 前端:`用户服务 / Code Queue` React 页面负责展示队列卡片、任务 ID、复制任务 ID、引用按钮、任务耗时、默认模型、模型下拉、执行 Provider 下拉、执行模式下拉、Provider/模式对应默认工作目录、显式入队份数、引用任务 ID、清空输入、创建成功提示、MiniMax judge 状态、Codex CLI-like 输出流、attempt 终态、追加 prompt、打断和手动重试控件;选择 `windows-native` 时应优先切到支持 Windows 原生 Codex 的非主 server Provider,并把工作目录提示切到 `/mnt/` 默认路径;整个 agent loop 消息流统一命名为专有名词 `Trace`,`Trace` 包含 assistant message、user prompt、system event 和 tool call;Code Queue 与 Pipeline/OpenCode messages 必须共用 `src/components/frontend/src/trace.tsx` 的 Trace 公共组件、统一 Trace item 接口和 codex/opencode port 适配层;连续 read/edit/run 工具调用只是在 Trace 内折叠为可展开工具调用组,汇总格式至少包含 `xx read, xx edit, xx run`,并展示读取文件、编辑文件、运行命令和耗时摘要;最近 3 个工具调用保持展开,工具调用内容不得自动换行且必须在工具调用块内部横向滚动,工具调用组展开后不得再增加额外左侧缩进;message 与 prompt 必须自动换行,普通 message 不显示左侧项目符号缩进且永不折叠;点击队列卡片引用按钮必须自动把该任务 ID 写入提交表单的引用任务 ID 输入框;引用任务 ID 创建新任务时必须自动注入 `bun scripts/cli.ts codex task ` 的提示,让 Codex 读取初始 prompt、最后消息和工具摘要后继续;连续执行同一 prompt 应使用 `入队份数` 一次性生成多条队列任务,而不是依赖快速连点按钮;左侧 queue/session 卡片的 `QUEUED` 状态必须显示原因,例如 `QUEUED(PREV TASK)`、`QUEUED(MEM LIMIT)`、`QUEUED(ACTIVE LIMIT)`;原始任务 JSON 只能通过显式 `查看原始JSON` 打开。 -### MDTODO K3S-Managed +### MDTODO k3s-Managed 当前 MDTODO 作为 `id=mdtodo` 的 `k3sctl-managed` 用户服务登记在 `config.json`,用于把 D601 Windows 工作区 `F:\Work\vscode-mdtodo` 从 VS Code 扩展形态拆成 UniDesk 可代理的后端服务: diff --git a/docs/reference/repo-tree.md b/docs/reference/repo-tree.md index 7e394e2c..13a877a5 100644 --- a/docs/reference/repo-tree.md +++ b/docs/reference/repo-tree.md @@ -73,7 +73,7 @@ - src/met-nonlinear.tsx (MET Nonlinear D601 training orchestration React page; do not fold back into `app.tsx`) - src/code-queue.tsx (Code Queue user-service React page; do not fold back into `app.tsx`) - src/oa-event-flow.tsx (Unified OA event flow and Trace/STEP stats React page; do not fold back into `app.tsx`) - - src/k3sctl.tsx (K3S Control Plane React page backed only by `k3sctl-adapter`; do not fold back into `app.tsx`) + - src/k3sctl.tsx (k3s Control Plane React page backed only by `k3sctl-adapter`; do not fold back into `app.tsx`) - public/ (HTML/CSS static assets for the compact industrial console; no handwritten app JS) - provider-gateway/ (Compute node Provider Gateway container) - package.json diff --git a/scripts/src/deploy.ts b/scripts/src/deploy.ts index 00d95bad..2f87c7f6 100644 --- a/scripts/src/deploy.ts +++ b/scripts/src/deploy.ts @@ -83,6 +83,8 @@ const k8sKubeconfig = "/etc/rancher/k3s/k3s.yaml"; const k3sDeployDir = "/home/ubuntu/cq-deploy"; const providerGatewayWsEgressProxyUrl = "http://127.0.0.1:18789"; const nativeK3sInstallVersion = "v1.34.1+k3s1"; +const nativeK3sImage = "rancher/k3s:v1.34.1-k3s1"; +const nativeK3sCtrAddress = "/run/k3s/containerd/containerd.sock"; function nowIso(): string { return new Date().toISOString(); @@ -532,6 +534,7 @@ function rootAccessPrelude(): string[] { function ensureNativeK3sScript(): string { const installCommand = [ + "INSTALL_K3S_SKIP_DOWNLOAD=true", `INSTALL_K3S_VERSION=${nativeK3sInstallVersion}`, "INSTALL_K3S_EXEC=\"server --disable traefik --disable servicelb --disable metrics-server --node-name D601 --node-label unidesk.ai/node-id=D601 --node-label unidesk.ai/provider-id=D601 --tls-san 127.0.0.1 --tls-san host.docker.internal --write-kubeconfig-mode 644\"", "sh /tmp/unidesk-install-k3s.sh", @@ -539,7 +542,60 @@ function ensureNativeK3sScript(): string { return [ "set -euo pipefail", ...rootAccessPrelude(), - "if ! command -v k3s >/dev/null 2>&1; then", + `native_k3s_image=${shellQuote(nativeK3sImage)}`, + `native_ctr_address=${shellQuote(nativeK3sCtrAddress)}`, + "install_native_k3s_binaries() {", + " missing=0", + " for binary in k3s ctr containerd containerd-shim-runc-v2 crictl runc cni flannel bridge host-local loopback portmap bandwidth firewall; do", + " command -v \"$binary\" >/dev/null 2>&1 || missing=1", + " done", + " if [ \"$missing\" = \"0\" ]; then return; fi", + " docker image inspect \"$native_k3s_image\" >/dev/null 2>&1 || docker pull \"$native_k3s_image\"", + " tmp_dir=$(mktemp -d)", + " tmp_container=$(docker create \"$native_k3s_image\" sh -lc true)", + " docker cp \"$tmp_container:/bin/.\" \"$tmp_dir/bin\"", + " docker rm \"$tmp_container\" >/dev/null", + " for binary in k3s ctr containerd containerd-shim-runc-v2 crictl runc cni flannel bridge host-local loopback portmap bandwidth firewall; do", + " if [ -e \"$tmp_dir/bin/$binary\" ]; then", + " root_exec install -m 755 \"$tmp_dir/bin/$binary\" \"/usr/local/bin/$binary\"", + " echo native_k3s_binary_installed=$binary", + " fi", + " done", + " rm -rf \"$tmp_dir\"", + "}", + "unmount_invalid_wsl_kubelet_mounts() {", + " if awk 'NF != 6 && $0 ~ /\\/Docker\\/host/ {found=1} END {exit found ? 0 : 1}' /proc/mounts; then", + " root_exec umount /Docker/host >/dev/null 2>&1 || root_exec umount -l /Docker/host >/dev/null 2>&1 || true", + " echo native_k3s_unmounted_invalid_mount=/Docker/host", + " fi", + " awk 'NF != 6 {print \"native_k3s_invalid_mount_line=\" NR \":\" $0; bad=1} END {exit bad}' /proc/mounts", + "}", + "install_system_images_from_legacy_k3s() {", + " legacy_container=$(docker ps --format '{{.Names}} {{.Image}}' | awk '$2 ~ /^rancher\\/k3s:/ {print $1; exit}')", + " if [ -n \"$legacy_container\" ]; then", + " if docker exec \"$legacy_container\" ctr -n k8s.io images export /tmp/unidesk-k3s-system-images.tar docker.io/rancher/local-path-provisioner:v0.0.32 docker.io/rancher/mirrored-coredns-coredns:1.12.3 >/dev/null 2>&1; then", + " docker cp \"$legacy_container:/tmp/unidesk-k3s-system-images.tar\" /tmp/unidesk-k3s-system-images.tar >/dev/null", + " root_exec ctr --address \"$native_ctr_address\" -n k8s.io images import /tmp/unidesk-k3s-system-images.tar >/dev/null", + " echo native_k3s_imported_legacy_system_images=$legacy_container", + " fi", + " fi", + "}", + "install_pause_image() {", + " if root_exec ctr --address \"$native_ctr_address\" -n k8s.io images ls | grep -q 'docker.io/rancher/mirrored-pause:3.6'; then return; fi", + " docker image inspect rancher/mirrored-pause:3.6 >/dev/null 2>&1 || {", + " docker image inspect \"$native_k3s_image\" >/dev/null 2>&1 || docker pull \"$native_k3s_image\"", + " tmp_dir=$(mktemp -d)", + " printf '%s\\n' \"FROM $native_k3s_image\" 'ENTRYPOINT [\"/bin/sleep\", \"365d\"]' > \"$tmp_dir/Dockerfile\"", + " docker build -q -t rancher/mirrored-pause:3.6 \"$tmp_dir\" >/dev/null", + " rm -rf \"$tmp_dir\"", + " }", + " docker save rancher/mirrored-pause:3.6 -o /tmp/unidesk-k3s-pause.tar", + " root_exec ctr --address \"$native_ctr_address\" -n k8s.io images import /tmp/unidesk-k3s-pause.tar >/dev/null", + " echo native_k3s_imported_pause_image=rancher/mirrored-pause:3.6", + "}", + "install_native_k3s_binaries", + "unmount_invalid_wsl_kubelet_mounts", + "if ! systemctl cat k3s.service >/dev/null 2>&1; then", " curl -fsSL --max-time 60 https://get.k3s.io -o /tmp/unidesk-install-k3s.sh", ` root_shell ${shellQuote(installCommand)}`, "fi", @@ -551,6 +607,11 @@ function ensureNativeK3sScript(): string { " sleep 2", "done", `KUBECONFIG=${shellQuote(k8sKubeconfig)} kubectl get nodes -l unidesk.ai/node-id=D601 --no-headers | grep -q .`, + `KUBECONFIG=${shellQuote(k8sKubeconfig)} kubectl wait --for=condition=Ready node -l unidesk.ai/node-id=D601 --timeout=180s`, + "install_system_images_from_legacy_k3s", + "install_pause_image", + `KUBECONFIG=${shellQuote(k8sKubeconfig)} kubectl -n kube-system delete pod -l k8s-app=kube-dns --ignore-not-found >/dev/null 2>&1 || true`, + `KUBECONFIG=${shellQuote(k8sKubeconfig)} kubectl -n kube-system delete pod -l app=local-path-provisioner --ignore-not-found >/dev/null 2>&1 || true`, "legacy_k3s_containers=$(docker ps --format '{{.Names}} {{.Image}}' | awk '$2 ~ /^rancher\\/k3s:/ {print $1}')", "for container in $legacy_k3s_containers; do", " docker update --restart=no \"$container\" >/dev/null 2>&1 || true", @@ -569,8 +630,8 @@ function importK3sImageScript(service: UniDeskMicroserviceConfig): string { ...rootAccessPrelude(), `image=${shellQuote(image)}`, "docker image inspect \"$image\" >/dev/null", - "docker save \"$image\" | root_exec k3s ctr -n k8s.io images import -", - "root_exec k3s ctr -n k8s.io images ls | grep -F \"$image\" || true", + `docker save "$image" | root_exec ctr --address ${shellQuote(nativeK3sCtrAddress)} -n k8s.io images import -`, + `root_exec ctr --address ${shellQuote(nativeK3sCtrAddress)} -n k8s.io images ls | grep -F "$image" || true`, ].join("\n"); } @@ -1013,7 +1074,7 @@ async function applyOneService(config: UniDeskConfig, service: UniDeskMicroservi if (!pushStep(steps, build)) return { ok: false, serviceId: service.id, startedAt, finishedAt: nowIso(), resolvedCommit, before, steps }; if (service.deployment.mode === "k3sctl-managed") { - const nativeK3s = await step(config, service, "ensure-native-k3s", ensureNativeK3sScript(), "/home/ubuntu", 180_000, true); + const nativeK3s = await step(config, service, "ensure-native-k3s", ensureNativeK3sScript(), "/home/ubuntu", 600_000, true); if (!pushStep(steps, nativeK3s)) return { ok: false, serviceId: service.id, startedAt, finishedAt: nowIso(), resolvedCommit, before, steps }; const imageImport = await step(config, service, "import-k3s-image", importK3sImageScript(service), targetWorkDir(service), 180_000, true); if (!pushStep(steps, imageImport)) return { ok: false, serviceId: service.id, startedAt, finishedAt: nowIso(), resolvedCommit, before, steps };