feat: initialize unidesk platform
This commit is contained in:
@@ -0,0 +1,73 @@
|
||||
- Requirements
|
||||
- Build a distributed work platform covering research, project development, and project management
|
||||
- Deploy the main entry point on a server with a public IP, providing a unified interface
|
||||
- Multiple computing resource machines join the platform to execute computing tasks
|
||||
- The platform must support task scheduling, state monitoring, versioned code distribution, and large file storage
|
||||
- Design goals are high availability, high concurrency, centralized state management, and stateless compute nodes
|
||||
- Key Assumptions
|
||||
- The main server has a public IP and can be accessed from the internet
|
||||
- Computing resource machines have no public IP, possibly behind NAT or firewalls
|
||||
- Computing resource machines have stable outbound network connectivity (within intranet or internet)
|
||||
- Computing resource machines can run Docker and support WSL (some nodes are Windows workstations)
|
||||
- Users interact with the platform only through the main server entry point, never directly with compute nodes
|
||||
- The main server's availability is higher than that of computing resource machines; compute nodes may go offline frequently due to hardware, network, or human factors
|
||||
- Tasks prone to single points of failure are deployed on the main server first, leveraging its high-availability environment to protect the critical path
|
||||
- UniDesk Distributed Work Platform Architecture
|
||||
- Overview
|
||||
- The main server hosts all stateless business logic as the unified entry point
|
||||
- Computing resource nodes actively connect via lightweight Provider Gateway containers
|
||||
- All state is stored centrally in PostgreSQL, never scattered across nodes
|
||||
- Code and environments are distributed via GitHub versions; large file storage solution is to be determined
|
||||
- The main server also connects itself to the platform as a compute node, using the exact same method as ordinary compute nodes
|
||||
- This design allows verification of the full distributed dispatching flow on a single main server
|
||||
- Main Server Components
|
||||
- UniDesk Stateless Services
|
||||
- Run all business microservices as Docker containers
|
||||
- Includes API gateway, task scheduler, project management, and other stateless modules
|
||||
- Instances can scale horizontally; failure recovery requires no state synchronization
|
||||
- PostgreSQL Database
|
||||
- Deployed as a Docker container with a 10 GB named volume
|
||||
- Stores all task metadata, node heartbeats, resource labels, and business state
|
||||
- Backed up periodically via `pg_dump`, keeping the last 7 daily snapshots
|
||||
- The named volume ensures data survives container recreation or upgrades
|
||||
- Code and Environment Distribution
|
||||
- Code repositories and execution environment definitions may reside in multiple GitHub repositories
|
||||
- When dispatching a task, five metadata items must be specified: `code_repo_url`, `code_commit_id`, `env_repo_url`, `env_commit_id`, and `dockerfile_path`
|
||||
- A single env repo can contain multiple Dockerfiles defining different execution environments, distinguished by `dockerfile_path`
|
||||
- Compute nodes maintain a local Git cache and only incrementally fetch the specified version each time
|
||||
- Docker layer caching accelerates environment builds, making subsequent builds nearly instantaneous after the first
|
||||
- Compute Node Connection Scheme
|
||||
- Provider Gateway Docker
|
||||
- Each computing resource machine runs a Provider Gateway container
|
||||
- Acts as the node-side gateway, bridging the main server and the local execution environment
|
||||
- The container houses the agent logic, implementing a WebSocket client and local scheduling
|
||||
- WebSocket Persistent Connection
|
||||
- Provider Gateway actively initiates a WebSocket connection to the main server
|
||||
- Commands, heartbeats, and task statuses are exchanged bidirectionally over this persistent connection
|
||||
- The main server never initiates connections to nodes, perfectly adapting to environments without public IP and behind NAT
|
||||
- Interaction with Local Execution Environment
|
||||
- The primary path for automated task dispatching and execution is via the local Docker socket
|
||||
- Access to the local environment via WSL SSH is reserved solely as an auxiliary path for emergency maintenance and troubleshooting
|
||||
- Automating task deployment or dispatching through the WSL SSH channel is forbidden
|
||||
- Connection Management
|
||||
- When registering, a node carries an authentication token to verify its identity and declares resources such as GPU/CPU
|
||||
- The authentication token is pre-issued by the main server and configured at Provider Gateway startup
|
||||
- Heartbeats are sent every 15 seconds; if no heartbeat arrives for 90 seconds, the node is marked offline
|
||||
- Automatic reconnection on disconnect with exponential backoff to avoid a thundering herd on the main server
|
||||
- Data Flow and State Management
|
||||
- Task commands are delivered over WebSocket and never contain large file content
|
||||
- All state changes are reported to the main server in real time by Provider Gateway
|
||||
- The main server writes state updates to PostgreSQL, completing the unified closed loop
|
||||
- Critical Task Deployment Principles
|
||||
- Single-point components such as the database, core scheduler logic, and API gateway are deployed on the main server
|
||||
- The high-availability environment of the main server ensures the critical scheduling path never breaks
|
||||
- Compute nodes are only responsible for task execution; their offline status does not affect overall platform availability
|
||||
- Large File Storage Solution
|
||||
- The concrete implementation is to be determined, and must meet the following requirements
|
||||
- Support automated pull and upload by compute nodes without human intervention
|
||||
- Provide a programmable interface for the scheduler to generate temporary access credentials
|
||||
- Have sufficient bandwidth so that concurrent reads/writes never become the bottleneck for training tasks
|
||||
- Deployment Notes
|
||||
- Use `docker-compose` on the main server to orchestrate all services uniformly
|
||||
- PostgreSQL uses a named volume to guarantee data persistence
|
||||
- The Provider Gateway image is built uniformly and distributed to all compute nodes in a versioned manner
|
||||
@@ -0,0 +1,28 @@
|
||||
# UniDesk CLI Reference
|
||||
|
||||
UniDesk 的统一 CLI 入口是根目录 `scripts/cli.ts`,运行方式固定为 `bun scripts/cli.ts <command>`。CLI 默认输出 JSON,所有成功和失败路径都必须向 stdout 写出结构化对象,避免无输出造成状态不可观测。
|
||||
|
||||
## Command Model
|
||||
|
||||
- `help` 输出命令索引,适合作为交互式入口。
|
||||
- `config show` 读取并校验根目录 `config.json`,不从环境变量、默认值或隐藏文件静默补配置。
|
||||
- `check` 执行配置校验、文件存在性检查、`scripts/` TypeScript 检查、`src/components/` TypeScript 检查和 Docker Compose 配置检查。
|
||||
- `server start` 创建异步 job,在后台执行 Docker 构建和启动;命令本身只负责返回 job id、日志路径和启动命令。
|
||||
- `server stop` 创建异步 job,在后台停止固定 Compose project 中的全部 UniDesk 服务。
|
||||
- `server status` 查询固定端口、Compose 容器、core/frontend 健康检查和访问 URL。
|
||||
- `server logs` 返回 `logs/` 文件日志和 Docker 容器日志的尾部,默认限制输出大小,避免日志爆炸。
|
||||
- `job list` 与 `job status` 查询 `.state/jobs/` 文件系统状态,是异步命令的可观测入口。
|
||||
- `debug health` 与 `debug dispatch` 走真实 HTTP、WebSocket、数据库和 provider 流程,只用于开发调试,不写入 `TEST.md` 的正式验收步骤。
|
||||
- `e2e run` 使用 publicHost 派生的公开 URL 验证 core API、PostgreSQL、provider self-connection 和 Playwright 前端页面,是交付前的自动化 E2E 门禁。
|
||||
|
||||
## Async Job State
|
||||
|
||||
长时操作采用 Fire-and-Forget 模式:CLI 创建 `.state/jobs/{jobId}.json`,后台进程执行真实命令,并将 stdout、stderr 分别写入 `.state/jobs/{jobId}.stdout.log` 与 `.state/jobs/{jobId}.stderr.log`。调用者通过 `bun scripts/cli.ts job status <jobId>` 查询进度和尾部输出。
|
||||
|
||||
## Output Contract
|
||||
|
||||
每条命令的最外层 JSON 包含 `ok`、`command` 和 `data` 或 `error`。失败时 CLI 设置非零退出码,但仍然输出 JSON 错误对象;错误对象应包含 `name`、`message` 和可用的 `stack`。
|
||||
|
||||
## Debug Contract
|
||||
|
||||
`debug` 子命令必须复用真实模块与真实端点,禁止维护平行实现。`debug dispatch` 会调用 core 的 `/api/dispatch`,core 再通过 WebSocket 将任务下发给 provider gateway,因此它可以验证核心调度闭环。
|
||||
@@ -0,0 +1,19 @@
|
||||
# UniDesk Configuration Reference
|
||||
|
||||
根目录 `config.json` 是 UniDesk CLI 的唯一配置来源。CLI 启动时必须完整校验配置结构,读取失败或字段不合法时直接返回 JSON 错误,不允许静默 fallback。
|
||||
|
||||
## Runtime
|
||||
|
||||
TypeScript 运行时固定为 Bun。根目录 CLI、backend-core、frontend 和 provider-gateway 都直接运行 `.ts` 入口;Docker 镜像使用 `oven/bun` 基础镜像,本机命令使用 `bun scripts/cli.ts`。
|
||||
|
||||
## Fixed Ports
|
||||
|
||||
`config.json` 中固定三个对外端口:backend-core、frontend、database。`network.publicHost` 必须是浏览器和外部客户端可访问的主 server 地址;公网 E2E 不允许把它保留为 `127.0.0.1`。`server start` 会在启动前检查这些端口,避免因端口冲突产生多个版本混乱的服务实例。
|
||||
|
||||
## Compose Env Generation
|
||||
|
||||
Docker Compose 本身不读取 JSON,因此 CLI 会从 `config.json` 生成 `.state/docker-compose.env`。该文件是派生状态,不应手写;如需改端口、token、provider 标签或主机名,应修改 `config.json` 后重新运行 CLI。
|
||||
|
||||
## Secrets
|
||||
|
||||
当前配置面向主 server 开发部署,包含开发用数据库密码和 provider token。公网暴露前必须在 `config.json` 中修改这些值,并重新启动栈以刷新派生环境文件。
|
||||
@@ -0,0 +1,22 @@
|
||||
# UniDesk Deployment Reference
|
||||
|
||||
主 server 使用根目录 `docker-compose.yml` 统一编排 database、backend-core、frontend 和 provider-gateway。当前环境本身就是主 server,因此 provider-gateway 也在同一台机器上启动,用与普通计算节点相同的 WebSocket 方式接入 core。
|
||||
|
||||
## Services
|
||||
|
||||
- `database` 使用 `postgres:16-alpine`,数据保存到 named volume `unidesk_pgdata_10gb`,初始化 SQL 位于 `src/components/database/init/`。
|
||||
- `backend-core` 是无状态核心服务,提供 REST API、provider WebSocket、任务调度入口和数据库访问层。
|
||||
- `frontend` 是独立 Web 容器,通过浏览器访问 core 的公开 API URL。
|
||||
- `provider-gateway` 是当前主 server 的本机计算节点代理,通过 WebSocket 主动连到 backend-core,并挂载 `/var/run/docker.sock` 作为自动任务执行主路径。
|
||||
|
||||
## Start And Stop
|
||||
|
||||
`bun scripts/cli.ts server start` 与 `bun scripts/cli.ts server stop` 都是异步 job。启动 job 会先清理固定 Compose project 的旧容器,再重新构建并启动,避免主 server 上残留旧容器或旧镜像配置。启动后用 `job status latest` 观察后台命令,用 `server status` 验证端口、容器和健康检查。
|
||||
|
||||
## Health Criteria
|
||||
|
||||
服务跑通的最低标准是:backend-core `/health` 返回 ok,frontend `/health` 返回 ok,database 端口监听,`/api/nodes` 中出现 `main-server` provider 且状态为 `online`,`debug dispatch main-server docker.ps` 能完成真实任务下发。交付前还必须运行 `bun scripts/cli.ts e2e run`,并以 `docs/reference/e2e.md` 的门禁作为最终判定。
|
||||
|
||||
## Database Volume
|
||||
|
||||
架构要求数据库使用 10 GB named volume;当前实现将 volume 命名为 `unidesk_pgdata_10gb` 以固定生命周期。Docker named volume 默认不强制容量上限;如需硬配额,应在主机存储层或 Docker volume driver 层配置。CLI server 控制只能使用不删除 volume 的 `down` / `up` 流程,禁止使用 `down -v` 或删除 `unidesk_pgdata_10gb`。
|
||||
@@ -0,0 +1,36 @@
|
||||
# UniDesk E2E Reference
|
||||
|
||||
UniDesk delivery is not complete until the public frontend, public core API, PostgreSQL database, and local provider-gateway self-connection pass one end-to-end check. The canonical automated command is `bun scripts/cli.ts e2e run`.
|
||||
|
||||
## Required Preconditions
|
||||
|
||||
- `config.json` `network.publicHost` must be the externally reachable host name or IP of the main server, not `127.0.0.1`, when validating browser access from outside the server.
|
||||
- `bunx playwright install chromium` and `bunx playwright install-deps chromium` must have been run on hosts that execute browser E2E tests.
|
||||
- The Docker stack must be running through `bun scripts/cli.ts server start`, and `bun scripts/cli.ts server status` must report healthy core, frontend, database, and provider-gateway containers.
|
||||
|
||||
## Automated E2E Scope
|
||||
|
||||
`bun scripts/cli.ts e2e run` validates the following through the public URLs derived from `config.json`:
|
||||
|
||||
- Core API: `GET /api/overview` reports `dbReady: true` and at least one online node.
|
||||
- Provider self-connection: `GET /api/nodes` contains `main-server` with `status: online`.
|
||||
- Database: the command writes an `unidesk_e2e_markers` row through `docker exec unidesk-database psql`, confirms provider state is stored in PostgreSQL, and probes the public PostgreSQL port with `pg_isready`.
|
||||
- Frontend: Playwright opens the public frontend URL, waits for `核心在线`, asserts that `main-server` and `Main Server Provider` are visible, checks the metrics panel, and captures a screenshot under `.state/e2e/`.
|
||||
|
||||
## Public Frontend Rule
|
||||
|
||||
The frontend must not inject `127.0.0.1` as the browser-facing core API URL for public deployments. If a loopback URL is accidentally injected and the page itself is opened from a non-loopback host, `public/app.js` rewrites the API host to `window.location.hostname` as a safety net; however the correct fix is still to set `network.publicHost` correctly in `config.json` and restart the stack.
|
||||
|
||||
## Database Persistence Rule
|
||||
|
||||
The PostgreSQL data volume is the named Docker volume `unidesk_pgdata_10gb`. CLI server control commands must never use `docker compose down -v`, `docker volume rm`, or any equivalent data-volume removal. To validate persistence, insert a marker row into `unidesk_e2e_markers`, run `bun scripts/cli.ts server start` or a full stop/start cycle, and verify the marker row still exists.
|
||||
|
||||
## Delivery Gate
|
||||
|
||||
Before claiming delivery, run these checks and keep their JSON output or screenshot path available for review:
|
||||
|
||||
1. `bun scripts/cli.ts check`
|
||||
2. `bun scripts/cli.ts server start`, then `bun scripts/cli.ts job status latest` until `succeeded`
|
||||
3. `bun scripts/cli.ts server status`
|
||||
4. `bun scripts/cli.ts e2e run`
|
||||
5. a database persistence marker check across at least one CLI-controlled restart
|
||||
@@ -0,0 +1,15 @@
|
||||
# UniDesk Frontend Reference
|
||||
|
||||
UniDesk 前端是工业化控制台,不追求展示型大屏效果。设计目标是高信息密度、低装饰、低字号、低间距,并让调度、节点、事件和配置入口在单屏内快速切换。
|
||||
|
||||
## Layout
|
||||
|
||||
左侧边栏切换主模块:运行总览、资源节点、任务调度、系统配置。顶部标签切换子模块:Overview、Live Nodes、Event Log、Dispatch。桌面端采用双列内容网格,移动端将左侧栏压缩为横向模块条。
|
||||
|
||||
## Visual Language
|
||||
|
||||
界面使用深钢蓝、炭黑、琥珀和冷青作为工业控制台色板;字体选择窄体和等宽组合,以减少横向浪费。字号、表格行高和面板间距保持克制,避免大标题和松散卡片造成信息密度下降。
|
||||
|
||||
## Data Flow
|
||||
|
||||
frontend 容器只服务静态资产和轻量 HTML 注入;浏览器根据 `CORE_PUBLIC_URL` 调用 backend-core 的 REST API。调度表单调用 `/api/dispatch`,事件表和节点表通过轮询刷新。
|
||||
@@ -0,0 +1,15 @@
|
||||
# UniDesk Observability Reference
|
||||
|
||||
UniDesk 的可观测性优先级高于静默成功。CLI、服务日志、Docker 日志和数据库状态都必须能通过短命令查询。
|
||||
|
||||
## CLI Logs
|
||||
|
||||
异步 job 的 stdout 和 stderr 位于 `.state/jobs/`。`job status` 会返回有限尾部,避免输出爆炸,同时保留完整日志文件路径便于继续排查。
|
||||
|
||||
## Service Logs
|
||||
|
||||
服务日志位于 `logs/{YYYYMMDD}/`,每次 `server start` 都生成新的本地时间戳前缀。backend-core、frontend 和 provider-gateway 输出 JSONL 文件;database 通过 PostgreSQL logging collector 写入同一目录。
|
||||
|
||||
## Log Access
|
||||
|
||||
`bun scripts/cli.ts server logs` 同时读取文件日志和 Docker logs 尾部。文件日志是服务崩溃时的第一现场,Docker logs 是容器启动失败和 stdout/stderr 的辅助来源。
|
||||
@@ -0,0 +1,15 @@
|
||||
# Provider Gateway Reference
|
||||
|
||||
Provider Gateway 是计算节点侧容器。它只主动连出到 backend-core 的 WebSocket,不要求计算节点有公网 IP,适合 NAT、内网和防火墙后的机器。
|
||||
|
||||
## Main Server Self Provider
|
||||
|
||||
当前主 server 也运行一个 provider-gateway,`providerId` 固定来自 `config.json` 的 `providerGateway.id`。这让单机环境也能验证完整的分布式调度闭环:frontend 发起任务,core 写数据库并通过 WebSocket 下发,provider gateway 执行后回传状态。
|
||||
|
||||
## Docker Socket Path
|
||||
|
||||
自动任务执行只允许走本地 Docker socket。Compose 将 `/var/run/docker.sock` 挂入 provider-gateway,provider 标签会报告 `dockerSocketPresent`,`docker.ps` 调试任务会通过该 socket 查询宿主 Docker 容器。
|
||||
|
||||
## Host SSH Maintenance Bridge
|
||||
|
||||
宿主 SSH 转发只作为应急维护辅助路径,不用于自动任务调度。实现参考 `../web-terminal` 的经验:容器内使用只读挂载的私钥,通过 `ssh -tt` 主动连接宿主 sshd,并设置 `StrictHostKeyChecking=accept-new`、`ServerAliveInterval` 和 `ServerAliveCountMax`。本仓库保留 `src/components/provider-gateway/scripts/host-ssh-shell.sh` 作为维护桥接脚本,默认 Compose 不挂载私钥,避免把 SSH 路径误用为调度通道。
|
||||
@@ -0,0 +1,66 @@
|
||||
- unidesk/ (Repository root: configuration, orchestration, CLI, and documentation)
|
||||
- AGENTS.md (Top-level agent index and `scripts/cli.ts` usage guide)
|
||||
- TEST.md (Manual CLI test plan following cli-spec expectations)
|
||||
- config.json (Single source of truth for ports, tokens, runtime, paths, and provider identity)
|
||||
- docker-compose.yml (Main server orchestration for database, backend-core, frontend, provider-gateway)
|
||||
- package.json / bun.lock (Root Bun tooling for CLI checks)
|
||||
- .gitignore
|
||||
- reference -> docs/reference (Compatibility symlink for older references)
|
||||
- scripts/ (Unified CLI and implementation modules)
|
||||
- cli.ts (Single Bun CLI entry)
|
||||
- tsconfig.json (TypeScript check scope for CLI)
|
||||
- src/ (CLI business logic modules; `cli.ts` remains a thin router)
|
||||
- config.ts (Root config loading and validation)
|
||||
- docker.ts (Docker Compose env generation, start/stop/status/logs)
|
||||
- jobs.ts (Fire-and-Forget job state under `.state/jobs/`)
|
||||
- check.ts (Formal checks)
|
||||
- debug.ts (Real-flow debug helpers)
|
||||
- command.ts (Bounded command execution helpers)
|
||||
- output.ts (JSON output helpers)
|
||||
- e2e.ts (Public API, database, provider, and Playwright frontend E2E checks)
|
||||
- logs/ (Generated service logs; ignored by git)
|
||||
- .state/ (Generated job state and compose env; ignored by git)
|
||||
- docs/
|
||||
- issue/ (Manual test issue records)
|
||||
- reference/ (Long-term reference documents)
|
||||
- arch.md (Distributed work platform architecture)
|
||||
- repo-tree.md (This repository structure reference)
|
||||
- cli.md (CLI command model and async job contract)
|
||||
- config.md (Config and runtime rules)
|
||||
- deployment.md (Docker stack deployment and health criteria)
|
||||
- frontend.md (Frontend layout and design rules)
|
||||
- provider-gateway.md (Provider connection and host SSH maintenance bridge)
|
||||
- observability.md (Logs and status visibility)
|
||||
- e2e.md (Delivery gate, Playwright frontend E2E, and database persistence checks)
|
||||
- src/ (TypeScript component monorepo)
|
||||
- package.json (Component workspace metadata)
|
||||
- bun.lock (Component dependency lockfile)
|
||||
- tsconfig.base.json (Project references for component checks)
|
||||
- tsconfig.check.json (No-emit TypeScript check scope for all components)
|
||||
- components/
|
||||
- shared/ (Shared message types and utilities)
|
||||
- package.json
|
||||
- tsconfig.json
|
||||
- src/index.ts
|
||||
- backend-core/ (UniDesk stateless core service container)
|
||||
- package.json
|
||||
- tsconfig.json
|
||||
- Dockerfile
|
||||
- src/index.ts (REST API, WebSocket provider server, scheduler, database access)
|
||||
- frontend/ (Frontend web application container)
|
||||
- package.json
|
||||
- tsconfig.json
|
||||
- Dockerfile
|
||||
- src/index.ts (Bun static server and runtime config injection)
|
||||
- public/ (HTML/CSS/JS assets for the compact industrial console)
|
||||
- provider-gateway/ (Compute node Provider Gateway container)
|
||||
- package.json
|
||||
- tsconfig.json
|
||||
- Dockerfile
|
||||
- src/index.ts (WebSocket client, heartbeat, Docker adapter)
|
||||
- scripts/host-ssh-shell.sh (Optional maintenance-only SSH bridge)
|
||||
- database/ (PostgreSQL initialization and configuration)
|
||||
- config/postgresql.conf
|
||||
- init/001_unidesk_init.sql
|
||||
- microservices/ (Reserved for future stateless microservices)
|
||||
- example-service/
|
||||
Reference in New Issue
Block a user