pikasTech-unidesk/docs/reference/arch.md at caa80ee5e7856f86927baf71c78c78f537759fed

mirrors/pikasTech-unidesk

Fork 0

Files

T

Codex caa80ee5e7 feat: initialize unidesk platform

2026-05-04 11:09:35 +00:00

6.2 KiB

Raw Blame History

Requirements
- Build a distributed work platform covering research, project development, and project management
- Deploy the main entry point on a server with a public IP, providing a unified interface
- Multiple computing resource machines join the platform to execute computing tasks
- The platform must support task scheduling, state monitoring, versioned code distribution, and large file storage
- Design goals are high availability, high concurrency, centralized state management, and stateless compute nodes
Key Assumptions
- The main server has a public IP and can be accessed from the internet
- Computing resource machines have no public IP, possibly behind NAT or firewalls
- Computing resource machines have stable outbound network connectivity (within intranet or internet)
- Computing resource machines can run Docker and support WSL (some nodes are Windows workstations)
- Users interact with the platform only through the main server entry point, never directly with compute nodes
- The main server's availability is higher than that of computing resource machines; compute nodes may go offline frequently due to hardware, network, or human factors
- Tasks prone to single points of failure are deployed on the main server first, leveraging its high-availability environment to protect the critical path
UniDesk Distributed Work Platform Architecture
- Overview
  - The main server hosts all stateless business logic as the unified entry point
  - Computing resource nodes actively connect via lightweight Provider Gateway containers
  - All state is stored centrally in PostgreSQL, never scattered across nodes
  - Code and environments are distributed via GitHub versions; large file storage solution is to be determined
  - The main server also connects itself to the platform as a compute node, using the exact same method as ordinary compute nodes
  - This design allows verification of the full distributed dispatching flow on a single main server
- Main Server Components
  - UniDesk Stateless Services
    - Run all business microservices as Docker containers
    - Includes API gateway, task scheduler, project management, and other stateless modules
    - Instances can scale horizontally; failure recovery requires no state synchronization
  - PostgreSQL Database
    - Deployed as a Docker container with a 10 GB named volume
    - Stores all task metadata, node heartbeats, resource labels, and business state
    - Backed up periodically via pg_dump, keeping the last 7 daily snapshots
    - The named volume ensures data survives container recreation or upgrades
- Code and Environment Distribution
  - Code repositories and execution environment definitions may reside in multiple GitHub repositories
  - When dispatching a task, five metadata items must be specified: code_repo_url, code_commit_id, env_repo_url, env_commit_id, and dockerfile_path
  - A single env repo can contain multiple Dockerfiles defining different execution environments, distinguished by dockerfile_path
  - Compute nodes maintain a local Git cache and only incrementally fetch the specified version each time
  - Docker layer caching accelerates environment builds, making subsequent builds nearly instantaneous after the first
- Compute Node Connection Scheme
  - Provider Gateway Docker
    - Each computing resource machine runs a Provider Gateway container
    - Acts as the node-side gateway, bridging the main server and the local execution environment
    - The container houses the agent logic, implementing a WebSocket client and local scheduling
  - WebSocket Persistent Connection
    - Provider Gateway actively initiates a WebSocket connection to the main server
    - Commands, heartbeats, and task statuses are exchanged bidirectionally over this persistent connection
    - The main server never initiates connections to nodes, perfectly adapting to environments without public IP and behind NAT
  - Interaction with Local Execution Environment
    - The primary path for automated task dispatching and execution is via the local Docker socket
    - Access to the local environment via WSL SSH is reserved solely as an auxiliary path for emergency maintenance and troubleshooting
    - Automating task deployment or dispatching through the WSL SSH channel is forbidden
  - Connection Management
    - When registering, a node carries an authentication token to verify its identity and declares resources such as GPU/CPU
    - The authentication token is pre-issued by the main server and configured at Provider Gateway startup
    - Heartbeats are sent every 15 seconds; if no heartbeat arrives for 90 seconds, the node is marked offline
    - Automatic reconnection on disconnect with exponential backoff to avoid a thundering herd on the main server
- Data Flow and State Management
  - Task commands are delivered over WebSocket and never contain large file content
  - All state changes are reported to the main server in real time by Provider Gateway
  - The main server writes state updates to PostgreSQL, completing the unified closed loop
- Critical Task Deployment Principles
  - Single-point components such as the database, core scheduler logic, and API gateway are deployed on the main server
  - The high-availability environment of the main server ensures the critical scheduling path never breaks
  - Compute nodes are only responsible for task execution; their offline status does not affect overall platform availability
- Large File Storage Solution
  - The concrete implementation is to be determined, and must meet the following requirements
  - Support automated pull and upload by compute nodes without human intervention
  - Provide a programmable interface for the scheduler to generate temporary access credentials
  - Have sufficient bandwidth so that concurrent reads/writes never become the bottleneck for training tasks
- Deployment Notes
  - Use docker-compose on the main server to orchestrate all services uniformly
  - PostgreSQL uses a named volume to guarantee data persistence
  - The Provider Gateway image is built uniformly and distributed to all compute nodes in a versioned manner

6.2 KiB Raw Blame History

6.2 KiB

Raw Blame History