76 lines
6.7 KiB
Markdown
76 lines
6.7 KiB
Markdown
- Requirements
|
|
- Build a distributed work platform covering research, project development, and project management
|
|
- Deploy the main entry point on a server with a public IP, providing a unified interface
|
|
- Multiple computing resource machines join the platform to execute computing tasks
|
|
- The platform must support task scheduling, state monitoring, versioned code distribution, and large file storage
|
|
- Design goals are high availability, high concurrency, centralized state management, and stateless compute nodes
|
|
- Key Assumptions
|
|
- The main server has a public IP and can be accessed from the internet
|
|
- Computing resource machines have no public IP, possibly behind NAT or firewalls
|
|
- Computing resource machines have stable outbound network connectivity (within intranet or internet)
|
|
- Computing resource machines can run Docker and support WSL (some nodes are Windows workstations)
|
|
- Users interact with the platform only through the main server entry point, never directly with compute nodes
|
|
- The main server's availability is higher than that of computing resource machines; compute nodes may go offline frequently due to hardware, network, or human factors
|
|
- Tasks prone to single points of failure are deployed on the main server first, leveraging its high-availability environment to protect the critical path
|
|
- UniDesk Distributed Work Platform Architecture
|
|
- Overview
|
|
- The main server hosts all stateless business logic as the unified entry point
|
|
- Computing resource nodes actively connect via lightweight Provider Gateway containers
|
|
- All state is stored centrally in PostgreSQL, never scattered across nodes
|
|
- Code and environments are distributed via GitHub versions; large file storage solution is to be determined
|
|
- The main server also connects itself to the platform as a compute node, using the exact same method as ordinary compute nodes
|
|
- This design allows verification of the full distributed dispatching flow on a single main server
|
|
- Main Server Components
|
|
- UniDesk Stateless Services
|
|
- Run all user services as Docker containers; these user-facing services are mounted onto the UniDesk core and the core can still run without them
|
|
- Includes frontend gateway, task scheduler, project management, provider ingress, and other stateless modules
|
|
- Instances can scale horizontally; failure recovery requires no state synchronization
|
|
- Only the frontend gateway and provider ingress are public; core REST APIs and PostgreSQL remain on the Docker internal network
|
|
- PostgreSQL Database
|
|
- Deployed as a Docker container with a 10 GB named volume
|
|
- Stores all task metadata, node heartbeats, resource labels, and business state
|
|
- Backed up periodically via `pg_dump`, keeping the last 7 daily snapshots
|
|
- The named volume ensures data survives container recreation or upgrades
|
|
- Code and Environment Distribution
|
|
- Code repositories and execution environment definitions may reside in multiple GitHub repositories
|
|
- When dispatching a task, five metadata items must be specified: `code_repo_url`, `code_commit_id`, `env_repo_url`, `env_commit_id`, and `dockerfile_path`
|
|
- A single env repo can contain multiple Dockerfiles defining different execution environments, distinguished by `dockerfile_path`
|
|
- Compute nodes maintain a local Git cache and only incrementally fetch the specified version each time
|
|
- Docker layer caching accelerates environment builds, making subsequent builds nearly instantaneous after the first
|
|
- Compute Node Connection Scheme
|
|
- Provider Gateway Docker
|
|
- Each computing resource machine runs a Provider Gateway container
|
|
- Acts as the node-side gateway, bridging the main server and the local execution environment
|
|
- The container houses the agent logic, implementing a WebSocket client and local scheduling
|
|
- WebSocket Persistent Connection
|
|
- Provider Gateway actively initiates a WebSocket connection to the main server
|
|
- Commands, heartbeats, and task statuses are exchanged bidirectionally over this persistent connection
|
|
- The main server never initiates connections to nodes, perfectly adapting to environments without public IP and behind NAT
|
|
- Interaction with Local Execution Environment
|
|
- The primary path for automated task dispatching and execution is via the local Docker socket
|
|
- Access to the local environment via WSL SSH is reserved solely as an auxiliary path for emergency maintenance and troubleshooting, exposed only as bounded `host.ssh` probe/exec tasks
|
|
- Automating task deployment or dispatching through the WSL SSH channel is forbidden
|
|
- Connection Management
|
|
- When registering, a node carries an authentication token to verify its identity and declares resources such as GPU/CPU
|
|
- The authentication token is pre-issued by the main server and configured at Provider Gateway startup
|
|
- Heartbeats are sent every 15 seconds; if no heartbeat arrives for 90 seconds, the node is marked offline
|
|
- Automatic reconnection on disconnect with exponential backoff to avoid a thundering herd on the main server
|
|
- Data Flow and State Management
|
|
- Task commands are delivered over WebSocket and never contain large file content
|
|
- All state changes are reported to the main server in real time by Provider Gateway
|
|
- The main server writes state updates to PostgreSQL, completing the unified closed loop
|
|
- Pipeline workflow control follows the OA event-flow model: OA is the only control bus, factual node events remain policy-neutral, and runner/monitor/frontend/CLI actions are represented as OA events; detailed constraints live in `docs/reference/pipeline-oa-event-flow.md`
|
|
- Critical Task Deployment Principles
|
|
- Single-point components such as the database, core scheduler logic, and API gateway are deployed on the main server
|
|
- The high-availability environment of the main server ensures the critical scheduling path never breaks
|
|
- Compute nodes are only responsible for task execution; their offline status does not affect overall platform availability
|
|
- Large File Storage Solution
|
|
- The concrete implementation is to be determined, and must meet the following requirements
|
|
- Support automated pull and upload by compute nodes without human intervention
|
|
- Provide a programmable interface for the scheduler to generate temporary access credentials
|
|
- Have sufficient bandwidth so that concurrent reads/writes never become the bottleneck for training tasks
|
|
- Deployment Notes
|
|
- Use `docker-compose` on the main server to orchestrate all services uniformly
|
|
- PostgreSQL uses a named volume to guarantee data persistence
|
|
- The Provider Gateway image is built uniformly and distributed to all compute nodes in a versioned manner
|