docs: plan d601 k3s dev environment

2026-05-17 12:14:05 +00:00
parent 4da70ca671
commit 5093bec450
1 changed files with 345 additions and 0 deletions
@@ -0,0 +1,345 @@
+# D601 k3s Development Environment Plan
+
+## Goal
+
+Build an isolated UniDesk development environment inside the existing D601 native k3s cluster so LLM-driven development can deploy, break, rebuild, and validate backend-core, frontend, Code Queue, and their database dependencies without interrupting the production main server.
+
+The first version must support deployment by GitHub commit id through environment deploy manifests. The desired long-term control point is GitHub-hosted `deploy.json`: deploying an environment reads the `deploy.json` stored on the matching GitHub environment branch and applies the commit ids declared there.
+
+Initial environment branches:
+
+- `deploy/dev`: desired state for the D601 k3s development environment.
+- `deploy/prod`: desired state for production. Branch protection can be added later; the first implementation must still keep prod deployment commands and credentials separate from dev.
+
+## Non-Goals
+
+- Do not create a second physical k3s control plane in the first version. Use the existing D601 native k3s cluster with namespace-level isolation.
+- Do not move production main server backend-core/frontend into k3s in the first version.
+- Do not let the dev environment share production PostgreSQL tables, provider identity, provider token, Code Queue task state, or deployment worktree paths.
+- Do not make `deploy/dev` or `deploy/prod` aliases for normal source branches. They are environment desired-state branches.
+
+## Target Dev Topology
+
+The first dev environment runs in namespace `unidesk-dev` on D601:
+
+- `postgres-dev`: independent PostgreSQL StatefulSet or equivalent persistent database for dev.
+- `backend-core-dev`: backend-core built from the commit id declared in `deploy/dev:deploy.json`.
+- `frontend-dev`: frontend built from the commit id declared in `deploy/dev:deploy.json`, proxying only to `backend-core-dev`.
+- `code-queue-mgr-dev`: lightweight Code Queue control plane using the dev database.
+- `code-queue-read-dev`, `code-queue-write-dev`, `code-queue-scheduler-dev`: Code Queue k3s execution components using dev database, dev logs, dev state paths, and dev Code Queue settings.
+- Optional first-access path: SSH port-forward or a private D601-hosted ingress. Public exposure is not required for phase 1.
+
+All dev services must report environment identity in `/health`:
+
+- `environment=dev`
+- namespace
+- database name
+- service id
+- GitHub repo and commit id
+- deployment ref, expected to be `origin/deploy/dev`
+
+## Core Isolation Rules
+
+1. Dev services must use `unidesk-dev` namespace only.
+2. Dev services must use a dev PostgreSQL instance or database. They must not connect to production PostgreSQL.
+3. Dev provider identity must be separate, for example `D601-dev`; it must not reuse production `D601` provider id or provider token.
+4. Dev Code Queue tasks, queues, attempts, notifications, and trace state must not write production tables unless table names are explicitly namespaced and verified safe. The preferred first version is a separate dev database.
+5. Dev manifests must not mount production deployment roots such as `/root/unidesk` on the main server or production D601 deployment paths unless the mount is read-only and explicitly needed for diagnostics.
+6. Dev Code Queue must use dev work directories, dev log directories, and dev state directories.
+7. Production deploy must not read a local dirty `deploy.json`; production deploy must read the production desired state from the configured GitHub environment ref.
+8. LLM/Code Queue development tasks should only receive dev deploy credentials by default.
+
+## Deploy Manifest Model
+
+Use one schema for environment manifests:
+
+```json
+{
+  "schemaVersion": 1,
+  "environment": "dev",
+  "services": [
+    {
+      "id": "backend-core",
+      "repo": "https://github.com/pikasTech/unidesk",
+      "commitId": "<commit>"
+    },
+    {
+      "id": "frontend",
+      "repo": "https://github.com/pikasTech/unidesk",
+      "commitId": "<commit>"
+    },
+    {
+      "id": "code-queue",
+      "repo": "https://github.com/pikasTech/unidesk",
+      "commitId": "<commit>"
+    },
+    {
+      "id": "code-queue-mgr",
+      "repo": "https://github.com/pikasTech/unidesk",
+      "commitId": "<commit>"
+    }
+  ]
+}
+```
+
+Environment-to-ref mapping must be fixed in code or canonical config:
+
+- `dev` maps to `origin/deploy/dev`.
+- `prod` maps to `origin/deploy/prod`.
+
+The deploy command should accept an environment, not an arbitrary branch for production. A debug or admin-only command may inspect arbitrary refs, but normal prod deployment must use the fixed mapping.
+
+## Phase 0: Design And Guardrails
+
+Purpose: make the target behavior explicit before adding a second runtime.
+
+Implementation items:
+
+- Define the environment manifest schema and validation rules.
+- Add `environment` to deploy manifests and reject mismatches.
+- Define fixed environment mappings: `dev -> deploy/dev`, `prod -> deploy/prod`.
+- Document target namespace, database, provider identity, and service ids for dev.
+- Add CLI dry-run planning output that prints:
+  - selected environment
+  - GitHub ref
+  - resolved manifest commit
+  - services and commit ids
+  - target namespace
+  - target database fingerprint
+  - target provider identity
+
+Acceptance criteria:
+
+- `deploy plan --env dev` can read and validate a dev manifest without mutating the cluster.
+- `deploy plan --env prod` can read and validate a prod manifest without using the local worktree `deploy.json`.
+- A manifest with `environment=prod` must be rejected for `--env dev`, and the reverse must also be rejected.
+
+## Phase 1: GitHub Environment Branch Deploy Source
+
+Purpose: make GitHub desired-state refs the deploy source of truth.
+
+Implementation items:
+
+- Create or initialize `deploy/dev` with a valid `deploy.json`.
+- Create or initialize `deploy/prod` with a valid `deploy.json`.
+- Add CLI support to fetch an environment ref and read `deploy.json` from that ref.
+- Keep the existing local `deploy.json` path as a compatibility mode only for explicit local/admin workflows.
+- Ensure commit ids listed by the manifest exist in their declared repos.
+- Ensure dev/prod deploy does not depend on a dirty local working tree.
+
+Acceptance criteria:
+
+- `deploy plan --env dev` reads `origin/deploy/dev:deploy.json`.
+- `deploy plan --env prod` reads `origin/deploy/prod:deploy.json`.
+- Changing local `deploy.json` does not affect `--env dev` or `--env prod`.
+- The plan output includes the Git ref and manifest blob/commit used.
+
+## Phase 2: D601 Dev Namespace And Database
+
+Purpose: create the minimum isolated substrate for dev backend and Code Queue state.
+
+Implementation items:
+
+- Add a k8s manifest for namespace `unidesk-dev`.
+- Add dev PostgreSQL StatefulSet/Service/PVC or an equivalent persistent DB.
+- Add dev DB init and migration flow for backend-core and Code Queue tables.
+- Add dev secrets/config:
+  - database credentials
+  - provider token
+  - auth/session secret
+  - Code Queue model secrets if needed
+- Add resource requests/limits so dev DB cannot starve D601 production k3s workloads.
+
+Technical decisions:
+
+- Prefer a separate dev PostgreSQL instance over sharing production PostgreSQL with a different database name. It gives the clearest failure boundary.
+- If a shared PostgreSQL server is temporarily used, the CLI and services must hard-check database name and connection target before startup.
+
+Acceptance criteria:
+
+- `kubectl -n unidesk-dev get pods,svc,pvc` shows the dev DB ready.
+- Dev DB survives Pod restart.
+- Dev services cannot accidentally connect to the production database URL without failing startup validation.
+
+## Phase 3: backend-core-dev And frontend-dev
+
+Purpose: make a usable UniDesk dev control surface independent from production main server Compose.
+
+Implementation items:
+
+- Add k8s manifests for `backend-core-dev` and `frontend-dev`.
+- Build images from the commit ids declared in `deploy/dev:deploy.json`.
+- Inject dev-only config into backend-core:
+  - `UNIDESK_ENV=dev`
+  - dev `MICROSERVICES_JSON`
+  - dev database URL
+  - dev provider token
+  - dev log paths
+- Inject frontend config so it proxies to `backend-core-dev`, not production backend-core.
+- Add service health and readiness probes.
+- Expose dev frontend through port-forward or a private dev ingress.
+
+Technical decisions:
+
+- First version can omit public exposure. Port-forward is acceptable while validating isolation.
+- Dev frontend must have a visible DEV environment marker to avoid operator confusion.
+
+Acceptance criteria:
+
+- Dev backend-core `/health` returns ok and includes `environment=dev`.
+- Dev frontend `/health` returns ok and proxies only to dev backend-core.
+- Production `bun scripts/cli.ts server status` remains healthy while dev backend/frontend are redeployed.
+- Rebuilding dev backend/frontend does not touch main server Docker Compose containers.
+
+## Phase 4: code-queue-mgr-dev
+
+Purpose: provide the dev queue management and submission path without writing production Code Queue tables.
+
+Implementation items:
+
+- Add k8s manifest for `code-queue-mgr-dev`.
+- Configure it to use the dev database only.
+- Configure dev backend-core service catalog so stable dev `code-queue` control/read paths route to `code-queue-mgr-dev`.
+- Ensure `code-queue-mgr-dev` can submit, list, summarize, and update dev queue state.
+- Add health output proving:
+  - role is master-control-plane or dev-control-plane
+  - database is dev
+  - schema is ready
+  - no runner dependencies
+
+Acceptance criteria:
+
+- Dev UI/CLI can submit a dry-run or queued task to the dev DB.
+- Production Code Queue task list is unchanged by dev submissions.
+- Dev `code-queue-mgr-dev` memory footprint remains within the lightweight control-plane budget.
+
+## Phase 5: code-queue-dev Execution Components
+
+Purpose: run dev Code Queue execution inside `unidesk-dev` without interfering with production Code Queue.
+
+Implementation items:
+
+- Add dev variants of Code Queue manifests:
+  - `code-queue-read-dev`
+  - `code-queue-write-dev`
+  - `code-queue-scheduler-dev`
+- Configure all dev components to use dev database, dev logs, and dev state paths.
+- Use dev service names and labels so production k3s adapter does not confuse dev and prod services.
+- Decide whether first version supports real Codex execution or smoke-only execution.
+- If real execution is enabled:
+  - isolate workdir paths
+  - isolate Codex/OpenCode XDG/state paths
+  - isolate notifications
+  - cap concurrency and memory
+  - avoid writing production OA Event Flow unless explicitly configured for dev
+
+Technical decisions:
+
+- First version should default to smoke/dry-run execution unless real task execution is needed immediately.
+- If real task execution is enabled, use a dev-specific queue prefix or dev database and disable production ClaudeQQ notifications by default.
+
+Acceptance criteria:
+
+- Dev Code Queue `/health` returns ok and includes `environment=dev`.
+- Dev scheduler can pick up a dev queued task and move it through a terminal state.
+- Restarting dev scheduler does not affect production running tasks.
+- Production `code-queue` health remains healthy during dev Code Queue rollout.
+
+## Phase 6: Dev Deploy Apply
+
+Purpose: make `deploy/dev:deploy.json` drive the dev environment end to end.
+
+Implementation items:
+
+- Add `deploy apply --env dev`.
+- For each service in the dev manifest:
+  - fetch declared repo and commit
+  - build image on D601 or through the established target-side build path
+  - tag image with environment and commit
+  - apply the dev k8s manifest
+  - wait for rollout
+  - verify live commit from `/health` or Deployment annotation
+- Ensure deployment records include environment, ref, service id, commit id, image tag, namespace, and rollout status.
+- Add `deploy status --env dev` or equivalent drift check.
+
+Acceptance criteria:
+
+- Updating `deploy/dev:deploy.json` to a new commit and running `deploy apply --env dev` updates dev backend-core/frontend/code-queue components.
+- Live `/health` commit matches the manifest commit.
+- No production Deployment, Service, Secret, PVC, DB table, or Docker Compose container is mutated by dev deploy.
+
+## Phase 7: Prod Deploy Ref Compatibility
+
+Purpose: let production read desired state from `deploy/prod` while keeping production runtime unchanged.
+
+Implementation items:
+
+- Add `deploy plan --env prod` and `deploy apply --env prod` using `origin/deploy/prod:deploy.json`.
+- Keep production target executors as they are initially:
+  - main server Compose for production backend-core/frontend and direct sidecars
+  - D601 k3s for production Code Queue execution
+- Enforce production command guardrails:
+  - canonical root only
+  - production credentials only on main server
+  - manifest must say `environment=prod`
+  - target namespace and provider identity must match production
+- Branch protection for `deploy/prod` is recommended but can be added after the first version.
+
+Acceptance criteria:
+
+- Production deploy no longer depends on local `deploy.json`.
+- Production deploy reports the exact Git ref and manifest commit used.
+- Production deploy still validates live commit after rollout.
+
+## Phase 8: Operator And LLM Safety
+
+Purpose: reduce environment confusion for LLM agents and humans.
+
+Implementation items:
+
+- Add clear CLI output for every deploy:
+  - environment
+  - ref
+  - namespace
+  - DB fingerprint
+  - provider id
+  - services and commits
+- Add explicit DEV marker in dev frontend.
+- Add hard startup checks:
+  - dev service refuses production DB
+  - dev service refuses production provider id/token
+  - prod service refuses dev namespace/DB
+- Ensure LLM task containers receive dev deploy credentials by default and do not receive prod credentials.
+- Add smoke checks that intentionally try unsafe combinations and verify they fail.
+
+Acceptance criteria:
+
+- Running a dev service with production DB config fails before listening.
+- Running prod deploy from a non-canonical context fails.
+- LLM/Code Queue default environment can deploy dev but cannot deploy prod without the separate production credential path.
+
+## Risks And Mitigations
+
+- Risk: namespace isolation does not isolate node-level CPU, memory, Docker socket, hostPath, or containerd load.
+  - Mitigation: resource requests/limits, separate dev workdirs, no production path mounts, and bounded Code Queue concurrency.
+- Risk: dev Code Queue accidentally writes production task tables.
+  - Mitigation: separate dev DB, startup DB fingerprint checks, and health output showing DB identity.
+- Risk: dev frontend appears to be prod or proxies to prod backend-core.
+  - Mitigation: visible DEV marker, `CORE_INTERNAL_URL` hardwired to dev service, and proxy target health checks.
+- Risk: deploy command accidentally reads local manifest instead of GitHub environment ref.
+  - Mitigation: `--env` mode must read remote ref only and report the ref/blob used.
+- Risk: D601 k3s control plane failure affects both dev and production k3s workloads.
+  - Mitigation: accept this in phase 1; consider a separate physical/node-level dev cluster only after namespace isolation proves insufficient.
+- Risk: branch `deploy/prod` is initially unprotected.
+  - Mitigation: even before branch protection, production deploy should still require canonical main server credentials and should report the ref used for audit.
+
+## Suggested Implementation Order
+
+1. Phase 0 and Phase 1: establish GitHub environment branch desired-state and dry-run planning.
+2. Phase 2 and Phase 3: create dev namespace, dev DB, backend-core-dev, and frontend-dev.
+3. Phase 4 and Phase 5: add dev Code Queue control and execution components.
+4. Phase 6: make `deploy apply --env dev` deploy the full first dev stack by commit id.
+5. Phase 7: migrate production deploy to `deploy/prod`.
+6. Phase 8: harden operator and LLM safety checks.
+
+The first milestone is complete when `deploy apply --env dev` can deploy backend-core, frontend, code-queue-mgr, and Code Queue read/write/scheduler into `unidesk-dev` from commit ids declared in `origin/deploy/dev:deploy.json`, and repeated dev redeploys do not change production main server status or production Code Queue state.