From 236c5c38f6a839060f8d2bcd2dfa74b9c2aa1667 Mon Sep 17 00:00:00 2001
From: Codex <codex@noreply.local>
Date: Sun, 17 May 2026 16:49:18 +0000
Subject: [PATCH] docs: record master workflow and code queue operations

---
 AGENTS.md                       | 3 ++-
 docs/reference/arch.md          | 3 +++
 docs/reference/cli.md           | 2 +-
 docs/reference/codex-deploy.md  | 2 ++
 docs/reference/deployment.md    | 2 +-
 docs/reference/microservices.md | 1 +
 docs/reference/observability.md | 6 ++++++
 7 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/AGENTS.md b/AGENTS.md
index 10a1a4fd..5593d651 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -5,8 +5,9 @@ UniDesk 是一个以主 server 为统一入口的分布式工作平台；本文
 ## Critical Git / Multi-Repo Sync Rule
 
 - UniDesk 同时存在 main server、D601 `~/cq-deploy` 和其他 provider worktree 等多个开发/部署实例；Git remote 是长期 source of truth，本地部署实例只能视为运行副本或缓存。
+- 任何开发、文档或部署配置变更开始前，必须先在当前 worktree 执行 `git status` 并从主线拉取最新源码：`git pull --ff-only origin master`；若本地并行变更或远端推进导致不能快进，必须当即分清来源并解决冲突后再继续。
 - 任何需要保留的代码、文档或配置变更，在完成必要自测/部署验证后必须立刻按 `git-spec` 提交并 push 到 remote；禁止让未推送的本地修改成为部署真相或后续任务依赖。
-- 提交前必须用 `git status` 和 `git diff` 区分并只提交当前任务相关文件，保留并避开并行任务产生的无关修改；长期规则见 `docs/reference/arch.md`。
+- 提交前必须用 `git status` 和 `git diff` 区分并只提交当前任务相关文件，保留并避开并行任务产生的无关修改；所有 UniDesk agent 变更只允许在 `master` 上开发并 `git push origin master`，禁止新建、切换到或推送其他分支；长期规则见 `docs/reference/arch.md`。
 
 ## Critical Provider Gateway Upgrade Rule
 
diff --git a/docs/reference/arch.md b/docs/reference/arch.md
index 9c2e0da5..92a00428 100644
--- a/docs/reference/arch.md
+++ b/docs/reference/arch.md
@@ -83,7 +83,10 @@
     - Migration bridges may normalize legacy facts into the authoritative bus or table, but must not become a second source of truth or keep read-time dual-path fallback after the authoritative path is ready.
   - Multi-Repo Deployment Sync
     - The main server repository, D601 deployment tree, provider-local worktrees, and other live copies are working or deployment instances; the Git remote is the long-term project source of truth.
+    - Before any development, documentation, or deployment manifest change, an agent must inspect the current worktree with `git status` and pull the latest source from the only accepted integration branch with `git pull --ff-only origin master`.
+    - If a pull, rebase, commit, or push is blocked by concurrent work, the conflict must be handled immediately in the current worktree by separating the current task's edits from unrelated parallel changes. Do not create a feature branch to postpone the conflict.
     - Any source, document, or persistent configuration change intended to survive the current task must be committed and pushed to the remote promptly after required self-tests or deployment validation, following `git-spec`.
+    - All UniDesk agent changes must be developed on `master` and pushed to `origin master`. Agents must not create, switch to, or push feature/fix branches for UniDesk work.
     - Live deployment should run from a known commit or from a change set that is immediately committed and pushed; local-only hotfixes must not become the implicit dependency for later tasks.
     - Secrets, tokens, generated runtime state, and node-local env files stay outside Git, but their required contract, storage location, and recovery path must be documented so pushing source changes is not blocked by runtime-only data.
   - Critical Task Deployment Principles
diff --git a/docs/reference/cli.md b/docs/reference/cli.md
index a95de5fc..1c3e92a5 100644
--- a/docs/reference/cli.md
+++ b/docs/reference/cli.md
@@ -23,7 +23,7 @@ UniDesk 的统一 CLI 入口是根目录 `scripts/cli.ts`，运行方式固定
 - `decision upload/list/show/health` 通过 backend-core 用户服务代理访问 D601 k3s Decision Center，用于上传会议记录/决议 Markdown、列出权威记录、查看详情和健康检查；它不得直连 D601 Service、NodePort 或 provider-gateway 业务 HTTP。
 - `deploy check/plan/apply` 从根目录 `deploy.json` 读取服务 repo 与 commit 期望状态，join `config.json` 和现有 manifest 后使用 target-side build 单一路径校验或更新直管服务与 k3s 代管服务；规则见 `docs/reference/deploy.md`。
 - `codex deploy <commitId>` 是 Code Queue 兼容部署入口，会生成临时 desired manifest 并调用 `deploy apply --service code-queue` 的同一条 target-side build、k3s import、rollout 和 live commit 验证路径；详细规则见 `docs/reference/codex-deploy.md`。
-- `codex submit [prompt] [--prompt-file path|--prompt-stdin] [--queue queueId] [--provider-id id] [--cwd path] [--model model] [--reasoning-effort effort] [--execution-mode mode] [--max-attempts N] [--reference-task-id id] [--dry-run]` 通过 backend-core 私有代理向稳定 `code-queue` 用户服务路径提交任务；prompt 必须且只能来自位置参数、文件或 stdin 之一，`--dry-run` 只返回结构化请求与 prompt 预览，不实际入队。backend-core 默认把提交、队列 CRUD、已读状态、历史摘要和轻量 Trace 读取分流到主 server `code-queue-mgr`，由它写入主 PostgreSQL；D601 scheduler 只轮询并执行已入库任务。
+- `codex submit [prompt] [--prompt-file path|--prompt-stdin] [--queue queueId] [--provider-id id] [--cwd path] [--model model] [--reasoning-effort effort] [--execution-mode mode] [--max-attempts N] [--reference-task-id id] [--dry-run]` 通过 backend-core 私有代理向稳定 `code-queue` 用户服务路径提交任务；prompt 必须且只能来自位置参数、文件或 stdin 之一，`--dry-run` 只返回结构化请求且不实际入队。提交确认和 dry-run 必须返回完整 prompt、字符数和 `truncated=false`，不能套用任务详情的预览截断策略，否则长任务 prompt 无法被人工验收。backend-core 默认把提交、队列 CRUD、已读状态、历史摘要和轻量 Trace 读取分流到主 server `code-queue-mgr`，由它写入主 PostgreSQL；D601 scheduler 只轮询并执行已入库任务。
 - `codex task <taskId>` 通过 Code Queue 私有代理按任务 ID 查询结构化执行摘要；默认只返回有界 prompt/response 预览、执行 Provider、工作目录、最后 assistant message、最近工具调用摘要、attempt、judge、错误、耗时和 trace 翻页提示，适合在新队列任务中引用历史 session 且避免噪声爆炸。该摘要读取默认由主 server `code-queue-mgr` 从 PostgreSQL 返回，不依赖 D601 `code-queue-read` Service 可用。
 - `codex task <taskId> --trace --tail|--from-start|--after-seq N|--before-seq N --limit N` 按页拉取 Code Queue 的逻辑 trace；响应会返回 `nextAfterSeq`、`previousBeforeSeq`、`hasMore`、`hasBefore` 和下一页/上一页命令，默认 `--trace` 取最新一页，需要完整 prompt/最后 response 时加 `--full`。
 - `codex output <taskId> --tail|--from-start|--after-seq N|--before-seq N --limit N [--full-text]` 按原始 output seq 分页读取底层记录；当 trace 行提示 `commandOmittedLines`、`bodyOmittedLines` 或 `rawSeqs` 时，用该命令按 seq 补取完整信息，默认仍有单条文本预览上限，显式 `--full-text` 才返回该页全文。
diff --git a/docs/reference/codex-deploy.md b/docs/reference/codex-deploy.md
index 93cc41f2..e32a03c9 100644
--- a/docs/reference/codex-deploy.md
+++ b/docs/reference/codex-deploy.md
@@ -40,6 +40,8 @@ bun scripts/cli.ts microservice health code-queue
 bun scripts/cli.ts microservice proxy code-queue '/api/tasks/overview?limit=5&transcriptLimit=1&compact=1&afterSeq=0&preferId='
 ```
 
+D601 原生 k3s 的人工诊断必须显式使用 host kubeconfig：`KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n unidesk ...`。D601 上的默认 `kubectl` context 可能指向 Docker Desktop 或其他本地集群，不能作为 UniDesk Code Queue 部署是否 ready 的证据。部署后直接查 k3s 时，至少确认 `deployment/code-queue`、`code-queue-read`、`code-queue-write`、`d601-provider-egress-proxy` 和 `d601-tcp-egress-gateway` ready，Pod 环境中的 `UNIDESK_DEPLOY_REQUESTED_COMMIT`/`CODE_QUEUE_DEPLOY_REQUESTED_COMMIT` 等于期望 commit，并且 scheduler `/health` 暴露 PostgreSQL ready、`storage.lastError=null`、egress proxy connected 和 MiniMax `NO_PROXY` 例外。
+
 ## Boundaries
 
 Code Queue 由 D601 k3s/k8s 控制面代管，不再通过 `server rebuild` 或手工 `docker compose up` 作为正式部署路径。`codex deploy` 可以在 Code Queue 自身正在执行任务时运行；服务重启后由 restart-recovery 恢复任务状态，不能等待当前 Code Queue task 退出后再部署。
diff --git a/docs/reference/deployment.md b/docs/reference/deployment.md
index fc13417c..41b06124 100644
--- a/docs/reference/deployment.md
+++ b/docs/reference/deployment.md
@@ -67,7 +67,7 @@ frontend 的 Docker 上线顺序为：先运行必要的本地校验，例如 `b
 
 主 server `code-queue-mgr` 是低资源控制面，目标常驻内存不超过 100 MB，只允许 PostgreSQL 小连接池、日志和基础 CRUD/摘要逻辑；不得安装或运行 Playwright、Chromium、Codex/OpenCode、Docker socket、dev-container 或执行器。它的 `/health` 必须暴露 `resourceBudget.targetMemoryMb=100`、`noRunnerDependencies=true`、连接池上限和 `role=master-control-plane`，便于在主 server 低内存环境中识别是否越界。
 
-D601 Code Queue 执行面仍必须保持明确的 memory/swap 硬上限，默认 `CODE_QUEUE_MAX_ACTIVE_QUEUES=0` 以恢复 queue 间并行，仍保持 `CODE_QUEUE_IN_MEMORY_OUTPUT_RECORDS=10`、`CODE_QUEUE_IN_MEMORY_EVENT_RECORDS=10` 这类小热窗口；任务历史、队列统计和 Trace/output 读取必须优先从 PostgreSQL 直读或聚合，执行面 `/health` 只做轻量 readiness，不能为了性能便利在 Bun 进程内缓存全量历史。任何提高 Code Queue 热窗口、日志缓冲、Playwright/Codex 子进程常驻规模或容器上限的变更，或把 `CODE_QUEUE_MAX_ACTIVE_QUEUES` 显式改成正数，都必须在同一任务里说明 D601 资源预算来源，并通过 D601 `KUBECONFIG=/home/ubuntu/unidesk-code-queue-deploy/.state/k3s/kubeconfig kubectl -n unidesk get deploy,svc,pod`、`kubectl -n unidesk top pod` 或等价 Docker stats、D601 scheduler health 和对应 E2E 证明未重新引入内存爆炸风险。
+D601 Code Queue 执行面仍必须保持明确的 memory/swap 硬上限，默认 `CODE_QUEUE_MAX_ACTIVE_QUEUES=0` 以恢复 queue 间并行，仍保持 `CODE_QUEUE_IN_MEMORY_OUTPUT_RECORDS=10`、`CODE_QUEUE_IN_MEMORY_EVENT_RECORDS=10` 这类小热窗口；任务历史、队列统计和 Trace/output 读取必须优先从 PostgreSQL 直读或聚合，执行面 `/health` 只做轻量 readiness，不能为了性能便利在 Bun 进程内缓存全量历史。任何提高 Code Queue 热窗口、日志缓冲、Playwright/Codex 子进程常驻规模或容器上限的变更，或把 `CODE_QUEUE_MAX_ACTIVE_QUEUES` 显式改成正数，都必须在同一任务里说明 D601 资源预算来源，并通过 D601 `KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n unidesk get deploy,svc,pod`、`kubectl -n unidesk top pod` 或等价 Docker stats、D601 scheduler health 和对应 E2E 证明未重新引入内存爆炸风险。D601 上的默认 `kubectl` context 可能不是 UniDesk 原生 k3s，不能替代该 kubeconfig 验证。
 
 ## Database Connection Budget
 
diff --git a/docs/reference/microservices.md b/docs/reference/microservices.md
index c41ffd02..8d39e021 100644
--- a/docs/reference/microservices.md
+++ b/docs/reference/microservices.md
@@ -163,6 +163,7 @@ Baidu Netdisk 在 UniDesk 语境中按纯后端服务管理：不得暴露百度
 - 实例语义：D601 是当前唯一 active 执行节点，`code-queue-scheduler` 以一个 scheduler Pod 承载长生命周期 Codex/OpenCode 子进程并轮询主 PostgreSQL 中由 `code-queue-mgr` 写入的 queued/retry_wait 任务。D518 不属于当前 Code Queue k3s 拓扑；在没有原生 k3s-agent 与稳定 Kubernetes 网络前，不得把 D518 写回 `expectedNodeIds` 或恢复 `code-queue-d518` standby。D601 scheduler 默认关闭 `CODE_QUEUE_STARTUP_OA_BACKFILL_ENABLED`；历史 OA Trace/STEP 回填必须通过显式 `/api/oa/backfill` 运维动作触发，不能在每次 Pod 重启时自动批量发布旧事件。
 - 滚动更新边界：master `code-queue-mgr` 保证 D601 抖动或执行面滚动更新期间普通提交、queue 管理和历史读取仍可用；但当前 D601 scheduler Pod 内仍直接承载正在运行的 agent 子进程，scheduler Pod 被替换时 active task 仍会进入 restart-recovery/retry 语义，不能宣称 running task 零中断。真正的长期目标是继续把调度器和执行器拆开：scheduler 只负责 claim task 并创建 Kubernetes Job/Pod 或独立 worker，runner 把输出、状态、attempt、事件和通知写回 PostgreSQL/OA Event Flow/归档；只有这样 controller/scheduler 滚动更新才不会影响正在执行的任务。
 - Restart recovery：D601 scheduler 启动时必须把没有本地 active run 的 `running`/`judging` 任务恢复为 `retry_wait` 并先写回 PostgreSQL，再开启新一轮 scheduler 轮询；同时必须清理 `queued`/`retry_wait`/terminal 任务残留的 `activeTurnId`，否则 PG 中残留的 running 或旧 turn id 会阻塞队列且不会被执行。health/overview 中的 `activeTaskIds` 只代表当前进程真实持有的 agent run；数据库里仍处于 `running`/`judging` 但没有本地 run 的任务只能作为 scheduler 侧 `orphanedActiveTaskIds` 暴露，不能计入 active run slot。主 server 直管 `code-queue-mgr` 只有 PostgreSQL 视角，不得把数据库中的 `running`/`judging` 误报为真实 active run；只能作为 `databaseActiveTaskIds`/`executionStateSource=postgres-control-plane` 这类控制面状态返回。
+- Transient dependency recovery：D601 scheduler/read/write 通过 provider egress 和 TCP gateway 访问主 PostgreSQL、OA Event Flow 与模型 API，必须把 `CONNECTION_CLOSED`、`CONNECT_TIMEOUT`、stale PostgreSQL client、provider egress 瞬时失败和 MiniMax judge provider 初始化失败视为可恢复运行时抖动。实现上应轮换失效数据库 client、重试或降级 judge provider 初始化、释放 active run slot 并继续扫描后续 queued/retry_wait 任务；不得因为一次连接关闭、一次 judge provider transient error 或滚动更新窗口让 scheduler 长期停止推进。
 - 部署引用：Code Queue 镜像仍复用 `src/components/microservices/code-queue/Dockerfile`，Kubernetes 运行清单为 `src/components/microservices/k3sctl-adapter/k3s/code-queue.k8s.yaml`，`config.json` 对外记录 k3s manifest `src/components/microservices/k3sctl-adapter/k3s/code-queue.k3s.json`；主 server 根目录 `docker-compose.yml` 不包含 `code-queue` service，旧 D601 direct Compose 文件只作为迁移/本地诊断参考，不是正式运行入口。
 - 主服务依赖映射：Code Queue 仍以主 PostgreSQL 为权威数据库，但 D601 k3s Pod 不能依赖公网直连 `74.48.78.17:15432/4255`。Pod 内 `DATABASE_URL` 和 `OA_EVENT_FLOW_BASE_URL` 必须指向集群内 `d601-tcp-egress-gateway` Service，再由该 gateway 通过 D601 provider-gateway egress proxy 的 HTTP CONNECT 转发到主 PostgreSQL 和 OA Event Flow；新增 TCP 依赖时扩展 `TCP_EGRESS_ROUTES`，不得在业务容器里新增一次性公网直连或 ad hoc 隧道。D601 active 实例的 `CODE_QUEUE_NOTIFY_CLAUDEQQ_BASE_URL` 必须使用集群内 ClaudeQQ Service `http://claudeqq.unidesk.svc.cluster.local:3290`，并把 `claudeqq`/`claudeqq.unidesk.svc.cluster.local` 加入 `NO_PROXY`，避免任务完成通知被默认出网代理错误转发。旧 `http://host.docker.internal:3290` 只允许作为迁移期诊断，不得作为 Code Queue k3s Pod 的正式通知路径。这些端口映射只服务受控节点运行时，必须用防火墙或等价策略限制来源，不得成为浏览器或任意公网客户端入口。
 - K8s 探针与启动维护：Kubernetes liveness/startup probe 必须使用轻量 `/live`，readiness 和用户服务健康使用 `/health`；`/health` 不得执行全量任务聚合、历史回填或长事务索引维护，历史任务总览应由 `/api/tasks/overview` 读取 PostgreSQL。启动时允许后台执行队列元数据 flush、通知 outbox 读取、任务表索引维护和 overview warmup，但这些维护不得阻塞 Bun server、readiness endpoint 或 frontend overview；通知表索引和大批量 OA backfill 不得作为默认启动副作用。
diff --git a/docs/reference/observability.md b/docs/reference/observability.md
index e2be8653..a9761b9e 100644
--- a/docs/reference/observability.md
+++ b/docs/reference/observability.md
@@ -36,6 +36,12 @@ backend-core 必须提供 `/api/performance`，返回滚动窗口内的 HTTP 组
 
 frontend Bun server 必须提供同源 `/api/frontend-performance`，记录 webui 静态资源、登录/session、API 代理和 frontend->core 代理操作耗时。浏览器中的 `运行总览 / 性能面板` 必须把 frontend 与 backend-core 指标合并展示为 Bwebui 曲线、组件汇总、最近失败请求、内部操作汇总和最近慢操作；完整性能 JSON 只能通过显式 `查看原始JSON` 打开。
 
+## Low-Memory Diagnostics
+
+主 server 是低资源、低抖动控制面，排查内存时必须先区分共享内存、容器 cgroup 占用和进程私有占用。PostgreSQL 后端进程的 RSS 会重复显示 `shared_buffers` 等共享映射，不能把多个 `postgres` 进程 RSS 简单相加当成真实内存消耗；优先看 `docker stats unidesk-database`、cgroup memory、`/proc/<pid>/smaps_rollup` 的 PSS/USS、`pg_stat_activity` 连接数和 `pg_settings` 中的 `shared_buffers`/`work_mem`。
+
+如果 PostgreSQL 容器总占用和 PSS 并不异常，不应优先通过压缩 `shared_buffers` 解决主 server OOM。更高优先级是识别非核心、交互式和开发型进程，例如 web terminal、长驻 agent session、一次性日志调查或大输出 CLI，把它们迁移到 D601、增加 TTL/硬上限，或通过 `server logs`、`job status`、`microservice proxy` 的默认输出限额减少瞬时内存尖峰。只有在连接池、真实 cgroup 占用和慢查询证据都指向 PostgreSQL 时，才调整 PostgreSQL 内存参数。
+
 性能优化必须先用这些指标锁定慢操作名称、路径、耗时和代理层级，再改后端查询或前后端通信策略；不得只凭主观体感改 UI。Code Queue 这类控制面页面出现 `core_proxy`、`GET /api/microservices/code-queue/proxy/api/tasks/overview`、`POST /api/microservices/code-queue/proxy/api/tasks/<id>/read` 等超过 1s 的慢操作时，应保留优化前后的性能面板证据，并同时记录 live API 耗时、容器内存、`/health` 存储摘要和是否仍通过 PostgreSQL/append-only archive 重建历史数据。短 TTL cache、warmup 或页面内存缓存只能作为重复请求抖动保护，性能证据必须证明数据库索引/聚合、分页和渐进式披露本身已把核心路径降到目标内，不能用长缓存遮蔽慢 SQL 或全量 JSON 物化。
 
 当最近失败请求集中出现 frontend `core_proxy` 502/503/504，路径为 `/api/microservices/code-queue/proxy/...` 的 overview、trace 或 summary，且 k3s/k8s Pod 仍在运行时，必须先运行 `bun scripts/cli.ts microservice diagnostics code-queue`，区分 provider-gateway online、WebSocket HTTP tunnel、k3sctl-adapter、Kubernetes API service proxy 和目标 Service 五段状态。provider tunnel 类失败必须记录响应 body/headers 中的 `requestId`、`stage`、`failureReason`、`x-unidesk-request-id` 和 `x-unidesk-tunnel-error`；如需主动验证错误结构，运行 `bun scripts/cli.ts microservice tunnel-self-test code-queue`，该自测应返回预期失败但 `ok=true` 的诊断结果。随后再继续判断“Kubernetes API service proxy 不可达”“Code Queue 进程不可达”和“Code Queue event loop 被热路径同步工作饿死”。如果 `debug health` 或 provider-gateway egress health 显示 `providerGatewayEgressProxyActiveTunnels` 持续偏高、`pendingTunnels` 非零或 `oldestTunnelAgeMs` 长时间增长，应先按 provider-gateway egress tunnel 生命周期排障，确认 `egress_tcp_open`、connect timeout、idle cleanup 与 core socket close 清理是否生效。排障顺序是同时查看 `/api/frontend-performance`、`/api/performance`、`k3sctl-adapter` `/api/control-plane`、Kubernetes Pod `/live`、`/health`、overview/trace-step curl、`kubectl top pod` 或 Docker stats、容器 `RestartCount`/`OOMKilled` 和 Code Queue 日志；如果 Pod 内 `/health` 也超时，应优先检查实时 output 发布、archive 读取、transcript 构建、统计计算、启动维护、历史 OA backfill 和远程 Provider 准备/SSH 子进程是否阻塞 event loop，而不是先调整 frontend 渲染或代理超时。Code Queue 默认不得在启动时自动执行历史 OA backfill 或通知表索引维护；显式 backfill 必须作为运维动作记录，并在运行期间并发证明 `/live`、`/health` 与 `/api/tasks/overview` 仍快速返回。涉及 D601 等远程 Provider 时，还要检查 `runCodeQueueSsh`/开发容器准备是否仍存在同步子进程、无 timeout 的 SSH、无上限 stdout/stderr 或 stale TUN 重建等待；修复后必须在远程准备探针运行期间并发证明 Pod `/health` 与 `/api/tasks/overview` 仍快速返回。