diff --git a/docs/reference/platform-infra.md b/docs/reference/platform-infra.md index e713e93e..4214dd81 100644 --- a/docs/reference/platform-infra.md +++ b/docs/reference/platform-infra.md @@ -95,3 +95,37 @@ When an automatic availability probe is added, it should be YAML-controlled and 4. Optional per-upstream account probes if Sub2API exposes a safe account selection or admin-health mechanism; otherwise document that group-level success does not prove every upstream account is healthy. Until continuous probing exists, closeout comments must state that validation was on-demand and include the exact CLI/API entrypoints used. + +## k3s Network Policy Requirements + +G14 k3s runs kube-router as its network policy controller. When any NetworkPolicy CRD exists in a namespace, kube-router replaces its default allow-all behavior with explicit iptables/ipset rules that only permit traffic matching declared policies. If a namespace has NetworkPolicy resources but the generated iptables rules miss or incorrectly evaluate a traffic path, pods in that namespace will experience silent connection timeouts (REJECT with `icmp-port-unreachable`) even though `kubectl get networkpolicy` shows the policy and DNS/service resolution works. + +The `platform-infra` namespace **must** have a `NetworkPolicy` named `allow-all` (or equivalent) that explicitly permits all ingress and egress within the namespace. Without it, kube-router's default-deny iptables chains block cross-pod traffic including Sub2API → PostgreSQL and Sub2API → Redis connections, causing Sub2API init containers and background services to hang with `context deadline exceeded` or `no response` errors. + +Diagnostic symptoms: +- Sub2API pod stuck `Init:0/2` with `wait-postgres` logging `sub2api-postgres:5432 - no response` perpetually +- `pg_isready` succeeds inside the postgres pod itself but TCP from any other pod times out +- `kubectl exec` from a different pod or `nc -zv` to the postgres ClusterIP/pod-IP returns `Operation timed out` +- `iptables -L KUBE-ROUTER-INPUT -n | grep ` shows per-pod FW chains; the chain ends with `REJECT ... mark match ! 0x10000/0x10000` + +If kube-router iptables rules become stale after a NetworkPolicy create/update cycle (e.g., ipset references old pod IPs or mark-bit logic fails to match), the fastest recovery is: `iptables -I FORWARD 1 -s 10.42.0.0/16 -d 10.42.0.0/16 -j ACCEPT` as a temporary bypass, then recreate the NetworkPolicy or restart kube-router/k3s to force a full iptables sync. After recovery, remove the temporary rule: `iptables -D FORWARD -s 10.42.0.0/16 -d 10.42.0.0/16 -j ACCEPT`. + +The manifest for the required `allow-all` policy is: +```yaml +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: allow-all + namespace: platform-infra +spec: + podSelector: {} + policyTypes: + - Ingress + - Egress + ingress: + - {} + egress: + - {} +``` + +This policy must be included in the `sub2api plan` / `apply` manifest rendering so that it is created as part of the normal deployment flow, not maintained as a manual one-off.