# G14 Observability Infrastructure

This document is the long-term specification for cluster-level observability infrastructure on the G14 provider node. Application-specific metrics contracts remain in the owning application repository; this document defines only the shared Kubernetes monitoring control plane, ownership boundary, security posture and validation surface.

## Scope

G14 observability infrastructure is a platform service for native k3s workloads on G14, including HWLAB lanes, AgentRun lanes and future G14-hosted execution services. It is not part of any single HWLAB runtime service matrix and must not be rolled out as a side effect of a `hwlab-v02` application deployment.

The target architecture is Kubernetes-native:

- Prometheus Operator owns Prometheus custom resources and reconciliation.
- Prometheus runs as a cluster infrastructure workload in the existing `devops-infra` namespace unless a future isolation decision explicitly creates a separate monitoring namespace.
- Workload namespaces expose scrape intent through `ServiceMonitor`, `PodMonitor` and `PrometheusRule` objects.
- Metrics are queried through controlled UniDesk CLI or service-proxy surfaces, not by exposing Prometheus directly on public FRP ports.

`devops-infra` is the correct default namespace because G14 already uses it for platform-level Git mirror/relay. Monitoring belongs to the same platform-infra layer, while workload-specific scrape declarations stay in each workload namespace. `hwlab-v02`, `agentrun-v01`, `hwlab-dev` and `hwlab-prod` must not host the shared Prometheus control plane.

## Ownership Boundary

Shared infrastructure:

- Prometheus Operator CRDs and controller.
- Prometheus instance, retention policy, scrape selection policy and query Service.
- Optional Alertmanager, Grafana, `kube-state-metrics` and `node-exporter` when a specific issue or PR defines their scope.
- RBAC that lets Prometheus read only the Kubernetes resources needed for discovery and scrape.

Application-owned declarations:

- `/metrics` endpoint implementation.
- Service labels, named ports and internal scrape paths.
- `ServiceMonitor` / `PodMonitor` objects in the application namespace.
- `PrometheusRule` objects for application SLOs, latency, error-rate and domain-specific health signals.

The shared Prometheus stack may discover application monitors across namespaces by label and namespace selector, but it must not take ownership of application Deployments, Services, Secrets, ConfigMaps or runtime rollout policy.

## GitOps And Control Plane

Monitoring infrastructure must be declared as Git-backed desired state and applied through a controlled UniDesk or G14 GitOps path. A temporary `kubectl apply` may be used only as a `$dad-dev` P2 experiment; it must be followed by a durable source change and GitOps/CLI validation.

Current durable control surface:

- `bun scripts/cli.ts hwlab g14 observability status` reads the G14 monitoring state through the controlled `G14:k3s` route and reports CRDs, Prometheus Operator readiness, Prometheus readiness, selected workload monitors and a bounded `up` query.
- `bun scripts/cli.ts hwlab g14 observability apply --dry-run|--confirm` is the standard write path for the shared stack. It installs Prometheus Operator `v0.91.0`, Prometheus `v3.12.0`, Prometheus RBAC, the `devops-infra` Prometheus instance and the internal query Service.
- `bun scripts/cli.ts hwlab g14 observability query --promql <expr>` is the controlled query path. It uses Kubernetes service proxy to the internal ClusterIP Service and must not expose Prometheus through FRP, NodePort or LoadBalancer.
- Cluster-scoped CRDs and ClusterRole/ClusterRoleBinding resources owned by the infrastructure path, not by a HWLAB lane Application whose destination is only `hwlab-v02`.
- Runtime workloads in `devops-infra` labeled with `app.kubernetes.io/part-of=devops-infra` and component labels such as `observability`, `prometheus`, `operator` or `query`.

Future GitOps work may move the same desired state behind a dedicated G14 infrastructure Argo CD Application. Until that exists, the UniDesk CLI source is the stable audited desired-state entry, and direct native `kubectl` remains only an implementation detail inside that CLI.

Do not attach Prometheus Operator CRDs, Prometheus Deployments, Grafana or Alertmanager to `hwlab-g14-v02`. That Argo Application is scoped to the HWLAB v0.2 runtime namespace and must remain a lane-specific application rollout controller.

## Security

Prometheus, Grafana and Alertmanager Services remain `ClusterIP` by default. Public exposure through FRP, NodePort, LoadBalancer or ad hoc port-forward is forbidden unless a separate security review defines authentication, audience, port ownership and rollback.

Metrics must be treated as operational data. They must not expose:

- Secret values, API keys, access tokens or password material.
- User prompt bodies, assistant responses, device output payloads or raw trace content.
- High-cardinality identifiers such as `traceId`, `sessionId`, `conversationId`, `threadId`, `runId`, `commandId`, `jobId`, user IDs or API key IDs as metric labels.

Allowed labels are low-cardinality dimensions such as service, namespace, route template, HTTP method, status class, provider profile, operation family and terminal status. Detailed per-run evidence belongs in trace, inspect, logs or issue comments, not in Prometheus labels.

Prometheus service accounts must not be able to read unrelated application Secrets. The existing `devops-infra` GitHub SSH Secret for Git mirror/relay is unrelated to monitoring and must not be mounted into Prometheus, Grafana, Alertmanager or exporter pods.

## Retention And Capacity

G14 is a single-node native k3s cluster. The first monitoring deployment should stay intentionally small:

- Start with one Prometheus replica unless HA is explicitly required.
- Use a bounded PVC and retention size.
- Prefer 15s or 30s scrape intervals for application metrics.
- Keep default retention short enough for local troubleshooting, normally days to a few weeks, before increasing storage.

Capacity planning must be based on actual sample ingestion, series count and PVC usage observed on G14. Broad all-namespace scraping, unbounded labels and public scrape endpoints are regressions.

## Validation

P1 discovery for monitoring work should first collect:

- Existing namespaces and CRDs.
- Existing `devops-infra` workloads and Secrets without printing Secret values.
- Whether `ServiceMonitor`, `PodMonitor`, `Prometheus`, `Alertmanager` and `PrometheusRule` CRDs exist.
- Node capacity, current CPU/memory usage and PVC pressure.

Durable validation after rollout must prove:

- Prometheus Operator CRDs exist and are reconciled.
- The Prometheus workload in `devops-infra` is Ready.
- Prometheus can query its own `up` metric.
- Application `ServiceMonitor` / `PodMonitor` targets are discovered only through approved labels and namespaces.
- No public HWLAB or UniDesk FRP endpoint exposes raw Prometheus, Grafana or application `/metrics`.
- CLI or controlled proxy output can answer a bounded query and reports the target namespace/lane/source of the data.

Metrics are supporting observability evidence. They do not replace `$dad-dev` source-level tests, CI/CD provenance or original-entry validation for HWLAB issues.

## Application Closeout Contract

Shared monitoring readiness is necessary but not sufficient for closing an application observability issue. Application closeout must prove both platform scrape readiness and the application-owned health signal.

A durable closeout must include:

- `hwlab g14 observability status` or the equivalent controlled infrastructure status showing CRDs, Prometheus Operator and Prometheus Ready in `devops-infra`.
- Explicit PromQL assertions for the workload namespace, not only the infrastructure status summary. Use `hwlab g14 observability query --promql <expr> --expect-count <N> --expect-value <V>` so the CLI returns `assertion.ok`, actual count, bad values and missing/extra series instead of requiring manual vector inspection.
- For HWLAB v0.2, the current application-owned PromQL checks are `up{namespace="hwlab-v02"}`, `hwlab_service_up{namespace="hwlab-v02"}` and `hwlab_service_health_probe_success{namespace="hwlab-v02"}`. `up=1` proves Prometheus can scrape the sidecar; it does not prove the sidecar can reach the business health endpoint.
- `hwlab g14 observability targets --lane v02` for the high-level target view: discovered service/pod, metrics sidecar readiness and restart count, selected monitor declarations, the latest `up` / `hwlab_service_up` / `hwlab_service_health_probe_success` values, synthetic health/scrape duration summaries and the current CPU/memory resource snapshot from `metrics.k8s.io`.
- `hwlab g14 observability boundary --lane v02` for the namespace and public ingress boundary: the workload namespace may contain application `ServiceMonitor` / `PodMonitor` / `PrometheusRule` declarations only, must not contain shared Prometheus or Alertmanager instances, and public `19666/19667` `/metrics` must be denied or non-Prometheus text.
- `hwlab g14 observability closeout --lane v02` as the standard monitoring closeout summary. It should report semantic fields such as `platformReady`, `scrapeReachable`, `sidecarServing`, `businessHealthProbe`, `resourceSnapshot`, `namespaceControlPlaneBoundary` and `publicMetricsExposure`, plus bounded drill-down evidence and next diagnostic commands on failure. Public `/metrics` denial is represented as `publicMetricsExposure=pass` with `publicMetricsExposureState=denied`.
- CI/CD and GitOps provenance when the workload desired state changed. For HWLAB v0.2 this includes the target source commit, PipelineRun, Argo sync revision and git mirror `pendingFlush=false` / `githubInSync=true`.

Issue comments should lead with the semantic conclusion and then list the commands, result counts and target values. A raw metrics dump or a green `status` command alone is not a closeout, and CI/CD provenance still comes from `hwlab g14 control-plane closeout --lane v02 --source-commit <full-sha>` or the equivalent high-level control-plane entry when runtime desired state changed.

The current HWLAB v0.2 monitoring surface is intentionally split by source. Prometheus provides sidecar availability, business health probe success/status/duration, scrape duration and sidecar uptime; `metrics.k8s.io` provides current pod/container CPU and memory snapshots for the same monitored services. Request throughput, error rate, per-route latency percentiles and business-operation latency are application-owned signals and require HWLAB service instrumentation before Prometheus can answer them.

## Failure Modes

The following regressions are common enough to require explicit checks in future monitoring work:

- Treating Prometheus `up` as application health. `up` only covers scrape availability; application health must be exposed as a separate low-cardinality metric.
- Installing Prometheus Operator, Prometheus, Grafana or Alertmanager in a workload lane namespace such as `hwlab-v02`. The shared stack belongs in `devops-infra` unless a new infrastructure isolation decision supersedes this document.
- Letting workload artifact replacement rewrite infrastructure or helper sidecar images. Application render code must preserve platform sidecar images unless the sidecar itself is part of the service artifact contract.
- Updating a ConfigMap-mounted metrics script without rolling the owning pods. Workload repositories should include a pod-template hash annotation or an equivalent rollout trigger for mounted monitoring code.
- Exposing `/metrics` through public FRP, edge proxy or browser-facing routes while trying to make Prometheus discovery convenient.
- Expanding Prometheus RBAC until it can read unrelated application Secrets. Metrics discovery must remain separate from Secret access.