diff --git a/docs/reference/g14-observability-infra.md b/docs/reference/g14-observability-infra.md index 0662b5a9..ebcd2ab4 100644 --- a/docs/reference/g14-observability-infra.md +++ b/docs/reference/g14-observability-infra.md @@ -93,3 +93,29 @@ Durable validation after rollout must prove: - CLI or controlled proxy output can answer a bounded query and reports the target namespace/lane/source of the data. Metrics are supporting observability evidence. They do not replace `$dad-dev` source-level tests, CI/CD provenance or original-entry validation for HWLAB issues. + +## Application Closeout Contract + +Shared monitoring readiness is necessary but not sufficient for closing an application observability issue. Application closeout must prove both platform scrape readiness and the application-owned health signal. + +A durable closeout must include: + +- `hwlab g14 observability status` or the equivalent controlled infrastructure status showing CRDs, Prometheus Operator and Prometheus Ready in `devops-infra`. +- Explicit PromQL queries for the workload namespace, not only the infrastructure status summary. The result count must match the application spec and every terminal health value must be checked. +- For HWLAB v0.2, the current application-owned PromQL checks are `up{namespace="hwlab-v02"}`, `hwlab_service_up{namespace="hwlab-v02"}` and `hwlab_service_health_probe_success{namespace="hwlab-v02"}`. `up=1` proves Prometheus can scrape the sidecar; it does not prove the sidecar can reach the business health endpoint. +- A namespace boundary check proving the workload namespace contains application `ServiceMonitor` / `PodMonitor` / `PrometheusRule` declarations only, with no shared Prometheus or Alertmanager control-plane instance. +- A public ingress negative check proving FRP, edge proxy or public Web/API ports do not expose raw Prometheus text or a Prometheus UI. +- CI/CD and GitOps provenance when the workload desired state changed. For HWLAB v0.2 this includes the target source commit, PipelineRun, Argo sync revision and git mirror `pendingFlush=false` / `githubInSync=true`. + +Issue comments should lead with the semantic conclusion and then list the commands, result counts and target values. A raw metrics dump or a green `status` command alone is not a closeout. + +## Failure Modes + +The following regressions are common enough to require explicit checks in future monitoring work: + +- Treating Prometheus `up` as application health. `up` only covers scrape availability; application health must be exposed as a separate low-cardinality metric. +- Installing Prometheus Operator, Prometheus, Grafana or Alertmanager in a workload lane namespace such as `hwlab-v02`. The shared stack belongs in `devops-infra` unless a new infrastructure isolation decision supersedes this document. +- Letting workload artifact replacement rewrite infrastructure or helper sidecar images. Application render code must preserve platform sidecar images unless the sidecar itself is part of the service artifact contract. +- Updating a ConfigMap-mounted metrics script without rolling the owning pods. Workload repositories should include a pod-template hash annotation or an equivalent rollout trigger for mounted monitoring code. +- Exposing `/metrics` through public FRP, edge proxy or browser-facing routes while trying to make Prometheus discovery convenient. +- Expanding Prometheus RBAC until it can read unrelated application Secrets. Metrics discovery must remain separate from Secret access.