# Gerbil Observability Architecture This document describes the metrics subsystem for Gerbil, explains the design decisions, and shows how to configure each backend. --- ## Architecture Overview Gerbil's metrics subsystem uses a **pluggable backend** design: ```text main.go ─── internal/metrics ─── internal/observability ─── backend (facade) (interface) Prometheus OR OTel/OTLP OR Noop (disabled) ``` Application code (main, relay, proxy) calls only the `metrics.Record*` functions in `internal/metrics`. That package delegates to whichever backend was selected at startup via `internal/observability.Backend`. ### Why Prometheus-native and OTel are mutually exclusive **Exactly one** metrics backend may be active at runtime: | Mode | What happens | |------|-------------| | `prometheus` | Native Prometheus client registers metrics on a dedicated registry and exposes `/metrics`. No OTel SDK is initialised. | | `otel` | OTel SDK pushes metrics via OTLP/gRPC or OTLP/HTTP to an external collector. No `/metrics` endpoint is exposed. | | `none` | A safe noop backend is used. All `Record*` calls are discarded. | Running both simultaneously would mean every metric is recorded twice through two different code paths, with differing semantics (pull vs. push, different naming rules, different cardinality handling). The design enforces a single source of truth. ### Future OTel tracing and logging The `internal/observability/otel/` package is designed so that tracing and logging support can be added **beside** the existing metrics code without touching the Prometheus-native path: ```bash internal/observability/otel/ backend.go ← metrics exporter.go ← OTLP exporter creation resource.go ← OTel resource trace.go ← future: TracerProvider setup log.go ← future: LoggerProvider setup ``` --- ## Configuration ### Config precedence 1. CLI flags (highest priority) 2. Environment variables 3. Defaults ### Config struct ```go type MetricsConfig struct { Enabled bool Backend string // "prometheus" | "otel" | "none" Prometheus PrometheusConfig OTel OTelConfig ServiceName string ServiceVersion string DeploymentEnvironment string } type PrometheusConfig struct { Path string // default: "/metrics" } type OTelConfig struct { Protocol string // "grpc" (default) or "http" Endpoint string // default: "localhost:4317" Insecure bool // default: true ExportInterval time.Duration // default: 60s Timeout time.Duration // default: 10s } ``` ### Environment variables | Variable | Default | Description | |----------|---------|-------------| | `METRICS_ENABLED` | `true` | Enable/disable metrics | | `METRICS_BACKEND` | `prometheus` | Backend: `prometheus`, `otel`, or `none` | | `METRICS_PATH` | `/metrics` | HTTP path for Prometheus endpoint | | `OTEL_METRICS_PROTOCOL` | `grpc` | OTLP transport: `grpc` or `http` | | `OTEL_METRICS_ENDPOINT` | `localhost:4317` | OTLP collector address | | `OTEL_METRICS_INSECURE` | `true` | Disable TLS for OTLP | | `OTEL_METRICS_EXPORT_INTERVAL` | `60s` | Push interval (e.g. `10s`, `1m`) | | `OTEL_METRICS_TIMEOUT` | `10s` | Timeout for OTLP exporter connection setup | | `DEPLOYMENT_ENVIRONMENT` | _(unset)_ | OTel deployment.environment attribute | ### CLI flags ```bash --metrics-enabled bool (default: true) --metrics-backend string (default: prometheus) --metrics-path string (default: /metrics) --otel-metrics-protocol string (default: grpc) --otel-metrics-endpoint string (default: localhost:4317) --otel-metrics-insecure bool (default: true) --otel-metrics-export-interval duration (default: 60s) --otel-metrics-timeout duration (default: 10s) ``` --- ## When to choose each backend | Criterion | Prometheus | OTel/OTLP | |-----------|-----------|-----------| | Existing Prometheus/Grafana stack | ✅ | | | Pull-based scraping | ✅ | | | No external collector required | ✅ | | | Vendor-neutral telemetry | | ✅ | | Push-based export | | ✅ | | Grafana Cloud / managed OTLP | | ✅ | | Future traces + logs via same pipeline | | ✅ | --- ## Enabling Prometheus-native mode ### Environment variables ```bash METRICS_ENABLED=true METRICS_BACKEND=prometheus METRICS_PATH=/metrics ``` ### CLI ```bash ./gerbil --metrics-enabled --metrics-backend=prometheus --metrics-path=/metrics \ --config=/etc/gerbil/config.json ``` The metrics config is supplied separately via env/flags; it is not embedded in the WireGuard config file. The Prometheus `/metrics` endpoint is registered only when `--metrics-backend=prometheus`. All gerbil_* metrics plus Go runtime metrics are available. --- ## Enabling OTel mode ### Environment variables ```bash export METRICS_ENABLED=true export METRICS_BACKEND=otel export OTEL_METRICS_PROTOCOL=grpc export OTEL_METRICS_ENDPOINT=otel-collector:4317 export OTEL_METRICS_INSECURE=true export OTEL_METRICS_EXPORT_INTERVAL=10s export OTEL_METRICS_TIMEOUT=10s export DEPLOYMENT_ENVIRONMENT=production ``` ### CLI ```bash ./gerbil --metrics-enabled \ --metrics-backend=otel \ --otel-metrics-protocol=grpc \ --otel-metrics-endpoint=otel-collector:4317 \ --otel-metrics-insecure \ --otel-metrics-export-interval=10s \ --otel-metrics-timeout=10s \ --config=/etc/gerbil/config.json ``` ### HTTP mode (OTLP/HTTP) ```bash export OTEL_METRICS_PROTOCOL=http export OTEL_METRICS_ENDPOINT=otel-collector:4318 ``` --- ## Disabling metrics ```bash export METRICS_ENABLED=false # or ./gerbil --metrics-enabled=false # or ./gerbil --metrics-backend=none ``` When disabled, all `Record*` calls are directed to a safe noop backend that discards observations without allocating or locking. --- ## Metric catalog All metrics use the prefix `gerbil__`. ### WireGuard metrics | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `gerbil_wg_interface_up` | Gauge | `ifname`, `instance` | 1=up, 0=down | | `gerbil_wg_peers_total` | UpDownCounter | `ifname` | Configured peers | | `gerbil_wg_peer_connected` | Gauge | `ifname`, `peer` | 1=connected, 0=disconnected | | `gerbil_wg_bytes_received_total` | Counter | `ifname`, `peer` | Bytes received | | `gerbil_wg_bytes_transmitted_total` | Counter | `ifname`, `peer` | Bytes transmitted | | `gerbil_wg_handshakes_total` | Counter | `ifname`, `peer`, `result` | Handshake attempts | | `gerbil_wg_handshake_latency_seconds` | Histogram | `ifname`, `peer` | Handshake duration | | `gerbil_wg_peer_rtt_seconds` | Histogram | `ifname`, `peer` | Peer round-trip time | ### Relay metrics | Metric | Type | Labels | |--------|------|--------| | `gerbil_proxy_mapping_active` | UpDownCounter | `ifname` | | `gerbil_active_sessions` | UpDownCounter | `ifname` | | `gerbil_udp_packets_total` | Counter | `ifname`, `type`, `direction` | | `gerbil_hole_punch_events_total` | Counter | `ifname`, `result` | ### SNI proxy metrics | Metric | Type | Labels | |--------|------|--------| | `gerbil_sni_connections_total` | Counter | `result` | | `gerbil_sni_active_connections` | UpDownCounter | _(none)_ | | `gerbil_sni_route_cache_hits_total` | Counter | `result` | | `gerbil_sni_route_api_requests_total` | Counter | `result` | | `gerbil_proxy_route_lookups_total` | Counter | `result`, `hostname` | ### HTTP metrics | Metric | Type | Labels | |--------|------|--------| | `gerbil_http_requests_total` | Counter | `endpoint`, `method`, `status_code` | | `gerbil_http_request_duration_seconds` | Histogram | `endpoint`, `method` | --- ## Using Docker Compose The `docker-compose.metrics.yml` provides a complete observability stack. **Prometheus mode:** ```bash METRICS_BACKEND=prometheus docker-compose -f docker compose.metrics.yml up -d # Scrape at http://localhost:3003/metrics # Grafana at http://localhost:3000 (admin/admin) ``` **OTel mode:** ```bash METRICS_BACKEND=otel OTEL_METRICS_ENDPOINT=otel-collector:4317 \ docker compose -f docker-compose.metrics.yml up -d ```