Files
gerbil/docs/observability.md
2026-05-04 00:12:16 +02:00

8.3 KiB

Gerbil Observability Architecture

This document describes the metrics subsystem for Gerbil, explains the design decisions, and shows how to configure each backend.


Architecture Overview

Gerbil's metrics subsystem uses a pluggable backend design:

main.go  ─── internal/metrics  ─── internal/observability  ─── backend
                 (facade)                 (interface)           Prometheus
                                                           OR   OTel/OTLP
                                                           OR   Noop (disabled)

Application code (main, relay, proxy) calls only the metrics.Record* functions in internal/metrics. That package delegates to whichever backend was selected at startup via internal/observability.Backend.

Why Prometheus-native and OTel are mutually exclusive

Exactly one metrics backend may be active at runtime:

Mode What happens
prometheus Native Prometheus client registers metrics on a dedicated registry and exposes /metrics. No OTel SDK is initialised.
otel OTel SDK pushes metrics via OTLP/gRPC or OTLP/HTTP to an external collector. No /metrics endpoint is exposed.
none A safe noop backend is used. All Record* calls are discarded.

Running both simultaneously would mean every metric is recorded twice through two different code paths, with differing semantics (pull vs. push, different naming rules, different cardinality handling). The design enforces a single source of truth.

Future OTel tracing and logging

The internal/observability/otel/ package is designed so that tracing and logging support can be added beside the existing metrics code without touching the Prometheus-native path:

internal/observability/otel/
  backend.go     ← metrics
  exporter.go    ← OTLP exporter creation
  resource.go    ← OTel resource
  trace.go       ← future: TracerProvider setup
  log.go         ← future: LoggerProvider setup

Configuration

Config precedence

  1. CLI flags (highest priority)
  2. Environment variables
  3. Defaults

Config struct

type MetricsConfig struct {
    Enabled               bool
    Backend               string // "prometheus" | "otel" | "none"
    Prometheus            PrometheusConfig
    OTel                  OTelConfig
    ServiceName           string
    ServiceVersion        string
    DeploymentEnvironment string
}

type PrometheusConfig struct {
    Path string // default: "/metrics"
}

type OTelConfig struct {
    Protocol       string        // "grpc" (default) or "http"
    Endpoint       string        // default: "localhost:4317"
    Insecure       bool          // default: true
    ExportInterval time.Duration // default: 60s
    Timeout        time.Duration // default: 10s
}

Environment variables

Variable Default Description
METRICS_ENABLED true Enable/disable metrics
METRICS_BACKEND prometheus Backend: prometheus, otel, or none
METRICS_PATH /metrics HTTP path for Prometheus endpoint
OTEL_METRICS_PROTOCOL grpc OTLP transport: grpc or http
OTEL_METRICS_ENDPOINT localhost:4317 OTLP collector address
OTEL_METRICS_INSECURE true Disable TLS for OTLP
OTEL_METRICS_EXPORT_INTERVAL 60s Push interval (e.g. 10s, 1m)
OTEL_METRICS_TIMEOUT 10s Timeout for OTLP exporter connection setup
DEPLOYMENT_ENVIRONMENT (unset) OTel deployment.environment attribute

CLI flags

--metrics-enabled            bool    (default: true)
--metrics-backend            string  (default: prometheus)
--metrics-path               string  (default: /metrics)
--otel-metrics-protocol      string  (default: grpc)
--otel-metrics-endpoint      string  (default: localhost:4317)
--otel-metrics-insecure      bool    (default: true)
--otel-metrics-export-interval  duration  (default: 60s)
--otel-metrics-timeout          duration  (default: 10s)

When to choose each backend

Criterion Prometheus OTel/OTLP
Existing Prometheus/Grafana stack
Pull-based scraping
No external collector required
Vendor-neutral telemetry
Push-based export
Grafana Cloud / managed OTLP
Future traces + logs via same pipeline

Enabling Prometheus-native mode

Environment variables

METRICS_ENABLED=true
METRICS_BACKEND=prometheus
METRICS_PATH=/metrics

CLI

./gerbil --metrics-enabled --metrics-backend=prometheus --metrics-path=/metrics \
         --config=/etc/gerbil/config.json

The metrics config is supplied separately via env/flags; it is not embedded in the WireGuard config file.

The Prometheus /metrics endpoint is registered only when --metrics-backend=prometheus. All gerbil_* metrics plus Go runtime metrics are available.


Enabling OTel mode

Environment variables

export METRICS_ENABLED=true
export METRICS_BACKEND=otel
export OTEL_METRICS_PROTOCOL=grpc
export OTEL_METRICS_ENDPOINT=otel-collector:4317
export OTEL_METRICS_INSECURE=true
export OTEL_METRICS_EXPORT_INTERVAL=10s
export OTEL_METRICS_TIMEOUT=10s
export DEPLOYMENT_ENVIRONMENT=production

CLI

./gerbil --metrics-enabled \
         --metrics-backend=otel \
         --otel-metrics-protocol=grpc \
         --otel-metrics-endpoint=otel-collector:4317 \
         --otel-metrics-insecure \
         --otel-metrics-export-interval=10s \
         --otel-metrics-timeout=10s \
         --config=/etc/gerbil/config.json

HTTP mode (OTLP/HTTP)

export OTEL_METRICS_PROTOCOL=http
export OTEL_METRICS_ENDPOINT=otel-collector:4318

Disabling metrics

export METRICS_ENABLED=false
# or
./gerbil --metrics-enabled=false
# or
./gerbil --metrics-backend=none

When disabled, all Record* calls are directed to a safe noop backend that discards observations without allocating or locking.


Metric catalog

All metrics use the prefix gerbil_<component>_<name>.

WireGuard metrics

Metric Type Labels Description
gerbil_wg_interface_up Gauge ifname, instance 1=up, 0=down
gerbil_wg_peers_total UpDownCounter ifname Configured peers
gerbil_wg_peer_connected Gauge ifname, peer 1=connected, 0=disconnected
gerbil_wg_bytes_received_total Counter ifname, peer Bytes received
gerbil_wg_bytes_transmitted_total Counter ifname, peer Bytes transmitted
gerbil_wg_handshakes_total Counter ifname, peer, result Handshake attempts
gerbil_wg_handshake_latency_seconds Histogram ifname, peer Handshake duration
gerbil_wg_peer_rtt_seconds Histogram ifname, peer Peer round-trip time

Relay metrics

Metric Type Labels
gerbil_proxy_mapping_active UpDownCounter ifname
gerbil_active_sessions UpDownCounter ifname
gerbil_udp_packets_total Counter ifname, type, direction
gerbil_hole_punch_events_total Counter ifname, result

SNI proxy metrics

Metric Type Labels
gerbil_sni_connections_total Counter result
gerbil_sni_active_connections UpDownCounter (none)
gerbil_sni_route_cache_hits_total Counter result
gerbil_sni_route_api_requests_total Counter result
gerbil_proxy_route_lookups_total Counter result, hostname

HTTP metrics

Metric Type Labels
gerbil_http_requests_total Counter endpoint, method, status_code
gerbil_http_request_duration_seconds Histogram endpoint, method

Using Docker Compose

The docker-compose.metrics.yml provides a complete observability stack.

Prometheus mode:

METRICS_BACKEND=prometheus docker-compose -f docker compose.metrics.yml up -d
# Scrape at http://localhost:3003/metrics
# Grafana at http://localhost:3000 (admin/admin)

OTel mode:

METRICS_BACKEND=otel OTEL_METRICS_ENDPOINT=otel-collector:4317 \
  docker compose -f docker-compose.metrics.yml up -d