[Feature Request] Implement OpenTelemetry Metrics in Gerbil #6

Open
opened 2025-11-19 07:03:00 -06:00 by GiteaMirror · 0 comments
Owner

Originally created by @marcschaeferger on GitHub (Sep 7, 2025).

Add OpenTelemetry-based observability to Gerbil

Reference: https://github.com/fosrl/pangolin/issues/1429

Summary / Goal

Instrument Gerbil with OpenTelemetry Metrics (OTel) following CNCF / industry best practices so that:

  • Metrics are emitted using the OpenTelemetry Go SDK (vendor-neutral API).
  • Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector.
  • Semantic conventions, SI units (_seconds, _bytes), and low‑cardinality labels are enforced.
  • Labels are stable and low‑cardinality (e.g., ifname, peer, site_id), avoiding per‑request unique values.
  • Provide an out‑of‑the‑box /metrics endpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines.
  • Focus is metrics first; design should allow adding traces and logs later.

Why this is needed

Gerbil manages WireGuard interfaces, keys, and peer state — all of which are critical for connectivity, security and performance. Operator visibility into handshake success/failure, per‑peer traffic, RTT, key rotations, netlink errors and config reloads is essential for:

  • Detecting connectivity regressions and degraded tunnels
  • Tracking authentication / handshake failures and key rotation issues
  • Capacity planning (peer counts, bandwidth)
  • Alerting (interface down, excessive errors, handshake failures)
  • Correlating with Pangolin and other components for end‑to‑end troubleshooting

OpenTelemetry provides a vendor‑neutral way to emit metrics, and the Collector allows flexible export to Prometheus (scrape or remote_write), Grafana Mimir, or other backends.


Interface / Peer Metrics

Metric name Type Labels Description / Units
gerbil_wg_interface_up Gauge (0/1) ifname, instance Interface operational state (1=up, 0=down)
gerbil_wg_peers_total Gauge ifname Number of configured peers on interface
gerbil_wg_peer_connected Gauge (0/1) ifname, peer Peer connected state (1=connected)
gerbil_wg_handshakes_total Counter ifname, peer, result Handshake attempts (result: success/failure)
gerbil_wg_handshake_latency_seconds Histogram ifname, peer Handshake latency distribution (seconds)
gerbil_wg_peer_rtt_seconds Histogram ifname, peer Observed RTT to peer (seconds)
gerbil_wg_bytes_received_total Counter ifname, peer Bytes received from peer
gerbil_wg_bytes_transmitted_total Counter ifname, peer Bytes transmitted to peer
gerbil_allowed_ips_count Gauge ifname, peer Number of allowed IP entries per peer
gerbil_key_rotation_total Counter ifname, reason Key rotation events (manual/auto/expired)

System Metrics

Metric name Type Labels Description
gerbil_netlink_events_total Counter event_type Netlink events processed (link/addr/rule changes)
gerbil_netlink_errors_total Counter component, error_type Netlink or kernel error counts
gerbil_sync_duration_seconds Histogram component Duration of reconciliation/sync loops (seconds)
gerbil_workqueue_depth Gauge queue Length of internal workqueues
gerbil_kernel_module_loads_total Counter result Kernel module load attempts (success/failure)
gerbil_firewall_rules_applied_total Counter result, chain IPTables/NFT rules applied count

Operational / Admin / Security

Metric name Type Labels Description
gerbil_config_reloads_total Counter result Config reloads (success/failure)
gerbil_restart_count_total Counter Process restarts count
gerbil_auth_failures_total Counter peer, reason Auth or peer validation failures
gerbil_acl_denied_total Counter ifname, peer, policy Access-control denied events
gerbil_certificate_expiry_days Gauge cert_name, ifname Days until certificate expiry (if TLS used)
  • Standard Go runtime/process metrics (goroutines, heap, GC, CPU) should be enabled either via OTel runtime instrumentation or exposed alongside OTel metrics for Prometheus scraping.

Implementation Plan

  1. Dependencies (example packages)

    • Add OpenTelemetry Go modules to go.mod:
      • go.opentelemetry.io/otel
      • go.opentelemetry.io/otel/sdk/metric
      • go.opentelemetry.io/otel/exporters/prometheus
      • go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc (or OTLP/HTTP variant)
      • Optional:
        • go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp (if HTTP APIs are exposed)
        • go.opentelemetry.io/contrib/instrumentation/runtime (for Go runtime metrics)
      • ...
  2. Central metrics module

    • Create internal/metrics/ that:
      • Initializes OTel MeterProvider.
      • Registers Prometheus exporter (when enabled) and exposes the exporter handler on /metrics (or mounts to existing HTTP server route).
      • Optionally registers OTLP exporter when configured via env vars.
      • Defines and pre‑registers all Gerbil metrics instruments:
        • Counters, Histograms, Gauges — with constants for names, descriptions, and label keys.
      • Exposes helper methods:
        • Inc(name string, labels ...attribute.KeyValue)
        • Observe(name string, value float64, labels ...attribute.KeyValue)
        • SetGauge(name string, value float64, labels ...attribute.KeyValue)
      • Provides Shutdown() to flush and close exporters.
  3. Instrumentation approach

    • WireGuard interface management:
      • Gauge for number of interfaces managed.
      • Gauge 0/1 per interface for status (up/down).
      • Counters for RX/TX bytes per interface.
      • Counter for uptime seconds per interface.
    • Peer management:
      • Gauge for peers per interface.
      • Gauge 0/1 for peer connection status.
      • Gauge for seconds since last handshake.
      • Counters for RX/TX bytes per peer.
      • Histogram for handshake latency.
      • Counter for peer connection failures (labelled with reason).
    • Configuration operations:
      • Counters for config reloads, interface add/remove (labelled with result: success/fail).
    • System/runtime:
      • Process uptime counter.
      • Go runtime metrics (goroutines, memory alloc).
    • All metrics should be updated where Gerbil processes WireGuard status or events.
  4. Histograms & buckets

    • Configure histogram buckets per spec:
      • Duration buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
      • Byte-size buckets: [512, 1024, 4096, 16384, 65536, 262144, 1048576]
    • Use seconds for all durations; bytes for all sizes.
  5. Exporter configuration (runtime)

    • Environment variables (suggested defaults):
      • GERBIL_METRICS_PROMETHEUS_ENABLED=true
      • GERBIL_METRICS_OTLP_ENABLED=false
      • OTEL_EXPORTER_OTLP_ENDPOINT (when OTLP enabled)
      • OTEL_EXPORTER_OTLP_PROTOCOL (http/protobuf or grpc)
      • OTEL_SERVICE_NAME=gerbil
      • OTEL_RESOURCE_ATTRIBUTES (e.g., service.instance.id=...)
      • OTEL_METRIC_EXPORT_INTERVAL (ms)
  6. Local testing

    • Provide docker-compose.metrics.yml with:
      • Gerbil
      • OpenTelemetry Collector (example config)
      • Prometheus (scraping /metrics or Collector)
      • Grafana (optional)
    • Validate both direct Prometheus scrape and OTLP → Collector → remote_write flows.
  7. Collector example

    • Include examples/collector.yaml demonstrating:
      • OTLP receiver
      • Transform processor to promote resource attributes (e.g., wg_interface, peer, site_id)
      • Prometheus remote_write exporter (generic endpoint)
      • Notes on:
        • Metric name normalization for Prometheus
        • out_of_order_time_window if sending OTLP to Prometheus
  8. Documentation

    • observability.md:
      • Metric catalog (name, type, labels, units, description)
      • How to enable/disable Prometheus exporter and OTLP exporter via env vars
      • How to run Docker Compose test stack
      • How to add a new metric (naming, labels, buckets)
  9. Testing & validation

    • Manual test: start compose, generate traffic, curl /metrics, verify metrics names, units, labels and histogram buckets.
    • Include sample /metrics output in the PR.
    • ...

🔗 References & Best Practices

Guides & integrations

Practical walkthroughs & blog posts

Originally created by @marcschaeferger on GitHub (Sep 7, 2025). Add OpenTelemetry-based observability to Gerbil --- Reference: https://github.com/fosrl/pangolin/issues/1429 ## Summary / Goal Instrument Gerbil with OpenTelemetry Metrics (OTel) following CNCF / industry best practices so that: - Metrics are emitted using the OpenTelemetry Go SDK (vendor-neutral API). - Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector. - Semantic conventions, SI units (_seconds, _bytes), and low‑cardinality labels are enforced. - Labels are stable and low‑cardinality (e.g., `ifname`, `peer`, `site_id`), avoiding per‑request unique values. - Provide an out‑of‑the‑box /metrics endpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines. - Focus is metrics first; design should allow adding traces and logs later. --- ## Why this is needed Gerbil manages WireGuard interfaces, keys, and peer state — all of which are critical for connectivity, security and performance. Operator visibility into handshake success/failure, per‑peer traffic, RTT, key rotations, netlink errors and config reloads is essential for: - Detecting connectivity regressions and degraded tunnels - Tracking authentication / handshake failures and key rotation issues - Capacity planning (peer counts, bandwidth) - Alerting (interface down, excessive errors, handshake failures) - Correlating with Pangolin and other components for end‑to‑end troubleshooting OpenTelemetry provides a vendor‑neutral way to emit metrics, and the Collector allows flexible export to Prometheus (scrape or remote_write), Grafana Mimir, or other backends. --- ## Recommended Gerbil Metrics ### Interface / Peer Metrics | Metric name | Type | Labels | Description / Units | |-------------|------|--------|---------------------| | `gerbil_wg_interface_up` | Gauge (0/1) | `ifname`, `instance` | Interface operational state (1=up, 0=down) | | `gerbil_wg_peers_total` | Gauge | `ifname` | Number of configured peers on interface | | `gerbil_wg_peer_connected` | Gauge (0/1) | `ifname`, `peer` | Peer connected state (1=connected) | | `gerbil_wg_handshakes_total` | Counter | `ifname`, `peer`, `result` | Handshake attempts (result: `success`/`failure`) | | `gerbil_wg_handshake_latency_seconds` | Histogram | `ifname`, `peer` | Handshake latency distribution (seconds) | | `gerbil_wg_peer_rtt_seconds` | Histogram | `ifname`, `peer` | Observed RTT to peer (seconds) | | `gerbil_wg_bytes_received_total` | Counter | `ifname`, `peer` | Bytes received from peer | | `gerbil_wg_bytes_transmitted_total` | Counter | `ifname`, `peer` | Bytes transmitted to peer | | `gerbil_allowed_ips_count` | Gauge | `ifname`, `peer` | Number of allowed IP entries per peer | | `gerbil_key_rotation_total` | Counter | `ifname`, `reason` | Key rotation events (manual/auto/expired) | ### System Metrics | Metric name | Type | Labels | Description | |-------------|------|--------|-------------| | `gerbil_netlink_events_total` | Counter | `event_type` | Netlink events processed (link/addr/rule changes) | | `gerbil_netlink_errors_total` | Counter | `component`, `error_type` | Netlink or kernel error counts | | `gerbil_sync_duration_seconds` | Histogram | `component` | Duration of reconciliation/sync loops (seconds) | | `gerbil_workqueue_depth` | Gauge | `queue` | Length of internal workqueues | | `gerbil_kernel_module_loads_total` | Counter | `result` | Kernel module load attempts (success/failure) | | `gerbil_firewall_rules_applied_total` | Counter | `result`, `chain` | IPTables/NFT rules applied count | ### Operational / Admin / Security | Metric name | Type | Labels | Description | |-------------|------|--------|-------------| | `gerbil_config_reloads_total` | Counter | `result` | Config reloads (success/failure) | | `gerbil_restart_count_total` | Counter | — | Process restarts count | | `gerbil_auth_failures_total` | Counter | `peer`, `reason` | Auth or peer validation failures | | `gerbil_acl_denied_total` | Counter | `ifname`, `peer`, `policy` | Access-control denied events | | `gerbil_certificate_expiry_days` | Gauge | `cert_name`, `ifname` | Days until certificate expiry (if TLS used) | ### Platform / Runtime (recommended alongside OTel) - Standard Go runtime/process metrics (goroutines, heap, GC, CPU) should be enabled either via OTel runtime instrumentation or exposed alongside OTel metrics for Prometheus scraping. --- ## Implementation Plan 1. Dependencies (example packages) - Add OpenTelemetry Go modules to `go.mod`: - `go.opentelemetry.io/otel` - `go.opentelemetry.io/otel/sdk/metric` - `go.opentelemetry.io/otel/exporters/prometheus` - `go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc` (or OTLP/HTTP variant) - Optional: - `go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp` (if HTTP APIs are exposed) - `go.opentelemetry.io/contrib/instrumentation/runtime` (for Go runtime metrics) - ... 2. Central metrics module - Create `internal/metrics/` that: - Initializes OTel `MeterProvider`. - Registers Prometheus exporter (when enabled) and exposes the exporter handler on `/metrics` (or mounts to existing HTTP server route). - Optionally registers OTLP exporter when configured via env vars. - Defines and pre‑registers all Gerbil metrics instruments: - Counters, Histograms, Gauges — with constants for names, descriptions, and label keys. - Exposes helper methods: - `Inc(name string, labels ...attribute.KeyValue)` - `Observe(name string, value float64, labels ...attribute.KeyValue)` - `SetGauge(name string, value float64, labels ...attribute.KeyValue)` - Provides `Shutdown()` to flush and close exporters. 3. Instrumentation approach - WireGuard interface management: - Gauge for number of interfaces managed. - Gauge 0/1 per interface for status (up/down). - Counters for RX/TX bytes per interface. - Counter for uptime seconds per interface. - Peer management: - Gauge for peers per interface. - Gauge 0/1 for peer connection status. - Gauge for seconds since last handshake. - Counters for RX/TX bytes per peer. - Histogram for handshake latency. - Counter for peer connection failures (labelled with reason). - Configuration operations: - Counters for config reloads, interface add/remove (labelled with result: success/fail). - System/runtime: - Process uptime counter. - Go runtime metrics (goroutines, memory alloc). - All metrics should be updated where Gerbil processes WireGuard status or events. 4. Histograms & buckets - Configure histogram buckets per spec: - Duration buckets: `[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]` - Byte-size buckets: `[512, 1024, 4096, 16384, 65536, 262144, 1048576]` - Use seconds for all durations; bytes for all sizes. 5. Exporter configuration (runtime) - Environment variables (suggested defaults): - `GERBIL_METRICS_PROMETHEUS_ENABLED=true` - `GERBIL_METRICS_OTLP_ENABLED=false` - `OTEL_EXPORTER_OTLP_ENDPOINT` (when OTLP enabled) - `OTEL_EXPORTER_OTLP_PROTOCOL` (http/protobuf or grpc) - `OTEL_SERVICE_NAME=gerbil` - `OTEL_RESOURCE_ATTRIBUTES` (e.g., `service.instance.id=...`) - `OTEL_METRIC_EXPORT_INTERVAL` (ms) 6. Local testing - Provide `docker-compose.metrics.yml` with: - Gerbil - OpenTelemetry Collector (example config) - Prometheus (scraping `/metrics` or Collector) - Grafana (optional) - Validate both direct Prometheus scrape and OTLP → Collector → remote_write flows. 7. Collector example - Include `examples/collector.yaml` demonstrating: - OTLP receiver - Transform processor to promote resource attributes (e.g., `wg_interface`, `peer`, `site_id`) - Prometheus remote_write exporter (generic endpoint) - Notes on: - Metric name normalization for Prometheus - `out_of_order_time_window` if sending OTLP to Prometheus 8. Documentation - `observability.md`: - Metric catalog (name, type, labels, units, description) - How to enable/disable Prometheus exporter and OTLP exporter via env vars - How to run Docker Compose test stack - How to add a new metric (naming, labels, buckets) 9. Testing & validation - Manual test: start compose, generate traffic, curl `/metrics`, verify metrics names, units, labels and histogram buckets. - Include sample `/metrics` output in the PR. - ... --- ## 🔗 References & Best Practices - [Traefik - Metrics (observability)](https://doc.traefik.io/traefik/reference/install-configuration/observability/metrics/) -- Traefik metrics configuration and exporter options. - [OpenTelemetry - Go: Getting Started / Instrumentation Guide](https://opentelemetry.io/docs/languages/go/getting-started/) -- How to instrument Go applications with OpenTelemetry. - [OpenTelemetry - Go: Exporters](https://opentelemetry.io/docs/languages/go/exporters/) -- Exporter options for Go (OTLP, Prometheus, etc.). **Guides & integrations** - [Prometheus - OpenTelemetry guide](https://prometheus.io/docs/guides/opentelemetry/) -- Guidance for integrating Prometheus with OpenTelemetry. - [Prometheus blog - Commitment to OpenTelemetry (Mar 2024)](https://prometheus.io/blog/2024/03/14/commitment-to-opentelemetry/) -- Prometheus project notes and recommended OTLP ingestion patterns. **Practical walkthroughs & blog posts** - [OpenTelemetry blog - Prometheus + OpenTelemetry (2024)](https://opentelemetry.io/blog/2024/prom-and-otel/) - Practical notes on combining Prometheus and OpenTelemetry. - [Grafana Blog - A practical guide to data collection with OpenTelemetry and Prometheus (Jul 2023)](https://grafana.com/blog/2023/07/20/a-practical-guide-to-data-collection-with-opentelemetry-and-prometheus/) -- Hands-on examples and best practices for OTEL + Prometheus. - [BetterStack - OpenTelemetry for Go](https://betterstack.com/community/guides/observability/opentelemetry-go/) -- Practical guide for instrumenting Go apps with OpenTelemetry. - [BetterStack - OpenTelemetry metrics vs Prometheus metrics](https://betterstack.com/community/guides/observability/opentelemetry-metrics-vs-prometheus-metrics/) -- Comparison and guidance when to use OTEL vs Prometheus metric
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/gerbil#6