mirror of
https://github.com/fosrl/gerbil.git
synced 2026-03-22 12:54:30 -05:00
[Feature Request] Implement OpenTelemetry Metrics in Gerbil #6
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @marcschaeferger on GitHub (Sep 7, 2025).
Add OpenTelemetry-based observability to Gerbil
Reference: https://github.com/fosrl/pangolin/issues/1429
Summary / Goal
Instrument Gerbil with OpenTelemetry Metrics (OTel) following CNCF / industry best practices so that:
ifname,peer,site_id), avoiding per‑request unique values.Why this is needed
Gerbil manages WireGuard interfaces, keys, and peer state — all of which are critical for connectivity, security and performance. Operator visibility into handshake success/failure, per‑peer traffic, RTT, key rotations, netlink errors and config reloads is essential for:
OpenTelemetry provides a vendor‑neutral way to emit metrics, and the Collector allows flexible export to Prometheus (scrape or remote_write), Grafana Mimir, or other backends.
Recommended Gerbil Metrics
Interface / Peer Metrics
gerbil_wg_interface_upifname,instancegerbil_wg_peers_totalifnamegerbil_wg_peer_connectedifname,peergerbil_wg_handshakes_totalifname,peer,resultsuccess/failure)gerbil_wg_handshake_latency_secondsifname,peergerbil_wg_peer_rtt_secondsifname,peergerbil_wg_bytes_received_totalifname,peergerbil_wg_bytes_transmitted_totalifname,peergerbil_allowed_ips_countifname,peergerbil_key_rotation_totalifname,reasonSystem Metrics
gerbil_netlink_events_totalevent_typegerbil_netlink_errors_totalcomponent,error_typegerbil_sync_duration_secondscomponentgerbil_workqueue_depthqueuegerbil_kernel_module_loads_totalresultgerbil_firewall_rules_applied_totalresult,chainOperational / Admin / Security
gerbil_config_reloads_totalresultgerbil_restart_count_totalgerbil_auth_failures_totalpeer,reasongerbil_acl_denied_totalifname,peer,policygerbil_certificate_expiry_dayscert_name,ifnamePlatform / Runtime (recommended alongside OTel)
Implementation Plan
Dependencies (example packages)
go.mod:go.opentelemetry.io/otelgo.opentelemetry.io/otel/sdk/metricgo.opentelemetry.io/otel/exporters/prometheusgo.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc(or OTLP/HTTP variant)go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp(if HTTP APIs are exposed)go.opentelemetry.io/contrib/instrumentation/runtime(for Go runtime metrics)Central metrics module
internal/metrics/that:MeterProvider./metrics(or mounts to existing HTTP server route).Inc(name string, labels ...attribute.KeyValue)Observe(name string, value float64, labels ...attribute.KeyValue)SetGauge(name string, value float64, labels ...attribute.KeyValue)Shutdown()to flush and close exporters.Instrumentation approach
Histograms & buckets
[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30][512, 1024, 4096, 16384, 65536, 262144, 1048576]Exporter configuration (runtime)
GERBIL_METRICS_PROMETHEUS_ENABLED=trueGERBIL_METRICS_OTLP_ENABLED=falseOTEL_EXPORTER_OTLP_ENDPOINT(when OTLP enabled)OTEL_EXPORTER_OTLP_PROTOCOL(http/protobuf or grpc)OTEL_SERVICE_NAME=gerbilOTEL_RESOURCE_ATTRIBUTES(e.g.,service.instance.id=...)OTEL_METRIC_EXPORT_INTERVAL(ms)Local testing
docker-compose.metrics.ymlwith:/metricsor Collector)Collector example
examples/collector.yamldemonstrating:wg_interface,peer,site_id)out_of_order_time_windowif sending OTLP to PrometheusDocumentation
observability.md:Testing & validation
/metrics, verify metrics names, units, labels and histogram buckets./metricsoutput in the PR.🔗 References & Best Practices
Guides & integrations
Practical walkthroughs & blog posts