[Feature Request] Implement OpenTelemetry Metrics in Newt #57

Closed
opened 2025-11-19 07:13:00 -06:00 by GiteaMirror · 1 comment
Owner

Originally created by @marcschaeferger on GitHub (Sep 7, 2025).

Originally assigned to: @oschwartz10612 on GitHub.

Add OpenTelemetry-based observability to Newt

Reference: https://github.com/fosrl/gerbil/issues/25

Summary / Goal

Instrument Newt with OpenTelemetry Metrics (OTel) following CNCF / industry standards so that:

  • Metrics are emitted using the OpenTelemetry Go SDK (vendor-neutral API).
  • Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector.
  • Semantic conventions, SI units (_seconds, _bytes), and low‑cardinality labels are enforced.
  • Focus is metrics first; design should allow adding traces and logs later.
  • Provide an out‑of‑the‑box /metrics endpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines.

Why OpenTelemetry (OTel)

  • OTel is the CNCF standard for multi‑signal observability (metrics, traces, logs).
  • Instrument once, export anywhere (Prometheus, Grafana Mimir, Thanos/Cortex, cloud vendors).
  • OTel Collector enables enrichment, normalization, batching, and flexible export pipelines (OTLP, remote_write).

Requirements & Constraints

  • Use the OpenTelemetry Go SDK (modules) and follow OTel semantic conventions for relevant signals (HTTP, RPC, network).
  • Provide a /metrics endpoint in Prometheus exposition format via the OTel Prometheus exporter.
  • All durations in seconds and sizes in bytes. Metric names should carry units where applicable (_seconds, _bytes) and counters use _total.
  • Labels must be low-cardinality and stable (e.g., site_id, tunnel_id, transport).
  • Exporters configurable at runtime through environment variables (no code change required to switch).
  • Provide an example OTel Collector config demonstrating attribute promotion and remote_write.

Category Metric Name Type Labels Units / Notes
Site / Registration newt_site_registrations_total Counter site_id, region, result count
newt_site_online Gauge site_id, transport bool (0/1)
newt_site_last_heartbeat_seconds Gauge site_id seconds since last heartbeat
Tunnel / Sessions newt_tunnel_sessions_total Gauge site_id, tunnel_id, transport active sessions
newt_tunnel_bytes_total Counter site_id, tunnel_id, direction bytes (in/out)
newt_tunnel_latency_seconds Histogram site_id, tunnel_id, transport seconds
newt_tunnel_reconnects_total Counter site_id, tunnel_id, reason count
Connection / NAT newt_connection_attempts_total Counter site_id, transport, result count
newt_connection_errors_total Counter site_id, transport, error_type count
newt_nat_mapping_active Gauge site_id, mapping_type bool/count
Peer / Health newt_peer_heartbeat_latency_seconds Histogram site_id, peer_id seconds
newt_peer_last_handshake_seconds Gauge site_id, peer_id seconds
Operational / Ops newt_config_reloads_total Counter result count
newt_restart_count_total Counter count
Runtime newt_go_goroutines Gauge count
newt_go_mem_alloc_bytes Gauge bytes

Implementation Plan

  1. Dependencies (example packages)

    • Add OpenTelemetry Go modules to go.mod:
      • go.opentelemetry.io/otel
      • go.opentelemetry.io/otel/sdk/metric
      • go.opentelemetry.io/otel/exporters/prometheus
      • go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc (or OTLP HTTP variant)
      • Optional contrib instrumentation:
        • go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
        • go.opentelemetry.io/contrib/instrumentation/runtime
      • ...
  2. Central metrics package

    • Create internal/metrics/ that:
      • Initializes OTel MeterProvider.
      • Registers Prometheus exporter (when enabled) and exposes a handler on /metrics (or mounts to existing server route).
      • Optionally registers OTLP exporter when enabled via env vars.
      • Pre-registers all Newt metric instruments with names, descriptions and label keys.
      • Exposes a singleton metrics API with helper functions:
        • Inc(name string, labels ...attribute.KeyValue)
        • Observe(name string, value float64, labels ...attribute.KeyValue)
        • SetGauge(name string, value float64, labels ...attribute.KeyValue)
      • Implements Shutdown(ctx) to flush and stop providers/exporters.
  3. Instrumentation approach

    • Site registration & heartbeats:
      • Increment registration counters and set site_online/site_last_heartbeat.
    • Tunnels & sessions:
      • Update session counts, bytes in/out, latency histograms, reconnect counters.
    • Connection & NAT logic:
      • Record connection attempts, successes/failures, NAT mapping states.
    • Peer health & handshakes:
      • Observe heartbeat latency and last handshake timestamps.
    • Operational flows:
      • Config reloads and restarts.
    • Runtime metrics:
      • Register basic Go runtime metrics (goroutines, mem) via contrib or runtime package and export them.
  4. Histograms & buckets

    • Duration buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]
    • Byte-size buckets: [512, 1024, 4096, 16384, 65536, 262144, 1048576]
    • Always use seconds for durations and bytes for sizes.
  5. Exporter configuration (runtime)

    • Environment variables (suggested defaults):
      • NEWT_METRICS_PROMETHEUS_ENABLED=true
      • NEWT_METRICS_OTLP_ENABLED=false
      • OTEL_EXPORTER_OTLP_ENDPOINT (when OTLP enabled)
      • OTEL_EXPORTER_OTLP_PROTOCOL (http/protobuf or grpc)
      • OTEL_SERVICE_NAME=newt
      • OTEL_RESOURCE_ATTRIBUTES (e.g., service.instance.id=...)
      • OTEL_METRIC_EXPORT_INTERVAL (ms)
  6. Local testing

    • Provide docker-compose.metrics.yml with:
      • Newt (local build)
      • OpenTelemetry Collector (example config)
      • Prometheus (scraping /metrics or scraping Collector)
      • Grafana (optional)
    • Validate direct Prometheus scrape and OTLP → Collector → remote_write flows.
  7. Collector example

    • Include examples/collector.yaml demonstrating:
      • OTLP receiver
      • Transform processor to promote resource attributes (e.g., wg_interface, peer, site_id)
      • Prometheus remote_write exporter (generic endpoint)
      • Notes on:
        • Metric name normalization for Prometheus
        • out_of_order_time_window if sending OTLP to Prometheus
  8. Documentation

    • observability.md:
      • Metric catalog (name, type, labels, units, description)
      • How to enable/disable Prometheus exporter and OTLP exporter via env vars
      • How to run Docker Compose test stack
      • How to add a new metric (naming, labels, buckets)
  9. Testing & validation

    • Manual test: start compose, generate traffic, curl /metrics, verify metrics names, units, labels and histogram buckets.
    • Include sample /metrics output in the PR.
    • ...

Acceptance Criteria

  • /metrics endpoint exposes OTel metrics in Prometheus format with correct naming and units.
  • Newt metrics cover site registration/heartbeats, tunnel sessions/throughput/latency, connections/NAT, peer health, certificates and operational events.
  • Exporter backends can be swapped via environment variables without code changes.
  • Example OTel Collector config provided and tested in local compose flow.
  • docs/observability.md added with metric catalog and run instructions.

🔗 References & Best Practices

Guides & integrations

Practical walkthroughs & blog posts

Originally created by @marcschaeferger on GitHub (Sep 7, 2025). Originally assigned to: @oschwartz10612 on GitHub. Add OpenTelemetry-based observability to Newt --- Reference: https://github.com/fosrl/gerbil/issues/25 ## Summary / Goal Instrument Newt with OpenTelemetry Metrics (OTel) following CNCF / industry standards so that: - Metrics are emitted using the OpenTelemetry Go SDK (vendor-neutral API). - Metrics are backend-agnostic and exportable to Prometheus‑compatible backends and any OTLP‑supporting system via the OpenTelemetry Collector. - Semantic conventions, SI units (_seconds, _bytes), and low‑cardinality labels are enforced. - Focus is metrics first; design should allow adding traces and logs later. - Provide an out‑of‑the‑box /metrics endpoint for Prometheus scraping (Prometheus exporter) and example OTel Collector config for production pipelines. --- ## Why OpenTelemetry (OTel) - OTel is the CNCF standard for multi‑signal observability (metrics, traces, logs). - Instrument once, export anywhere (Prometheus, Grafana Mimir, Thanos/Cortex, cloud vendors). - OTel Collector enables enrichment, normalization, batching, and flexible export pipelines (OTLP, remote_write). --- ## Requirements & Constraints - Use the OpenTelemetry **Go** SDK (modules) and follow OTel semantic conventions for relevant signals (HTTP, RPC, network). - Provide a /metrics endpoint in Prometheus exposition format via the OTel Prometheus exporter. - All durations in seconds and sizes in bytes. Metric names should carry units where applicable (_seconds, _bytes) and counters use _total. - Labels must be low-cardinality and stable (e.g., `site_id`, `tunnel_id`, `transport`). - Exporters configurable at runtime through environment variables (no code change required to switch). - Provide an example OTel Collector config demonstrating attribute promotion and remote_write. --- ## Recommended Newt Metrics | Category | Metric Name | Type | Labels | Units / Notes | |-------------------------|--------------------------------------------|------------|--------------------------------------------------|---------------| | Site / Registration | `newt_site_registrations_total` | Counter | `site_id`, `region`, `result` | count | | | `newt_site_online` | Gauge | `site_id`, `transport` | bool (0/1) | | | `newt_site_last_heartbeat_seconds` | Gauge | `site_id` | seconds since last heartbeat | | Tunnel / Sessions | `newt_tunnel_sessions_total` | Gauge | `site_id`, `tunnel_id`, `transport` | active sessions | | | `newt_tunnel_bytes_total` | Counter | `site_id`, `tunnel_id`, `direction` | bytes (in/out) | | | `newt_tunnel_latency_seconds` | Histogram | `site_id`, `tunnel_id`, `transport` | seconds | | | `newt_tunnel_reconnects_total` | Counter | `site_id`, `tunnel_id`, `reason` | count | | Connection / NAT | `newt_connection_attempts_total` | Counter | `site_id`, `transport`, `result` | count | | | `newt_connection_errors_total` | Counter | `site_id`, `transport`, `error_type` | count | | | `newt_nat_mapping_active` | Gauge | `site_id`, `mapping_type` | bool/count | | Peer / Health | `newt_peer_heartbeat_latency_seconds` | Histogram | `site_id`, `peer_id` | seconds | | | `newt_peer_last_handshake_seconds` | Gauge | `site_id`, `peer_id` | seconds | | Operational / Ops | `newt_config_reloads_total` | Counter | `result` | count | | | `newt_restart_count_total` | Counter | | count | | Runtime | `newt_go_goroutines` | Gauge | | count | | | `newt_go_mem_alloc_bytes` | Gauge | | bytes | --- ## Implementation Plan 1. Dependencies (example packages) - Add OpenTelemetry Go modules to `go.mod`: - `go.opentelemetry.io/otel` - `go.opentelemetry.io/otel/sdk/metric` - `go.opentelemetry.io/otel/exporters/prometheus` - `go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc` (or OTLP HTTP variant) - Optional contrib instrumentation: - `go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp` - `go.opentelemetry.io/contrib/instrumentation/runtime` - ... 2. Central metrics package - Create `internal/metrics/` that: - Initializes OTel `MeterProvider`. - Registers Prometheus exporter (when enabled) and exposes a handler on `/metrics` (or mounts to existing server route). - Optionally registers OTLP exporter when enabled via env vars. - Pre-registers all Newt metric instruments with names, descriptions and label keys. - Exposes a singleton `metrics` API with helper functions: - `Inc(name string, labels ...attribute.KeyValue)` - `Observe(name string, value float64, labels ...attribute.KeyValue)` - `SetGauge(name string, value float64, labels ...attribute.KeyValue)` - Implements `Shutdown(ctx)` to flush and stop providers/exporters. 3. Instrumentation approach - Site registration & heartbeats: - Increment registration counters and set `site_online`/`site_last_heartbeat`. - Tunnels & sessions: - Update session counts, bytes in/out, latency histograms, reconnect counters. - Connection & NAT logic: - Record connection attempts, successes/failures, NAT mapping states. - Peer health & handshakes: - Observe heartbeat latency and last handshake timestamps. - Operational flows: - Config reloads and restarts. - Runtime metrics: - Register basic Go runtime metrics (goroutines, mem) via contrib or runtime package and export them. 4. Histograms & buckets - Duration buckets: `[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30]` - Byte-size buckets: `[512, 1024, 4096, 16384, 65536, 262144, 1048576]` - Always use seconds for durations and bytes for sizes. 5. Exporter configuration (runtime) - Environment variables (suggested defaults): - `NEWT_METRICS_PROMETHEUS_ENABLED=true` - `NEWT_METRICS_OTLP_ENABLED=false` - `OTEL_EXPORTER_OTLP_ENDPOINT` (when OTLP enabled) - `OTEL_EXPORTER_OTLP_PROTOCOL` (`http/protobuf` or `grpc`) - `OTEL_SERVICE_NAME=newt` - `OTEL_RESOURCE_ATTRIBUTES` (e.g., `service.instance.id=...`) - `OTEL_METRIC_EXPORT_INTERVAL` (ms) 6. Local testing - Provide `docker-compose.metrics.yml` with: - Newt (local build) - OpenTelemetry Collector (example config) - Prometheus (scraping `/metrics` or scraping Collector) - Grafana (optional) - Validate direct Prometheus scrape and OTLP → Collector → remote_write flows. 7. Collector example - Include `examples/collector.yaml` demonstrating: - OTLP receiver - Transform processor to promote resource attributes (e.g., `wg_interface`, `peer`, `site_id`) - Prometheus remote_write exporter (generic endpoint) - Notes on: - Metric name normalization for Prometheus - `out_of_order_time_window` if sending OTLP to Prometheus 8. Documentation - `observability.md`: - Metric catalog (name, type, labels, units, description) - How to enable/disable Prometheus exporter and OTLP exporter via env vars - How to run Docker Compose test stack - How to add a new metric (naming, labels, buckets) 9. Testing & validation - Manual test: start compose, generate traffic, curl `/metrics`, verify metrics names, units, labels and histogram buckets. - Include sample `/metrics` output in the PR. - ... --- ## Acceptance Criteria - `/metrics` endpoint exposes OTel metrics in Prometheus format with correct naming and units. - Newt metrics cover site registration/heartbeats, tunnel sessions/throughput/latency, connections/NAT, peer health, certificates and operational events. - Exporter backends can be swapped via environment variables without code changes. - Example OTel Collector config provided and tested in local compose flow. - `docs/observability.md` added with metric catalog and run instructions. --- ## 🔗 References & Best Practices - [Traefik - Metrics (observability)](https://doc.traefik.io/traefik/reference/install-configuration/observability/metrics/) -- Traefik metrics configuration and exporter options. - [OpenTelemetry - Go: Getting Started / Instrumentation Guide](https://opentelemetry.io/docs/languages/go/getting-started/) -- How to instrument Go applications with OpenTelemetry. - [OpenTelemetry - Go: Exporters](https://opentelemetry.io/docs/languages/go/exporters/) -- Exporter options for Go (OTLP, Prometheus, etc.). **Guides & integrations** - [Prometheus - OpenTelemetry guide](https://prometheus.io/docs/guides/opentelemetry/) -- Guidance for integrating Prometheus with OpenTelemetry. - [Prometheus blog - Commitment to OpenTelemetry (Mar 2024)](https://prometheus.io/blog/2024/03/14/commitment-to-opentelemetry/) -- Prometheus project notes and recommended OTLP ingestion patterns. **Practical walkthroughs & blog posts** - [OpenTelemetry blog - Prometheus + OpenTelemetry (2024)](https://opentelemetry.io/blog/2024/prom-and-otel/) - Practical notes on combining Prometheus and OpenTelemetry. - [Grafana Blog - A practical guide to data collection with OpenTelemetry and Prometheus (Jul 2023)](https://grafana.com/blog/2023/07/20/a-practical-guide-to-data-collection-with-opentelemetry-and-prometheus/) -- Hands-on examples and best practices for OTEL + Prometheus. - [BetterStack - OpenTelemetry for Go](https://betterstack.com/community/guides/observability/opentelemetry-go/) -- Practical guide for instrumenting Go apps with OpenTelemetry. - [BetterStack - OpenTelemetry metrics vs Prometheus metrics](https://betterstack.com/community/guides/observability/opentelemetry-metrics-vs-prometheus-metrics/) -- Comparison and guidance when to use OTEL vs Prometheus metric
Author
Owner

@marcschaeferger commented on GitHub (Nov 8, 2025):

Done in PR https://github.com/fosrl/newt/pull/162

@marcschaeferger commented on GitHub (Nov 8, 2025): Done in PR https://github.com/fosrl/newt/pull/162
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/newt#57