[GH-ISSUE #11969] Feat Request: GC Pressure & metrics #70007

Closed
opened 2026-05-04 20:02:54 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @ItsMeForLua on GitHub (Aug 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11969

Since Ollama is written in GO, and uses garbage collection, I would REALLY love to be able to see metrics such as GC Pressure out-of-box rather than having to compile it from source.


Adding a flag or metric to measure and monitor GC (Garbage Collection) pressure in Ollama could be beneficial...especially for performance tuning, debugging memory behavior, and optimizing long-running or high-throughput inferences.
However, the practical benefit does technically depend on the use case. So then the question arises "Is it worth allocating time and energy to this feature?". I can't answer that for y'all, as all I know atleast is that it would be beneficial for me. But I will attempt to go over the potential public-facing benefits:


Since Ollama is written in Go, which uses a concurrent, tri-color mark-and-sweep garbage collector, GC activity can impact:

  • Latency: Brief pauses during GC cycles (though admittedly usually sub-millisecond).
  • Memory usage: GC delay can cause heap growth if allocations outpace collection.
  • Performance consistency: Under heavy load (e.g., multiple concurrent model loads/inferences), GC can contribute to jitter.

Go does provide some built-in tools (GODEBUG=gctrace=1, pprof, etc.), but these are developer-focused and not user-friendly, and I presumably would have to compile from source.


Potential Benefits of a --gc-debug or --metrics Flag

Benefit Explanation
Debugging memory bloat Helps identify if GC isn't keeping up with large tensor allocations or model unloading.
Performance tuning Users could correlate GC frequency with inference latency spikes.
Server monitoring In production-like deployments (e.g., Ollama as an API server), GC metrics could feed into observability tools.
Optimizing model swapping Ollama frequently loads/unloads models; GC behavior during these transitions could reveal inefficiencies.

Example:

ollama serve --gc-metrics

Could output:

GC #42: 12MB → 3MB, pause=85µs, duration=410µs, heap=1.2GB

But there are indeed limitations to consider:

  1. GC Is Usually Not the Bottleneck

    • In Ollama, VRAM/CPU memory bandwidth and model computation (via llama.cpp, CUDA, Metal, etc.) dominate performance.
    • Go’s runtime handles small object allocations well; most heavy lifting is done in the C/C++ backends (llama.cpp), so in that case the concern would be memory leaks which aren't directly measurable as far as I'm aware.
  2. Alternatives Exist

    • Use GODEBUG=gctrace=1 or pprof
      # Though requires building from source
      GODEBUG=gctrace=1 ollama serve
      
    • Monitor system memory with top, htop, or prometheus-style exporters.
  3. User-Facing Value might beLow

    • Most users care about response time, model load speed, and RAM/VRAM usage ...not GC cycles.

Alternative Approach? Opt-In Metrics Endpoint

Instead of a GC-specific flag, a more useful enhancement might be:

ollama serve --metrics

Which exposes a /metrics endpoint (like Prometheus) with:

  • Heap usage
  • GC count & pause times
  • Goroutine count
  • Model load/unload events
  • Inference duration histograms

This would additionally support integration with monitoring tools.

This is already common practice in production Go services (e.g., Kubernetes, Grafana).


This is just my proposal. Hopefully it provides some value,
thank you,

Originally created by @ItsMeForLua on GitHub (Aug 19, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11969 Since Ollama is written in GO, and uses garbage collection, I would REALLY love to be able to see metrics such as GC Pressure out-of-box rather than having to compile it from source. --- Adding a flag or metric to measure and monitor GC (Garbage Collection) pressure in Ollama *could* be beneficial...especially for performance tuning, debugging memory behavior, and optimizing long-running or high-throughput inferences. However, the practical benefit does technically depend on the use case. So then the question arises "Is it worth allocating time and energy to this feature?". I can't answer that for y'all, as all I know atleast is that it would be beneficial for me. But I will attempt to go over the potential public-facing benefits: --- Since Ollama is written in **Go**, which uses a **concurrent, tri-color mark-and-sweep garbage collector**, GC activity can impact: - **Latency**: Brief pauses during GC cycles (though admittedly usually sub-millisecond). - **Memory usage**: GC delay can cause heap growth if allocations outpace collection. - **Performance consistency**: Under heavy load (e.g., multiple concurrent model loads/inferences), GC can contribute to jitter. Go does provide some built-in tools (`GODEBUG=gctrace=1`, `pprof`, etc.), but these are **developer-focused and not user-friendly**, and I presumably would have to compile from source. --- Potential Benefits of a `--gc-debug` or `--metrics` Flag | Benefit | Explanation | |-------|-------------| | **Debugging memory bloat** | Helps identify if GC isn't keeping up with large tensor allocations or model unloading. | | **Performance tuning** | Users could correlate GC frequency with inference latency spikes. | | **Server monitoring** | In production-like deployments (e.g., Ollama as an API server), GC metrics could feed into observability tools. | | **Optimizing model swapping** | Ollama frequently loads/unloads models; GC behavior during these transitions could reveal inefficiencies. | Example: ```bash ollama serve --gc-metrics ``` Could output: ``` GC #42: 12MB → 3MB, pause=85µs, duration=410µs, heap=1.2GB ``` --- **But there are indeed limitations to consider:** 1. **GC Is Usually Not the Bottleneck** - In Ollama, **VRAM/CPU memory bandwidth and model computation** (via `llama.cpp`, CUDA, Metal, etc.) dominate performance. - Go’s runtime handles small object allocations well; most heavy lifting is done in the C/C++ backends (llama.cpp), so in that case the concern would be memory leaks which aren't directly measurable as far as I'm aware. 2. **Alternatives Exist** - Use `GODEBUG=gctrace=1` or `pprof` ```bash # Though requires building from source GODEBUG=gctrace=1 ollama serve ``` - Monitor system memory with `top`, `htop`, or `prometheus`-style exporters. 3. **User-Facing Value might beLow** - Most users care about **response time**, **model load speed**, and **RAM/VRAM usage** ...not GC cycles. --- ### Alternative Approach? Opt-In Metrics Endpoint Instead of a GC-specific flag, a more useful enhancement might be: ```bash ollama serve --metrics ``` Which exposes a `/metrics` endpoint (like Prometheus) with: - Heap usage - GC count & pause times - Goroutine count - Model load/unload events - Inference duration histograms This would additionally support integration with monitoring tools. This is already common practice in production Go services (e.g., Kubernetes, Grafana). --- This is just my proposal. Hopefully it provides some value, thank you,
GiteaMirror added the feature request label 2026-05-04 20:02:54 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 19, 2025):

#3144

<!-- gh-comment-id:3202046589 --> @rick-github commented on GitHub (Aug 19, 2025): #3144
Author
Owner

@ItsMeForLua commented on GitHub (Aug 19, 2025):

Ah, perfect,
thank you for the hyperlink

Glad to see I wasn't the only one with the idea !

I look forward to using :)

<!-- gh-comment-id:3202094076 --> @ItsMeForLua commented on GitHub (Aug 19, 2025): Ah, perfect, thank you for the hyperlink Glad to see I wasn't the only one with the idea ! I look forward to using :)
Author
Owner

@pdevine commented on GitHub (Aug 19, 2025):

Going to close as a dupe.

<!-- gh-comment-id:3202990762 --> @pdevine commented on GitHub (Aug 19, 2025): Going to close as a dupe.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70007