[GH-ISSUE #10323] GPU Memory Utilization and Performance Anomalies #68835

Open
opened 2026-05-04 15:22:58 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @ALLMI78 on GitHub (Apr 17, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10323

BUG: https://github.com/ollama/ollama/issues/10327

OLD...

Summary of my Experiment:

I developed a tool (using nvidia-smi & ollama ps) to log / and show LLM/GPU data during inference tests, specifically focusing on batch sizes and their effect on token generation (TG) and prompt processing (PP). The main goal was to understand how different batch sizes influence performance, particularly in terms of token generation speeds.

However, during testing, I noticed the following points:

  • Batch Size and Token Generation:
    Initially, I expected batch size to directly affect token generation (TG), but it seems that batch size only impacts prompt processing speed (PP). Even at high batch sizes (e.g., 512 or 1024), token generation remains constant at around 18 tokens per second, while prompt processing increases significantly with batch size (e.g., 191 tokens per second at batch size 16 vs. 1,352 tokens per second at batch size 512).

  • Sudden Spike in Memory Usage:
    During the tests, I observed that as batch size increased, memory usage grew sharply. For example, the memory usage went from 16 GB to 22 GB, which is quite unexpected. The question here is why does memory usage increase so irregularly and sharply with higher batch sizes? This doesn't seem to follow a natural, linear pattern?

  • Free VRAM Despite High Memory Usage:
    Even though the system reports high memory usage (e.g., 22 GB), there is still over 1 GB of VRAM left free throughout the entire test. At the peak memory usage, 1.1 GB of VRAM remains unutilized. This discrepancy is puzzling because, logically, the system should be fully utilizing the available VRAM. The fact that memory is offloaded to the CPU (which normally results in slower performance) despite VRAM being available suggests some inefficiencies in memory management.

Key Questions:

  • Why does memory usage increase so irregularly and sharply with higher batch sizes, do we expect a more linear correlation?

  • How is it possible that over 1 GB of VRAM remains free even at the highest memory utilization point?

  • Why is memory being offloaded to the CPU when VRAM is still available, leading to slower processing times during token generation?

  • Why are the prompt processing token generation rates highest at the point where memory usage spikes significantly and starts being offloaded to the CPU? This seems counterintuitive, as offloading memory should typically result in slower processing, yet the opposite is observed?

These issues seem to indicate some inefficiencies or anomalies in how memory is being utilized, which maybe could be worth investigating further?

upper chart:

  • OPS_SIZE = return of ollama ps SIZE field
  • OPS_GPU = return of ollama ps GPU USAGE field

lower chart:

  • inp_tps (PP) => prompt processing (calculated from ollama api returns)
  • out_tps (TG) => token generation (calculated from ollama api returns)

Image

If you need more data, thats what i can give you:
Image

Regards ;)

Originally created by @ALLMI78 on GitHub (Apr 17, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10323 BUG: https://github.com/ollama/ollama/issues/10327 OLD... Summary of my Experiment: I developed a tool (using nvidia-smi & ollama ps) to log / and show LLM/GPU data during inference tests, specifically focusing on batch sizes and their effect on token generation (TG) and prompt processing (PP). The main goal was to understand how different batch sizes influence performance, particularly in terms of token generation speeds. However, during testing, I noticed the following points: - Batch Size and Token Generation: Initially, I expected batch size to directly affect token generation (TG), but it seems that batch size only impacts prompt processing speed (PP). Even at high batch sizes (e.g., 512 or 1024), token generation remains constant at around 18 tokens per second, while prompt processing increases significantly with batch size (e.g., 191 tokens per second at batch size 16 vs. 1,352 tokens per second at batch size 512). - Sudden Spike in Memory Usage: During the tests, I observed that as batch size increased, memory usage grew sharply. For example, the memory usage went from 16 GB to 22 GB, which is quite unexpected. The question here is why does memory usage increase so irregularly and sharply with higher batch sizes? This doesn't seem to follow a natural, linear pattern? - Free VRAM Despite High Memory Usage: Even though the system reports high memory usage (e.g., 22 GB), there is still over 1 GB of VRAM left free throughout the entire test. At the peak memory usage, 1.1 GB of VRAM remains unutilized. This discrepancy is puzzling because, logically, the system should be fully utilizing the available VRAM. The fact that memory is offloaded to the CPU (which normally results in slower performance) despite VRAM being available suggests some inefficiencies in memory management. Key Questions: - Why does memory usage increase so irregularly and sharply with higher batch sizes, do we expect a more linear correlation? - How is it possible that over 1 GB of VRAM remains free even at the highest memory utilization point? - Why is memory being offloaded to the CPU when VRAM is still available, leading to slower processing times during token generation? - Why are the prompt processing token generation rates highest at the point where memory usage spikes significantly and starts being offloaded to the CPU? This seems counterintuitive, as offloading memory should typically result in slower processing, yet the opposite is observed? These issues seem to indicate some inefficiencies or anomalies in how memory is being utilized, which maybe could be worth investigating further? upper chart: - OPS_SIZE = return of ollama ps SIZE field - OPS_GPU = return of ollama ps GPU USAGE field lower chart: - inp_tps (PP) => prompt processing (calculated from ollama api returns) - out_tps (TG) => token generation (calculated from ollama api returns) ![Image](https://github.com/user-attachments/assets/bba1048e-64fb-4d9c-8bad-0a6005b3761a) If you need more data, thats what i can give you: ![Image](https://github.com/user-attachments/assets/b7aa6e2e-edc9-415c-bc4a-184da984fed4) Regards ;)
Author
Owner

@ALLMI78 commented on GitHub (Apr 17, 2025):

here you can see it better, Q3_K_M and 2.5 GB VRAM free but not used...?

ollama ps SIZE goes from 14-17 GB usage but nvidia-smi reports max 13.5GB used, 2.5 GB free

even with free unused VRAM, the load shifts from GPU to CPU from 100% GPU LOAD down to 89 % GPU LOAD

Image

<!-- gh-comment-id:2814084063 --> @ALLMI78 commented on GitHub (Apr 17, 2025): here you can see it better, Q3_K_M and 2.5 GB VRAM free but not used...? ollama ps SIZE goes from 14-17 GB usage but nvidia-smi reports max 13.5GB used, 2.5 GB free even with free unused VRAM, the load shifts from GPU to CPU from 100% GPU LOAD down to 89 % GPU LOAD ![Image](https://github.com/user-attachments/assets/942480af-7636-408e-aa9f-95fdcedd2f48)
Author
Owner

@ALLMI78 commented on GitHub (Apr 17, 2025):

i think the problem is that ollama ps reports wrong values:

nvidia-smi and win 10 taskmanager show the correct values, there ist no splitting CPU/GPU...?

Image

<!-- gh-comment-id:2814105058 --> @ALLMI78 commented on GitHub (Apr 17, 2025): i think the problem is that ollama ps reports wrong values: nvidia-smi and win 10 taskmanager show the correct values, there ist no splitting CPU/GPU...? ![Image](https://github.com/user-attachments/assets/ab58197e-b3e9-4dca-82c6-edea7c464f20)
Author
Owner

@ALLMI78 commented on GitHub (Apr 17, 2025):

https://github.com/ollama/ollama/issues/10327

<!-- gh-comment-id:2814129812 --> @ALLMI78 commented on GitHub (Apr 17, 2025): https://github.com/ollama/ollama/issues/10327
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68835