[GH-ISSUE #10092] Using split memory (RAM+VRAM) should never happen #32376

Closed
opened 2026-04-22 13:34:56 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @Mugane on GitHub (Apr 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10092

What is the issue?

Ollama will split models that are too big for the VRAM into VRAM+RAM. This should basically never happen if there is enough system memory since the bottleneck is ALWAYS the transfer between VRAM & RAM. This results in model execution that is an order of magnitude slower than RAM alone. It cannot be interrupted without killing the Ollama server, so it is a really big deal to never do this.

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.11

Originally created by @Mugane on GitHub (Apr 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10092 ### What is the issue? Ollama will split models that are too big for the VRAM into VRAM+RAM. This should basically never happen if there is enough system memory since the bottleneck is ALWAYS the transfer between VRAM & RAM. This results in model execution that is an order of magnitude slower than RAM alone. It cannot be interrupted without killing the Ollama server, so it is a really big deal to never do this. ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.11
GiteaMirror added the bug label 2026-04-22 13:34:56 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 2, 2025):

Can you provide more information on the performance issue you are seeing? I did a quick check with qwen2.5:7b and found that model execution was slowest at 0% VRAM usage (all RAM), increased as the RAM+VRAM split favoured VRAM, and was maximum at 100% VRAM usage.

Image

<!-- gh-comment-id:2773395738 --> @rick-github commented on GitHub (Apr 2, 2025): Can you provide more information on the performance issue you are seeing? I did a quick check with qwen2.5:7b and found that model execution was slowest at 0% VRAM usage (all RAM), increased as the RAM+VRAM split favoured VRAM, and was maximum at 100% VRAM usage. ![Image](https://github.com/user-attachments/assets/cfef8a64-db8b-4f2e-957d-4b1364689220)
Author
Owner

@Mugane commented on GitHub (Apr 2, 2025):

Do it with a model that doesn't fit in VRAM (compare 100% RAM vs split) for example llama3.3:70b-instruct-q8_0). Also it might be affected by your CPU - I ran on an Intel Core i7-10850H (2.7GHz) with 12 virtual processors & 128GB DDR4 SODIMM@2133MT/s. Video card is NVidia Quadro RTX 5000/Max-Q 16GB VRAM

<!-- gh-comment-id:2773407966 --> @Mugane commented on GitHub (Apr 2, 2025): Do it with a model that doesn't fit in VRAM (compare 100% RAM vs split) for example llama3.3:70b-instruct-q8_0). Also it might be affected by your CPU - I ran on an Intel Core i7-10850H (2.7GHz) with 12 virtual processors & 128GB DDR4 SODIMM@2133MT/s. Video card is NVidia Quadro RTX 5000/Max-Q 16GB VRAM
Author
Owner

@rick-github commented on GitHub (Apr 3, 2025):

llama3.3:70b-instruct-q8_0, i7-13700, 96G DDR5 @ 5200, RTX 4070 12G

Image

<!-- gh-comment-id:2775193822 --> @rick-github commented on GitHub (Apr 3, 2025): llama3.3:70b-instruct-q8_0, i7-13700, 96G DDR5 @ 5200, RTX 4070 12G ![Image](https://github.com/user-attachments/assets/2aba0f55-89bc-456c-9e45-0b220c8ce1c8)
Author
Owner

@Mugane commented on GitHub (Apr 8, 2025):

Thanks, may I ask what you're using to run the test? I'll see if I can duplicate locally in case there's a real-world vs test setup issue or it's just me or some local issue.

<!-- gh-comment-id:2785686325 --> @Mugane commented on GitHub (Apr 8, 2025): Thanks, may I ask what you're using to run the test? I'll see if I can duplicate locally in case there's a real-world vs test setup issue or it's just me or some local issue.
Author
Owner

@rick-github commented on GitHub (Apr 8, 2025):

model=llama3.3:70b-instruct-q8_0 ; for i in {0..11} ; do r="{\"layers\":$i}" ; tps=$(for m in default true false ; do mmap="" ; [ "$m" == "true" ] && mmap='"use_mmap":true,' ; [ "$m" == "false" ] && mmap='"use_mmap":false,' ; tps=$(curl -s localhost:11434/api/generate -d '{"model":"'$model'","options":{'$mmap'"num_gpu":'$i',"num_predict":100},"stream":false,"prompt":"why is the sky blue","seed":42}' | jq '.eval_count/(.eval_duration/1000000000)') ; echo -n "+{\"tps_mmap_$m\":$tps}"; done) ; jq -cn "$r$tps+{\"vram\":$(nvidia-smi -x -q | yq -p=xml -o=json  | jq -r '.nvidia_smi_log.gpu.processes.process_info.used_memory // "0"|gsub(" MiB";"")')}" ; done
<!-- gh-comment-id:2785824015 --> @rick-github commented on GitHub (Apr 8, 2025): ``` model=llama3.3:70b-instruct-q8_0 ; for i in {0..11} ; do r="{\"layers\":$i}" ; tps=$(for m in default true false ; do mmap="" ; [ "$m" == "true" ] && mmap='"use_mmap":true,' ; [ "$m" == "false" ] && mmap='"use_mmap":false,' ; tps=$(curl -s localhost:11434/api/generate -d '{"model":"'$model'","options":{'$mmap'"num_gpu":'$i',"num_predict":100},"stream":false,"prompt":"why is the sky blue","seed":42}' | jq '.eval_count/(.eval_duration/1000000000)') ; echo -n "+{\"tps_mmap_$m\":$tps}"; done) ; jq -cn "$r$tps+{\"vram\":$(nvidia-smi -x -q | yq -p=xml -o=json | jq -r '.nvidia_smi_log.gpu.processes.process_info.used_memory // "0"|gsub(" MiB";"")')}" ; done ```
Author
Owner

@Mugane commented on GitHub (Apr 13, 2025):

@rick-github Did you actually run that?

<!-- gh-comment-id:2799998004 --> @Mugane commented on GitHub (Apr 13, 2025): @rick-github Did you actually run that?
Author
Owner

@rick-github commented on GitHub (Apr 13, 2025):

Yes, that's how I got the numbers to create the graph in https://github.com/ollama/ollama/issues/10092#issuecomment-2775193822

<!-- gh-comment-id:2800004712 --> @rick-github commented on GitHub (Apr 13, 2025): Yes, that's how I got the numbers to create the graph in https://github.com/ollama/ollama/issues/10092#issuecomment-2775193822
Author
Owner

@Mugane commented on GitHub (Apr 30, 2025):

This is not completed. I have not had time to debug your script yet. The issue persists.

<!-- gh-comment-id:2842983271 --> @Mugane commented on GitHub (Apr 30, 2025): This is not completed. I have not had time to debug your script yet. The issue persists.
Author
Owner

@rick-github commented on GitHub (Apr 30, 2025):

What problem are you having with the script?

<!-- gh-comment-id:2843013721 --> @rick-github commented on GitHub (Apr 30, 2025): What problem are you having with the script?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32376