[GH-ISSUE #7584] Nvidia fallback memory #66891

Closed
opened 2026-05-04 08:40:37 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @AncientMystic on GitHub (Nov 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7584

Would it be possible to add a feature option such as "ollama_cuda_fallback" or something like simply detecting whether it is enabled or disabled. (Detection could be a little complicated since drivers between 531 and around I believe 541 do not allow for configuration of the feature and pascal vgpu support ends at 536 I believe it is)

to allow over allocation of vram on nvidia cards to support the new (531+ drivers I believe) ram fallback.

As it stands ollama detects vram and manages cpu+gpu hybrid mode allocation of memory.

i suspect while fallback can decrease speeds vs a pure gpu allocation when you have enough vram, it should offer a performance boost over cpu runners (not to mention possible lower cpu use, eliminating bugs between runners etc) if we simply allow the GPU to fallback on system ram and hook system ram so it can be directly managed by the GPU

then we simply have a GPU with partly slower ram and with this feature that should still be faster and better than mixing in cpu runners even if only slightly.

It seems like it would be at least worth testing.

Edit:
Small quick test done with LM studio on laptop

Model: Estopianmaid 13B Q5_k_m 
(just a random model i downloaded to see how it did with writing styles might be better tested on gemma 2 or llama 3.2) 

Hardware (laptop):
Cpu: i7-8750H 6C 12T
Gpu: gtx 1050 ti 4gb
Ram: 32GB DDR4 2133mhz

Fallback:  
Tokens: 1.1t/s 
Gen time: 41s 
Time to first token: 64.97s 
11-13% cpu usage(system usage not model), 60-100% gpu usage

GPU+CPU: 
Tokens: 0.48t/s
Gen time: 146.72s
Time to first token: 17.13s 
100% cpu, 60-100 gpu

CPU: (wrote a much shorter response) 
Tokens: 0.84t/s (regen resulted in 0.4t/s) 
Gen time: 26.7s
Time to first token: 287.4s 
100% cpu 

smaller model Tiamat 7b q2_k that fits in vram (wrote a much longer response): 
tokens: 2.6t/s
gen time: 56.64s
time to first: 14.53s

While it is about 55% slower than pure vram, it is 130% faster than hybrid runners in this specific test at least, plus the benefit of the cpu usage being null and the cpu power usage not being a factor with this method. (Cpu power use would be a decent factor by itself even if performance turns out to be around equal overall since gpu's use so much to begin with it would be nice to at least eliminate some power usage)

Originally created by @AncientMystic on GitHub (Nov 9, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7584 Would it be possible to add a feature option such as "ollama_cuda_fallback" or something like simply detecting whether it is enabled or disabled. (Detection could be a little complicated since drivers between 531 and around I believe 541 do not allow for configuration of the feature and pascal vgpu support ends at 536 I believe it is) to allow over allocation of vram on nvidia cards to support the new (531+ drivers I believe) ram fallback. As it stands ollama detects vram and manages cpu+gpu hybrid mode allocation of memory. i suspect while fallback can decrease speeds vs a pure gpu allocation when you have enough vram, it should offer a performance boost over cpu runners (not to mention possible lower cpu use, eliminating bugs between runners etc) if we simply allow the GPU to fallback on system ram and hook system ram so it can be directly managed by the GPU then we simply have a GPU with partly slower ram and with this feature that should still be faster and better than mixing in cpu runners even if only slightly. It seems like it would be at least worth testing. Edit: Small quick test done with LM studio on laptop ``` Model: Estopianmaid 13B Q5_k_m (just a random model i downloaded to see how it did with writing styles might be better tested on gemma 2 or llama 3.2) Hardware (laptop): Cpu: i7-8750H 6C 12T Gpu: gtx 1050 ti 4gb Ram: 32GB DDR4 2133mhz Fallback: Tokens: 1.1t/s Gen time: 41s Time to first token: 64.97s 11-13% cpu usage(system usage not model), 60-100% gpu usage GPU+CPU: Tokens: 0.48t/s Gen time: 146.72s Time to first token: 17.13s 100% cpu, 60-100 gpu CPU: (wrote a much shorter response) Tokens: 0.84t/s (regen resulted in 0.4t/s) Gen time: 26.7s Time to first token: 287.4s 100% cpu smaller model Tiamat 7b q2_k that fits in vram (wrote a much longer response): tokens: 2.6t/s gen time: 56.64s time to first: 14.53s ``` While it is about 55% slower than pure vram, it is 130% faster than hybrid runners in this specific test at least, plus the benefit of the cpu usage being null and the cpu power usage not being a factor with this method. (Cpu power use would be a decent factor by itself even if performance turns out to be around equal overall since gpu's use so much to begin with it would be nice to at least eliminate some power usage)
GiteaMirror added the feature request label 2026-05-04 08:40:37 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 10, 2024):

ollama supports this, or rather llama.cpp has a feature called unified memory. On Windows, it's on by default, Linux users need to add GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 to the server environment. To use, just overriide num_gpu. I'm not sure how useful it is, though. Stable Diffusion users found it such a problem that Nvidia added the ability to turn it off. In my experience, it's good for working around some of the memory mis-calculations that ollama does, so prevents OOMing, but when you start using it for large amounts of RAM allocation, token generation speed plummets.

For example, I imported Estopianmaid and then ran it with various amounts of the model in unified memory. My test GPU has 12G of VRAM, so I loaded some other junk in there so that the entire model wouldn't fit. In this config, ollama wanted to load 15 out of 41 layers, normally having the remaining 26 layers running on CPU.

Screenshot from 2024-11-10 12-13-02

Here we see that using unified memory allows loading more than 15 layers, and performance does continue to increase. But somewhere between 20 and 25 layers, there's some bottleneck that drastically lowers the token generation rate. This might be a function of my test environment - limited PCI bandwidth, card in the wrong slot, etc - so it may work for others, but for me, it's only useful as an OOM preventative.

Edit 2025/09/01

Adding some updated data for recent model and ollama releases.

Using ollama v0.11.8 and qwen3:32b-q4_K_M, the same results are seen. Ollama estimates 31 of 65 layers will fit in the 12GB of a 4070, but overriding num_ctx allows ~38 layers before performance suffers:

Image

Interestingly, the performance impact seems less for MoE models. qwen3:30b-a3b-instruct-2507-q4_K_M is approximately the same size as qwen3:32b-q4_K_M, and ollama estimates 28 of 49 layers can be scheduled in VRAM. Overriding num_gpu allows ~33 layers to be offloaded, and the performance drop-off is not nearly as severe.

Image

This works well for gpt-oss:20b. Despite the estimation logic calculating 18 layers fitting into VRAM using a context length of 128k, forcing more layers into shared RAM doesn't have the severe tps drop-off seen in dense models, and performance is much better (although about half of the model loaded fully into VRAM):

Image
<!-- gh-comment-id:2466715900 --> @rick-github commented on GitHub (Nov 10, 2024): ollama supports this, or rather llama.cpp has a feature called unified memory. On Windows, it's on by default, Linux users need to add `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` to the server environment. To use, just overriide `num_gpu`. I'm not sure how useful it is, though. Stable Diffusion users found it such a problem that Nvidia added the ability to [turn it off](https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/system-memory-fallback-for-stable-diffusion). In my experience, it's good for working around some of the memory mis-calculations that ollama does, so prevents OOMing, but when you start using it for large amounts of RAM allocation, token generation speed plummets. For example, I imported Estopianmaid and then ran it with various amounts of the model in unified memory. My test GPU has 12G of VRAM, so I loaded some other junk in there so that the entire model wouldn't fit. In this config, ollama wanted to load 15 out of 41 layers, normally having the remaining 26 layers running on CPU. ![Screenshot from 2024-11-10 12-13-02](https://github.com/user-attachments/assets/0cad280c-85ff-40d7-8a26-0040cb2898fb) Here we see that using unified memory allows loading more than 15 layers, and performance does continue to increase. But somewhere between 20 and 25 layers, there's some bottleneck that drastically lowers the token generation rate. This might be a function of my test environment - limited PCI bandwidth, card in the wrong slot, etc - so it may work for others, but for me, it's only useful as an OOM preventative. # Edit 2025/09/01 Adding some updated data for recent model and ollama releases. Using ollama v0.11.8 and qwen3:32b-q4_K_M, the same results are seen. Ollama estimates 31 of 65 layers will fit in the 12GB of a 4070, but overriding `num_ctx` allows ~38 layers before performance suffers: <img width="728" height="394" alt="Image" src="https://github.com/user-attachments/assets/e85aa0a2-2858-4f5b-84fa-9b2d355a9d07" /> Interestingly, the performance impact seems less for MoE models. qwen3:30b-a3b-instruct-2507-q4_K_M is approximately the same size as qwen3:32b-q4_K_M, and ollama estimates 28 of 49 layers can be scheduled in VRAM. Overriding `num_gpu` allows ~33 layers to be offloaded, and the performance drop-off is not nearly as severe. <img width="737" height="381" alt="Image" src="https://github.com/user-attachments/assets/6c528eae-f7f7-4e03-bd13-5d4f176caf92" /> This works well for gpt-oss:20b. Despite the estimation logic calculating 18 layers fitting into VRAM using a context length of 128k, forcing more layers into shared RAM doesn't have the severe tps drop-off seen in dense models, and performance is much better (although about half of the model loaded fully into VRAM): <img width="726" height="386" alt="Image" src="https://github.com/user-attachments/assets/a00dddaa-1fba-4d58-ba16-4ea4286d3a02" />
Author
Owner

@AncientMystic commented on GitHub (Nov 11, 2024):

Thank you for such a detailed response i really appreciate it.

It does not seem to be properly implemented in ollama unless i am doing something wrong anyways.

Even with it set to force all layers to the gpu it cuts off at the vram amount and forces cpu for the rest.

I also see 100% cpu usage where as in LM studio i see almost no cpu usage when purely using gpu.

<!-- gh-comment-id:2467046088 --> @AncientMystic commented on GitHub (Nov 11, 2024): Thank you for such a detailed response i really appreciate it. It does not seem to be properly implemented in ollama unless i am doing something wrong anyways. Even with it set to force all layers to the gpu it cuts off at the vram amount and forces cpu for the rest. I also see 100% cpu usage where as in LM studio i see almost no cpu usage when purely using gpu.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66891