[GH-ISSUE #7175] Layer-wise Inferencing from ram/low vram mode? #66611

Closed
opened 2026-05-04 07:37:06 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @AncientMystic on GitHub (Oct 11, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7175

Would it be possible to add something similar to Layer-wise Inferencing (possibly from cpu/ram instead of disk so it is not extremely slow) to ollama similar to airllm? 

It seems like being able to use this feature for ollama to switch out layers to VRAM (preferably from memory) for processing could lead to a nice performance boost, most modern systems should be able to load requested layers into vram nearly instantly from ram.

This could enable the use of far larger models and increase efficiency of parallel models within ollama and it seems it would be more efficient to always load x number of layers to the gpu rather than just loading how ever many of the first layers will fit on the gpu filling vram completely then leaving spill over to be loaded to the cpu/ram and getting held back by the cpu waiting for it to process the spill over layers much more slowly. Which seems like an inefficient approach that isn't really fully utilising the gpu unless you have enough vram to fit an entire model.

This would especially be useful to be able to specify how many layers can be loaded at a given time, if ollama is free to load multiple layers at once to vram in a batch queue setup while keeping them in ram for nearly instant access, it seems much more efficient on systems without extremely high amounts of vram, although it would still take a good amount of ram but having 32-64GB+ ram is a lot more common and less expensive than having 24-32GB+ vram for fairly large models. with this common 2-4gb vram gpu's would become a lot more useful to ollama.

This i feel would also work nicely with the K/V quantisation PR since that minimises vram usage of the K/V cache and would allow much higher context sizes, etc.

so in combination this could be a nice way to load extremely large models and context sizes on low vram if you have a lot of ram at least.

Originally created by @AncientMystic on GitHub (Oct 11, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7175 Would it be possible to add something similar to Layer-wise Inferencing (possibly from cpu/ram instead of disk so it is not extremely slow) to ollama similar to airllm?  It seems like being able to use this feature for ollama to switch out layers to VRAM (preferably from memory) for processing could lead to a nice performance boost, most modern systems should be able to load requested layers into vram nearly instantly from ram. This could enable the use of far larger models and increase efficiency of parallel models within ollama and it seems it would be more efficient to always load x number of layers to the gpu rather than just loading how ever many of the first layers will fit on the gpu filling vram completely then leaving spill over to be loaded to the cpu/ram and getting held back by the cpu waiting for it to process the spill over layers much more slowly. Which seems like an inefficient approach that isn't really fully utilising the gpu unless you have enough vram to fit an entire model. This would especially be useful to be able to specify how many layers can be loaded at a given time, if ollama is free to load multiple layers at once to vram in a batch queue setup while keeping them in ram for nearly instant access, it seems much more efficient on systems without extremely high amounts of vram, although it would still take a good amount of ram but having 32-64GB+ ram is a lot more common and less expensive than having 24-32GB+ vram for fairly large models. with this common 2-4gb vram gpu's would become a lot more useful to ollama. This i feel would also work nicely with the K/V quantisation PR since that minimises vram usage of the K/V cache and would allow much higher context sizes, etc. so in combination this could be a nice way to load extremely large models and context sizes on low vram if you have a lot of ram at least.
GiteaMirror added the feature request label 2026-05-04 07:37:06 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 11, 2024):

This would be a feature request for llama.cpp.

<!-- gh-comment-id:2408178162 --> @rick-github commented on GitHub (Oct 11, 2024): This would be a feature request for [llama.cpp](https://github.com/ggerganov/llama.cpp/issues).
Author
Owner

@AncientMystic commented on GitHub (Oct 11, 2024):

I will post a feature request there and see what the Llama.cpp team has to say about it, thank you

<!-- gh-comment-id:2408203374 --> @AncientMystic commented on GitHub (Oct 11, 2024): I will post a feature request there and see what the Llama.cpp team has to say about it, thank you
Author
Owner

@AncientMystic commented on GitHub (Oct 11, 2024):

It seems llama.cpp does support layer-wise inference by compiling cuBLAS with -ngl 0 but i think that is from disk specifically. I posted a discussion so we will see what everyone over at llama.cpp has to say about this, hopefully it will be positive, a performance boost and not too complicated to implement in ollama.

<!-- gh-comment-id:2408225117 --> @AncientMystic commented on GitHub (Oct 11, 2024): It seems llama.cpp does support layer-wise inference by compiling cuBLAS with -ngl 0 but i think that is from disk specifically. I posted a discussion so we will see what everyone over at llama.cpp has to say about this, hopefully it will be positive, a performance boost and not too complicated to implement in ollama.
Author
Owner

@rick-github commented on GitHub (Nov 17, 2024):

original llama.cpp discussion in https://github.com/ggerganov/llama.cpp/discussions/4310
followup discussion in https://github.com/ggerganov/llama.cpp/discussions/9854

<!-- gh-comment-id:2481289574 --> @rick-github commented on GitHub (Nov 17, 2024): original llama.cpp discussion in https://github.com/ggerganov/llama.cpp/discussions/4310 followup discussion in https://github.com/ggerganov/llama.cpp/discussions/9854
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66611