[GH-ISSUE #3509] Can Ollama use both CPU and GPU for inference? #2162

Closed
opened 2026-04-12 12:24:12 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @OPDEV001 on GitHub (Apr 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3509

What are you trying to do?

May I know whether ollama support to mix CPU and GPU together for running on windows? I know my hardware is not enough for ollama, but I still want to use the part ability of GPU.
But I checked the parameter information from link below, I still can not mix CPU&GPU, most load by CPU.
https://github.com/ollama/ollama/blob/main/docs/modelfile.md

If I put all load to GPU, it will say "Out of VRam", :) you know it.

I am guessing if it have, maybe like we can specify GPU sharing part load, and CPU on most load?

Thanks,

How should we solve this?

Please see content.

What is the impact of not solving this?

If not, all load on GPU will crash.

Anything else?

Thanks for all

Originally created by @OPDEV001 on GitHub (Apr 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3509 ### What are you trying to do? May I know whether ollama support to mix CPU and GPU together for running on windows? I know my hardware is not enough for ollama, but I still want to use the part ability of GPU. But I checked the parameter information from link below, I still can not mix CPU&GPU, most load by CPU. https://github.com/ollama/ollama/blob/main/docs/modelfile.md If I put all load to GPU, it will say "Out of VRam", :) you know it. I am guessing if it have, maybe like we can specify GPU sharing part load, and CPU on most load? Thanks, ### How should we solve this? Please see content. ### What is the impact of not solving this? If not, all load on GPU will crash. ### Anything else? Thanks for all
GiteaMirror added the question label 2026-04-12 12:24:12 -05:00
Author
Owner

@navr32 commented on GitHub (Apr 8, 2024):

You must provide the log off ollama serve to see how ollama have tried to share the models in both memory.

<!-- gh-comment-id:2042381262 --> @navr32 commented on GitHub (Apr 8, 2024): You must provide the log off ollama serve to see how ollama have tried to share the models in both memory.
Author
Owner

@OPDEV001 commented on GitHub (Apr 8, 2024):

I can run ollama serve in cmd line, it is the log what you want?

<!-- gh-comment-id:2042679587 --> @OPDEV001 commented on GitHub (Apr 8, 2024): I can run ollama serve in cmd line, it is the log what you want?
Author
Owner

@pdevine commented on GitHub (Apr 12, 2024):

Hi @OPDEV001 , this is how Ollama works by default. It will try to offload as many layers of the LLM as possible onto the GPU, and then if it doesn't fit the other layers will run on the CPU.

I'll go ahead and close the issue, but feel free to keep commenting.

<!-- gh-comment-id:2052609678 --> @pdevine commented on GitHub (Apr 12, 2024): Hi @OPDEV001 , this is how Ollama works by default. It will try to offload as many layers of the LLM as possible onto the GPU, and then if it doesn't fit the other layers will run on the CPU. I'll go ahead and close the issue, but feel free to keep commenting.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2162