[GH-ISSUE #5749] how to run in only GPU mode #65619

Closed
opened 2026-05-03 21:55:11 -05:00 by GiteaMirror · 18 comments
Owner

Originally created by @janglichao on GitHub (Jul 17, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5749

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

my model sometime run half on cpu half on gpu,when I run ollam ps command it shows 49% on cpu 51% on GPU,how can I config to run model always only on gpu mode but disable on cpu?
pls help me

OS

Linux

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @janglichao on GitHub (Jul 17, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5749 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? my model sometime run half on cpu half on gpu,when I run ollam ps command it shows 49% on cpu 51% on GPU,how can I config to run model always only on gpu mode but disable on cpu? pls help me ### OS Linux ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bugnvidianeeds more info labels 2026-05-03 21:55:13 -05:00
Author
Owner

@rick-github commented on GitHub (Jul 17, 2024):

How big is your model and how much VRAM do you have? ollama will normally try to fit it all in GPU, if it's not then ollama's calculation of available VRAM is telling it it needs to spill some to CPU RAM. What's the output of nvidia-smi?

<!-- gh-comment-id:2233435106 --> @rick-github commented on GitHub (Jul 17, 2024): How big is your model and how much VRAM do you have? ollama will normally try to fit it all in GPU, if it's not then ollama's calculation of available VRAM is telling it it needs to spill some to CPU RAM. What's the output of `nvidia-smi`?
Author
Owner

@dhiltgen commented on GitHub (Jul 17, 2024):

@janglichao can you clarify "sometimes"? Are you loading the same model and sometimes it loads 100% GPU and sometimes loads ~50/50 CPU/GPU, or are you loading different models? We'll try to load as much of the model into GPU as we can, but the amount of VRAM on your GPU will limit how large the models can be before we have to overflow to the CPU.

If you are seeing different behavior when loading the same model, a likely explanation may be you have other applications running that are taking varying amount of VRAM. You didn't mention what brand of GPU you are using, but if it's nvidia, you can use nvidia-smi to see the other apps running on the GPU.

If that doesn't clear it up, and you think there's a bug, please explain your scenario a little more and share your server log and I'll reopen the issue.

<!-- gh-comment-id:2234022406 --> @dhiltgen commented on GitHub (Jul 17, 2024): @janglichao can you clarify "sometimes"? Are you loading the same model and sometimes it loads 100% GPU and sometimes loads ~50/50 CPU/GPU, or are you loading different models? We'll try to load as much of the model into GPU as we can, but the amount of VRAM on your GPU will limit how large the models can be before we have to overflow to the CPU. If you are seeing different behavior when loading the same model, a likely explanation may be you have other applications running that are taking varying amount of VRAM. You didn't mention what brand of GPU you are using, but if it's nvidia, you can use `nvidia-smi` to see the other apps running on the GPU. If that doesn't clear it up, and you think there's a bug, please explain your scenario a little more and share your server log and I'll reopen the issue.
Author
Owner

@janglichao commented on GitHub (Jul 18, 2024):

@janglichao can you clarify "sometimes"? Are you loading the same model and sometimes it loads 100% GPU and sometimes loads ~50/50 CPU/GPU, or are you loading different models? We'll try to load as much of the model into GPU as we can, but the amount of VRAM on your GPU will limit how large the models can be before we have to overflow to the CPU.

If you are seeing different behavior when loading the same model, a likely explanation may be you have other applications running that are taking varying amount of VRAM. You didn't mention what brand of GPU you are using, but if it's nvidia, you can use nvidia-smi to see the other apps running on the GPU.

If that doesn't clear it up, and you think there's a bug, please explain your scenario a little more and share your server log and I'll reopen the issue.

my gpu: v100*2,
model cost GPU:42GB
it's enough to load full model on gpu,sometimes it just load only on gpu 100%,but sometimes it just half on cpu and half on gpu,can I disable use cpu that force into run only on gpu?

<!-- gh-comment-id:2235056905 --> @janglichao commented on GitHub (Jul 18, 2024): > @janglichao can you clarify "sometimes"? Are you loading the same model and sometimes it loads 100% GPU and sometimes loads ~50/50 CPU/GPU, or are you loading different models? We'll try to load as much of the model into GPU as we can, but the amount of VRAM on your GPU will limit how large the models can be before we have to overflow to the CPU. > > If you are seeing different behavior when loading the same model, a likely explanation may be you have other applications running that are taking varying amount of VRAM. You didn't mention what brand of GPU you are using, but if it's nvidia, you can use `nvidia-smi` to see the other apps running on the GPU. > > If that doesn't clear it up, and you think there's a bug, please explain your scenario a little more and share your server log and I'll reopen the issue. my gpu: v100*2, model cost GPU:42GB it's enough to load full model on gpu,sometimes it just load only on gpu 100%,but sometimes it just half on cpu and half on gpu,can I disable use cpu that force into run only on gpu?
Author
Owner

@dhiltgen commented on GitHub (Jul 18, 2024):

@janglichao please share a server log when you see it only partially loading so I can see why it was unable to load.

<!-- gh-comment-id:2236879907 --> @dhiltgen commented on GitHub (Jul 18, 2024): @janglichao please share a server log when you see it only partially loading so I can see why it was unable to load.
Author
Owner

@haier-1314 commented on GitHub (Jul 24, 2024):

请问可以设置仅用GPU吗,不考虑显存占用的情况,想要完全禁止CPU

<!-- gh-comment-id:2247473250 --> @haier-1314 commented on GitHub (Jul 24, 2024): 请问可以设置仅用GPU吗,不考虑显存占用的情况,想要完全禁止CPU
Author
Owner

@rick-github commented on GitHub (Jul 24, 2024):

ollama will use as much of the GPU as it can. If you don't want to use CPU, only load models that will fit on the GPU.

<!-- gh-comment-id:2247515407 --> @rick-github commented on GitHub (Jul 24, 2024): ollama will use as much of the GPU as it can. If you don't want to use CPU, only load models that will fit on the GPU.
Author
Owner

@dhiltgen commented on GitHub (Oct 15, 2024):

@janglichao the behavior you're describing sounds like a bug, but without a server log, I can't tell what exactly is going wrong.

We don't currently offer a mechanism to force GPU only and fail the model load if it wont fit.

<!-- gh-comment-id:2415357401 --> @dhiltgen commented on GitHub (Oct 15, 2024): @janglichao the behavior you're describing sounds like a bug, but without a server log, I can't tell what exactly is going wrong. We don't currently offer a mechanism to force GPU only and fail the model load if it wont fit.
Author
Owner

@rocymp commented on GitHub (Oct 23, 2024):

We also encountered this problem, where GPU/CPU mixing occurs when we have dual graphics cards. With the same resources and models, the old version of ollama (v0.1.41) is 100% GPU. So we went back to the old version.

<!-- gh-comment-id:2431226614 --> @rocymp commented on GitHub (Oct 23, 2024): We also encountered this problem, where GPU/CPU mixing occurs when we have dual graphics cards. With the same resources and models, the old version of ollama (v0.1.41) is 100% GPU. So we went back to the old version.
Author
Owner

@rick-github commented on GitHub (Oct 23, 2024):

0.2.0 introduced parallelism, which can increase resources. There are settings to manage resource usage.

<!-- gh-comment-id:2431433239 --> @rick-github commented on GitHub (Oct 23, 2024): 0.2.0 introduced parallelism, which can increase resources. There are settings to manage resource usage.
Author
Owner

@dhiltgen commented on GitHub (Nov 5, 2024):

@rocymp can you share server logs from 0.1.41 and a recent version running the same model on the same system so we can see what the difference is? If there's a bug we'd like to fix it, but without server logs it's hard to understand what might be wrong.

<!-- gh-comment-id:2458393656 --> @dhiltgen commented on GitHub (Nov 5, 2024): @rocymp can you share server logs from 0.1.41 and a recent version running the same model on the same system so we can see what the difference is? If there's a bug we'd like to fix it, but without server logs it's hard to understand what might be wrong.
Author
Owner

@CRYPTO-GENIK commented on GitHub (Nov 14, 2024):

Is there not just a simple answer to this question?
Who cares if his model fits just how do you disable the CPU feature. I'd rather get an error than it running a test to see if it fit in the GPU or not.

<!-- gh-comment-id:2477425875 --> @CRYPTO-GENIK commented on GitHub (Nov 14, 2024): Is there not just a simple answer to this question? Who cares if his model fits just how do you disable the CPU feature. I'd rather get an error than it running a test to see if it fit in the GPU or not.
Author
Owner

@rick-github commented on GitHub (Nov 14, 2024):

num_gpu=999

<!-- gh-comment-id:2477475021 --> @rick-github commented on GitHub (Nov 14, 2024): `num_gpu=999`
Author
Owner

@DenisBalan commented on GitHub (Jan 26, 2025):

@rick-github can you elaborate on your answer?

<!-- gh-comment-id:2614545987 --> @DenisBalan commented on GitHub (Jan 26, 2025): @rick-github can you elaborate on your answer?
Author
Owner

@rick-github commented on GitHub (Jan 26, 2025):

https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650

Instead of setting num_gpu 0 as in the post above, set it to 999 to force all layers on to the GPU. This can cause the runner to crash or suffer a performance penalty.

<!-- gh-comment-id:2614547002 --> @rick-github commented on GitHub (Jan 26, 2025): https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650 Instead of setting `num_gpu` 0 as in the post above, set it to 999 to force all layers on to the GPU. This can cause the runner to crash or suffer a [performance penalty](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900).
Author
Owner

@pcgeek86 commented on GitHub (Jun 22, 2025):

I have an NVIDIA GeForce RTX 4070 Ti SUPER with 16 GB of VRAM + 64 GB of DDR4 memory.

According to Windows 11 Task Manager, I have 48.0 GB of total GPU memory, because 32 GB of DDR4 is allowed as "shared" GPU memory.

I would like to force Ollama to load llama3.3:latest (45 GB) only onto GPU, since it ought to fit inside the 48 GB of total GPU memory.

Right now, Ollama is splitting the model across CPU / GPU (71% CPU / 29% GPU) and it is yielding very slow Tokens per Second (TPS) results.

How can I force Ollama to load the model only on my NVIDIA GPU + shared memory? I am reasonably confident that using shared memory with the GPU will still be much faster than using the CPU for part of the model.

Edit: I tried the num_gpu = 999 suggestion in a Modelfile, but it did not have the intended effect, or any effect that I can tell.

<!-- gh-comment-id:2993918985 --> @pcgeek86 commented on GitHub (Jun 22, 2025): I have an NVIDIA GeForce RTX 4070 Ti SUPER with 16 GB of VRAM + 64 GB of DDR4 memory. According to Windows 11 Task Manager, I have 48.0 GB of total GPU memory, because 32 GB of DDR4 is allowed as "shared" GPU memory. I would like to force Ollama to load `llama3.3:latest` (45 GB) **_only_** onto GPU, since it ought to fit inside the 48 GB of total GPU memory. Right now, Ollama is splitting the model across CPU / GPU (71% CPU / 29% GPU) and it is yielding very slow Tokens per Second (TPS) results. How can I force Ollama to load the model only on my NVIDIA GPU + shared memory? I am reasonably confident that using shared memory with the GPU will still be much faster than using the CPU for part of the model. **Edit**: I tried the `num_gpu = 999` suggestion in a `Modelfile`, but it did **not** have the intended effect, or any effect that I can tell.
Author
Owner

@rick-github commented on GitHub (Jun 22, 2025):

If you set num_gpu to 999, then logs should show that the model is fully offloaded to the GPU.

<!-- gh-comment-id:2994031509 --> @rick-github commented on GitHub (Jun 22, 2025): If you set `num_gpu` to 999, then logs should show that the model is fully offloaded to the GPU.
Author
Owner

@pcgeek86 commented on GitHub (Jul 10, 2025):

Sorry to bug you @janglichao but did this get closed because it has been implemented? Would you be able to share how to use it correctly?

<!-- gh-comment-id:3057523864 --> @pcgeek86 commented on GitHub (Jul 10, 2025): Sorry to bug you @janglichao but did this get closed because it has been implemented? Would you be able to share how to use it correctly?
Author
Owner

@athuljayaram commented on GitHub (Apr 12, 2026):

how to use only gpu?

<!-- gh-comment-id:4231388856 --> @athuljayaram commented on GitHub (Apr 12, 2026): how to use only gpu?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65619