[GH-ISSUE #9902] a model run at GPU and another model run at CPU #68540

Closed
opened 2026-05-04 14:23:01 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @ROBODRILL on GitHub (Mar 20, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9902

What is the issue?

In docker, I copy the model from deepseek-r1:32b ,Add 2 parameter
PARAMETER num_ctx 131072
PARAMETER num_predict -1
in Modelfile,
named new model deepseek-r1:32b-max-context.

run deepseek-r1:32b-max-context,
then "ollama ps ":
the processors is 100%CPU,
run deepseek-r1:32b,
then "ollama ps ":
the processors is 100%GPU,

Relevant log output


OS

docker

GPU

tesla T4

CPU

No response

Ollama version

0.5.11

Originally created by @ROBODRILL on GitHub (Mar 20, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9902 ### What is the issue? In docker, I copy the model from deepseek-r1:32b ,Add 2 parameter PARAMETER num_ctx 131072 PARAMETER num_predict -1 in Modelfile, named new model deepseek-r1:32b-max-context. run deepseek-r1:32b-max-context, then "ollama ps ": the processors is 100%CPU, run deepseek-r1:32b, then "ollama ps ": the processors is 100%GPU, ### Relevant log output ```shell ``` ### OS docker ### GPU tesla T4 ### CPU _No response_ ### Ollama version 0.5.11
GiteaMirror added the bug label 2026-05-04 14:23:01 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 20, 2025):

The size of the context is larger than will fit in GPU VRAM, so the model runs in CPU.

<!-- gh-comment-id:2739664293 --> @rick-github commented on GitHub (Mar 20, 2025): The size of the context is larger than will fit in GPU VRAM, so the model runs in CPU.
Author
Owner

@sieveLau commented on GitHub (Mar 21, 2025):

Your context length is too large. Context length also takes VRAM to store, so try and reduce it to a lower one. In most cases, 32K would be enough for a 32B model, longer ones don't make much difference.

<!-- gh-comment-id:2743203448 --> @sieveLau commented on GitHub (Mar 21, 2025): Your context length is too large. Context length also takes VRAM to store, so try and reduce it to a lower one. In most cases, 32K would be enough for a 32B model, longer ones don't make much difference.
Author
Owner

@ROBODRILL commented on GitHub (Mar 25, 2025):

The size of the context is larger than will fit in GPU VRAM, so the model runs in CPU.

thankyou,My VARM is 128GB,so the context_size that I can config is ??

<!-- gh-comment-id:2749940726 --> @ROBODRILL commented on GitHub (Mar 25, 2025): > The size of the context is larger than will fit in GPU VRAM, so the model runs in CPU. thankyou,My VARM is 128GB,so the context_size that I can config is ??
Author
Owner

@ROBODRILL commented on GitHub (Mar 25, 2025):

Your context length is too large. Context length also takes VRAM to store, so try and reduce it to a lower one. In most cases, 32K would be enough for a 32B model, longer ones don't make much difference.

thankyou ,I set the 8K for the context_num,but I don't know what's size to config to match the VARM

<!-- gh-comment-id:2749942766 --> @ROBODRILL commented on GitHub (Mar 25, 2025): > Your context length is too large. Context length also takes VRAM to store, so try and reduce it to a lower one. In most cases, 32K would be enough for a 32B model, longer ones don't make much difference. thankyou ,I set the 8K for the context_num,but I don't know what's size to config to match the VARM
Author
Owner

@sieveLau commented on GitHub (Mar 25, 2025):

Your context length is too large. Context length also takes VRAM to store, so try and reduce it to a lower one. In most cases, 32K would be enough for a 32B model, longer ones don't make much difference.

thankyou ,I set the 8K for the context_num,but I don't know what's size to config to match the VARM

I think you can add one more parameter num_gpu 999 to force ollama load the model into the VRAM, ignoring its estimation. Then you can increase the num_ctx step by step to test where the limit is. But in this way you should monitor your VRAM through nvidia-smi instead of ollama ps, and watch the systemd ollama service output to see any warnings or errors.

<!-- gh-comment-id:2749961223 --> @sieveLau commented on GitHub (Mar 25, 2025): > > Your context length is too large. Context length also takes VRAM to store, so try and reduce it to a lower one. In most cases, 32K would be enough for a 32B model, longer ones don't make much difference. > > thankyou ,I set the 8K for the context_num,but I don't know what's size to config to match the VARM I think you can add one more parameter `num_gpu 999` to force ollama load the model into the VRAM, ignoring its estimation. Then you can increase the num_ctx step by step to test where the limit is. But in this way you should monitor your VRAM through `nvidia-smi` instead of `ollama ps`, and watch the systemd ollama service output to see any warnings or errors.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68540