[GH-ISSUE #11339] How to start GPU support in API mode? #69537

Closed
opened 2026-05-04 18:24:07 -05:00 by GiteaMirror · 17 comments
Owner

Originally created by @renliao on GitHub (Jul 9, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11339

Hello, In the ollama run mode, I can determine that GPU is enabled, but in the API mode, how to enable GPU support?

payload = { "model": MODEL_NAME, "images": [image_base64], "prompt": prompt, "stream": False, "options": { "temperature": 0.2, "num_ctx": 4096 } }

response = requests.post( f"{OLLAMA_HOST}/api/generate", json=payload, timeout=180 )

Originally created by @renliao on GitHub (Jul 9, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11339 Hello, In the `ollama run` mode, I can determine that GPU is enabled, but in the API mode, how to enable GPU support? `payload = { "model": MODEL_NAME, "images": [image_base64], "prompt": prompt, "stream": False, "options": { "temperature": 0.2, "num_ctx": 4096 } }` `response = requests.post( f"{OLLAMA_HOST}/api/generate", json=payload, timeout=180 )`
GiteaMirror added the questionneeds more info labels 2026-05-04 18:24:14 -05:00
Author
Owner

@rick-github commented on GitHub (Jul 9, 2025):

GPU use is automatic, it doesn't need to be enabled.

<!-- gh-comment-id:3051801507 --> @rick-github commented on GitHub (Jul 9, 2025): GPU use is automatic, it doesn't need to be enabled.
Author
Owner

@renliao commented on GitHub (Jul 9, 2025):

GPU use is automatic, it doesn't need to be enabled.

Thank you for your reply, but I found that when I use API mode, GPU is not automatically enabled, resulting in slow model reasoning. How can I solve it? thanks

<!-- gh-comment-id:3051816725 --> @renliao commented on GitHub (Jul 9, 2025): > GPU use is automatic, it doesn't need to be enabled. Thank you for your reply, but I found that when I use API mode, GPU is not automatically enabled, resulting in slow model reasoning. How can I solve it? thanks
Author
Owner

@rick-github commented on GitHub (Jul 9, 2025):

Server logs will aid in debugging.

<!-- gh-comment-id:3051823242 --> @rick-github commented on GitHub (Jul 9, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@TrurlMcByte commented on GitHub (Jul 11, 2025):

GPU use is automatic, it doesn't need to be enabled.

Thank you for your reply, but I found that when I use API mode, GPU is not automatically enabled, resulting in slow model reasoning. How can I solve it? thanks

Install version 0.8.0 - it's working for me. Newer versions switch to CPU after attempting GPU
(especially on qwen3 models, but not only)

<!-- gh-comment-id:3062874038 --> @TrurlMcByte commented on GitHub (Jul 11, 2025): > > GPU use is automatic, it doesn't need to be enabled. > > Thank you for your reply, but I found that when I use API mode, GPU is not automatically enabled, resulting in slow model reasoning. How can I solve it? thanks Install version 0.8.0 - it's working for me. Newer versions switch to CPU after attempting GPU (especially on qwen3 models, but not only)
Author
Owner

@rick-github commented on GitHub (Jul 11, 2025):

Newer versions switch to CPU after attempting GPU

They don't. The server will estimate the required resources and will fit what it can on the GPU. The rest will run in CPU. The only time that the runner will use CPU when it was told to use the GPU is if the installation is broken in some way.

<!-- gh-comment-id:3062890518 --> @rick-github commented on GitHub (Jul 11, 2025): > Newer versions switch to CPU after attempting GPU They don't. The server will estimate the required resources and will fit what it can on the GPU. The rest will run in CPU. The only time that the runner will use CPU when it was told to use the GPU is if the installation is broken in some way.
Author
Owner

@TrurlMcByte commented on GitHub (Jul 11, 2025):

Newer versions switch to CPU after attempting GPU

They don't. The server will estimate the required resources and will fit what it can on the GPU. The rest will run in CPU. The only time that the runner will use CPU when it was told to use the GPU is if the installation is broken in some way.

They (v0.9+) load model in GPU, do something on GPU short time and start heavy load on CPU and memory without any logs (on DEBUG=2).

<!-- gh-comment-id:3062901745 --> @TrurlMcByte commented on GitHub (Jul 11, 2025): > > Newer versions switch to CPU after attempting GPU > > They don't. The server will estimate the required resources and will fit what it can on the GPU. The rest will run in CPU. The only time that the runner will use CPU when it was told to use the GPU is if the installation is broken in some way. They (v0.9+) load model in GPU, do something on GPU short time and start heavy load on CPU and memory without any logs (on DEBUG=2).
Author
Owner

@rick-github commented on GitHub (Jul 11, 2025):

Likely part of the model has been offloaded to system RAM because it doesn't all fit in the VRAM of the GPU. The parts in the VRAM get processed quickly because the GPU is faster. The part in system RAM gets processed slower because the CPU is slower. As a result, most of the time is spent waiting for the CPU to finish processing its part of the model.

<!-- gh-comment-id:3062909433 --> @rick-github commented on GitHub (Jul 11, 2025): Likely part of the model has been offloaded to system RAM because it doesn't all fit in the VRAM of the GPU. The parts in the VRAM get processed quickly because the GPU is faster. The part in system RAM gets processed slower because the CPU is slower. As a result, most of the time is spent waiting for the CPU to finish processing its part of the model.
Author
Owner

@TrurlMcByte commented on GitHub (Jul 11, 2025):

Likely part of the model has been offloaded to system RAM because it doesn't all fit in the VRAM of the GPU. The parts in the VRAM get processed quickly because the GPU is faster. The part in system RAM gets processed slower because the CPU is slower. As a result, most of the time is spent waiting for the CPU to finish processing its part of the model.

But v0.8.0 do all in VRAM with same query and same model (qwen3:8b for example).

<!-- gh-comment-id:3062915496 --> @TrurlMcByte commented on GitHub (Jul 11, 2025): > Likely part of the model has been offloaded to system RAM because it doesn't all fit in the VRAM of the GPU. The parts in the VRAM get processed quickly because the GPU is faster. The part in system RAM gets processed slower because the CPU is slower. As a result, most of the time is spent waiting for the CPU to finish processing its part of the model. But v0.8.0 do all in VRAM with same query and same model (qwen3:8b for example).
Author
Owner

@rick-github commented on GitHub (Jul 11, 2025):

Memory estimation logic has undergone work (and is still undergoing work, #11090) to minimize OOMs, resulting in more conservative estimates. As a result, some layers have been offloaded to the CPU. Until the memory work is completed, the model can be forced into VRAM by setting num_gpu.

<!-- gh-comment-id:3062925975 --> @rick-github commented on GitHub (Jul 11, 2025): Memory estimation logic has undergone work (and is still undergoing work, #11090) to minimize OOMs, resulting in more conservative estimates. As a result, some layers have been offloaded to the CPU. Until the memory work is completed, the model can be forced into VRAM by setting [`num_gpu`](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650).
Author
Owner

@TrurlMcByte commented on GitHub (Jul 11, 2025):

Memory estimation logic has undergone work (and is still undergoing work, #11090) to minimize OOMs, resulting in more conservative estimates. As a result, some layers have been offloaded to the CPU. Until the memory work is completed, the model can be forced into VRAM by setting num_gpu.

I have tried many ways, looks like in some configs with old cards like my NVIDIA GeForce RTX 2080 Ti (compute capability 7.5) after load model in VRAM (even small qwen3:0.6b) ollama-0.9.+ switching to CPU in any case.
nvtop shows load on GPU in this case, may be model really answer already, but it's ignored and started on CPU.

<!-- gh-comment-id:3062949378 --> @TrurlMcByte commented on GitHub (Jul 11, 2025): > Memory estimation logic has undergone work (and is still undergoing work, [#11090](https://github.com/ollama/ollama/pull/11090)) to minimize OOMs, resulting in more conservative estimates. As a result, some layers have been offloaded to the CPU. Until the memory work is completed, the model can be forced into VRAM by setting [`num_gpu`](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650). I have tried many ways, looks like in some configs with old cards like my NVIDIA GeForce RTX 2080 Ti (compute capability 7.5) after load model in VRAM (even small qwen3:0.6b) ollama-0.9.+ switching to CPU in any case. nvtop shows load on GPU in this case, may be model really answer already, but it's ignored and started on CPU.
Author
Owner

@rick-github commented on GitHub (Jul 11, 2025):

No, the model does not get switched from GPU to CPU after the runner has started, unless an API request changes parameters (num_gpu, num_ctx, use_mmap, etc).

<!-- gh-comment-id:3062957230 --> @rick-github commented on GitHub (Jul 11, 2025): No, the model does not get switched from GPU to CPU after the runner has started, unless an API request changes parameters (`num_gpu`, `num_ctx`, `use_mmap`, etc).
Author
Owner

@TrurlMcByte commented on GitHub (Jul 11, 2025):

No, the model does not get switched from GPU to CPU after the runner has started, unless an API request changes parameters (num_gpu, num_ctx, use_mmap, etc).

I can run tests for you, but don't know how to collect GPU and CPU load graphs in same time (my prometheus died under this CPU load)

<!-- gh-comment-id:3062966848 --> @TrurlMcByte commented on GitHub (Jul 11, 2025): > No, the model does not get switched from GPU to CPU after the runner has started, unless an API request changes parameters (`num_gpu`, `num_ctx`, `use_mmap`, etc). I can run tests for you, but don't know how to collect GPU and CPU load graphs in same time (my prometheus died under this CPU load)
Author
Owner

@rick-github commented on GitHub (Jul 11, 2025):

Server logs will aid in debugging.

<!-- gh-comment-id:3062971366 --> @rick-github commented on GitHub (Jul 11, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@TrurlMcByte commented on GitHub (Jul 11, 2025):

And how must be OLLAMA_DEBUG?

<!-- gh-comment-id:3062985933 --> @TrurlMcByte commented on GitHub (Jul 11, 2025): And how must be OLLAMA_DEBUG?
Author
Owner

@rick-github commented on GitHub (Jul 11, 2025):

OLLAMA_DEBUG=1 should be enough to determine the basic scope of the issue, OLLAMA_DEBUG=2 can be tried later if there's not enough detail at level 1.

<!-- gh-comment-id:3062993727 --> @rick-github commented on GitHub (Jul 11, 2025): `OLLAMA_DEBUG=1` should be enough to determine the basic scope of the issue, `OLLAMA_DEBUG=2` can be tried later if there's not enough detail at level 1.
Author
Owner

@TrurlMcByte commented on GitHub (Jul 11, 2025):

ollama v0.9.6: OLLAMA_MODELS="/usr/share/ollama/.ollama/models" OLLAMA_DEBUG=1 OLLAMA_HOST="http://0.0.0.0:11434" ollama serve 2>&1 | tee server9.log died without answer
ollama v0.8.0: OLLAMA_MODELS="/usr/share/ollama/.ollama/models" OLLAMA_DEBUG=1 OLLAMA_HOST="http://0.0.0.0:11434" ollama serve 2>&1 | tee server8.log fast reply

server9.log
server8.log

<!-- gh-comment-id:3063091647 --> @TrurlMcByte commented on GitHub (Jul 11, 2025): ollama v0.9.6: `OLLAMA_MODELS="/usr/share/ollama/.ollama/models" OLLAMA_DEBUG=1 OLLAMA_HOST="http://0.0.0.0:11434" ollama serve 2>&1 | tee server9.log` died without answer ollama v0.8.0: `OLLAMA_MODELS="/usr/share/ollama/.ollama/models" OLLAMA_DEBUG=1 OLLAMA_HOST="http://0.0.0.0:11434" ollama serve 2>&1 | tee server8.log` fast reply [server9.log](https://github.com/user-attachments/files/21187299/server9.log) [server8.log](https://github.com/user-attachments/files/21187298/server8.log)
Author
Owner

@TrurlMcByte commented on GitHub (Aug 3, 2025):

v0.10.1 working with GPU fine!

<!-- gh-comment-id:3148404649 --> @TrurlMcByte commented on GitHub (Aug 3, 2025): v0.10.1 working with GPU fine!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69537