[GH-ISSUE #10232] Model Randomly splits between CPU and GPU #6714

Closed
opened 2026-04-12 18:27:30 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @EduardDoronin on GitHub (Apr 11, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10232

What is the issue?

As the title suggests, for some reason my ollama randomly to sometimes split between CPU and GPU even tho Ive set it up that it only uses the GPU.

I have tried doing that:
Environment="OLLAMA_ACCELERATOR=cuda"
to force it to use the GPU, that worked with the early versions of Ollama but ever since I upgraded to version 0.5.x or higher we had the described problem. I cant see a pattern, when it splits, it just happens entirely randomly. A restard of Ollama sometimes helps but it will eventually go back to splitting between CPU and GPU

Image

Image

Any ideas?

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @EduardDoronin on GitHub (Apr 11, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10232 ### What is the issue? As the title suggests, for some reason my ollama randomly to sometimes split between CPU and GPU even tho Ive set it up that it only uses the GPU. I have tried doing that: Environment="OLLAMA_ACCELERATOR=cuda" to force it to use the GPU, that worked with the early versions of Ollama but ever since I upgraded to version 0.5.x or higher we had the described problem. I cant see a pattern, when it splits, it just happens entirely randomly. A restard of Ollama sometimes helps but it will eventually go back to splitting between CPU and GPU ![Image](https://github.com/user-attachments/assets/24b6d0ef-907f-4582-8a2e-4397aff3cdee) ![Image](https://github.com/user-attachments/assets/1947ee40-3cea-43ed-9e2a-7b10e2b4d096) Any ideas? ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-12 18:27:30 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 11, 2025):

OLLAMA_ACCELERATOR is not an ollama configuration variable. ollama is splitting between VRAM and RAM because it thinks the model won't fit in VRAM. The memory estimation is sometimes not accurate, particularly if you are using flash attention. You can force ollama to load more layers into VRAM by setting num_gpu, see here. Choose a value of num_gpu that maximizes the amount of VRAM used. Note that depending on your OS/driver, setting it too high can cause performance issues.

<!-- gh-comment-id:2796516070 --> @rick-github commented on GitHub (Apr 11, 2025): `OLLAMA_ACCELERATOR` is not an ollama configuration variable. ollama is splitting between VRAM and RAM because it thinks the model won't fit in VRAM. The memory estimation is sometimes not accurate, particularly if you are using flash attention. You can force ollama to load more layers into VRAM by setting `num_gpu`, see [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650). Choose a value of `num_gpu` that maximizes the amount of VRAM used. Note that depending on your OS/driver, setting it too high can cause [performance issues](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900).
Author
Owner

@EduardDoronin commented on GitHub (Apr 22, 2025):

A okay, is there not a way to specifically say "dont use anything but the gpu whatsoever"? Instead of having to guess the amount of usable layers?

<!-- gh-comment-id:2820347201 --> @EduardDoronin commented on GitHub (Apr 22, 2025): A okay, is there not a way to specifically say "dont use anything but the gpu whatsoever"? Instead of having to guess the amount of usable layers?
Author
Owner

@rick-github commented on GitHub (Apr 22, 2025):

Instead of having to guess the amount of usable layers?

That's what ollama is doing, it's calculating what it thinks will fit in order to get best performance. If you don't agree, your can override ollama by setting nun_gpu. If you want to force ollama to use only GPU, set num_gpu to 999. Previous warning applies.

<!-- gh-comment-id:2820508642 --> @rick-github commented on GitHub (Apr 22, 2025): > Instead of having to guess the amount of usable layers? That's what ollama is doing, it's calculating what it thinks will fit in order to get best performance. If you don't agree, your can override ollama by setting `nun_gpu`. If you want to force ollama to use only GPU, set `num_gpu` to 999. Previous warning applies.
Author
Owner

@EduardDoronin commented on GitHub (Apr 22, 2025):

Instead of having to guess the amount of usable layers?

That's what ollama is doing, it's calculating what it thinks will fit in order to get best performance. If you don't agree, your can override ollama by setting nun_gpu. If you want to force ollama to use only GPU, set num_gpu to 999. Previous warning applies.

Okay, tried doing that and now I am getting an entirely different error:
Error: llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer llama_model_load_from_file_impl: failed to load model

<!-- gh-comment-id:2820581113 --> @EduardDoronin commented on GitHub (Apr 22, 2025): > > Instead of having to guess the amount of usable layers? > > That's what ollama is doing, it's calculating what it thinks will fit in order to get best performance. If you don't agree, your can override ollama by setting `nun_gpu`. If you want to force ollama to use only GPU, set `num_gpu` to 999. Previous warning applies. Okay, tried doing that and now I am getting an entirely different error: `Error: llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer llama_model_load_from_file_impl: failed to load model`
Author
Owner

@rick-github commented on GitHub (Apr 22, 2025):

Note that depending on your OS/driver, setting it too high can cause performance issues.

<!-- gh-comment-id:2820664317 --> @rick-github commented on GitHub (Apr 22, 2025): > Note that depending on your OS/driver, setting it too high can cause performance issues.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6714