[GH-ISSUE #14632] Ollama under utilizes available GPU VRAM causing out of memory #35240

Open
opened 2026-04-22 19:37:30 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @i418c on GitHub (Mar 5, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14632

What is the issue?

I'm trying to run multiple models side by side, but Ollama improperly puts both models on the same GPUs even when others are available. Tinyllama could easily fit in full on GPU0, but Ollama doesn't put it there.

environment:
- OLLAMA_KEEP_ALIVE="30m"
- OLLAMA_FLASH_ATTENTION=true
- OLLAMA_LOAD_TIMEOUT="15m"
- OLLAMA_CONTEXT_LENGTH=80000
- OLLAMA_NUM_PARALLEL=2

Relevant log output

ollama-gpu-1  | time=2026-03-05T03:53:42.460Z level=INFO source=sched.go:565 msg="loaded runners" count=2
ollama-gpu-1  | time=2026-03-05T03:53:42.460Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
ollama-gpu-1  | time=2026-03-05T03:53:42.460Z level=INFO source=server.go:1388 msg="llama runner started in 2.50 seconds"
ollama-gpu-1  | CUDA error: out of memory
ollama-gpu-1  |   current device: 0, in function ggml_backend_cuda_synchronize at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2981
...
ollama-gpu-1  | time=2026-03-05T03:53:42.739Z level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33587/completion\": EOF"


$ nvidia-smi 
Thu Mar  5 03:53:59 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:0C:00.0 Off |                  N/A |
| 30%   47C    P8             24W /  350W |     264MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:0D:00.0 Off |                  N/A |
| 30%   41C    P8             21W /  350W |   23528MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        On  |   00000000:0E:00.0 Off |                  Off |
|  0%   44C    P8             39W /  480W |   24042MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            8444      C   /usr/bin/ollama                         254MiB |
|    1   N/A  N/A            8444      C   /usr/bin/ollama                       23518MiB |
|    2   N/A  N/A            8444      C   /usr/bin/ollama                       24032MiB |
+-----------------------------------------------------------------------------------------+
$ docker exec ollama-ollama-gpu-1 ollama ps
NAME                  ID              SIZE      PROCESSOR    CONTEXT    UNTIL               
tinyllama:latest      2644915ede35    985 MB    100% GPU     2048       29 minutes from now    
glm-4.7-flash:q8_0    a035bf4bc812    49 GB     100% GPU     80000      29 minutes from now

OS

Docker

GPU

Nvidia

CPU

AMD

Ollama version

0.17.6

Originally created by @i418c on GitHub (Mar 5, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14632 ### What is the issue? I'm trying to run multiple models side by side, but Ollama improperly puts both models on the same GPUs even when others are available. Tinyllama could easily fit in full on GPU0, but Ollama doesn't put it there. environment: - OLLAMA_KEEP_ALIVE="30m" - OLLAMA_FLASH_ATTENTION=true - OLLAMA_LOAD_TIMEOUT="15m" - OLLAMA_CONTEXT_LENGTH=80000 - OLLAMA_NUM_PARALLEL=2 ### Relevant log output ```shell ollama-gpu-1 | time=2026-03-05T03:53:42.460Z level=INFO source=sched.go:565 msg="loaded runners" count=2 ollama-gpu-1 | time=2026-03-05T03:53:42.460Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" ollama-gpu-1 | time=2026-03-05T03:53:42.460Z level=INFO source=server.go:1388 msg="llama runner started in 2.50 seconds" ollama-gpu-1 | CUDA error: out of memory ollama-gpu-1 | current device: 0, in function ggml_backend_cuda_synchronize at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:2981 ... ollama-gpu-1 | time=2026-03-05T03:53:42.739Z level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:33587/completion\": EOF" $ nvidia-smi Thu Mar 5 03:53:59 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 On | 00000000:0C:00.0 Off | N/A | | 30% 47C P8 24W / 350W | 264MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3090 On | 00000000:0D:00.0 Off | N/A | | 30% 41C P8 21W / 350W | 23528MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 4090 On | 00000000:0E:00.0 Off | Off | | 0% 44C P8 39W / 480W | 24042MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 8444 C /usr/bin/ollama 254MiB | | 1 N/A N/A 8444 C /usr/bin/ollama 23518MiB | | 2 N/A N/A 8444 C /usr/bin/ollama 24032MiB | +-----------------------------------------------------------------------------------------+ $ docker exec ollama-ollama-gpu-1 ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL tinyllama:latest 2644915ede35 985 MB 100% GPU 2048 29 minutes from now glm-4.7-flash:q8_0 a035bf4bc812 49 GB 100% GPU 80000 29 minutes from now ``` ### OS Docker ### GPU Nvidia ### CPU AMD ### Ollama version 0.17.6
GiteaMirror added the bug label 2026-04-22 19:37:30 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 5, 2026):

tinyllama runs on the llama.cpp engine by default, which is not very accurate when it comes to estimating the amount of memory required to run the model. Try setting OLLAMA_NEW_ENGINE=1 in the server environment to use the ollama engine, which has better memory management.

<!-- gh-comment-id:4004676428 --> @rick-github commented on GitHub (Mar 5, 2026): tinyllama runs on the llama.cpp engine by default, which is not very accurate when it comes to estimating the amount of memory required to run the model. Try setting `OLLAMA_NEW_ENGINE=1` in the server environment to use the ollama engine, which has better memory management.
Author
Owner

@i418c commented on GitHub (Mar 5, 2026):

Using the new engine doesn't make a difference. I had tried with qwen3.5:0.8b-q8_0 and granite4:350m-h-q8_0 prior to tinyllama and they both had the same issue.

<!-- gh-comment-id:4006816148 --> @i418c commented on GitHub (Mar 5, 2026): Using the new engine doesn't make a difference. I had tried with qwen3.5:0.8b-q8_0 and granite4:350m-h-q8_0 prior to tinyllama and they both had the same issue.
Author
Owner

@rick-github commented on GitHub (Mar 5, 2026):

Can you provide the logs of a failure from both unset OLLAMA_NEW_ENGINE and OLLAMA_NEW_ENGINE=1?

<!-- gh-comment-id:4007256108 --> @rick-github commented on GitHub (Mar 5, 2026): Can you provide the logs of a failure from both unset `OLLAMA_NEW_ENGINE` and `OLLAMA_NEW_ENGINE=1`?
Author
Owner

@i418c commented on GitHub (Mar 5, 2026):

New Engine
Old Engine

The old engine eventually figured out how to allocate enough memory on this run, so the issue isn't entirely deterministic.

<!-- gh-comment-id:4007660899 --> @i418c commented on GitHub (Mar 5, 2026): [New Engine](https://gist.github.com/i418c/d32e24689a46ac4deee202642615c475) [Old Engine](https://gist.github.com/i418c/b23311383e3c1fcfb16416813223a630) The old engine eventually figured out how to allocate enough memory on this run, so the issue isn't entirely deterministic.
Author
Owner

@rick-github commented on GitHub (Mar 5, 2026):

It looks like the old engine failed during a completion, while the model failed to complete loading in the new engine. That needs investigation. In the meantime you can try some of the OOM mitigations shown here. Probably the easiest would be to set OLLAMA_GPU_OVERHEAD.

<!-- gh-comment-id:4007727770 --> @rick-github commented on GitHub (Mar 5, 2026): It looks like the old engine failed during a completion, while the model failed to complete loading in the new engine. That needs investigation. In the meantime you can try some of the OOM mitigations shown [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288). Probably the easiest would be to set `OLLAMA_GPU_OVERHEAD`.
Author
Owner

@i418c commented on GitHub (Mar 5, 2026):

I pumped OLLAMA_GPU_OVERHEAD as high as I could get it without pushing GLM 4.7 onto the 3rd GPU. As far as I can tell, it didn't change how the application decided where to allocate the model. I also tried qwen3.5:0.8b-q8_0 again with OLLAMA_GPU_OVERHEAD set, and it fails completely like tinyllama on the new engine.

Qwen log

I also tried setting CUDA_DEVICE_ORDER=PCI_BUS_ID on the container to no effect.

<!-- gh-comment-id:4008062774 --> @i418c commented on GitHub (Mar 5, 2026): I pumped OLLAMA_GPU_OVERHEAD as high as I could get it without pushing GLM 4.7 onto the 3rd GPU. As far as I can tell, it didn't change how the application decided where to allocate the model. I also tried qwen3.5:0.8b-q8_0 again with OLLAMA_GPU_OVERHEAD set, and it fails completely like tinyllama on the new engine. [Qwen log](https://gist.github.com/i418c/8c14f75fd7719931902c45de8bf68846) I also tried setting CUDA_DEVICE_ORDER=PCI_BUS_ID on the container to no effect.
Author
Owner

@rick-github commented on GitHub (Mar 5, 2026):

If you load tinyllama first does it work better?

<!-- gh-comment-id:4008072117 --> @rick-github commented on GitHub (Mar 5, 2026): If you load tinyllama first does it work better?
Author
Owner

@i418c commented on GitHub (Mar 5, 2026):

Yes because it pushes GLM 4.7 to the 3rd GPU. That's not great for performance though since my GPUs are limited to PCIe 3.0x1 speeds already.

<!-- gh-comment-id:4008086508 --> @i418c commented on GitHub (Mar 5, 2026): Yes because it pushes GLM 4.7 to the 3rd GPU. That's not great for performance though since my GPUs are limited to PCIe 3.0x1 speeds already.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#35240