[GH-ISSUE #8285] GPU runs at maximum load with 2 models #67357

Closed
opened 2026-05-04 10:04:22 -05:00 by GiteaMirror · 21 comments
Owner

Originally created by @RomanDrechsel on GitHub (Jan 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8285

What is the issue?

Hi,

i use ollama as the provider for the Continue extension for VSCode for tab autocompletion.

Since the last update I have the problem that my GPU runs at maximum load as soon as 2 models are running at the same time.
Even if they are only very small models (e.g. nomic-embed-text for embeddings and qwen2.5-coder:0.5b as tab autocomplete).
The load remains at 100% until I stop one of the two models.

Before the last update, I had no problems using larger models (e.g. qwen2.5-coder:3b).

My OS is Manjaro Linux with kernel 6.12.4,
my hardware is an AMD Ryzen 9 9950X processor and an AMD Radeon RX 7900 XTX.

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.5.4

Originally created by @RomanDrechsel on GitHub (Jan 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8285 ### What is the issue? Hi, i use ollama as the provider for the Continue extension for VSCode for tab autocompletion. Since the last update I have the problem that my GPU runs at maximum load as soon as 2 models are running at the same time. Even if they are only very small models (e.g. nomic-embed-text for embeddings and qwen2.5-coder:0.5b as tab autocomplete). The load remains at 100% until I stop one of the two models. Before the last update, I had no problems using larger models (e.g. qwen2.5-coder:3b). My OS is Manjaro Linux with kernel 6.12.4, my hardware is an AMD Ryzen 9 9950X processor and an AMD Radeon RX 7900 XTX. ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.5.4
GiteaMirror added the bug label 2026-05-04 10:04:22 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 2, 2025):

Server logs will aid in debugging.

<!-- gh-comment-id:2567614903 --> @rick-github commented on GitHub (Jan 2, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@systemofapwne commented on GitHub (Jan 2, 2025):

I can confirm this bug.
This occasionally happens out of the blue. Sometimes, it just works fine and then suddenly, GPU rests at 100% and generation on one backend just stops. It eventually recoveres but quickly locks up again,

Interestingly, I configured my ollama instance to only allow one model in VRAM since I "only" have 20 GB and need the VRAM also for other tasks:

# ENVs as set for my docker container
OLLAMA_KEEP_ALIVE=5m
OLLAMA_MAX_LOADED_MODELS=1

Yet, I noticed that indeed two different models were loaded when this happened. At that time, I interacted via OpenWebUI and VSCode (continue), which both used two different models.

I bookmarked this report and will post logs, once I have reproduced the issue: I just have restarted my ollama instance and for now, it seems to work fine.

<!-- gh-comment-id:2568399172 --> @systemofapwne commented on GitHub (Jan 2, 2025): I can confirm this bug. This occasionally happens out of the blue. Sometimes, it just works fine and then suddenly, GPU rests at 100% and generation on one backend just stops. It eventually recoveres but quickly locks up again, Interestingly, I configured my ollama instance to only allow one model in VRAM since I "only" have 20 GB and need the VRAM also for other tasks: ```env # ENVs as set for my docker container OLLAMA_KEEP_ALIVE=5m OLLAMA_MAX_LOADED_MODELS=1 ``` Yet, I noticed that indeed two different models were loaded when this happened. At that time, I interacted via OpenWebUI and VSCode (continue), which both used two different models. I bookmarked this report and will post logs, once I have reproduced the issue: I just have restarted my ollama instance and for now, it seems to work fine.
Author
Owner

@RomanDrechsel commented on GitHub (Jan 4, 2025):

Hi,

sorry for the delay. Here are the logs when two models are running on the same time.

$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
qwen2.5-coder:0.5b d392ed348d5b 2.4 GB 100% GPU 4 minutes from now
nomic-embed-text:latest 0a109f422b47 849 MB 100% GPU 4 minutes from now

ollama.tar.gz

<!-- gh-comment-id:2571025136 --> @RomanDrechsel commented on GitHub (Jan 4, 2025): Hi, sorry for the delay. Here are the logs when two models are running on the same time. > $ ollama ps > NAME ID SIZE PROCESSOR UNTIL > qwen2.5-coder:0.5b d392ed348d5b 2.4 GB 100% GPU 4 minutes from now > nomic-embed-text:latest 0a109f422b47 849 MB 100% GPU 4 minutes from now [ollama.tar.gz](https://github.com/user-attachments/files/18306346/ollama.tar.gz)
Author
Owner

@rick-github commented on GitHub (Jan 4, 2025):

Maybe some misunderstanding: the "100%" in the output of ollama ps doesn't mean the GPU is running at maximum speed, it means that the model is fully loaded in the GPU VRAM. If you want to see the load on on the GPU, use nvtop, recent releases now support AMD GPUs.

If you actually mean that your GPU is running at maximum speed non-stop, there's no evidence of if in the log - the generations finish within a few seconds, and there are periods of inactivity. Note that only one of your GPUs is supported: all the inference is being done on the Radeon. Logs with OLLAMA_DEBUG=1 set in the server environment may show more details.

<!-- gh-comment-id:2571282986 --> @rick-github commented on GitHub (Jan 4, 2025): Maybe some misunderstanding: the "100%" in the output of `ollama ps` doesn't mean the GPU is running at maximum speed, it means that the model is fully loaded in the GPU VRAM. If you want to see the load on on the GPU, use `nvtop`, recent releases now support AMD GPUs. If you actually mean that your GPU is running at maximum speed non-stop, there's no evidence of if in the log - the generations finish within a few seconds, and there are periods of inactivity. Note that only one of your GPUs is supported: all the inference is being done on the Radeon. Logs with `OLLAMA_DEBUG=1` set in the server environment may show more details.
Author
Owner

@RomanDrechsel commented on GitHub (Jan 4, 2025):

Hi,

i know ollama ps doesn't show the load of the GPU, i just wanted to show that only two very small models are running.

I started ollama with OLLAMA_DEBUG=1, here are the logs:

ollama.tar.gz

I know my Graphics-Card runs on maximum speed, because my conky (system monitor) shows it. It reads the data from /sys/class/drm/card1/device/

conky

<!-- gh-comment-id:2571308283 --> @RomanDrechsel commented on GitHub (Jan 4, 2025): Hi, i know `ollama ps` doesn't show the load of the GPU, i just wanted to show that only two very small models are running. I started ollama with `OLLAMA_DEBUG=1`, here are the logs: [ollama.tar.gz](https://github.com/user-attachments/files/18306941/ollama.tar.gz) I know my Graphics-Card runs on maximum speed, because my conky (system monitor) shows it. It reads the data from `/sys/class/drm/card1/device/` # ![conky](https://github.com/user-attachments/assets/0025f1dc-f69a-4331-90b1-007d25b14544)
Author
Owner

@rick-github commented on GitHub (Jan 4, 2025):

Does your CPU also go to 100%?

<!-- gh-comment-id:2571309408 --> @rick-github commented on GitHub (Jan 4, 2025): Does your CPU also go to 100%?
Author
Owner

@RomanDrechsel commented on GitHub (Jan 4, 2025):

No, it does nearly nothing.

<!-- gh-comment-id:2571309673 --> @RomanDrechsel commented on GitHub (Jan 4, 2025): No, it does nearly nothing.
Author
Owner

@rick-github commented on GitHub (Jan 4, 2025):

If you unload the models (ollama stop nomic-embed-text:latest ; ollama stop qwen2.5-coder:0.5b) and then load them without a completion (ollama run nomic-embed-text:latest "" ; ollama run qwen2.5-coder:0.5b ""), does the GPU go to 100%? If you then do a completion (ollama run qwen2.5-coder:0.5b hello), what happens to the GPU? Was your previous version of olllama 0.5.3 or something earlier?

<!-- gh-comment-id:2571311585 --> @rick-github commented on GitHub (Jan 4, 2025): If you unload the models (`ollama stop nomic-embed-text:latest ; ollama stop qwen2.5-coder:0.5b`) and then load them without a completion (`ollama run nomic-embed-text:latest "" ; ollama run qwen2.5-coder:0.5b ""`), does the GPU go to 100%? If you then do a completion (`ollama run qwen2.5-coder:0.5b hello`), what happens to the GPU? Was your previous version of olllama 0.5.3 or something earlier?
Author
Owner

@RomanDrechsel commented on GitHub (Jan 4, 2025):

Without completion, my GPU is still idling, even if 2 models are loaded.
As soon as I start the completion with ollama run qwen2.5-coder:0.5b hello, the GPU starts to run at 100%.

<!-- gh-comment-id:2571312519 --> @RomanDrechsel commented on GitHub (Jan 4, 2025): Without completion, my GPU is still idling, even if 2 models are loaded. As soon as I start the completion with `ollama run qwen2.5-coder:0.5b hello`, the GPU starts to run at 100%.
Author
Owner

@rick-github commented on GitHub (Jan 4, 2025):

And if you unload qwen2.5-coder:0.5b when the GPU is at 100%, it drops back to 0? If you unload nomic-embed-text:latest when the GPU is at 100% after the qwen completion, does the GPU stay at high load?

<!-- gh-comment-id:2571315414 --> @rick-github commented on GitHub (Jan 4, 2025): And if you unload `qwen2.5-coder:0.5b` when the GPU is at 100%, it drops back to 0? If you unload `nomic-embed-text:latest` when the GPU is at 100% after the qwen completion, does the GPU stay at high load?
Author
Owner

@RomanDrechsel commented on GitHub (Jan 4, 2025):

I tried a little, as long as 2 or more models are running, the GPU stays at 100%.
It doesn't matter which model i stop, as soon as only one model is running, the GPU gets back to normal.

<!-- gh-comment-id:2571333497 --> @RomanDrechsel commented on GitHub (Jan 4, 2025): I tried a little, as long as 2 or more models are running, the GPU stays at 100%. It doesn't matter which model i stop, as soon as only one model is running, the GPU gets back to normal.
Author
Owner

@RomanDrechsel commented on GitHub (Jan 10, 2025):

Since today, ollama is not usable for me anymore...

The GPU load stays at 100%, even if only one model is running ...
Even without any prompt, only
ollama run qwen2.5-coder:1.5b

A few days ago, is switched to the docker version (rocm), but the problem with the gpu load with only one model is since today.

Edit: Ok on the default linux installation (without docker), 1 model at a time seems to work...

<!-- gh-comment-id:2582606543 --> @RomanDrechsel commented on GitHub (Jan 10, 2025): Since today, ollama is not usable for me anymore... The GPU load stays at 100%, even if only one model is running ... Even without any prompt, only `ollama run qwen2.5-coder:1.5b` A few days ago, is switched to the docker version (rocm), but the problem with the gpu load with only one model is since today. Edit: Ok on the default linux installation (without docker), 1 model at a time seems to work...
Author
Owner

@rick-github commented on GitHub (Jan 10, 2025):

This is likely a llama.cpp problem. There was bug some time ago (https://github.com/ggerganov/llama.cpp/issues/5280) that sounds every similar, a solution that worked for some users was to set GPU_MAX_HW_QUEUES=1 in the environment.

<!-- gh-comment-id:2582729166 --> @rick-github commented on GitHub (Jan 10, 2025): This is likely a llama.cpp problem. There was bug some time ago (https://github.com/ggerganov/llama.cpp/issues/5280) that sounds every similar, a solution that worked for some users was to set `GPU_MAX_HW_QUEUES=1` in the environment.
Author
Owner

@RomanDrechsel commented on GitHub (Jan 10, 2025):

This is already in my /etc/environment
Unfortunately, this did not solve the problem

<!-- gh-comment-id:2582739200 --> @RomanDrechsel commented on GitHub (Jan 10, 2025): This is already in my /etc/environment Unfortunately, this did not solve the problem
Author
Owner

@rick-github commented on GitHub (Jan 10, 2025):

Can you confirm that it's in the environment of the server? Try this:

sudo cat /proc/$(pidof ollama)/environ | tr \\0 \\n
<!-- gh-comment-id:2582754783 --> @rick-github commented on GitHub (Jan 10, 2025): Can you confirm that it's in the environment of the server? Try this: ```shell sudo cat /proc/$(pidof ollama)/environ | tr \\0 \\n ```
Author
Owner

@RomanDrechsel commented on GitHub (Jan 10, 2025):

Well ok, nevermind :)

The /etc/environment entry seems not to work with ollama service.
In sudo cat /proc/$(pidof ollama)/environ | tr \\0 \\n i didn't see the environment variable, so i added
Environment="GPU_MAX_HW_QUEUES=1" to the /etc/systemd/system/ollama.service

Now i can also run multiple models at the same time again.

Thanks for your help.

<!-- gh-comment-id:2582776000 --> @RomanDrechsel commented on GitHub (Jan 10, 2025): Well ok, nevermind :) The /etc/environment entry seems not to work with ollama service. In `sudo cat /proc/$(pidof ollama)/environ | tr \\0 \\n` i didn't see the environment variable, so i added `Environment="GPU_MAX_HW_QUEUES=1"` to the `/etc/systemd/system/ollama.service` Now i can also run multiple models at the same time again. Thanks for your help.
Author
Owner

@rick-github commented on GitHub (Jan 11, 2025):

/etc/environment is usually for interactive logins (X/console, ssh), not services.

Glad the issue is resolved.

<!-- gh-comment-id:2585123256 --> @rick-github commented on GitHub (Jan 11, 2025): `/etc/environment` is usually for interactive logins (X/console, ssh), not services. Glad the issue is resolved.
Author
Owner

@melroy89 commented on GitHub (Jan 24, 2025):

This is likely a llama.cpp problem. There was bug some time ago (ggerganov/llama.cpp#5280) that sounds every similar, a solution that worked for some users was to set GPU_MAX_HW_QUEUES=1 in the environment.

I also see this bug on my AMD 7900 XTX system using ollama v0.5.7.

If two models are loaded, I also get 100% GPU load in idle:

Image

You might want to reopen the issue, and solve it once and for all. GPU_MAX_HW_QUEUES feels like a workaround.

<!-- gh-comment-id:2613220780 --> @melroy89 commented on GitHub (Jan 24, 2025): > This is likely a llama.cpp problem. There was bug some time ago ([ggerganov/llama.cpp#5280](https://github.com/ggerganov/llama.cpp/issues/5280)) that sounds every similar, a solution that worked for some users was to set `GPU_MAX_HW_QUEUES=1` in the environment. I also see this bug on my AMD 7900 XTX system using ollama v0.5.7. If two models are loaded, I also get 100% GPU load in idle: ![Image](https://github.com/user-attachments/assets/d0c5cfd7-8c78-4642-a844-b4fcabb77c77) You might want to reopen the issue, and solve it once and for all. `GPU_MAX_HW_QUEUES` feels like a workaround.
Author
Owner

@rick-github commented on GitHub (Jan 24, 2025):

It's a ROCm driver issue, not much we can do about it. Apparently a fix is in 6.2, there's a pending PR to upgrade.

<!-- gh-comment-id:2613231939 --> @rick-github commented on GitHub (Jan 24, 2025): It's a ROCm driver issue, not much we can do about it. Apparently a fix is in 6.2, there's a pending [PR](https://github.com/ollama/ollama/pull/6969) to upgrade.
Author
Owner

@systemofapwne commented on GitHub (Jan 24, 2025):

It's a ROCm driver issue, not much we can do about it. Apparently a fix is in 6.2, there's a pending PR to upgrade.

I am not so sure if this is ROCm only or if my problems are unrelated to the one described here, but with the same symptoms.
I have an nVidia RTX 4000 SFF (Ada) and from time to time (it is extremely unreproduceably) two models get stuck in VRAM (even when configured for one persistent model only) eating all GPU cycles.

<!-- gh-comment-id:2613424298 --> @systemofapwne commented on GitHub (Jan 24, 2025): > It's a ROCm driver issue, not much we can do about it. Apparently a fix is in 6.2, there's a pending [PR](https://github.com/ollama/ollama/pull/6969) to upgrade. I am not so sure if this is ROCm only or if my problems are unrelated to the one described here, but with the same symptoms. I have an nVidia RTX 4000 SFF (Ada) and from time to time (it is extremely unreproduceably) two models get stuck in VRAM (even when configured for one persistent model only) eating all GPU cycles.
Author
Owner

@rick-github commented on GitHub (Jan 24, 2025):

It's unlikely to be the same issue. I suggest creating a new issue, adding logs and info about the system.

<!-- gh-comment-id:2613446896 --> @rick-github commented on GitHub (Jan 24, 2025): It's unlikely to be the same issue. I suggest creating a new issue, adding logs and info about the system.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#67357