[GH-ISSUE #5455] ollama does not work on ALL GPU automatically #3411

Closed
opened 2026-04-12 14:03:07 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @HeroSong666 on GitHub (Jul 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5455

What is the issue?

when I use the ollama:0.1.38, I use the following command to start:

docker run -d --gpus=all -v /root/ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

ollama will automatically use all 4 gpu cards for inference.

when I upgrate to ollama:0.1.48, I use the same command, but it only use 1 gpu for inference:

1

I remember the running 'Processes' is not '...unners/cuda_v11/ollama_llama_server', should be "ollama/ollama" or something else before.

Why this happens?

Also, when I use
docker run -d --gpus=all -v /root/ollama:/root/.ollama -p 11434:11434 -e OLLAMA_SCHED_SPREAD=1 --name ollama ollama/ollama:0.1.48
to force it use all 4 gpus for inference, I noticed that the combined utilization of the four GPU cards will not reach 100%, let alone 400%.

Here is the gpu usage I monitor.
gpu_usage.csv

I think ollama does not make full use of GPU resources. Why is this?

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.48

Originally created by @HeroSong666 on GitHub (Jul 3, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5455 ### What is the issue? when I use the ollama:0.1.38, I use the following command to start: ` docker run -d --gpus=all -v /root/ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama ` ollama will automatically use all 4 gpu cards for inference. when I upgrate to ollama:0.1.48, I use the same command, but it only use 1 gpu for inference: ![1](https://github.com/ollama/ollama/assets/142960235/b7a0d651-067e-4670-9a58-c02c5076f046) I remember the running 'Processes' is not '...unners/cuda_v11/ollama_llama_server', should be "ollama/ollama" or something else before. Why this happens? Also, when I use ` docker run -d --gpus=all -v /root/ollama:/root/.ollama -p 11434:11434 -e OLLAMA_SCHED_SPREAD=1 --name ollama ollama/ollama:0.1.48 ` to force it use all 4 gpus for inference, I noticed that the combined utilization of the four GPU cards will not reach 100%, let alone 400%. Here is the gpu usage I monitor. [gpu_usage.csv](https://github.com/user-attachments/files/16081349/gpu_usage.csv) I think ollama does not make full use of GPU resources. Why is this? ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.48
GiteaMirror added the bug label 2026-04-12 14:03:07 -05:00
Author
Owner

@mxyng commented on GitHub (Jul 8, 2024):

Can you confirm if the model is fully loaded onto the one GPU? If it is, this is the expected behaviour. Ollama will use a single GPU if the model fits since splitting the model between multiple will incur a performance hit. This frees the other GPUs for other models

If you absolutely want it spread across all GPUs, you can disable this behaviour with the environment variable OLLAMA_SCHED_SPREAD=1

<!-- gh-comment-id:2215511480 --> @mxyng commented on GitHub (Jul 8, 2024): Can you confirm if the model is fully loaded onto the one GPU? If it is, this is the expected behaviour. Ollama will use a single GPU if the model fits since splitting the model between multiple will incur a performance hit. This frees the other GPUs for other models If you absolutely want it spread across all GPUs, you can disable this behaviour with the environment variable `OLLAMA_SCHED_SPREAD=1`
Author
Owner

@HeroSong666 commented on GitHub (Jul 9, 2024):

您能否确认模型是否已完全加载到一个 GPU 上?如果是,这是预期的行为。如果模型合适,Ollama 将使用单个 GPU,因为将模型拆分到多个 GPU 上会导致性能下降。这样可以释放其他 GPU 以用于其他模型

如果你确实希望它分布在所有 GPU 上,则可以使用环境变量禁用此行为OLLAMA_SCHED_SPREAD=1

The model is fully loaded onto the one GPU. In my usage scenario, there may be hundreds of people using ollama, so I would like it to use as many GPUs as possible to reduce inference time. But in fact, I noticed that if I force the model to spread across all GPUs, the time of a single inference will increase (have not tested it with many users). Do you have any good advice in this situation?

<!-- gh-comment-id:2215807334 --> @HeroSong666 commented on GitHub (Jul 9, 2024): > 您能否确认模型是否已完全加载到一个 GPU 上?如果是,这是预期的行为。如果模型合适,Ollama 将使用单个 GPU,因为将模型拆分到多个 GPU 上会导致性能下降。这样可以释放其他 GPU 以用于其他模型 > > 如果你确实希望它分布在所有 GPU 上,则可以使用环境变量禁用此行为`OLLAMA_SCHED_SPREAD=1` The model is fully loaded onto the one GPU. In my usage scenario, there may be hundreds of people using ollama, so I would like it to use as many GPUs as possible to reduce inference time. But in fact, I noticed that if I force the model to spread across all GPUs, the time of a single inference will increase (have not tested it with many users). Do you have any good advice in this situation?
Author
Owner

@HeroSong666 commented on GitHub (Jul 9, 2024):

Can you confirm if the model is fully loaded onto the one GPU? If it is, this is the expected behaviour. Ollama will use a single GPU if the model fits since splitting the model between multiple will incur a performance hit. This frees the other GPUs for other models

If you absolutely want it spread across all GPUs, you can disable this behaviour with the environment variable OLLAMA_SCHED_SPREAD=1

Also, when I run a model with a large number of parameters (for example, qwen2-72b), ollama's inference speed is quiet slower, but the combined usage of the 4 GPUs is far from 400%, at most about 120%. Why does this happen? Can you give me some optimization suggestions?

<!-- gh-comment-id:2216089178 --> @HeroSong666 commented on GitHub (Jul 9, 2024): > Can you confirm if the model is fully loaded onto the one GPU? If it is, this is the expected behaviour. Ollama will use a single GPU if the model fits since splitting the model between multiple will incur a performance hit. This frees the other GPUs for other models > > If you absolutely want it spread across all GPUs, you can disable this behaviour with the environment variable `OLLAMA_SCHED_SPREAD=1` Also, when I run a model with a large number of parameters (for example, qwen2-72b), ollama's inference speed is quiet slower, but the combined usage of the 4 GPUs is far from 400%, at most about 120%. Why does this happen? Can you give me some optimization suggestions?
Author
Owner

@dhiltgen commented on GitHub (Jul 24, 2024):

We'll get the docs improved, but in short, use OLLAMA_SCHED_SPREAD if you want to force it to spread over all your GPUs.

<!-- gh-comment-id:2246631531 --> @dhiltgen commented on GitHub (Jul 24, 2024): We'll get the docs improved, but in short, use OLLAMA_SCHED_SPREAD if you want to force it to spread over all your GPUs.
Author
Owner

@EvelynBai commented on GitHub (Nov 8, 2024):

Can you confirm if the model is fully loaded onto the one GPU? If it is, this is the expected behaviour. Ollama will use a single GPU if the model fits since splitting the model between multiple will incur a performance hit. This frees the other GPUs for other models
If you absolutely want it spread across all GPUs, you can disable this behaviour with the environment variable OLLAMA_SCHED_SPREAD=1

Also, when I run a model with a large number of parameters (for example, qwen2-72b), ollama's inference speed is quiet slower, but the combined usage of the 4 GPUs is far from 400%, at most about 120%. Why does this happen? Can you give me some optimization suggestions?

Same problem here. Did you solve the problem?
截屏2024-11-08 下午4 27 46

<!-- gh-comment-id:2464111907 --> @EvelynBai commented on GitHub (Nov 8, 2024): > > Can you confirm if the model is fully loaded onto the one GPU? If it is, this is the expected behaviour. Ollama will use a single GPU if the model fits since splitting the model between multiple will incur a performance hit. This frees the other GPUs for other models > > If you absolutely want it spread across all GPUs, you can disable this behaviour with the environment variable `OLLAMA_SCHED_SPREAD=1` > > Also, when I run a model with a large number of parameters (for example, qwen2-72b), ollama's inference speed is quiet slower, but the combined usage of the 4 GPUs is far from 400%, at most about 120%. Why does this happen? Can you give me some optimization suggestions? Same problem here. Did you solve the problem? <img width="638" alt="截屏2024-11-08 下午4 27 46" src="https://github.com/user-attachments/assets/2ca5e5ea-feef-491b-8e94-cd4ca837266d">
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3411