[GH-ISSUE #4044] Problems with more GPUs using v0.1.33-rc5 #49020

Closed
opened 2026-04-28 10:36:33 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @cBrainAI on GitHub (Apr 30, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4044

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I am testing the fantastic(!) new features with OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS in v0.1.33-rc5.
I am running ollama using docker on a machine with two RTX4090.

Unfortunately it seems like ollama does not use both graphic-cards with v0.1.33-rc5 - it has worked perfect with previous versions (have just tested with v0.1.32).

It does not matter whether I set the environment variables, set them to 1 or set them to e.g. 4

As you can se in the log below - ollama detects the 2 GPU's

ollama  | time=2024-04-30T09:37:32.070Z level=INFO source=images.go:821 msg="total blobs: 68"
ollama  | time=2024-04-30T09:37:32.071Z level=INFO source=images.go:828 msg="total unused blobs removed: 0"
ollama  | time=2024-04-30T09:37:32.071Z level=INFO source=routes.go:1074 msg="Listening on [::]:11434 (version 0.1.33-rc5)"
ollama  | time=2024-04-30T09:37:32.072Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3211222627/runners
ollama  | time=2024-04-30T09:37:34.328Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11 rocm_v60002 cpu]"
ollama  | time=2024-04-30T09:37:34.328Z level=INFO source=gpu.go:96 msg="Detecting GPUs"
ollama  | time=2024-04-30T09:37:34.367Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama3211222627/runners/cuda_v11/libcudart.so.11.0 count=2
ollama  | time=2024-04-30T09:37:34.367Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"

But I can see (using nvtop) that only one GPU is used during prompt-evaluation

OS

Docker

GPU

Nvidia

CPU

AMD

Ollama version

v0.1.33-rc5

Originally created by @cBrainAI on GitHub (Apr 30, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4044 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I am testing the fantastic(!) new features with OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS in v0.1.33-rc5. I am running ollama using docker on a machine with two RTX4090. Unfortunately it seems like ollama does not use both graphic-cards with v0.1.33-rc5 - it has worked perfect with previous versions (have just tested with v0.1.32). It does not matter whether I set the environment variables, set them to 1 or set them to e.g. 4 As you can se in the log below - ollama detects the 2 GPU's ``` ollama | time=2024-04-30T09:37:32.070Z level=INFO source=images.go:821 msg="total blobs: 68" ollama | time=2024-04-30T09:37:32.071Z level=INFO source=images.go:828 msg="total unused blobs removed: 0" ollama | time=2024-04-30T09:37:32.071Z level=INFO source=routes.go:1074 msg="Listening on [::]:11434 (version 0.1.33-rc5)" ollama | time=2024-04-30T09:37:32.072Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama3211222627/runners ollama | time=2024-04-30T09:37:34.328Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11 rocm_v60002 cpu]" ollama | time=2024-04-30T09:37:34.328Z level=INFO source=gpu.go:96 msg="Detecting GPUs" ollama | time=2024-04-30T09:37:34.367Z level=INFO source=gpu.go:101 msg="detected GPUs" library=/tmp/ollama3211222627/runners/cuda_v11/libcudart.so.11.0 count=2 ollama | time=2024-04-30T09:37:34.367Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" ``` But I can see (using `nvtop`) that only one GPU is used during prompt-evaluation ### OS Docker ### GPU Nvidia ### CPU AMD ### Ollama version v0.1.33-rc5
GiteaMirror added the gpubugamd labels 2026-04-28 10:36:33 -05:00
Author
Owner

@dhiltgen commented on GitHub (May 1, 2024):

@cBrainAI the algorithm in this release tries to load models into a single card if it can, where in prior releases, we always spread the model over all available GPUs even if it wasn't necessary. To leverage both cards in your setup, you'll either need to load 1 model that is larger than a single card can fit, or load 2 models. If you've tried that and are still seeing only 1 GPU used, can you share a little more details about your scenario? (nvidia-smi output might also help show where it's loaded)

<!-- gh-comment-id:2088728005 --> @dhiltgen commented on GitHub (May 1, 2024): @cBrainAI the algorithm in this release tries to load models into a single card if it can, where in prior releases, we always spread the model over all available GPUs even if it wasn't necessary. To leverage both cards in your setup, you'll either need to load 1 model that is larger than a single card can fit, or load 2 models. If you've tried that and are still seeing only 1 GPU used, can you share a little more details about your scenario? (nvidia-smi output might also help show where it's loaded)
Author
Owner

@cBrainAI commented on GitHub (May 2, 2024):

@dhiltgen I can confirm that this is exactly how it is working :-)
I ran a test with 2 batches of requests in parallel with the same model (that could fit on one card) - and it just used one GPU. Changing the model on one the batches made both GPUs work.
I would (of course) prefer that both GPU's was working at the samt time, even with the same model.
(Just a simple "round-robin" where each request is sent to the next GPU in line)
My setup is a service, that uses the same model for a lot of users - where I try to get the most throughput out of my hardware.

Thank you for a fast and precise response to my issue :-)

<!-- gh-comment-id:2089716351 --> @cBrainAI commented on GitHub (May 2, 2024): @dhiltgen I can confirm that this is exactly how it is working :-) I ran a test with 2 batches of requests in parallel with the same model (that could fit on one card) - and it just used one GPU. Changing the model on one the batches made both GPUs work. I would (of course) prefer that both GPU's was working at the samt time, even with the same model. (Just a simple "round-robin" where each request is sent to the next GPU in line) My setup is a service, that uses the same model for a lot of users - where I try to get the most throughput out of my hardware. Thank you for a fast and precise response to my issue :-)
Author
Owner

@dhiltgen commented on GitHub (May 2, 2024):

Loading the same model multiple times isn't currently supported but something we may add in the future. As a workaround, you can use a Modelfile to create 2 variations that are functionally equivalent but slightly different.

<!-- gh-comment-id:2090895328 --> @dhiltgen commented on GitHub (May 2, 2024): Loading the same model multiple times isn't currently supported but something we may add in the future. As a workaround, you can use a [Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md) to create 2 variations that are functionally equivalent but slightly different.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#49020