[GH-ISSUE #7574] LLaMa 3.2 90B on multi GPU crashes #66883

Closed
opened 2026-05-04 08:36:24 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @BBOBDI on GitHub (Nov 8, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7574

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Hello!

My problem may be similar to issue 7568. I think there is a problem with the distribution of the LLaMa 3.2 90B model across multiple GPUs. When it runs on a single GPU (quantized), it works. But when it runs on multiple GPUs, it crashes.

In my server running Linux Debian Bookworm and having 4x Nvidia H100 NVL, I can easily run a 4-bit quantized version of LLaMa 3.2 90B model. Here is the call of a "ollama run" in this case :

$ ollama run llama3.2-vision:90b-instruct-q4_K_M

What do you see in this picture ? ./TeddyBear.jpg
Added image './TeddyBear.jpg'
The image shows a teddy bear.

And here is the ollama serve 0.4.0 output :
OLLAMA_SERVE_llama3.2-vision_90b-instruct-q4_K_M.log

But when I run a 8 bits quantized version model (dispatched on my 4x Nvidia H100), here is what I get :

$ ollama run llama3.2-vision:90b-instruct-q8_0

What do you see in this picture ? ./TeddyBear.jpg
Added image './TeddyBear.jpg'
Error: POST predict: Post "http://127.0.0.1:40459/completion": EOF

And here is the ollama serve 0.4.0 output I get :
OLLAMA_SERVE_llama3.2-vision_90b-instruct-q8_0.log

The program crashes with a Segfault signal (as in ticket 7568 mentioned earlier). Can you take a look at this please? Thanks in advance!

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.4.0

Originally created by @BBOBDI on GitHub (Nov 8, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7574 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Hello! My problem may be similar to issue 7568. I think there is a problem with the distribution of the LLaMa 3.2 90B model across multiple GPUs. When it runs on a single GPU (quantized), it works. But when it runs on multiple GPUs, it crashes. In my server running Linux Debian Bookworm and having 4x Nvidia H100 NVL, I can easily run a 4-bit quantized version of LLaMa 3.2 90B model. Here is the call of a "ollama run" in this case : $ ollama run llama3.2-vision:90b-instruct-q4_K_M >>> What do you see in this picture ? ./TeddyBear.jpg Added image './TeddyBear.jpg' The image shows a teddy bear. And here is the ollama serve 0.4.0 output : [OLLAMA_SERVE_llama3.2-vision_90b-instruct-q4_K_M.log](https://github.com/user-attachments/files/17676883/OLLAMA_SERVE_llama3.2-vision_90b-instruct-q4_K_M.log) But when I run a 8 bits quantized version model (dispatched on my 4x Nvidia H100), here is what I get : $ ollama run llama3.2-vision:90b-instruct-q8_0 >>> What do you see in this picture ? ./TeddyBear.jpg Added image './TeddyBear.jpg' Error: POST predict: Post "http://127.0.0.1:40459/completion": EOF And here is the ollama serve 0.4.0 output I get : [OLLAMA_SERVE_llama3.2-vision_90b-instruct-q8_0.log](https://github.com/user-attachments/files/17676927/OLLAMA_SERVE_llama3.2-vision_90b-instruct-q8_0.log) The program crashes with a Segfault signal (as in ticket 7568 mentioned earlier). Can you take a look at this please? Thanks in advance! ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.4.0
GiteaMirror added the bugnvidia labels 2026-05-04 08:36:24 -05:00
Author
Owner

@BBOBDI commented on GitHub (Nov 8, 2024):

Oh dear... In all this log mess, I completely missed the following line:

time=2024-11-08T10:00:39.539Z level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet"

Should I upgrade my issue to a feature request?

<!-- gh-comment-id:2464367582 --> @BBOBDI commented on GitHub (Nov 8, 2024): Oh dear... In all this log mess, I completely missed the following line: time=2024-11-08T10:00:39.539Z level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet" Should I upgrade my issue to a feature request?
Author
Owner

@rick-github commented on GitHub (Nov 8, 2024):

No, this is still a bug, the warning just means the model is being loaded with OLLAMA_NUM_PARALLEL=1, overriding the value of 2 that you have configured.

<!-- gh-comment-id:2464606170 --> @rick-github commented on GitHub (Nov 8, 2024): No, this is still a bug, the warning just means the model is being loaded with `OLLAMA_NUM_PARALLEL=1`, overriding the value of 2 that you have configured.
Author
Owner

@rick-github commented on GitHub (Nov 8, 2024):

#7568 has a similar CUDA error: unspecified launch failure when running llama3.2-vision on multiple GPUs.

<!-- gh-comment-id:2464611738 --> @rick-github commented on GitHub (Nov 8, 2024): #7568 has a similar `CUDA error: unspecified launch failure` when running llama3.2-vision on multiple GPUs.
Author
Owner

@dhiltgen commented on GitHub (Nov 8, 2024):

There was a cuda version linking bug in 0.4.0 which will be fixed in 0.4.1 which should hopefully resolve this.

<!-- gh-comment-id:2465413362 --> @dhiltgen commented on GitHub (Nov 8, 2024): There was a cuda version linking bug in 0.4.0 which will be fixed in 0.4.1 which should hopefully resolve this.
Author
Owner

@dhiltgen commented on GitHub (Nov 8, 2024):

Quick update. This was not due to the link failure. There's a bug relating to the cross-attention implementation on cuda with multiple GPUs.

<!-- gh-comment-id:2465639706 --> @dhiltgen commented on GitHub (Nov 8, 2024): Quick update. This was not due to the link failure. There's a bug relating to the cross-attention implementation on cuda with multiple GPUs.
Author
Owner

@dhiltgen commented on GitHub (Nov 8, 2024):

We'll track this under #7558

<!-- gh-comment-id:2465825810 --> @dhiltgen commented on GitHub (Nov 8, 2024): We'll track this under #7558
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66883