[GH-ISSUE #5476] Scheduler attempts to load model split over cuda + rocm GPUs #3423

Closed
opened 2026-04-12 14:04:37 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @sksonic on GitHub (Jul 4, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5476

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I have two GPUs mixed nvidia (P40) + amd (RTX7900XTX).

I am able to load smaller models - these go to the P40 first. When loading a larger model than can fit on the P40, it seems the malloc operation is trying to allocate the full model size on the first GPU despite the "offload to cuda" logs pointing to a spread across the two GPUs.

Error is: allocating 39979.48 MiB on device 0: cudaMalloc failed: out of memory

ollama[8945]: time=2024-07-03T23:34:38.947+04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=81 layers.split=41,40 memory.available="[23.7 GiB 23.5 GiB]" memory.required.full="44.4 GiB" memory.required.partial="44.4 GiB" memory.required.kv="1.2 GiB" memory.required.allocations="[22.6 GiB 21.8 GiB]" memory.weights.total="39.5 GiB" memory.weights.repeating="38.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
ollama[8945]: time=2024-07-03T23:34:38.947+04:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama3425135909/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6baa2a027ec7595d421d151fec74dd338a15acebb83e52510a67e08fa4dd7b71 --ctx-size 4000 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --parallel 1 --tensor-split 41,40 --tensor-split 41,40 --port 43421
...
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 39979.48 MiB on device 0: cudaMalloc failed: out of memory```

### OS

Linux

### GPU

Nvidia, AMD

### CPU

AMD

### Ollama version

0.1.48 (also had issue on previous version)
Originally created by @sksonic on GitHub (Jul 4, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5476 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I have two GPUs mixed nvidia (P40) + amd (RTX7900XTX). I am able to load smaller models - these go to the P40 first. When loading a larger model than can fit on the P40, it seems the malloc operation is trying to allocate the full model size on the first GPU despite the "offload to cuda" logs pointing to a spread across the two GPUs. Error is: allocating 39979.48 MiB on device 0: cudaMalloc failed: out of memory ``` ollama[8945]: time=2024-07-03T23:34:38.947+04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=81 layers.offload=81 layers.split=41,40 memory.available="[23.7 GiB 23.5 GiB]" memory.required.full="44.4 GiB" memory.required.partial="44.4 GiB" memory.required.kv="1.2 GiB" memory.required.allocations="[22.6 GiB 21.8 GiB]" memory.weights.total="39.5 GiB" memory.weights.repeating="38.7 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" ollama[8945]: time=2024-07-03T23:34:38.947+04:00 level=INFO source=server.go:368 msg="starting llama server" cmd="/tmp/ollama3425135909/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-6baa2a027ec7595d421d151fec74dd338a15acebb83e52510a67e08fa4dd7b71 --ctx-size 4000 --batch-size 512 --embedding --log-disable --n-gpu-layers 81 --parallel 1 --tensor-split 41,40 --tensor-split 41,40 --port 43421 ... ggml_backend_cuda_buffer_type_alloc_buffer: allocating 39979.48 MiB on device 0: cudaMalloc failed: out of memory``` ### OS Linux ### GPU Nvidia, AMD ### CPU AMD ### Ollama version 0.1.48 (also had issue on previous version)
GiteaMirror added the gpunvidiaamdbug labels 2026-04-12 14:04:37 -05:00
Author
Owner

@jmorganca commented on GitHub (Jul 4, 2024):

Sorry about this - will work on fixing it

<!-- gh-comment-id:2208109469 --> @jmorganca commented on GitHub (Jul 4, 2024): Sorry about this - will work on fixing it
Author
Owner

@dhiltgen commented on GitHub (Jul 5, 2024):

Can you share a server log with OLLAMA_DEBUG=1 set when it tries to load across both cards? (It's not supposed to, but it sounds like there's a logic error someplace leading it to make a mistake during scheduling)

<!-- gh-comment-id:2211363178 --> @dhiltgen commented on GitHub (Jul 5, 2024): Can you share a server log with OLLAMA_DEBUG=1 set when it tries to load across both cards? (It's not supposed to, but it sounds like there's a logic error someplace leading it to make a mistake during scheduling)
Author
Owner

@sksonic commented on GitHub (Jul 6, 2024):

server.log
Attached the server log.

<!-- gh-comment-id:2211692839 --> @sksonic commented on GitHub (Jul 6, 2024): [server.log](https://github.com/user-attachments/files/16115182/server.log) Attached the server log.
Author
Owner

@sksonic commented on GitHub (Jul 10, 2024):

This is still happening on the latest release.

<!-- gh-comment-id:2219606937 --> @sksonic commented on GitHub (Jul 10, 2024): This is still happening on the latest release.
Author
Owner

@dhiltgen commented on GitHub (Jul 22, 2024):

There's a logic flaw somewhere in the scheduler where we're accidentally trying to split a model over mixed GPUs types. We can only spread on the same brand of GPUs. This setup with mixed brands can be used for different models, but not with a single model.

<!-- gh-comment-id:2243542079 --> @dhiltgen commented on GitHub (Jul 22, 2024): There's a logic flaw somewhere in the scheduler where we're accidentally trying to split a model over mixed GPUs types. We can only spread on the same brand of GPUs. This setup with mixed brands can be used for different models, but not with a single model.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3423