[GH-ISSUE #10318] How to force to load all layers to GPU than partly? #68831

Closed
opened 2026-05-04 15:22:41 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @ddrcrow on GitHub (Apr 17, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10318

What is the issue?

I am using 2 4090s(2X24GB) for qwq32, but it is so slow with the size of prompt far less than the qwq supports, I checked the logs, and found not all the layers were to GPUs but half to CPU, I suspect it was the reason why the ollama running slowly, am I right? If so, is it possible to force to load all the layers to GPUs when using cmd "ollama serve"? or "ollama run"?

load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer 0 assigned to device CPU
load_tensors: layer 1 assigned to device CPU
load_tensors: layer 2 assigned to device CPU
load_tensors: layer 3 assigned to device CPU
load_tensors: layer 4 assigned to device CPU
load_tensors: layer 5 assigned to device CPU
load_tensors: layer 6 assigned to device CPU
load_tensors: layer 7 assigned to device CPU
load_tensors: layer 8 assigned to device CPU
load_tensors: layer 9 assigned to device CPU
load_tensors: layer 10 assigned to device CPU
load_tensors: layer 11 assigned to device CPU
load_tensors: layer 12 assigned to device CPU
load_tensors: layer 13 assigned to device CPU
load_tensors: layer 14 assigned to device CPU
load_tensors: layer 15 assigned to device CPU
load_tensors: layer 16 assigned to device CPU
load_tensors: layer 17 assigned to device CPU
load_tensors: layer 18 assigned to device CPU
load_tensors: layer 19 assigned to device CPU
load_tensors: layer 20 assigned to device CPU
load_tensors: layer 21 assigned to device CPU
load_tensors: layer 22 assigned to device CPU
load_tensors: layer 23 assigned to device CPU
load_tensors: layer 24 assigned to device CPU
load_tensors: layer 25 assigned to device CPU
load_tensors: layer 26 assigned to device CUDA0
load_tensors: layer 27 assigned to device CUDA0
load_tensors: layer 28 assigned to device CUDA0
load_tensors: layer 29 assigned to device CUDA0
load_tensors: layer 30 assigned to device CUDA0
load_tensors: layer 31 assigned to device CUDA0
load_tensors: layer 32 assigned to device CUDA0
load_tensors: layer 33 assigned to device CUDA0
load_tensors: layer 34 assigned to device CUDA0
load_tensors: layer 35 assigned to device CUDA0
load_tensors: layer 36 assigned to device CUDA0
load_tensors: layer 37 assigned to device CUDA0
load_tensors: layer 38 assigned to device CUDA0
load_tensors: layer 39 assigned to device CUDA0
load_tensors: layer 40 assigned to device CUDA0
load_tensors: layer 41 assigned to device CUDA0
load_tensors: layer 42 assigned to device CUDA0
load_tensors: layer 43 assigned to device CUDA0
load_tensors: layer 44 assigned to device CUDA0
load_tensors: layer 45 assigned to device CUDA1
load_tensors: layer 46 assigned to device CUDA1
load_tensors: layer 47 assigned to device CUDA1
load_tensors: layer 48 assigned to device CUDA1
load_tensors: layer 49 assigned to device CUDA1
load_tensors: layer 50 assigned to device CUDA1
load_tensors: layer 51 assigned to device CUDA1
load_tensors: layer 52 assigned to device CUDA1
load_tensors: layer 53 assigned to device CUDA1
load_tensors: layer 54 assigned to device CUDA1
load_tensors: layer 55 assigned to device CUDA1
load_tensors: layer 56 assigned to device CUDA1
load_tensors: layer 57 assigned to device CUDA1
load_tensors: layer 58 assigned to device CUDA1
load_tensors: layer 59 assigned to device CUDA1
load_tensors: layer 60 assigned to device CUDA1
load_tensors: layer 61 assigned to device CUDA1
load_tensors: layer 62 assigned to device CUDA1
load_tensors: layer 63 assigned to device CUDA1
load_tensors: layer 64 assigned to device CPU
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @ddrcrow on GitHub (Apr 17, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10318 ### What is the issue? I am using 2 4090s(2X24GB) for qwq32, but it is so slow with the size of prompt far less than the qwq supports, I checked the logs, and found not all the layers were to GPUs but half to CPU, I suspect it was the reason why the ollama running slowly, am I right? If so, is it possible to force to load all the layers to GPUs when using cmd "ollama serve"? or "ollama run"? load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: layer 0 assigned to device CPU load_tensors: layer 1 assigned to device CPU load_tensors: layer 2 assigned to device CPU load_tensors: layer 3 assigned to device CPU load_tensors: layer 4 assigned to device CPU load_tensors: layer 5 assigned to device CPU load_tensors: layer 6 assigned to device CPU load_tensors: layer 7 assigned to device CPU load_tensors: layer 8 assigned to device CPU load_tensors: layer 9 assigned to device CPU load_tensors: layer 10 assigned to device CPU load_tensors: layer 11 assigned to device CPU load_tensors: layer 12 assigned to device CPU load_tensors: layer 13 assigned to device CPU load_tensors: layer 14 assigned to device CPU load_tensors: layer 15 assigned to device CPU load_tensors: layer 16 assigned to device CPU load_tensors: layer 17 assigned to device CPU load_tensors: layer 18 assigned to device CPU load_tensors: layer 19 assigned to device CPU load_tensors: layer 20 assigned to device CPU load_tensors: layer 21 assigned to device CPU load_tensors: layer 22 assigned to device CPU load_tensors: layer 23 assigned to device CPU load_tensors: layer 24 assigned to device CPU load_tensors: layer 25 assigned to device CPU load_tensors: layer 26 assigned to device CUDA0 load_tensors: layer 27 assigned to device CUDA0 load_tensors: layer 28 assigned to device CUDA0 load_tensors: layer 29 assigned to device CUDA0 load_tensors: layer 30 assigned to device CUDA0 load_tensors: layer 31 assigned to device CUDA0 load_tensors: layer 32 assigned to device CUDA0 load_tensors: layer 33 assigned to device CUDA0 load_tensors: layer 34 assigned to device CUDA0 load_tensors: layer 35 assigned to device CUDA0 load_tensors: layer 36 assigned to device CUDA0 load_tensors: layer 37 assigned to device CUDA0 load_tensors: layer 38 assigned to device CUDA0 load_tensors: layer 39 assigned to device CUDA0 load_tensors: layer 40 assigned to device CUDA0 load_tensors: layer 41 assigned to device CUDA0 load_tensors: layer 42 assigned to device CUDA0 load_tensors: layer 43 assigned to device CUDA0 load_tensors: layer 44 assigned to device CUDA0 load_tensors: layer 45 assigned to device CUDA1 load_tensors: layer 46 assigned to device CUDA1 load_tensors: layer 47 assigned to device CUDA1 load_tensors: layer 48 assigned to device CUDA1 load_tensors: layer 49 assigned to device CUDA1 load_tensors: layer 50 assigned to device CUDA1 load_tensors: layer 51 assigned to device CUDA1 load_tensors: layer 52 assigned to device CUDA1 load_tensors: layer 53 assigned to device CUDA1 load_tensors: layer 54 assigned to device CUDA1 load_tensors: layer 55 assigned to device CUDA1 load_tensors: layer 56 assigned to device CUDA1 load_tensors: layer 57 assigned to device CUDA1 load_tensors: layer 58 assigned to device CUDA1 load_tensors: layer 59 assigned to device CUDA1 load_tensors: layer 60 assigned to device CUDA1 load_tensors: layer 61 assigned to device CUDA1 load_tensors: layer 62 assigned to device CUDA1 load_tensors: layer 63 assigned to device CUDA1 load_tensors: layer 64 assigned to device CPU load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-05-04 15:22:41 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 17, 2025):

ollama has estimated that only 37 layers will fit in the available VRAM. Server logs will show the estimation logic.

<!-- gh-comment-id:2812270229 --> @rick-github commented on GitHub (Apr 17, 2025): ollama has estimated that only 37 layers will fit in the available VRAM. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will show the estimation logic.
Author
Owner

@ddrcrow commented on GitHub (Apr 18, 2025):

is it possible to change the number?
is the 37 a magic number both for a single GPU and multiple GPUs

<!-- gh-comment-id:2814373619 --> @ddrcrow commented on GitHub (Apr 18, 2025): is it possible to change the number? is the 37 a magic number both for a single GPU and multiple GPUs
Author
Owner

@rick-github commented on GitHub (Apr 18, 2025):

You can override ollama by setting num_gou.

<!-- gh-comment-id:2814915528 --> @rick-github commented on GitHub (Apr 18, 2025): You can override ollama by setting [`num_gou`](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650).
Author
Owner

@ddrcrow commented on GitHub (Apr 19, 2025):

@rick-github thank you, will try soon

<!-- gh-comment-id:2816714880 --> @ddrcrow commented on GitHub (Apr 19, 2025): @rick-github thank you, will try soon
Author
Owner

@ddrcrow commented on GitHub (Apr 21, 2025):

it worked for me, thank you

<!-- gh-comment-id:2817485365 --> @ddrcrow commented on GitHub (Apr 21, 2025): it worked for me, thank you
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68831