[GH-ISSUE #8143] Codellama 34b runs on CPU instead of GPU #51708

Closed
opened 2026-04-28 20:46:55 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @gregory-lebl on GitHub (Dec 17, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/8143

What is the issue?

Hello,
When I try to use Codellama:34b it doesn't run on the GPU but on the CPU. The 7b and 13B models runs on the GPU but not the 34b and 70b.

Any idea?

OS

WSL2

GPU

Nvidia

CPU

Intel

Ollama version

0.4.7

Originally created by @gregory-lebl on GitHub (Dec 17, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/8143 ### What is the issue? Hello, When I try to use Codellama:34b it doesn't run on the GPU but on the CPU. The 7b and 13B models runs on the GPU but not the 34b and 70b. Any idea? ### OS WSL2 ### GPU Nvidia ### CPU Intel ### Ollama version 0.4.7
GiteaMirror added the needs more infobug labels 2026-04-28 20:46:55 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 17, 2024):

Server logs will aid in debugging.

<!-- gh-comment-id:2549884765 --> @rick-github commented on GitHub (Dec 17, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@felixniemeyer commented on GitHub (Dec 18, 2024):

maybe ollama goes for CPU because the models don't fit in your VRAM

<!-- gh-comment-id:2552325067 --> @felixniemeyer commented on GitHub (Dec 18, 2024): maybe ollama goes for CPU because the models don't fit in your VRAM
Author
Owner

@gregory-lebl commented on GitHub (Dec 26, 2024):

maybe ollama goes for CPU because the models don't fit in your VRAM

That will be an explanation. I see this log when using the 34b model :

Dec 26 13:59:52 DESKTOP-KO76CAD ollama[110]: time=2024-12-26T13:59:52.591+01:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=26 layers.split="" memory.available="[10.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="19.1 GiB" memory.required.partial="10.8 GiB" memory.required.kv="384.0 MiB" memory.required.allocations="[10.8 GiB]" memory.weights.total="17.8 GiB" memory.weights.repeating="17.6 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="348.0 MiB"

And this log for the 7b model:

Dec 26 14:04:40 DESKTOP-KO76CAD ollama[110]: time=2024-12-26T14:04:40.296+01:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[10.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.7 GiB" memory.required.partial="8.7 GiB" memory.required.kv="4.0 GiB" memory.required.allocations="[8.7 GiB]" memory.weights.total="7.4 GiB" memory.weights.repeating="7.3 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="681.0 MiB"

The 34b model need 19.1 GiB for memory and the 7b need 8.7 GiB. My GPU has only 10.8 GiB available.

<!-- gh-comment-id:2562687259 --> @gregory-lebl commented on GitHub (Dec 26, 2024): > maybe ollama goes for CPU because the models don't fit in your VRAM That will be an explanation. I see this log when using the 34b model : > Dec 26 13:59:52 DESKTOP-KO76CAD ollama[110]: time=2024-12-26T13:59:52.591+01:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=49 layers.offload=26 layers.split="" memory.available="[10.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="19.1 GiB" memory.required.partial="10.8 GiB" memory.required.kv="384.0 MiB" memory.required.allocations="[10.8 GiB]" memory.weights.total="17.8 GiB" memory.weights.repeating="17.6 GiB" memory.weights.nonrepeating="205.1 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="348.0 MiB" And this log for the 7b model: > Dec 26 14:04:40 DESKTOP-KO76CAD ollama[110]: time=2024-12-26T14:04:40.296+01:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[10.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.7 GiB" memory.required.partial="8.7 GiB" memory.required.kv="4.0 GiB" memory.required.allocations="[8.7 GiB]" memory.weights.total="7.4 GiB" memory.weights.repeating="7.3 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="681.0 MiB" The 34b model need 19.1 GiB for memory and the 7b need 8.7 GiB. My GPU has only 10.8 GiB available.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51708