[GH-ISSUE #11640] gemma3n:e2b no offload on NVIDIA GeForce RTX 3050 Laptop GPU #69752

Closed
opened 2026-05-04 19:06:35 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @gianlen on GitHub (Aug 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11640

What is the issue?

server.log
Using gemma3n, it seems that there is no offloading on gpu, while gemma3 seems fine.

Relevant log output

Device 0: NVIDIA GeForce RTX 3050 Laptop GPU, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\Users\gianl\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\gianl\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-08-02T12:32:35.697+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-08-02T12:32:35.763+02:00 level=INFO source=ggml.go:365 msg="offloading 0 repeating layers to GPU"
time=2025-08-02T12:32:35.763+02:00 level=INFO source=ggml.go:369 msg="offloading output layer to CPU"
time=2025-08-02T12:32:35.763+02:00 level=INFO source=ggml.go:376 msg="offloaded 0/31 layers to GPU"
time=2025-08-02T12:32:35.763+02:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="5.6 GiB"
time=2025-08-02T12:32:35.782+02:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-02T12:32:35.782+02:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="123.0 MiB"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.10.0

Originally created by @gianlen on GitHub (Aug 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11640 ### What is the issue? [server.log](https://github.com/user-attachments/files/21558621/server.log) Using gemma3n, it seems that there is no offloading on gpu, while gemma3 seems fine. ### Relevant log output ```shell Device 0: NVIDIA GeForce RTX 3050 Laptop GPU, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from C:\Users\gianl\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\gianl\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-08-02T12:32:35.697+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-08-02T12:32:35.763+02:00 level=INFO source=ggml.go:365 msg="offloading 0 repeating layers to GPU" time=2025-08-02T12:32:35.763+02:00 level=INFO source=ggml.go:369 msg="offloading output layer to CPU" time=2025-08-02T12:32:35.763+02:00 level=INFO source=ggml.go:376 msg="offloaded 0/31 layers to GPU" time=2025-08-02T12:32:35.763+02:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="5.6 GiB" time=2025-08-02T12:32:35.782+02:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-08-02T12:32:35.782+02:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="123.0 MiB" ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.10.0
GiteaMirror added the bug label 2026-05-04 19:06:35 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 2, 2025):

source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=31 layers.offload=0 layers.split="" memory.available="[3.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.7 GiB" memory.required.partial="0 B" memory.required.kv="240.0 MiB" memory.required.allocations="[0 B]" memory.weights.total="1.4 GiB" memory.weights.repeating="1.0 GiB" memory.weights.nonrepeating="420.4 MiB" memory.graph.full="2.0 GiB" memory.graph.partial="3.7 GiB"

Insufficient VRAM to offload any layers.

<!-- gh-comment-id:3146485560 --> @rick-github commented on GitHub (Aug 2, 2025): ``` source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=31 layers.offload=0 layers.split="" memory.available="[3.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="1.7 GiB" memory.required.partial="0 B" memory.required.kv="240.0 MiB" memory.required.allocations="[0 B]" memory.weights.total="1.4 GiB" memory.weights.repeating="1.0 GiB" memory.weights.nonrepeating="420.4 MiB" memory.graph.full="2.0 GiB" memory.graph.partial="3.7 GiB" ``` Insufficient VRAM to offload any layers.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69752