[GH-ISSUE #11722] GPT OSS 120b only uses 14gb of vram out of 24gb #33520

Closed
opened 2026-04-22 16:18:03 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @iChristGit on GitHub (Aug 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11722

What is the issue?

Image

It could be faster if more is loaded into VRAM, right?

Relevant log output


OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

ollama version is 0.11.2

Originally created by @iChristGit on GitHub (Aug 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11722 ### What is the issue? <img width="961" height="999" alt="Image" src="https://github.com/user-attachments/assets/692ac1aa-39cf-407b-986b-60e18b4af73e" /> It could be faster if more is loaded into VRAM, right? ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version ollama version is 0.11.2
GiteaMirror added the bug label 2026-04-22 16:18:03 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 6, 2025):

Server logs will help in debugging.

<!-- gh-comment-id:3157713395 --> @rick-github commented on GitHub (Aug 6, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will help in debugging.
Author
Owner

@wiwikuan commented on GitHub (Aug 6, 2025):

I'm having the same issue with my dual GPU setup.

My System

  • Linux Mint 22.1
  • AMD Ryzen 9 7900X3D, 128 GB RAM
  • 2x NVIDIA RTX 4000 (total 40GB VRAM)
  • Running gpt-oss:20b model
  • Ollama version 0.11.2

The Problem

I observed this by monitoring journalctl -u ollama --no-pager --follow --pager-end and nvidia-smi:

  • Ollama only using about 14GB out of 40GB VRAM available
  • Only 5 out of 25 model layers are loaded on GPU
  • GPU utilization is very low (2% on GPU0, 30% on GPU1)

Testing Other Models

I tested aya-expanse:32b and qwen2.5:32b models, and they work perfectly fine. Both can use around 36GB VRAM and load the entire model on GPU as expected.

Logs

journalctl -u ollama --no-pager --follow --pager-end:

 8月 06 17:23:21 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:21.675+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-40272e29-fac5-4952-576f-cc6ad6f62ed2 library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA RTX 4000 SFF Ada Generation" total="19.7 GiB" available="19.5 GiB"
 8月 06 17:23:21 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:21.675+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-e5b8d85d-027e-5a79-f9b1-556b5dd5da70 library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA RTX 4000 Ada Generation" total="19.7 GiB" available="16.5 GiB"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.109+08:00 level=INFO source=server.go:135 msg="system memory" total="124.9 GiB" free="114.3 GiB" free_swap="69.8 MiB"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.110+08:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=5 layers.split=5,0 memory.available="[19.5 GiB 16.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="29.5 GiB" memory.required.partial="19.3 GiB" memory.required.kv="876.0 MiB" memory.required.allocations="[19.3 GiB 0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="16.0 GiB" memory.graph.partial="16.0 GiB"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.152+08:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 32768 --batch-size 512 --n-gpu-layers 5 --threads 12 --parallel 1 --tensor-split 5,0 --port 37455"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.153+08:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.153+08:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.153+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.161+08:00 level=INFO source=runner.go:925 msg="starting ollama engine"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.161+08:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:37455"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.202+08:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: ggml_cuda_init: found 2 CUDA devices:
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]:   Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]:   Device 1: NVIDIA RTX 4000 Ada Generation, compute capability 8.9, VMM: yes
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.310+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.353+08:00 level=INFO source=ggml.go:367 msg="offloading 5 repeating layers to GPU"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.353+08:00 level=INFO source=ggml.go:371 msg="offloading output layer to CPU"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.353+08:00 level=INFO source=ggml.go:378 msg="offloaded 5/25 layers to GPU"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.353+08:00 level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="10.6 GiB"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.353+08:00 level=INFO source=ggml.go:381 msg="model weights" buffer=CUDA0 size="2.2 GiB"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.403+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.599+08:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="8.1 GiB"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.599+08:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="0 B"
 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.599+08:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="8.0 GiB"
 8月 06 17:23:45 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:45.916+08:00 level=INFO source=server.go:637 msg="llama runner started in 1.76 seconds"

nvidia-smi:

 +-----------------------------------------------------------------------------------------+
 | NVIDIA-SMI 550.144.03             Driver Version: 550.144.03     CUDA Version: 12.4     |
 |-----------------------------------------+------------------------+----------------------+
 | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
 |                                         |                        |               MIG M. |
 |=========================================+========================+======================|
 |   0  NVIDIA RTX 4000 SFF Ada ...    Off |   00000000:01:00.0 Off |                  Off |
 | 33%   61C    P2             27W /   70W |   11095MiB /  20475MiB |      2%      Default |
 |                                         |                        |                  N/A |
 +-----------------------------------------+------------------------+----------------------+
 |   1  NVIDIA RTX 4000 Ada Gene...    Off |   00000000:0D:00.0  On |                  Off |
 | 30%   40C    P5             21W /  130W |    3281MiB /  20475MiB |     30%      Default |
 |                                         |                        |                  N/A |
 +--------------------------
<!-- gh-comment-id:3158689803 --> @wiwikuan commented on GitHub (Aug 6, 2025): I'm having the same issue with my dual GPU setup. ### My System - Linux Mint 22.1 - AMD Ryzen 9 7900X3D, 128 GB RAM - 2x NVIDIA RTX 4000 (total 40GB VRAM) - Running gpt-oss:20b model - Ollama version 0.11.2 ### The Problem I observed this by monitoring `journalctl -u ollama --no-pager --follow --pager-end` and `nvidia-smi`: - Ollama only using about 14GB out of 40GB VRAM available - Only 5 out of 25 model layers are loaded on GPU - GPU utilization is very low (2% on GPU0, 30% on GPU1) ### Testing Other Models I tested aya-expanse:32b and qwen2.5:32b models, and they work perfectly fine. Both can use around 36GB VRAM and load the entire model on GPU as expected. ### Logs `journalctl -u ollama --no-pager --follow --pager-end`: ``` 8月 06 17:23:21 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:21.675+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-40272e29-fac5-4952-576f-cc6ad6f62ed2 library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA RTX 4000 SFF Ada Generation" total="19.7 GiB" available="19.5 GiB" 8月 06 17:23:21 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:21.675+08:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-e5b8d85d-027e-5a79-f9b1-556b5dd5da70 library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA RTX 4000 Ada Generation" total="19.7 GiB" available="16.5 GiB" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.109+08:00 level=INFO source=server.go:135 msg="system memory" total="124.9 GiB" free="114.3 GiB" free_swap="69.8 MiB" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.110+08:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=5 layers.split=5,0 memory.available="[19.5 GiB 16.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="29.5 GiB" memory.required.partial="19.3 GiB" memory.required.kv="876.0 MiB" memory.required.allocations="[19.3 GiB 0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="16.0 GiB" memory.graph.partial="16.0 GiB" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.152+08:00 level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 32768 --batch-size 512 --n-gpu-layers 5 --threads 12 --parallel 1 --tensor-split 5,0 --port 37455" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.153+08:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.153+08:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.153+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.161+08:00 level=INFO source=runner.go:925 msg="starting ollama engine" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.161+08:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:37455" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.202+08:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: ggml_cuda_init: found 2 CUDA devices: 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: Device 1: NVIDIA RTX 4000 Ada Generation, compute capability 8.9, VMM: yes 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.310+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.353+08:00 level=INFO source=ggml.go:367 msg="offloading 5 repeating layers to GPU" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.353+08:00 level=INFO source=ggml.go:371 msg="offloading output layer to CPU" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.353+08:00 level=INFO source=ggml.go:378 msg="offloaded 5/25 layers to GPU" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.353+08:00 level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="10.6 GiB" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.353+08:00 level=INFO source=ggml.go:381 msg="model weights" buffer=CUDA0 size="2.2 GiB" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.403+08:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.599+08:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="8.1 GiB" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.599+08:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="0 B" 8月 06 17:23:44 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:44.599+08:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="8.0 GiB" 8月 06 17:23:45 wiwi-nvidia ollama[1126164]: time=2025-08-06T17:23:45.916+08:00 level=INFO source=server.go:637 msg="llama runner started in 1.76 seconds" ``` `nvidia-smi`: ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX 4000 SFF Ada ... Off | 00000000:01:00.0 Off | Off | | 33% 61C P2 27W / 70W | 11095MiB / 20475MiB | 2% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA RTX 4000 Ada Gene... Off | 00000000:0D:00.0 On | Off | | 30% 40C P5 21W / 130W | 3281MiB / 20475MiB | 30% Default | | | | N/A | +-------------------------- ```
Author
Owner

@rick-github commented on GitHub (Aug 6, 2025):

Reduce the size of the context. A 32k context results in a memory graph of 16GB, which doesn't fit on one GPU and only allows 5 layers on the other. You could also try upgrading ollama, gpt-oss is a new model and there are some ongoing adjustments to the memory estimation logic for this model.

<!-- gh-comment-id:3159565944 --> @rick-github commented on GitHub (Aug 6, 2025): Reduce the size of the context. A 32k context results in a memory graph of 16GB, which doesn't fit on one GPU and only allows 5 layers on the other. You could also try upgrading ollama, gpt-oss is a new model and there are some ongoing adjustments to the memory estimation logic for this model.
Author
Owner

@azomDev commented on GitHub (Aug 10, 2025):

I think this might be related to #11676 , but not completely sure for this one

<!-- gh-comment-id:3172343056 --> @azomDev commented on GitHub (Aug 10, 2025): I think this might be related to #11676 , but not completely sure for this one
Author
Owner

@rick-github commented on GitHub (Sep 1, 2025):

Recent release of ollama have reduced the memory footprint for gpt-oss. Upgrade and if the problem persists, add a comment.

<!-- gh-comment-id:3243153401 --> @rick-github commented on GitHub (Sep 1, 2025): Recent release of ollama have reduced the memory footprint for gpt-oss. Upgrade and if the problem persists, add a comment.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33520