[GH-ISSUE #11769] Low GPU Utilization with gpt-oss-20b Model #69857

Closed
opened 2026-05-04 19:35:00 -05:00 by GiteaMirror · 16 comments
Owner

Originally created by @songjiagui on GitHub (Aug 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11769

What is the issue?

I noticed that when running the gpt-oss-20b model, the GPU utilization is quite low—at most around 25%—while the CPU usage is very high. Other models are able to utilize the GPU properly. Is there a specific setting or configuration I need to adjust to ensure the model uses the GPU correctly?

Image Image

Relevant log output


OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

No response

Originally created by @songjiagui on GitHub (Aug 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11769 ### What is the issue? I noticed that when running the gpt-oss-20b model, the GPU utilization is quite low—at most around 25%—while the CPU usage is very high. Other models are able to utilize the GPU properly. Is there a specific setting or configuration I need to adjust to ensure the model uses the GPU correctly? <img width="1680" height="872" alt="Image" src="https://github.com/user-attachments/assets/9b554b73-b2c9-47c6-81fa-abbfd76e4a73" /> <img width="1920" height="1029" alt="Image" src="https://github.com/user-attachments/assets/6ce2e3cb-df90-4f0d-adf3-74d39a3c31d0" /> ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version _No response_
GiteaMirror added the bug label 2026-05-04 19:35:00 -05:00
Author
Owner

@includewins0ck2 commented on GitHub (Aug 7, 2025):

I have the same issue—same Windows system, and the problem persists from 0.11.2 to the latest 0.11.3.

<!-- gh-comment-id:3162264385 --> @includewins0ck2 commented on GitHub (Aug 7, 2025): I have the same issue—same Windows system, and the problem persists from 0.11.2 to the latest 0.11.3.
Author
Owner

@andrescaroc commented on GitHub (Aug 7, 2025):

I have the same issue

  • GPU memory usage about 65%
  • GPU compute usage 0%
  • CPU compute usage 100%
Image
  • OS - Manjaro Linux
  • GPU - NVIDIA
  • CPU - Intel
  • Ollama Version 0.11.3
<!-- gh-comment-id:3162517263 --> @andrescaroc commented on GitHub (Aug 7, 2025): I have the same issue - GPU memory usage about 65% - GPU compute usage 0% - CPU compute usage 100% <img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/07637d92-49c4-4ee0-a145-cc223a035526" /> - OS - Manjaro Linux - GPU - NVIDIA - CPU - Intel - Ollama Version 0.11.3
Author
Owner

@rick-github commented on GitHub (Aug 7, 2025):

Server logs will help in debugging.

It's likely that the model is too big to fit in the available VRAM and part of it has been loaded in system RAM, where the slower CPU does inference.

<!-- gh-comment-id:3162531313 --> @rick-github commented on GitHub (Aug 7, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will help in debugging. It's likely that the model is too big to fit in the available VRAM and part of it has been loaded in system RAM, where the slower CPU does inference.
Author
Owner

@andrescaroc commented on GitHub (Aug 7, 2025):

$ docker logs ollama

time=2025-08-07T04:45:15.445Z level=INFO source=server.go:637 msg="llama runner started in 1.01 seconds"
[GIN] 2025/08/07 - 04:45:31 | 200 | 23.196625137s |      172.17.0.1 | POST     "/v1/chat/completions"
[GIN] 2025/08/07 - 05:05:47 | 200 |      55.459µs |      172.17.0.1 | GET      "/api/version"
[GIN] 2025/08/07 - 05:06:50 | 200 |      33.137µs |      172.17.0.1 | HEAD     "/"
[GIN] 2025/08/07 - 05:06:50 | 200 |   83.665294ms |      172.17.0.1 | POST     "/api/show"
time=2025-08-07T05:06:50.793Z level=INFO source=server.go:135 msg="system memory" total="31.0 GiB" free="26.0 GiB" free_swap="0 B"
time=2025-08-07T05:06:50.793Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=1 layers.split="" memory.available="[3.6 GiB]" memory.gpu_overhead="0 B"memory.required.full="14.9 GiB" memory.required.partial="3.4 GiB" memory.required.kv="300.0 MiB" memory.required.allocations="[3.4 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="2.0 GiB" memory.graph.partial="2.0 GiB"
time=2025-08-07T05:06:50.833Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 8192 --batch-size 512 --n-gpu-layers 1 --threads 6 --parallel 1 --port 36181"
time=2025-08-07T05:06:50.834Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-07T05:06:50.834Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-07T05:06:50.834Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-07T05:06:50.848Z level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-07T05:06:50.848Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:36181"
time=2025-08-07T05:06:50.897Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A500 Laptop GPU, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so
time=2025-08-07T05:06:50.936Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-07T05:06:51.013Z level=INFO source=ggml.go:367 msg="offloading 1 repeating layers to GPU"
time=2025-08-07T05:06:51.013Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU"
time=2025-08-07T05:06:51.013Z level=INFO source=ggml.go:378 msg="offloaded 1/25 layers to GPU"
time=2025-08-07T05:06:51.013Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="12.4 GiB"
time=2025-08-07T05:06:51.013Z level=INFO source=ggml.go:381 msg="model weights" buffer=CUDA0 size="455.0 MiB"
time=2025-08-07T05:06:51.070Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="2.1 GiB"
time=2025-08-07T05:06:51.070Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="2.0 GiB"
time=2025-08-07T05:06:51.089Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-07T05:06:53.116Z level=INFO source=server.go:637 msg="llama runner started in 2.28 seconds"
[GIN] 2025/08/07 - 05:09:12 | 200 |         2m22s |      172.17.0.1 | POST     "/api/generate"
time=2025-08-07T05:14:17.915Z level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.125811551 runner.size="14.9 GiB" runner.vram="3.4 GiB" runner.parallel=1 runner.pid=7928runner.model=/root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583
time=2025-08-07T05:14:18.165Z level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.376055911 runner.size="14.9 GiB" runner.vram="3.4 GiB" runner.parallel=1 runner.pid=7928runner.model=/root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583
time=2025-08-07T05:14:18.415Z level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.625676648 runner.size="14.9 GiB" runner.vram="3.4 GiB" runner.parallel=1 runner.pid=7928runner.model=/root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583
[GIN] 2025/08/07 - 05:15:29 | 200 |       48.78µs |      172.17.0.1 | GET      "/api/version"
<!-- gh-comment-id:3162629064 --> @andrescaroc commented on GitHub (Aug 7, 2025): `$ docker logs ollama` ```shell time=2025-08-07T04:45:15.445Z level=INFO source=server.go:637 msg="llama runner started in 1.01 seconds" [GIN] 2025/08/07 - 04:45:31 | 200 | 23.196625137s | 172.17.0.1 | POST "/v1/chat/completions" [GIN] 2025/08/07 - 05:05:47 | 200 | 55.459µs | 172.17.0.1 | GET "/api/version" [GIN] 2025/08/07 - 05:06:50 | 200 | 33.137µs | 172.17.0.1 | HEAD "/" [GIN] 2025/08/07 - 05:06:50 | 200 | 83.665294ms | 172.17.0.1 | POST "/api/show" time=2025-08-07T05:06:50.793Z level=INFO source=server.go:135 msg="system memory" total="31.0 GiB" free="26.0 GiB" free_swap="0 B" time=2025-08-07T05:06:50.793Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=1 layers.split="" memory.available="[3.6 GiB]" memory.gpu_overhead="0 B"memory.required.full="14.9 GiB" memory.required.partial="3.4 GiB" memory.required.kv="300.0 MiB" memory.required.allocations="[3.4 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="2.0 GiB" memory.graph.partial="2.0 GiB" time=2025-08-07T05:06:50.833Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 8192 --batch-size 512 --n-gpu-layers 1 --threads 6 --parallel 1 --port 36181" time=2025-08-07T05:06:50.834Z level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-07T05:06:50.834Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-07T05:06:50.834Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-07T05:06:50.848Z level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-07T05:06:50.848Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:36181" time=2025-08-07T05:06:50.897Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX A500 Laptop GPU, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so time=2025-08-07T05:06:50.936Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-07T05:06:51.013Z level=INFO source=ggml.go:367 msg="offloading 1 repeating layers to GPU" time=2025-08-07T05:06:51.013Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU" time=2025-08-07T05:06:51.013Z level=INFO source=ggml.go:378 msg="offloaded 1/25 layers to GPU" time=2025-08-07T05:06:51.013Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="12.4 GiB" time=2025-08-07T05:06:51.013Z level=INFO source=ggml.go:381 msg="model weights" buffer=CUDA0 size="455.0 MiB" time=2025-08-07T05:06:51.070Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="2.1 GiB" time=2025-08-07T05:06:51.070Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="2.0 GiB" time=2025-08-07T05:06:51.089Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-08-07T05:06:53.116Z level=INFO source=server.go:637 msg="llama runner started in 2.28 seconds" [GIN] 2025/08/07 - 05:09:12 | 200 | 2m22s | 172.17.0.1 | POST "/api/generate" time=2025-08-07T05:14:17.915Z level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.125811551 runner.size="14.9 GiB" runner.vram="3.4 GiB" runner.parallel=1 runner.pid=7928runner.model=/root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 time=2025-08-07T05:14:18.165Z level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.376055911 runner.size="14.9 GiB" runner.vram="3.4 GiB" runner.parallel=1 runner.pid=7928runner.model=/root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 time=2025-08-07T05:14:18.415Z level=WARN source=sched.go:685 msg="gpu VRAM usage didn't recover within timeout" seconds=5.625676648 runner.size="14.9 GiB" runner.vram="3.4 GiB" runner.parallel=1 runner.pid=7928runner.model=/root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 [GIN] 2025/08/07 - 05:15:29 | 200 | 48.78µs | 172.17.0.1 | GET "/api/version" ```
Author
Owner

@rick-github commented on GitHub (Aug 7, 2025):

time=2025-08-07T05:06:50.793Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25
 layers.offload=1 layers.split="" memory.available="[3.6 GiB]" memory.gpu_overhead="0 B"memory.required.full="14.9 GiB"
 memory.required.partial="3.4 GiB" memory.required.kv="300.0 MiB" memory.required.allocations="[3.4 GiB]"
 memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB"
 memory.graph.full="2.0 GiB" memory.graph.partial="2.0 GiB"

3.6GiB available memory is only enough to hold 1 layer. The model is too big to fit in the available VRAM and part of it has been loaded in system RAM, where the slower CPU does inference.

<!-- gh-comment-id:3163027157 --> @rick-github commented on GitHub (Aug 7, 2025): ``` time=2025-08-07T05:06:50.793Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=1 layers.split="" memory.available="[3.6 GiB]" memory.gpu_overhead="0 B"memory.required.full="14.9 GiB" memory.required.partial="3.4 GiB" memory.required.kv="300.0 MiB" memory.required.allocations="[3.4 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="2.0 GiB" memory.graph.partial="2.0 GiB" ``` 3.6GiB available memory is only enough to hold 1 layer. The model is too big to fit in the available VRAM and part of it has been loaded in system RAM, where the slower CPU does inference.
Author
Owner

@Windsage63 commented on GitHub (Aug 8, 2025):

Don't feel bad, it won't load completely on a rtx 5090 either:

time=2025-08-07T19:50:29.744-05:00 level=INFO source=server.go:135 msg="system memory" total="63.4 GiB" free="49.7 GiB" free_swap="56.2 GiB"
time=2025-08-07T19:50:29.761-05:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=24 layers.split="" memory.available="[29.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="30.2 GiB" memory.required.partial="29.2 GiB" memory.required.kv="1.6 GiB" memory.required.allocations="[29.2 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="16.0 GiB" memory.graph.partial="16.0 GiB"
time=2025-08-07T19:50:29.761-05:00 level=WARN source=server.go:211 msg="flash attention enabled but not supported by model"
time=2025-08-07T19:50:29.761-05:00 level=WARN source=server.go:229 msg="quantized kv cache requested but flash attention disabled" type=q8_0

EDIT: I figured this out. It turns out that the LM Studio guys and OpenAI implemented a new type of Flash Attention and Ollama doesn't have that yet. So, Even this 13gb model is loading huge attention layers into vram. I lowered the context to 32k and it ran fine.

<!-- gh-comment-id:3166276065 --> @Windsage63 commented on GitHub (Aug 8, 2025): Don't feel bad, it won't load completely on a rtx 5090 either: ``` time=2025-08-07T19:50:29.744-05:00 level=INFO source=server.go:135 msg="system memory" total="63.4 GiB" free="49.7 GiB" free_swap="56.2 GiB" time=2025-08-07T19:50:29.761-05:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=24 layers.split="" memory.available="[29.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="30.2 GiB" memory.required.partial="29.2 GiB" memory.required.kv="1.6 GiB" memory.required.allocations="[29.2 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="16.0 GiB" memory.graph.partial="16.0 GiB" time=2025-08-07T19:50:29.761-05:00 level=WARN source=server.go:211 msg="flash attention enabled but not supported by model" time=2025-08-07T19:50:29.761-05:00 level=WARN source=server.go:229 msg="quantized kv cache requested but flash attention disabled" type=q8_0 ``` EDIT: I figured this out. It turns out that the LM Studio guys and OpenAI implemented a new type of Flash Attention and Ollama doesn't have that yet. So, Even this 13gb model is loading huge attention layers into vram. I lowered the context to 32k and it ran fine.
Author
Owner

@AantCoder commented on GitHub (Aug 8, 2025):

Does not use GPU. Do not write those who have no problem.

time=2025-08-08T07:17:17.018+04:00 level=INFO source=server.go:135 msg="system memory" total="63.8 GiB" free="46.2 GiB" free_swap="44.4 GiB"
time=2025-08-08T07:17:17.019+04:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[13.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.8 GiB" memory.required.partial="0 B" memory.required.kv="3.1 GiB" memory.required.allocations="[0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="32.0 GiB" memory.graph.partial="32.0 GiB"
time=2025-08-08T07:17:17.055+04:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model E:\\AI\\Ollama\\blobs\\sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 131072 --batch-size 512 --threads 8 --no-mmap --parallel 1 --port 51452"
time=2025-08-08T07:17:17.057+04:00 level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-08T07:17:17.057+04:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-08T07:17:17.058+04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error"
time=2025-08-08T07:17:17.083+04:00 level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-08T07:17:17.085+04:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:51452"
time=2025-08-08T07:17:17.123+04:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-08-08T07:17:17.196+04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-08-08T07:17:17.278+04:00 level=INFO source=ggml.go:367 msg="offloading 0 repeating layers to GPU"
time=2025-08-08T07:17:17.278+04:00 level=INFO source=ggml.go:371 msg="offloading output layer to CPU"
time=2025-08-08T07:17:17.278+04:00 level=INFO source=ggml.go:378 msg="offloaded 0/25 layers to GPU"
time=2025-08-08T07:17:17.278+04:00 level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="12.8 GiB"
time=2025-08-08T07:17:17.309+04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-08T07:17:17.922+04:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-08T07:17:17.922+04:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="32.0 GiB"
<!-- gh-comment-id:3166442934 --> @AantCoder commented on GitHub (Aug 8, 2025): Does not use GPU. Do not write those who have no problem. ``` time=2025-08-08T07:17:17.018+04:00 level=INFO source=server.go:135 msg="system memory" total="63.8 GiB" free="46.2 GiB" free_swap="44.4 GiB" time=2025-08-08T07:17:17.019+04:00 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[13.1 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.8 GiB" memory.required.partial="0 B" memory.required.kv="3.1 GiB" memory.required.allocations="[0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="32.0 GiB" memory.graph.partial="32.0 GiB" time=2025-08-08T07:17:17.055+04:00 level=INFO source=server.go:438 msg="starting llama server" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model E:\\AI\\Ollama\\blobs\\sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 131072 --batch-size 512 --threads 8 --no-mmap --parallel 1 --port 51452" time=2025-08-08T07:17:17.057+04:00 level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-08T07:17:17.057+04:00 level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-08T07:17:17.058+04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server error" time=2025-08-08T07:17:17.083+04:00 level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-08T07:17:17.085+04:00 level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:51452" time=2025-08-08T07:17:17.123+04:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-08-08T07:17:17.196+04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-08-08T07:17:17.278+04:00 level=INFO source=ggml.go:367 msg="offloading 0 repeating layers to GPU" time=2025-08-08T07:17:17.278+04:00 level=INFO source=ggml.go:371 msg="offloading output layer to CPU" time=2025-08-08T07:17:17.278+04:00 level=INFO source=ggml.go:378 msg="offloaded 0/25 layers to GPU" time=2025-08-08T07:17:17.278+04:00 level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="12.8 GiB" time=2025-08-08T07:17:17.309+04:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-08-08T07:17:17.922+04:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-08-08T07:17:17.922+04:00 level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="32.0 GiB" ```
Author
Owner

@AantCoder commented on GitHub (Aug 8, 2025):

It seems that if the size of the context is reduced to 32768, then the GPU begins to work. If it does not fit a little in video memory, then partial breakdown into layers for MXFP4 does not occur, and the execution is transmitted entirely to CPU

<!-- gh-comment-id:3166539480 --> @AantCoder commented on GitHub (Aug 8, 2025): It seems that if the size of the context is reduced to 32768, then the GPU begins to work. If it does not fit a little in video memory, then partial breakdown into layers for MXFP4 does not occur, and the execution is transmitted entirely to CPU
Author
Owner

@Windsage63 commented on GitHub (Aug 8, 2025):

I was able to sort of recreate the issue in lm studio by disabling the new flash attention scheme that was added. With that disabled I went from running 131k cache in 21 gb to overflowing at 65k. So, this likely means it will just be a matter of time and a code patch to catch up.

<!-- gh-comment-id:3166833917 --> @Windsage63 commented on GitHub (Aug 8, 2025): I was able to sort of recreate the issue in lm studio by disabling the new flash attention scheme that was added. With that disabled I went from running 131k cache in 21 gb to overflowing at 65k. So, this likely means it will just be a matter of time and a code patch to catch up.
Author
Owner

@mizuikk commented on GitHub (Aug 9, 2025):

Same here. GPU utilization is low even with all layers loaded to the GPU.
Image

Image
<!-- gh-comment-id:3170543889 --> @mizuikk commented on GitHub (Aug 9, 2025): Same here. GPU utilization is low even with all layers loaded to the GPU. <img width="823" height="162" alt="Image" src="https://github.com/user-attachments/assets/39dbb23f-c2d4-4e70-b0cc-a63b1a2be884" /> <img width="190" height="271" alt="Image" src="https://github.com/user-attachments/assets/bcd27445-16fa-4cf0-8aca-8fb1ee26ae1d" />
Author
Owner

@kha84 commented on GitHub (Aug 9, 2025):

On 4090 ollama version 0.11.2 (running with DEBUG=2) when gpt-oss:20b model is loaded I can see this:

Aug 09 14:18:38 ollama[4001649]: time=2025-08-09T14:18:38.075+03:00 level=WARN source=server.go:211 msg="flash attention enabled but not supported by model"
Aug 09 14:18:38 ollama[4001649]: time=2025-08-09T14:18:38.075+03:00 level=WARN source=server.go:229 msg="quantized kv cache requested but flash attention disabled" type=q8_0

For other models (qwen3, mistral) flash attention works.

<!-- gh-comment-id:3171159453 --> @kha84 commented on GitHub (Aug 9, 2025): On 4090 ollama version 0.11.2 (running with DEBUG=2) when gpt-oss:20b model is loaded I can see this: ``` Aug 09 14:18:38 ollama[4001649]: time=2025-08-09T14:18:38.075+03:00 level=WARN source=server.go:211 msg="flash attention enabled but not supported by model" Aug 09 14:18:38 ollama[4001649]: time=2025-08-09T14:18:38.075+03:00 level=WARN source=server.go:229 msg="quantized kv cache requested but flash attention disabled" type=q8_0 ``` For other models (qwen3, mistral) flash attention works.
Author
Owner

@Master-Pr0grammer commented on GitHub (Aug 9, 2025):

I am also having the same issue. I only get 3T/s generation speeds, and on 4T/s prompt processing, despite having pretty decent hardware.

GPU utilization is only at 3-5%, practically idle, while VRAM usage is at 80% (and the majority of the model is loaded in VRAM)

In comparison, I get much more GPU usage on the larger qwen3 30b model which is also MOE. I get 5 times better performance on the 50% larger model (and both have 3b active parameters)

I am not so sure if this is a low GPU utilization problem, but the performance is absolutely terrible. Something definitely is going wrong here.

<!-- gh-comment-id:3172089863 --> @Master-Pr0grammer commented on GitHub (Aug 9, 2025): I am also having the same issue. I only get 3T/s generation speeds, and on 4T/s prompt processing, despite having pretty decent hardware. GPU utilization is only at 3-5%, practically idle, while VRAM usage is at 80% (and the majority of the model is loaded in VRAM) In comparison, I get much more GPU usage on the larger qwen3 30b model which is also MOE. I get 5 times better performance on the 50% larger model (and both have 3b active parameters) I am not so sure if this is a low GPU utilization problem, but the performance is absolutely terrible. Something definitely is going wrong here.
Author
Owner

@rick-github commented on GitHub (Aug 9, 2025):

Server logs will help in debugging.

<!-- gh-comment-id:3172091086 --> @rick-github commented on GitHub (Aug 9, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will help in debugging.
Author
Owner

@azomDev commented on GitHub (Aug 10, 2025):

This appears to be related to #11676

<!-- gh-comment-id:3172341479 --> @azomDev commented on GitHub (Aug 10, 2025): This appears to be related to #11676
Author
Owner

@linuxlite commented on GitHub (Aug 13, 2025):

Running 2 x P6000's context set to 32768, OLLAMA_NUM_PARALLEL=1 did the trick here.

Image Image Image
<!-- gh-comment-id:3182993242 --> @linuxlite commented on GitHub (Aug 13, 2025): Running 2 x P6000's context set to 32768, OLLAMA_NUM_PARALLEL=1 did the trick here. <img width="870" height="114" alt="Image" src="https://github.com/user-attachments/assets/9fafb27b-d13f-4316-ba83-fd91fd1b880d" /> <img width="920" height="509" alt="Image" src="https://github.com/user-attachments/assets/fbb06fcc-caf3-4029-9802-bedfbd30fdeb" /> <img width="735" height="641" alt="Image" src="https://github.com/user-attachments/assets/54ce24f6-2761-4f23-a7f1-550b6451eb97" />
Author
Owner

@rick-github commented on GitHub (Sep 1, 2025):

Recent releases of ollama have reduced the memory footprint of gpt-oss. Upgrade and add a comment if the issues persists.

<!-- gh-comment-id:3243167130 --> @rick-github commented on GitHub (Sep 1, 2025): Recent releases of ollama have reduced the memory footprint of gpt-oss. Upgrade and add a comment if the issues persists.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69857