[GH-ISSUE #11939] gpt-oss Uses 100% CPU #69985

Closed
opened 2026-05-04 19:59:07 -05:00 by GiteaMirror · 10 comments
Owner

Originally created by @lzlrd on GitHub (Aug 16, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11939

Originally assigned to: @jessegross on GitHub.

What is the issue?

For some reason, gpt-oss won't use the GPU at all even though larger models do:

┌──────[Diab Neiroukh@XXX]─[C:\WINDOWS\system32]
└── $ ollama ps
NAME           ID              SIZE     PROCESSOR    CONTEXT    UNTIL
gpt-oss:20b    aa4295ac10c3    15 GB    100% CPU     131072     4 minutes from now

┌──────[Diab Neiroukh@XXX]─[C:\WINDOWS\system32]
└── $ ollama ps
NAME                                     ID              SIZE     PROCESSOR          CONTEXT    UNTIL
huihui_ai/dolphin3-r1-abliterated:24b    05a0492884c8    19 GB    23%/77% CPU/GPU    131072     4 minutes from now

Here is the env. vars I've set:

GGML_CUDA_ENABLE_UNIFIED_ME... 1
OLLAMA_CONTEXT_LENGTH          131072
OLLAMA_FLASH_ATTENTION         1
OLLAMA_HOST                    172.16.0.3
OLLAMA_KV_CACHE_TYPE           q8_0
OLLAMA_MODELS                  D:\Tools\Ollama
OLLAMA_NEW_ENGINE              1
OLLAMA_NEW_ESTIMATES           1

GGML_CUDA_ENABLE_UNIFIED_ME... is the full environment key, this is just output from gci env:* | sort-object name which truncates it.

Relevant log output

N/A

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.11.4

Originally created by @lzlrd on GitHub (Aug 16, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11939 Originally assigned to: @jessegross on GitHub. ### What is the issue? For some reason, `gpt-oss` won't use the GPU at all even though larger models do: ``` ┌──────[Diab Neiroukh@XXX]─[C:\WINDOWS\system32] └── $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b aa4295ac10c3 15 GB 100% CPU 131072 4 minutes from now ┌──────[Diab Neiroukh@XXX]─[C:\WINDOWS\system32] └── $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL huihui_ai/dolphin3-r1-abliterated:24b 05a0492884c8 19 GB 23%/77% CPU/GPU 131072 4 minutes from now ``` Here is the env. vars I've set: ``` GGML_CUDA_ENABLE_UNIFIED_ME... 1 OLLAMA_CONTEXT_LENGTH 131072 OLLAMA_FLASH_ATTENTION 1 OLLAMA_HOST 172.16.0.3 OLLAMA_KV_CACHE_TYPE q8_0 OLLAMA_MODELS D:\Tools\Ollama OLLAMA_NEW_ENGINE 1 OLLAMA_NEW_ESTIMATES 1 ``` `GGML_CUDA_ENABLE_UNIFIED_ME...` is the full environment key, this is just output from `gci env:* | sort-object name` which truncates it. ### Relevant log output ```shell N/A ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.11.4
GiteaMirror added the bug label 2026-05-04 19:59:07 -05:00
Author
Owner

@lzlrd commented on GitHub (Aug 16, 2025):

When setting context to 4096, I do see it attempt to use the GPU but the runner fails:

┌──────[Diab Neiroukh@XXX]─[C:\WINDOWS\system32]
└── $ ollama ps
NAME           ID              SIZE     PROCESSOR    CONTEXT    UNTIL
gpt-oss:20b    aa4295ac10c3    14 GB    100% GPU     4096       4 minutes from now
<!-- gh-comment-id:3193678861 --> @lzlrd commented on GitHub (Aug 16, 2025): When setting context to 4096, I do see it attempt to use the GPU but the runner fails: ``` ┌──────[Diab Neiroukh@XXX]─[C:\WINDOWS\system32] └── $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b aa4295ac10c3 14 GB 100% GPU 4096 4 minutes from now ```
Author
Owner

@lzlrd commented on GitHub (Aug 16, 2025):

With OLLAMA_FLASH_ATTENTION=0, OLLAMA_NEW_*=0, and a context length of 4096:

gpt-oss:20b    aa4295ac10c3    14 GB    100% GPU     4096       4 minutes from now
time=2025-08-16T14:46:51.704+01:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\Users\Diab Neiroukh\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\Diab Neiroukh\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-08-16T14:46:51.801+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-08-16T14:46:51.883+01:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-16T14:46:53.333+01:00 level=INFO source=ggml.go:365 msg="offloading 24 repeating layers to GPU"
time=2025-08-16T14:46:53.333+01:00 level=INFO source=ggml.go:371 msg="offloading output layer to GPU"
time=2025-08-16T14:46:53.333+01:00 level=INFO source=ggml.go:376 msg="offloaded 25/25 layers to GPU"
time=2025-08-16T14:46:53.333+01:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="11.7 GiB"
time=2025-08-16T14:46:53.333+01:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB"
time=2025-08-16T14:46:53.515+01:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.2 GiB"
time=2025-08-16T14:46:53.515+01:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB"
[GIN] 2025/08/16 - 14:46:55 | 200 |            0s |      172.16.0.3 | HEAD     "/"
[GIN] 2025/08/16 - 14:46:55 | 200 |            0s |      172.16.0.3 | GET      "/api/ps"
time=2025-08-16T14:46:55.638+01:00 level=INFO source=server.go:637 msg="llama runner started in 4.01 seconds"
CUDA error: the resource allocation failed
  current device: 0, in function cublas_handle at C:/a/ollama/ollama/ml/backend/ggml/ggml/src\ggml-cuda/common.cuh:823
  cublasCreate_v2(&cublas_handles[device])
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:77: CUDA error
time=2025-08-16T14:48:46.449+01:00 level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:51039/completion\": read tcp 127.0.0.1:51041->127.0.0.1:51039: wsarecv: An existing connection was forcibly closed by the remote host."
[GIN] 2025/08/16 - 14:48:46 | 200 |         1m55s |      172.16.0.3 | POST     "/api/chat"
time=2025-08-16T14:48:46.609+01:00 level=ERROR source=server.go:464 msg="llama runner terminated" error="exit status 0xc0000409"

With OLLAMA_FLASH_ATTENTION=1, OLLAMA_NEW_*=1, and a context length of 4096:

...the PC crashes.

time=2025-08-16T14:54:35.117+01:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
time=2025-08-16T14:54:35.291+01:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from C:\Users\Diab Neiroukh\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\Diab Neiroukh\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-08-16T14:54:35.472+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-08-16T14:54:36.998+01:00 level=INFO source=ggml.go:365 msg="offloading 24 repeating layers to GPU"
time=2025-08-16T14:54:36.998+01:00 level=INFO source=ggml.go:371 msg="offloading output layer to GPU"
time=2025-08-16T14:54:36.998+01:00 level=INFO source=ggml.go:376 msg="offloaded 25/25 layers to GPU"
time=2025-08-16T14:54:36.998+01:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="11.7 GiB"
time=2025-08-16T14:54:36.998+01:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB"
time=2025-08-16T14:54:37.237+01:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.2 GiB"
time=2025-08-16T14:54:37.237+01:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB"
[GIN] 2025/08/16 - 14:54:37 | 200 |            0s |      172.16.0.3 | HEAD     "/"
[GIN] 2025/08/16 - 14:54:37 | 200 |            0s |      172.16.0.3 | GET      "/api/ps"
time=2025-08-16T14:54:40.046+01:00 level=INFO source=server.go:637 msg="llama runner started in 5.01 seconds"
CUDA error: the resource allocation failed
  current device: 0, in function cublas_handle at C:/a/ollama/ollama/ml/backend/ggml/ggml/src\ggml-cuda/common.cuh:823
  cublasCreate_v2(&cublas_handles[device])
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:77: CUDA error
<!-- gh-comment-id:3193709382 --> @lzlrd commented on GitHub (Aug 16, 2025): With `OLLAMA_FLASH_ATTENTION=0`, `OLLAMA_NEW_*=0`, and a context length of 4096: ``` gpt-oss:20b aa4295ac10c3 14 GB 100% GPU 4096 4 minutes from now ``` ``` time=2025-08-16T14:46:51.704+01:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from C:\Users\Diab Neiroukh\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\Diab Neiroukh\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-08-16T14:46:51.801+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-08-16T14:46:51.883+01:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-08-16T14:46:53.333+01:00 level=INFO source=ggml.go:365 msg="offloading 24 repeating layers to GPU" time=2025-08-16T14:46:53.333+01:00 level=INFO source=ggml.go:371 msg="offloading output layer to GPU" time=2025-08-16T14:46:53.333+01:00 level=INFO source=ggml.go:376 msg="offloaded 25/25 layers to GPU" time=2025-08-16T14:46:53.333+01:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="11.7 GiB" time=2025-08-16T14:46:53.333+01:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB" time=2025-08-16T14:46:53.515+01:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.2 GiB" time=2025-08-16T14:46:53.515+01:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB" [GIN] 2025/08/16 - 14:46:55 | 200 | 0s | 172.16.0.3 | HEAD "/" [GIN] 2025/08/16 - 14:46:55 | 200 | 0s | 172.16.0.3 | GET "/api/ps" time=2025-08-16T14:46:55.638+01:00 level=INFO source=server.go:637 msg="llama runner started in 4.01 seconds" CUDA error: the resource allocation failed current device: 0, in function cublas_handle at C:/a/ollama/ollama/ml/backend/ggml/ggml/src\ggml-cuda/common.cuh:823 cublasCreate_v2(&cublas_handles[device]) C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:77: CUDA error time=2025-08-16T14:48:46.449+01:00 level=ERROR source=server.go:807 msg="post predict" error="Post \"http://127.0.0.1:51039/completion\": read tcp 127.0.0.1:51041->127.0.0.1:51039: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2025/08/16 - 14:48:46 | 200 | 1m55s | 172.16.0.3 | POST "/api/chat" time=2025-08-16T14:48:46.609+01:00 level=ERROR source=server.go:464 msg="llama runner terminated" error="exit status 0xc0000409" ``` With `OLLAMA_FLASH_ATTENTION=1`, `OLLAMA_NEW_*=1`, and a context length of 4096: ...the PC crashes. ``` time=2025-08-16T14:54:35.117+01:00 level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 time=2025-08-16T14:54:35.291+01:00 level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from C:\Users\Diab Neiroukh\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\Diab Neiroukh\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-08-16T14:54:35.472+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-08-16T14:54:36.998+01:00 level=INFO source=ggml.go:365 msg="offloading 24 repeating layers to GPU" time=2025-08-16T14:54:36.998+01:00 level=INFO source=ggml.go:371 msg="offloading output layer to GPU" time=2025-08-16T14:54:36.998+01:00 level=INFO source=ggml.go:376 msg="offloaded 25/25 layers to GPU" time=2025-08-16T14:54:36.998+01:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="11.7 GiB" time=2025-08-16T14:54:36.998+01:00 level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB" time=2025-08-16T14:54:37.237+01:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.2 GiB" time=2025-08-16T14:54:37.237+01:00 level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB" [GIN] 2025/08/16 - 14:54:37 | 200 | 0s | 172.16.0.3 | HEAD "/" [GIN] 2025/08/16 - 14:54:37 | 200 | 0s | 172.16.0.3 | GET "/api/ps" time=2025-08-16T14:54:40.046+01:00 level=INFO source=server.go:637 msg="llama runner started in 5.01 seconds" CUDA error: the resource allocation failed current device: 0, in function cublas_handle at C:/a/ollama/ollama/ml/backend/ggml/ggml/src\ggml-cuda/common.cuh:823 cublasCreate_v2(&cublas_handles[device]) C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:77: CUDA error ```
Author
Owner

@lzlrd commented on GitHub (Aug 16, 2025):

Note: I have CUDA - Sysmem Fallback Policy set to Prefer No Sysmem Fallback as Default and Prefer Sysmem Fallback for Ollama (both ollama.exe and ollama app.exe) in NVCPL.

<!-- gh-comment-id:3193713198 --> @lzlrd commented on GitHub (Aug 16, 2025): Note: I have `CUDA - Sysmem Fallback Policy` set to `Prefer No Sysmem Fallback` as Default and `Prefer Sysmem Fallback` for Ollama (both `ollama.exe` and `ollama app.exe`) in NVCPL.
Author
Owner

@rick-github commented on GitHub (Aug 16, 2025):

GGML_CUDA_ENABLE_UNIFIED_MEMORY is not necessary for Windows, it should be the default.

ollama doesn't use the GPU when OLLAMA_CONTEXT_LENGTH=131072 because the memory graph won't fit in the available VRAM. ollama can be forced to load all layers on the GPU by setting the layer count (num_gpu) as described here. This should result in the layers over flowing into system RAM, which may cause performance issues.

<!-- gh-comment-id:3193720679 --> @rick-github commented on GitHub (Aug 16, 2025): `GGML_CUDA_ENABLE_UNIFIED_MEMORY` is not necessary for Windows, it should be the default. ollama doesn't use the GPU when `OLLAMA_CONTEXT_LENGTH=131072` because the memory graph won't fit in the available VRAM. ollama can be forced to load all layers on the GPU by setting the layer count (`num_gpu`) as described [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650). This should result in the layers over flowing into system RAM, which may cause [performance issues](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900).
Author
Owner

@lzlrd commented on GitHub (Aug 16, 2025):

GGML_CUDA_ENABLE_UNIFIED_MEMORY is not necessary for Windows, it should be the default.

Thanks.

ollama doesn't use the GPU when OLLAMA_CONTEXT_LENGTH=131072 because the memory graph won't fit in the available VRAM.

It seems to do so for other, larger models though:

$ ollama ps
NAME                                     ID              SIZE     PROCESSOR          CONTEXT    UNTIL
huihui_ai/dolphin3-r1-abliterated:24b    05a0492884c8    19 GB    27%/73% CPU/GPU    131072     4 minutes from now

Then there's the problem of it crashing when it does use the GPU for gpt-oss @ 4096 CL. It seems to be a "worse" crash (as in, the system freezes indefinitely rather than for a few minutes) with flash attention, the new engine and new estimates enabled.

This should result in the layers over flowing into system RAM, which may cause https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900..

I do expect that, though I'm having [surprisingly] decent toks/s on 100% CPU with gpt-oss so my goal here is to see if I could improve that.

<!-- gh-comment-id:3193736715 --> @lzlrd commented on GitHub (Aug 16, 2025): > `GGML_CUDA_ENABLE_UNIFIED_MEMORY` is not necessary for Windows, it should be the default. Thanks. > ollama doesn't use the GPU when `OLLAMA_CONTEXT_LENGTH=131072` because the memory graph won't fit in the available VRAM. It seems to do so for other, larger models though: ``` $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL huihui_ai/dolphin3-r1-abliterated:24b 05a0492884c8 19 GB 27%/73% CPU/GPU 131072 4 minutes from now ``` Then there's the problem of it crashing when it does use the GPU for `gpt-oss` @ 4096 CL. It seems to be a "worse" crash (as in, the system freezes indefinitely rather than for a few minutes) with flash attention, the new engine and new estimates enabled. > This should result in the layers over flowing into system RAM, which may cause [https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900.](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900). I do expect that, though I'm having [surprisingly] decent toks/s on `100% CPU` with `gpt-oss` so my goal here is to see if I could improve that.
Author
Owner

@rick-github commented on GitHub (Aug 16, 2025):

It's not the size of the model that matters, it's the size of the memory graph.

ctx dolphin3 kv dolphin3 graph gpt-oss kv gpt-oss graph
4096 0.64G 0.8G - -
8192 1.2G 0.8G 0.3G 2.0G
16384 2.5G 1.1G 0.492G 4.0G
32768 5.0G 2.3G 0.876G 8.0G
65536 - - 1.6G 16.0G
131072 - - 3.1G 32.0G

dolphin3 was trained with a max context size of 32768, so while you can set num_ctx higher, it is capped at the max by ollama. So when you are running dolphin3 at 128k, you only need enough room to hold a 2.3G graph. gpt-oss on the other hand allows the full 128k context size, and the resulting graph of 32G will not fit in available VRAM along with any layers or ancillary data structures.

Note that this for 0.11.4, which has some issues the allocations and flash for gpt-oss. 0.11.5 (currently at rc2) fixes these issues (and adds a faster implementation of MXFP4), so running 0.11.5 with FA and KV quant of q8_0 will result in a graph of around 8G, leaving room for loading layers. Note however that KV quant for gpt-oss results in slow inference. With the new memory management layer (OLLAMA_NEW_ESTIMATES, only in 0.11.5) the graph reduction is even greater, although I don't have data on how much yet.

<!-- gh-comment-id:3193933965 --> @rick-github commented on GitHub (Aug 16, 2025): It's not the size of the model that matters, it's the size of the memory graph. | ctx | dolphin3 kv | dolphin3 graph | gpt-oss kv | gpt-oss graph | | --- | --- | --- | --- | --- | | 4096 | 0.64G | 0.8G | - | - | | 8192 | 1.2G | 0.8G | 0.3G | 2.0G | | 16384 | 2.5G | 1.1G | 0.492G | 4.0G | | 32768 | 5.0G | 2.3G | 0.876G | 8.0G | | 65536 | - | - | 1.6G | 16.0G | | 131072 | - | - | 3.1G | 32.0G | dolphin3 was trained with a max context size of 32768, so while you can set `num_ctx` higher, it is capped at the max by ollama. So when you are running dolphin3 at 128k, you only need enough room to hold a 2.3G graph. gpt-oss on the other hand allows the full 128k context size, and the resulting graph of 32G will not fit in available VRAM along with any layers or ancillary data structures. Note that this for 0.11.4, which has some issues the allocations and flash for gpt-oss. 0.11.5 (currently at [rc2](https://github.com/ollama/ollama/releases/tag/v0.11.5-rc2)) fixes these issues (and adds a faster implementation of MXFP4), so running 0.11.5 with FA and KV quant of q8_0 will result in a graph of around 8G, leaving room for loading layers. Note however that KV quant for gpt-oss results in slow inference. With the new memory management layer (`OLLAMA_NEW_ESTIMATES`, only in 0.11.5) the graph reduction is even greater, although I don't have data on how much yet.
Author
Owner

@alienatedsec commented on GitHub (Aug 17, 2025):

@lzlrd This could be helpful https://github.com/ollama/ollama/issues/11676#issuecomment-3193972390

<!-- gh-comment-id:3193987069 --> @alienatedsec commented on GitHub (Aug 17, 2025): @lzlrd This could be helpful https://github.com/ollama/ollama/issues/11676#issuecomment-3193972390
Author
Owner

@lzlrd commented on GitHub (Aug 17, 2025):

@rick-github, so I've tested 0.11.5 with the new estimates engine and q8_0 KV quant and I do see GPU usage:

$ ollama ps
NAME           ID              SIZE     PROCESSOR    CONTEXT    UNTIL
gpt-oss:20b    aa4295ac10c3    15 GB    100% GPU     131072     4 minutes from now

...but I'm still hitting into the runner failing:

time=2025-08-17T16:53:24.859+01:00 level=INFO source=server.go:166 msg="enabling new memory estimates"
time=2025-08-17T16:53:24.871+01:00 level=INFO source=server.go:211 msg="enabling flash attention"
time=2025-08-17T16:53:24.873+01:00 level=INFO source=server.go:383 msg="starting runner" cmd="C:\\Users\\Diab Neiroukh\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\Tools\\Ollama\\blobs\\sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 50514"
time=2025-08-17T16:53:24.877+01:00 level=INFO source=server.go:657 msg="loading model" "model layers"=25 requested=-1
time=2025-08-17T16:53:24.886+01:00 level=INFO source=server.go:663 msg="system memory" total="63.7 GiB" free="35.1 GiB" free_swap="36.9 GiB"
time=2025-08-17T16:53:24.886+01:00 level=INFO source=server.go:667 msg="gpu memory" id=GPU-e6f4866f-d905-e919-2700-f58f99f7d63d available="13.9 GiB" free="14.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-08-17T16:53:24.908+01:00 level=INFO source=runner.go:1006 msg="starting ollama engine"
time=2025-08-17T16:53:24.909+01:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:50514"
time=2025-08-17T16:53:24.918+01:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType:q8_0 NumThreads:16 GPULayers:25[ID:GPU-e6f4866f-d905-e919-2700-f58f99f7d63d Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-17T16:53:24.956+01:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes, ID: GPU-e6f4866f-d905-e919-2700-f58f99f7d63d
load_backend: loaded CUDA backend from C:\Users\Diab Neiroukh\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\Diab Neiroukh\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
time=2025-08-17T16:53:25.036+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-08-17T16:53:25.123+01:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType:q8_0 NumThreads:16 GPULayers:25[ID:GPU-e6f4866f-d905-e919-2700-f58f99f7d63d Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-17T16:53:26.675+01:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType:q8_0 NumThreads:16 GPULayers:25[ID:GPU-e6f4866f-d905-e919-2700-f58f99f7d63d Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-17T16:53:26.675+01:00 level=INFO source=ggml.go:486 msg="offloading 24 repeating layers to GPU"
time=2025-08-17T16:53:26.675+01:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
time=2025-08-17T16:53:26.675+01:00 level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU"
time=2025-08-17T16:53:26.676+01:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="11.8 GiB"
time=2025-08-17T16:53:26.676+01:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
time=2025-08-17T16:53:26.676+01:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="1.6 GiB"
time=2025-08-17T16:53:26.676+01:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="109.3 MiB"
time=2025-08-17T16:53:26.676+01:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="152.0 MiB"
time=2025-08-17T16:53:26.676+01:00 level=INFO source=backend.go:342 msg="total memory" size="14.7 GiB"
time=2025-08-17T16:53:26.676+01:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
time=2025-08-17T16:53:26.676+01:00 level=INFO source=server.go:1232 msg="waiting for llama runner to start responding"
time=2025-08-17T16:53:26.676+01:00 level=INFO source=server.go:1266 msg="waiting for server to become available" status="llm server loading model"
[GIN] 2025/08/17 - 16:53:26 | 200 |            0s |      172.16.0.3 | HEAD     "/"
[GIN] 2025/08/17 - 16:53:26 | 200 |            0s |      172.16.0.3 | GET      "/api/ps"
time=2025-08-17T16:53:29.181+01:00 level=INFO source=server.go:1270 msg="llama runner started in 4.31 seconds"
CUDA error: the resource allocation failed
  current device: 0, in function cublas_handle at C:/a/ollama/ollama/ml/backend/ggml/ggml/src\ggml-cuda/common.cuh:904
  cublasCreate_v2(&cublas_handles[device])
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:84: CUDA error
time=2025-08-17T16:53:37.823+01:00 level=ERROR source=server.go:1440 msg="post predict" error="Post \"http://127.0.0.1:50514/completion\": read tcp 127.0.0.1:50518->127.0.0.1:50514: wsarecv: An existing connection was forcibly closed by the remote host."
[GIN] 2025/08/17 - 16:53:37 | 200 |   13.1684693s |      172.16.0.3 | POST     "/api/chat"
time=2025-08-17T16:53:37.932+01:00 level=ERROR source=server.go:409 msg="llama runner terminated" error="exit status 0xc0000409"

<!-- gh-comment-id:3194473035 --> @lzlrd commented on GitHub (Aug 17, 2025): @rick-github, so I've tested 0.11.5 with the new estimates engine and q8_0 KV quant and I do see GPU usage: ``` $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b aa4295ac10c3 15 GB 100% GPU 131072 4 minutes from now ``` ...but I'm still hitting into the runner failing: ``` time=2025-08-17T16:53:24.859+01:00 level=INFO source=server.go:166 msg="enabling new memory estimates" time=2025-08-17T16:53:24.871+01:00 level=INFO source=server.go:211 msg="enabling flash attention" time=2025-08-17T16:53:24.873+01:00 level=INFO source=server.go:383 msg="starting runner" cmd="C:\\Users\\Diab Neiroukh\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\Tools\\Ollama\\blobs\\sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 50514" time=2025-08-17T16:53:24.877+01:00 level=INFO source=server.go:657 msg="loading model" "model layers"=25 requested=-1 time=2025-08-17T16:53:24.886+01:00 level=INFO source=server.go:663 msg="system memory" total="63.7 GiB" free="35.1 GiB" free_swap="36.9 GiB" time=2025-08-17T16:53:24.886+01:00 level=INFO source=server.go:667 msg="gpu memory" id=GPU-e6f4866f-d905-e919-2700-f58f99f7d63d available="13.9 GiB" free="14.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-08-17T16:53:24.908+01:00 level=INFO source=runner.go:1006 msg="starting ollama engine" time=2025-08-17T16:53:24.909+01:00 level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:50514" time=2025-08-17T16:53:24.918+01:00 level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType:q8_0 NumThreads:16 GPULayers:25[ID:GPU-e6f4866f-d905-e919-2700-f58f99f7d63d Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-17T16:53:24.956+01:00 level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4080 SUPER, compute capability 8.9, VMM: yes, ID: GPU-e6f4866f-d905-e919-2700-f58f99f7d63d load_backend: loaded CUDA backend from C:\Users\Diab Neiroukh\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\Diab Neiroukh\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll time=2025-08-17T16:53:25.036+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-08-17T16:53:25.123+01:00 level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType:q8_0 NumThreads:16 GPULayers:25[ID:GPU-e6f4866f-d905-e919-2700-f58f99f7d63d Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-17T16:53:26.675+01:00 level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType:q8_0 NumThreads:16 GPULayers:25[ID:GPU-e6f4866f-d905-e919-2700-f58f99f7d63d Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-17T16:53:26.675+01:00 level=INFO source=ggml.go:486 msg="offloading 24 repeating layers to GPU" time=2025-08-17T16:53:26.675+01:00 level=INFO source=ggml.go:492 msg="offloading output layer to GPU" time=2025-08-17T16:53:26.675+01:00 level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU" time=2025-08-17T16:53:26.676+01:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="11.8 GiB" time=2025-08-17T16:53:26.676+01:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" time=2025-08-17T16:53:26.676+01:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="1.6 GiB" time=2025-08-17T16:53:26.676+01:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="109.3 MiB" time=2025-08-17T16:53:26.676+01:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="152.0 MiB" time=2025-08-17T16:53:26.676+01:00 level=INFO source=backend.go:342 msg="total memory" size="14.7 GiB" time=2025-08-17T16:53:26.676+01:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 time=2025-08-17T16:53:26.676+01:00 level=INFO source=server.go:1232 msg="waiting for llama runner to start responding" time=2025-08-17T16:53:26.676+01:00 level=INFO source=server.go:1266 msg="waiting for server to become available" status="llm server loading model" [GIN] 2025/08/17 - 16:53:26 | 200 | 0s | 172.16.0.3 | HEAD "/" [GIN] 2025/08/17 - 16:53:26 | 200 | 0s | 172.16.0.3 | GET "/api/ps" time=2025-08-17T16:53:29.181+01:00 level=INFO source=server.go:1270 msg="llama runner started in 4.31 seconds" CUDA error: the resource allocation failed current device: 0, in function cublas_handle at C:/a/ollama/ollama/ml/backend/ggml/ggml/src\ggml-cuda/common.cuh:904 cublasCreate_v2(&cublas_handles[device]) C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:84: CUDA error time=2025-08-17T16:53:37.823+01:00 level=ERROR source=server.go:1440 msg="post predict" error="Post \"http://127.0.0.1:50514/completion\": read tcp 127.0.0.1:50518->127.0.0.1:50514: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2025/08/17 - 16:53:37 | 200 | 13.1684693s | 172.16.0.3 | POST "/api/chat" time=2025-08-17T16:53:37.932+01:00 level=ERROR source=server.go:409 msg="llama runner terminated" error="exit status 0xc0000409" ```
Author
Owner

@lzlrd commented on GitHub (Aug 17, 2025):

Looking at VRAM usage (with some third-party tools) it's not actually maxing out, either. There's about 0.8GB free before it fails.

<!-- gh-comment-id:3194473519 --> @lzlrd commented on GitHub (Aug 17, 2025): Looking at VRAM usage (with some third-party tools) it's not actually maxing out, either. There's about 0.8GB free before it fails.
Author
Owner

@jessegross commented on GitHub (Aug 21, 2025):

I believe that the original issue is fixed, so I'm going to close this. There is a second issue that you are seeing, which is the same as #11753

<!-- gh-comment-id:3211850458 --> @jessegross commented on GitHub (Aug 21, 2025): I believe that the original issue is fixed, so I'm going to close this. There is a second issue that you are seeing, which is the same as #11753
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69985