[GH-ISSUE #11676] Ollama not using NVIDIA GPUs with gpt-oss models #7723

Closed
opened 2026-04-12 19:50:08 -05:00 by GiteaMirror · 91 comments
Owner

Originally created by @nadamas2000 on GitHub (Aug 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11676

What is the issue?

Hello,

I've noticed an issue with GPU utilization. When running the gpt-oss:20b and gpt-oss:120b models, Ollama seems to be running them entirely on the CPU.

My NVIDIA GPUs (RTX 4070-Ti 16GB and RTX 3060 12GB) remain completely idle according to nvidia-smi and Task Manager, while my CPU usage is maxed out. I would expect these models to be loaded onto the GPUs for accelerated performance.

Key Information:

  • Ollama Version: 0.11.0
  • Operating System: Windows 11
  • Intel(R) Core(TM) Ultra 7 265K (3.90 GHz) 20 cores
  • Nvidia driver: Game Ready 580.88
  • Models affected: gpt-oss:20b, gpt-oss:120b

Steps to Reproduce:

  • ollama run gpt-oss:120b "Write a long story"
  • Observe that GPU utilization is at 0% and CPU is at 100%.

Thanks for your great work on this project. Let me know if you need any more information.

server.log

Originally created by @nadamas2000 on GitHub (Aug 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11676 ### What is the issue? Hello, I've noticed an issue with GPU utilization. When running the gpt-oss:20b and gpt-oss:120b models, Ollama seems to be running them entirely on the CPU. My NVIDIA GPUs (RTX 4070-Ti 16GB and RTX 3060 12GB) remain completely idle according to nvidia-smi and Task Manager, while my CPU usage is maxed out. I would expect these models to be loaded onto the GPUs for accelerated performance. Key Information: - Ollama Version: 0.11.0 - Operating System: Windows 11 - Intel(R) Core(TM) Ultra 7 265K (3.90 GHz) 20 cores - Nvidia driver: Game Ready 580.88 - Models affected: gpt-oss:20b, gpt-oss:120b Steps to Reproduce: - ollama run gpt-oss:120b "Write a long story" - Observe that GPU utilization is at 0% and CPU is at 100%. Thanks for your great work on this project. Let me know if you need any more information. [server.log](https://github.com/user-attachments/files/21605611/server.log)
GiteaMirror added the bug label 2026-04-12 19:50:08 -05:00
Author
Owner

@russellmm commented on GitHub (Aug 5, 2025):

server.log
Can confirm. Same issue for me. Swapped over to qwen3:30b just to be sure and it is using the GPU fine.

<!-- gh-comment-id:3156135929 --> @russellmm commented on GitHub (Aug 5, 2025): [server.log](https://github.com/user-attachments/files/21605549/server.log) Can confirm. Same issue for me. Swapped over to qwen3:30b just to be sure and it is using the GPU fine.
Author
Owner

@Shawneau commented on GitHub (Aug 5, 2025):

Can confirm not working in Docker on Nvidia gpu, while other models load fine. Host is Ubuntu 22.something

<!-- gh-comment-id:3156191082 --> @Shawneau commented on GitHub (Aug 5, 2025): Can confirm not working in Docker on Nvidia gpu, while other models load fine. Host is Ubuntu 22.something
Author
Owner

@jessegross commented on GitHub (Aug 5, 2025):

Can you please post the server logs?

<!-- gh-comment-id:3156199194 --> @jessegross commented on GitHub (Aug 5, 2025): Can you please post the [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues)?
Author
Owner

@av commented on GitHub (Aug 5, 2025):

@jessegross , sorry for the extra pull logs in the middle

Details

harbor.ollama  | time=2025-08-05T18:39:37.230Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
harbor.ollama  | time=2025-08-05T18:39:37.380Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4f549573-5491-abe4-bcf8-8804171f6b2b library=cuda variant=v12 compute=8.9 driver=12.9 name="NVIDIA GeForce RTX 4090 Laptop GPU" total="15.6 GiB" available="15.3 GiB"
harbor.ollama  | [GIN] 2025/08/05 - 18:39:38 | 200 |      80.352µs |      172.22.0.3 | HEAD     "/"
harbor.ollama  | [GIN] 2025/08/05 - 18:39:39 | 200 |  650.019494ms |      172.22.0.3 | POST     "/api/pull"
harbor.ollama  | [GIN] 2025/08/05 - 18:39:40 | 200 |      25.962µs |      172.22.0.4 | HEAD     "/"
harbor.ollama  | time=2025-08-05T18:39:40.835Z level=INFO source=download.go:177 msg="downloading b112e727c6f1 in 16 861 MB part(s)"
harbor.ollama  | time=2025-08-05T18:41:21.300Z level=INFO source=download.go:295 msg="b112e727c6f1 part 13 attempt 0 failed: unexpected EOF, retrying in 1s"
harbor.ollama  | time=2025-08-05T18:41:50.892Z level=INFO source=download.go:295 msg="b112e727c6f1 part 6 attempt 0 failed: unexpected EOF, retrying in 1s"
harbor.ollama  | [GIN] 2025/08/05 - 18:43:29 | 200 |      14.745µs |      172.22.0.5 | HEAD     "/"
harbor.ollama  | [GIN] 2025/08/05 - 18:43:35 | 200 |  6.075061495s |      172.22.0.5 | POST     "/api/pull"
harbor.ollama  | time=2025-08-05T18:44:12.210Z level=INFO source=download.go:177 msg="downloading 51468a0fd901 in 1 7.4 KB part(s)"
harbor.ollama  | time=2025-08-05T18:44:13.589Z level=INFO source=download.go:177 msg="downloading d8ba2f9a17b3 in 1 18 B part(s)"
harbor.ollama  | time=2025-08-05T18:44:14.979Z level=INFO source=download.go:177 msg="downloading fcaef9305bb6 in 1 415 B part(s)"
harbor.ollama  | [GIN] 2025/08/05 - 18:44:22 | 200 |         4m42s |      172.22.0.4 | POST     "/api/pull"
harbor.ollama  | [GIN] 2025/08/05 - 18:44:28 | 200 |    27.91755ms |      172.22.0.3 | GET      "/api/tags"
harbor.ollama  | [GIN] 2025/08/05 - 18:44:28 | 200 |      84.839µs |      172.22.0.3 | GET      "/api/ps"
harbor.ollama  | [GIN] 2025/08/05 - 18:44:28 | 200 |      35.083µs |      172.22.0.3 | GET      "/api/version"
harbor.ollama  | time=2025-08-05T18:44:31.245Z level=INFO source=server.go:135 msg="system memory" total="62.4 GiB" free="52.0 GiB" free_swap="20.0 GiB"
harbor.ollama  | time=2025-08-05T18:44:31.245Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[15.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="12.9 GiB" memory.required.partial="0 B" memory.required.kv="1.1 GiB" memory.required.allocations="[0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="8.0 GiB" memory.graph.partial="16.0 GiB"
harbor.ollama  | time=2025-08-05T18:44:31.273Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 32768 --batch-size 512 --threads 8 --no-mmap --parallel 4 --port 35337"
harbor.ollama  | time=2025-08-05T18:44:31.273Z level=INFO source=sched.go:481 msg="loaded runners" count=1
harbor.ollama  | time=2025-08-05T18:44:31.274Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
harbor.ollama  | time=2025-08-05T18:44:31.274Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
harbor.ollama  | time=2025-08-05T18:44:31.282Z level=INFO source=runner.go:925 msg="starting ollama engine"
harbor.ollama  | time=2025-08-05T18:44:31.282Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:35337"
harbor.ollama  | time=2025-08-05T18:44:31.314Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
harbor.ollama  | ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
harbor.ollama  | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
harbor.ollama  | ggml_cuda_init: found 1 CUDA devices:
harbor.ollama  |   Device 0: NVIDIA GeForce RTX 4090 Laptop GPU, compute capability 8.9, VMM: yes
harbor.ollama  | load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
harbor.ollama  | load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so
harbor.ollama  | time=2025-08-05T18:44:31.361Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
harbor.ollama  | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:367 msg="offloading 0 repeating layers to GPU"
harbor.ollama  | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU"
harbor.ollama  | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:378 msg="offloaded 0/25 layers to GPU"
harbor.ollama  | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="12.8 GiB"
harbor.ollama  | time=2025-08-05T18:44:31.525Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
harbor.ollama  | time=2025-08-05T18:44:31.698Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
harbor.ollama  | time=2025-08-05T18:44:31.698Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="8.0 GiB"
harbor.ollama  | time=2025-08-05T18:44:32.787Z level=INFO source=server.go:637 msg="llama runner started in 1.51 seconds"

Similar setup, Ollama v0.11 + Docker, other models use GPU as expected

<!-- gh-comment-id:3156228498 --> @av commented on GitHub (Aug 5, 2025): @jessegross , sorry for the extra `pull` logs in the middle <details><summary>Details</summary> <p> ```bash harbor.ollama | time=2025-08-05T18:39:37.230Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" harbor.ollama | time=2025-08-05T18:39:37.380Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4f549573-5491-abe4-bcf8-8804171f6b2b library=cuda variant=v12 compute=8.9 driver=12.9 name="NVIDIA GeForce RTX 4090 Laptop GPU" total="15.6 GiB" available="15.3 GiB" harbor.ollama | [GIN] 2025/08/05 - 18:39:38 | 200 | 80.352µs | 172.22.0.3 | HEAD "/" harbor.ollama | [GIN] 2025/08/05 - 18:39:39 | 200 | 650.019494ms | 172.22.0.3 | POST "/api/pull" harbor.ollama | [GIN] 2025/08/05 - 18:39:40 | 200 | 25.962µs | 172.22.0.4 | HEAD "/" harbor.ollama | time=2025-08-05T18:39:40.835Z level=INFO source=download.go:177 msg="downloading b112e727c6f1 in 16 861 MB part(s)" harbor.ollama | time=2025-08-05T18:41:21.300Z level=INFO source=download.go:295 msg="b112e727c6f1 part 13 attempt 0 failed: unexpected EOF, retrying in 1s" harbor.ollama | time=2025-08-05T18:41:50.892Z level=INFO source=download.go:295 msg="b112e727c6f1 part 6 attempt 0 failed: unexpected EOF, retrying in 1s" harbor.ollama | [GIN] 2025/08/05 - 18:43:29 | 200 | 14.745µs | 172.22.0.5 | HEAD "/" harbor.ollama | [GIN] 2025/08/05 - 18:43:35 | 200 | 6.075061495s | 172.22.0.5 | POST "/api/pull" harbor.ollama | time=2025-08-05T18:44:12.210Z level=INFO source=download.go:177 msg="downloading 51468a0fd901 in 1 7.4 KB part(s)" harbor.ollama | time=2025-08-05T18:44:13.589Z level=INFO source=download.go:177 msg="downloading d8ba2f9a17b3 in 1 18 B part(s)" harbor.ollama | time=2025-08-05T18:44:14.979Z level=INFO source=download.go:177 msg="downloading fcaef9305bb6 in 1 415 B part(s)" harbor.ollama | [GIN] 2025/08/05 - 18:44:22 | 200 | 4m42s | 172.22.0.4 | POST "/api/pull" harbor.ollama | [GIN] 2025/08/05 - 18:44:28 | 200 | 27.91755ms | 172.22.0.3 | GET "/api/tags" harbor.ollama | [GIN] 2025/08/05 - 18:44:28 | 200 | 84.839µs | 172.22.0.3 | GET "/api/ps" harbor.ollama | [GIN] 2025/08/05 - 18:44:28 | 200 | 35.083µs | 172.22.0.3 | GET "/api/version" harbor.ollama | time=2025-08-05T18:44:31.245Z level=INFO source=server.go:135 msg="system memory" total="62.4 GiB" free="52.0 GiB" free_swap="20.0 GiB" harbor.ollama | time=2025-08-05T18:44:31.245Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[15.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="12.9 GiB" memory.required.partial="0 B" memory.required.kv="1.1 GiB" memory.required.allocations="[0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="8.0 GiB" memory.graph.partial="16.0 GiB" harbor.ollama | time=2025-08-05T18:44:31.273Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 32768 --batch-size 512 --threads 8 --no-mmap --parallel 4 --port 35337" harbor.ollama | time=2025-08-05T18:44:31.273Z level=INFO source=sched.go:481 msg="loaded runners" count=1 harbor.ollama | time=2025-08-05T18:44:31.274Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" harbor.ollama | time=2025-08-05T18:44:31.274Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" harbor.ollama | time=2025-08-05T18:44:31.282Z level=INFO source=runner.go:925 msg="starting ollama engine" harbor.ollama | time=2025-08-05T18:44:31.282Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:35337" harbor.ollama | time=2025-08-05T18:44:31.314Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 harbor.ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no harbor.ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no harbor.ollama | ggml_cuda_init: found 1 CUDA devices: harbor.ollama | Device 0: NVIDIA GeForce RTX 4090 Laptop GPU, compute capability 8.9, VMM: yes harbor.ollama | load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so harbor.ollama | load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so harbor.ollama | time=2025-08-05T18:44:31.361Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) harbor.ollama | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:367 msg="offloading 0 repeating layers to GPU" harbor.ollama | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU" harbor.ollama | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:378 msg="offloaded 0/25 layers to GPU" harbor.ollama | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="12.8 GiB" harbor.ollama | time=2025-08-05T18:44:31.525Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" harbor.ollama | time=2025-08-05T18:44:31.698Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" harbor.ollama | time=2025-08-05T18:44:31.698Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="8.0 GiB" harbor.ollama | time=2025-08-05T18:44:32.787Z level=INFO source=server.go:637 msg="llama runner started in 1.51 seconds" ``` </p> </details> Similar setup, Ollama v0.11 + Docker, other models use GPU as expected
Author
Owner

@hrz6976 commented on GitHub (Aug 5, 2025):

Same here on 4xL40s.

time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-464c9e2c-5e57-838c-e947-f75970e572bd library=cuda total="44.4 GiB" available="43.6 GiB"
time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-b41cc73c-5bc1-2795-95ad-ec87002c38e2 library=cuda total="44.4 GiB" available="43.6 GiB"
time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-b7ad3d9e-25dc-58ee-8b74-aaa56e955517 library=cuda total="44.4 GiB" available="43.6 GiB"
time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-81971a31-64fe-a071-c6a5-de5dc026e0f7 library=cuda total="44.4 GiB" available="43.6 GiB"
time=2025-08-05T18:48:41.318Z level=INFO source=server.go:135 msg="system memory" total="755.5 GiB" free="688.2 GiB" free_swap="60.0 GiB"
time=2025-08-05T18:48:41.318Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=0 layers.split="" memory.available="[44.0 GiB 44.0 GiB 44.0 GiB 44.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="86.7 GiB" memory.required.partial="0 B" memory.required.kv="27.0 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="384.0 GiB" memory.graph.partial="384.0 GiB"
time=2025-08-05T18:48:41.361Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 524288 --batch-size 512 --threads 112 --no-mmap --parallel 64 --port 32909"
time=2025-08-05T18:48:41.362Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-05T18:48:41.362Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-05T18:48:41.362Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-05T18:48:41.378Z level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-05T18:48:41.379Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:32909"
time=2025-08-05T18:48:41.451Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA L40S, compute capability 8.9, VMM: yes
  Device 1: NVIDIA L40S, compute capability 8.9, VMM: yes
  Device 2: NVIDIA L40S, compute capability 8.9, VMM: yes
  Device 3: NVIDIA L40S, compute capability 8.9, VMM: yes
time=2025-08-05T18:48:41.614Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-08-05T18:48:41.849Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:367 msg="offloading 0 repeating layers to GPU"
time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU"
time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:378 msg="offloaded 0/37 layers to GPU"
time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="60.8 GiB"
time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="0 B"
time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="0 B"
time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA3 buffer_type=CUDA3 size="0 B"
time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="128.0 GiB"
time=2025-08-05T18:48:56.185Z level=INFO source=server.go:637 msg="llama runner started in 14.82 seconds"
<!-- gh-comment-id:3156242973 --> @hrz6976 commented on GitHub (Aug 5, 2025): Same here on 4xL40s. <details> ``` time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-464c9e2c-5e57-838c-e947-f75970e572bd library=cuda total="44.4 GiB" available="43.6 GiB" time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-b41cc73c-5bc1-2795-95ad-ec87002c38e2 library=cuda total="44.4 GiB" available="43.6 GiB" time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-b7ad3d9e-25dc-58ee-8b74-aaa56e955517 library=cuda total="44.4 GiB" available="43.6 GiB" time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-81971a31-64fe-a071-c6a5-de5dc026e0f7 library=cuda total="44.4 GiB" available="43.6 GiB" time=2025-08-05T18:48:41.318Z level=INFO source=server.go:135 msg="system memory" total="755.5 GiB" free="688.2 GiB" free_swap="60.0 GiB" time=2025-08-05T18:48:41.318Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=0 layers.split="" memory.available="[44.0 GiB 44.0 GiB 44.0 GiB 44.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="86.7 GiB" memory.required.partial="0 B" memory.required.kv="27.0 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="384.0 GiB" memory.graph.partial="384.0 GiB" time=2025-08-05T18:48:41.361Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 524288 --batch-size 512 --threads 112 --no-mmap --parallel 64 --port 32909" time=2025-08-05T18:48:41.362Z level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-05T18:48:41.362Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-05T18:48:41.362Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-05T18:48:41.378Z level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-05T18:48:41.379Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:32909" time=2025-08-05T18:48:41.451Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 4 CUDA devices: Device 0: NVIDIA L40S, compute capability 8.9, VMM: yes Device 1: NVIDIA L40S, compute capability 8.9, VMM: yes Device 2: NVIDIA L40S, compute capability 8.9, VMM: yes Device 3: NVIDIA L40S, compute capability 8.9, VMM: yes time=2025-08-05T18:48:41.614Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-08-05T18:48:41.849Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:367 msg="offloading 0 repeating layers to GPU" time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU" time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:378 msg="offloaded 0/37 layers to GPU" time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="60.8 GiB" time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="0 B" time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="0 B" time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA3 buffer_type=CUDA3 size="0 B" time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="128.0 GiB" time=2025-08-05T18:48:56.185Z level=INFO source=server.go:637 msg="llama runner started in 14.82 seconds" ``` </details>
Author
Owner

@jessegross commented on GitHub (Aug 5, 2025):

@av @hrz6976

It looks like you both increased OLLAMA_NUM_PARALLEL. I would recommend leaving it at the default setting as higher values use more VRAM and reduce ability to offload.

<!-- gh-comment-id:3156262074 --> @jessegross commented on GitHub (Aug 5, 2025): @av @hrz6976 It looks like you both increased OLLAMA_NUM_PARALLEL. I would recommend leaving it at the default setting as higher values use more VRAM and reduce ability to offload.
Author
Owner

@nadamas2000 commented on GitHub (Aug 5, 2025):

Thanks for the suggestion. I've confirmed that I'm using OLLAMA_NUM_PARALLEL=1. I have updated the issue description with the latest logs.

<!-- gh-comment-id:3156291706 --> @nadamas2000 commented on GitHub (Aug 5, 2025): Thanks for the suggestion. I've confirmed that I'm using OLLAMA_NUM_PARALLEL=1. I have updated the issue description with the latest logs.
Author
Owner

@Shawneau commented on GitHub (Aug 5, 2025):

I have a feeling everyone experiencing this is hitting Ollama via OpenWebUI? Command line Ollama works with 100% GPU, but hitting it with open web ui goes 100% CPU

<!-- gh-comment-id:3156295781 --> @Shawneau commented on GitHub (Aug 5, 2025): I have a feeling everyone experiencing this is hitting Ollama via OpenWebUI? Command line Ollama works with 100% GPU, but hitting it with open web ui goes 100% CPU
Author
Owner

@av commented on GitHub (Aug 5, 2025):

Understandable!

Setting OLLAMA_NUM_PARALLEL=1 the split is now:

NAME           ID              SIZE     PROCESSOR          CONTEXT    UNTIL              
gpt-oss:20b    05afbac4bad6    18 GB    12%/88% CPU/GPU    8192       4 minutes from now    

With OLLAMA_NUM_PARALLEL=4, it looks like:

NAME           ID              SIZE     PROCESSOR    CONTEXT    UNTIL              
gpt-oss:20b    05afbac4bad6    13 GB    100% CPU     8192       4 minutes from now    

So, possibly something is off with either ps or the estimator, as clearly batching should allocate more memory.

In both instances it only uses ~12.9 GB of VRAM, leaving some space unallocated, I hope there's some way to use that and improve the performance a bit.

<!-- gh-comment-id:3156296293 --> @av commented on GitHub (Aug 5, 2025): Understandable! Setting `OLLAMA_NUM_PARALLEL=1` the split is now: ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b 05afbac4bad6 18 GB 12%/88% CPU/GPU 8192 4 minutes from now ``` With `OLLAMA_NUM_PARALLEL=4`, it looks like: ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b 05afbac4bad6 13 GB 100% CPU 8192 4 minutes from now ``` So, possibly something is off with either `ps` or the estimator, as clearly batching should allocate more memory. In both instances it only uses ~12.9 GB of VRAM, leaving some space unallocated, I hope there's some way to use that and improve the performance a bit.
Author
Owner

@jessegross commented on GitHub (Aug 5, 2025):

@nadamas2000 It looks like you increased the context length, this has a similar effect to increasing NUM_PARALLEL. You'll need to use a lower value or the default.

<!-- gh-comment-id:3156304937 --> @jessegross commented on GitHub (Aug 5, 2025): @nadamas2000 It looks like you increased the context length, this has a similar effect to increasing NUM_PARALLEL. You'll need to use a lower value or the default.
Author
Owner

@nadamas2000 commented on GitHub (Aug 5, 2025):

Ok, in my case, with 4k context GPUs running well.
Thanks.

<!-- gh-comment-id:3156308950 --> @nadamas2000 commented on GitHub (Aug 5, 2025): Ok, in my case, with 4k context GPUs running well. Thanks.
Author
Owner

@hrz6976 commented on GitHub (Aug 5, 2025):

Thanks for spotting this! I misunderstood how OLLAMA_NUM_PARALLEL works (related: https://github.com/ollama/ollama/issues/4170). It worked after removing OLLAMA_NUM_PARALLEL from envvars. 😄
P.S. Is there a way for ollama itself calculate how many requests it can handle before falling back to CPU? I can't find a optimal OLLAMA_NUM_PARALLEL as it applies to all models and I sometimes need to run different models in parallel

<!-- gh-comment-id:3156309562 --> @hrz6976 commented on GitHub (Aug 5, 2025): Thanks for spotting this! I misunderstood how OLLAMA_NUM_PARALLEL works (related: https://github.com/ollama/ollama/issues/4170). It worked after removing OLLAMA_NUM_PARALLEL from envvars. 😄 P.S. Is there a way for ollama itself calculate how many requests it can handle before falling back to CPU? I can't find a optimal OLLAMA_NUM_PARALLEL as it applies to all models and I sometimes need to run different models in parallel
Author
Owner

@HuChundong commented on GitHub (Aug 5, 2025):

i have 4x2080ti 22gb, all 88GB, gpt-oss 120b use 10% cpu, ctx is 8k, 88GB is not enough for 120B model?

<!-- gh-comment-id:3156322721 --> @HuChundong commented on GitHub (Aug 5, 2025): i have 4x2080ti 22gb, all 88GB, gpt-oss 120b use 10% cpu, ctx is 8k, 88GB is not enough for 120B model?
Author
Owner

@russellmm commented on GitHub (Aug 5, 2025):

It was context size for me. On a 5090, I can set the context windows to 32K and the model uses the GPU. Setting to 64K and it switches to the CPU.

<!-- gh-comment-id:3156335312 --> @russellmm commented on GitHub (Aug 5, 2025): It was context size for me. On a 5090, I can set the context windows to 32K and the model uses the GPU. Setting to 64K and it switches to the CPU.
Author
Owner

@Shawneau commented on GitHub (Aug 5, 2025):

It was context size for me. On a 5090, I can set the context windows to 32K and the model uses the GPU. Setting to 64K and it switches to the CPU.

Yeah that works for me too, what's the context window for the model though? Still a bug if over 32K (might not be Ollama bug though might Open WebUI or elsewhere)

<!-- gh-comment-id:3156348467 --> @Shawneau commented on GitHub (Aug 5, 2025): > It was context size for me. On a 5090, I can set the context windows to 32K and the model uses the GPU. Setting to 64K and it switches to the CPU. Yeah that works for me too, what's the context window for the model though? Still a bug if over 32K (might not be Ollama bug though might Open WebUI or elsewhere)
Author
Owner

@abhinavxd commented on GitHub (Aug 5, 2025):

Yes, it's the context size. It works well with the Ollama UI and CLI (uses GPU).
But when I add this model to GitHub Copilot, the context goes up to 32,768 and it doesn't use the GPU at all.

I got a 4080

<!-- gh-comment-id:3156568622 --> @abhinavxd commented on GitHub (Aug 5, 2025): Yes, it's the context size. It works well with the Ollama UI and CLI (uses GPU). But when I add this model to GitHub Copilot, the context goes up to 32,768 and it doesn't use the GPU at all. I got a 4080
Author
Owner

@torbwol commented on GitHub (Aug 5, 2025):

It's so weird... With a context of 8192 it utilizes one of my two gpus and says size is 22GB. When increasing the context to 16384 it goes 100% CPU and says size is 13GB. How does this make any sense? Why can't it use both gpus and why can't it use gpus at all when increasing the context size?

<!-- gh-comment-id:3156690132 --> @torbwol commented on GitHub (Aug 5, 2025): It's so weird... With a context of 8192 it utilizes one of my two gpus and says size is 22GB. When increasing the context to 16384 it goes 100% CPU and says size is 13GB. How does this make any sense? Why can't it use both gpus and why can't it use gpus at all when increasing the context size?
Author
Owner

@SierraKiloGulf commented on GitHub (Aug 5, 2025):

Same here- team red- 7900XTX. Doesn't matter if using CLI, openWebUI, AnythingLLM and the likes. Windows/Ubuntu

<!-- gh-comment-id:3156848969 --> @SierraKiloGulf commented on GitHub (Aug 5, 2025): Same here- team red- 7900XTX. Doesn't matter if using CLI, openWebUI, AnythingLLM and the likes. Windows/Ubuntu
Author
Owner

@thedaveCA commented on GitHub (Aug 6, 2025):

As a datapoint: 0.11.0 ran gpt-oss:20b on CPU for me, 0.11.2 on GPU. 7900XTX w/24GB VRAM, reporting 14.8GiB in use.

<!-- gh-comment-id:3157880900 --> @thedaveCA commented on GitHub (Aug 6, 2025): As a datapoint: `0.11.0` ran gpt-oss:20b on CPU for me, `0.11.2` on GPU. 7900XTX w/24GB VRAM, reporting 14.8GiB in use.
Author
Owner

@ZYJZYJZYJ0801 commented on GitHub (Aug 6, 2025):

same as me
NAME SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:120b 67 GB 100% CPU 8192 3 minutes from now
how to fix it??
use other model, can use 100% GPU

<!-- gh-comment-id:3158563754 --> @ZYJZYJZYJ0801 commented on GitHub (Aug 6, 2025): same as me NAME SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b 67 GB 100% CPU 8192 3 minutes from now how to fix it?? use other model, can use 100% GPU
Author
Owner

@coolbirdzik commented on GitHub (Aug 6, 2025):

Same on me too with 2 A4000

<!-- gh-comment-id:3158842330 --> @coolbirdzik commented on GitHub (Aug 6, 2025): Same on me too with 2 A4000
Author
Owner

@n0k0de commented on GitHub (Aug 6, 2025):

In my setup with a 5060 Ti 16GB (Ollama + Open WebUI all on Docker), Ollama only offloads 22 out of 24 layers to the GPU, even though there are still 3GB of VRAM available.

The OLLAMA_NUM_PARALLEL variable is set to 1.

Additionally, even when I set num_ctx to 4096 via Open WebUI, the context remains at 8192. Hard to say whether this issue comes from Ollama or Open WebUI.

<!-- gh-comment-id:3158904924 --> @n0k0de commented on GitHub (Aug 6, 2025): In my setup with a 5060 Ti 16GB (Ollama + Open WebUI all on Docker), Ollama only offloads 22 out of 24 layers to the GPU, even though there are still 3GB of VRAM available. The OLLAMA_NUM_PARALLEL variable is set to 1. Additionally, even when I set num_ctx to 4096 via Open WebUI, the context remains at 8192. Hard to say whether this issue comes from Ollama or Open WebUI.
Author
Owner

@Ca-rs-on commented on GitHub (Aug 6, 2025):

FWIW I accidentally pulled the wrong Docker image when upgrading to use gpt-oss and it caused this same problem, if you're running NVIDIA don't pull the rocm tag lol.

<!-- gh-comment-id:3160800852 --> @Ca-rs-on commented on GitHub (Aug 6, 2025): FWIW I accidentally pulled the wrong Docker image when upgrading to use gpt-oss and it caused this same problem, if you're running NVIDIA don't pull the rocm tag lol.
Author
Owner

@ricardofiorani commented on GitHub (Aug 6, 2025):

Same here

<!-- gh-comment-id:3161438015 --> @ricardofiorani commented on GitHub (Aug 6, 2025): Same here
Author
Owner

@jessegross commented on GitHub (Aug 6, 2025):

There was a bug in 0.11.2 and below where the memory estimation would become too high for gpt-oss if the model needed to be split across GPU and CPU or multiple GPUs. This would often cause 100% on CPU usage once the model overflows a single GPU.

This is fixed in 0.11.3.

<!-- gh-comment-id:3161680323 --> @jessegross commented on GitHub (Aug 6, 2025): There was a bug in 0.11.2 and below where the memory estimation would become too high for gpt-oss if the model needed to be split across GPU and CPU or multiple GPUs. This would often cause 100% on CPU usage once the model overflows a single GPU. This is fixed in 0.11.3.
Author
Owner

@trdischat commented on GitHub (Aug 6, 2025):

Upgrading to 0.11.3 allowed gpt-oss:20b to load at least partially on the GPU. But the memory consumed by the model more than doubled. With 0.11.2, the model used 13GB of memory, 100% on the CPU. With 0.11.3, the model uses 32GB of memory, split 24%/76% between CPU and GPU. This is just running ollama run gpt-oss at the command line.

I am running Ollama on Ubuntu 20.04 with these environment variable settings:

Environment="OLLAMA_CONTEXT_LENGTH=32000"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"

The server has 2 RTX 3060 for a total of 24GB of VRAM and 96GB of system RAM. Reducing the context length to 2000 brought the memory used by the model down to 19GB (running 100% on GPU), still way more than in Ollama 0.11.2.

Testing with other models, including llama, mistral, qwen3, etc., reveals that all models seem to be using more RAM in 0.11.3 than they were in 0.11.2.

<!-- gh-comment-id:3161913214 --> @trdischat commented on GitHub (Aug 6, 2025): Upgrading to 0.11.3 allowed gpt-oss:20b to load at least partially on the GPU. But the memory consumed by the model more than doubled. With 0.11.2, the model used 13GB of memory, 100% on the CPU. With 0.11.3, the model uses 32GB of memory, split 24%/76% between CPU and GPU. This is just running `ollama run gpt-oss` at the command line. I am running Ollama on Ubuntu 20.04 with these environment variable settings: ``` Environment="OLLAMA_CONTEXT_LENGTH=32000" Environment="OLLAMA_NUM_PARALLEL=1" Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1" ``` The server has 2 RTX 3060 for a total of 24GB of VRAM and 96GB of system RAM. Reducing the context length to 2000 brought the memory used by the model down to 19GB (running 100% on GPU), still way more than in Ollama 0.11.2. Testing with other models, including llama, mistral, qwen3, etc., reveals that all models seem to be using more RAM in 0.11.3 than they were in 0.11.2.
Author
Owner

@ZYJZYJZYJ0801 commented on GitHub (Aug 7, 2025):

ollama version is 0.11.3
NAME SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:120b 151 GB 37%/63% CPU/GPU 8192 About a minute from now
GPU: 5000Ada *3
memory: 128G *2
can't use full GPU, how to fix it????

<!-- gh-comment-id:3162138633 --> @ZYJZYJZYJ0801 commented on GitHub (Aug 7, 2025): ollama version is 0.11.3 NAME SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b 151 GB 37%/63% CPU/GPU 8192 About a minute from now GPU: 5000Ada *3 memory: 128G *2 can't use full GPU, how to fix it????
Author
Owner

@azomDev commented on GitHub (Aug 7, 2025):

Similar issue here #11688

<!-- gh-comment-id:3164891220 --> @azomDev commented on GitHub (Aug 7, 2025): Similar issue here #11688
Author
Owner

@alienatedsec commented on GitHub (Aug 7, 2025):

root@[redacted]:/# nvidia-smi
Thu Aug  7 22:04:07 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A4000               On  |   00000000:01:00.0 Off |                  Off |
| 54%   66C    P8             13W /  140W |   12617MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4060 Ti     On  |   00000000:02:00.0 Off |                  N/A |
| 34%   48C    P8              6W /  165W |   14471MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A4000               On  |   00000000:03:00.0 Off |                  Off |
| 54%   63C    P8             14W /  140W |   14285MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A4000               On  |   00000000:04:00.0 Off |                  Off |
| 58%   68C    P8             10W /  140W |   12617MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX A4000               On  |   00000000:05:00.0 Off |                  Off |
| 60%   69C    P8             11W /  140W |   12041MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A             122      C   /usr/bin/ollama                         320MiB |
|    1   N/A  N/A             122      C   /usr/bin/ollama                         280MiB |
|    2   N/A  N/A             122      C   /usr/bin/ollama                         322MiB |
|    3   N/A  N/A             122      C   /usr/bin/ollama                         320MiB |
|    4   N/A  N/A             122      C   /usr/bin/ollama                         384MiB |
+-----------------------------------------------------------------------------------------+

and to follow up on the ollama side

root@[redacted]:/# ollama ps
NAME            ID              SIZE     PROCESSOR          CONTEXT    UNTIL   
gpt-oss:120b    735371f916a9    91 GB    13%/87% CPU/GPU    8192       Forever    
root@[redacted]:/# ollama -v
ollama version is 0.11.3
root@[redacted]:/# 

Seems like there is still some space to allocate GPU(s) memory. Some testing as per the below:

OLLAMA_NUM_PARALLEL=1

root@[redacted]:/# ollama run --verbose gpt-oss:120b
>>> write 3 paragraphs about cryptography
Thinking...
User asks: "write 3 paragraphs about cryptography". Provide three paragraphs, likely descriptive. Ensure good content.
...done thinking.

Cryptography, the art and science of securing information, has evolved from simple substitution ciphers used by ancient civilizations to sophisticated mathematical 
frameworks that underpin modern digital security. At its core, cryptography transforms readable data (plaintext) into an unintelligible form (ciphertext) using 
algorithms and keys, ensuring that only authorized parties can recover the original message. Early techniques, such as the Caesar shift and the Enigma machine, relied on 
mechanical or manual processes, but the advent of computers introduced computational complexity as a cornerstone, allowing for encryption schemes that are practically 
unbreakable given current technology.

The modern landscape of cryptography is divided primarily into two complementary paradigms: symmetric-key and asymmetric-key cryptography. Symmetric algorithms, like AES 
(Advanced Encryption Standard), use a single secret key for both encryption and decryption, offering high speed and efficiency for bulk data protection. Asymmetric 
systems, exemplified by RSA and elliptic‑curve cryptography (ECC), employ a pair of mathematically linked keys—a public key for encryption and a private key for 
decryption—enabling secure key exchange, digital signatures, and authentication without prior secret sharing. Together, these techniques form the backbone of protocols 
such as TLS/SSL, VPNs, and end‑to‑end encrypted messaging apps, safeguarding everything from online banking to personal communications.

Beyond confidentiality, cryptography also addresses integrity, authenticity, and non‑repudiation through tools like hash functions, message authentication codes (MACs), 
and digital signatures. Cryptographic hash functions (e.g., SHA‑256) produce fixed‑size digests that uniquely represent data, making it easy to detect tampering. Digital 
signatures, generated with a private key and verified with the corresponding public key, provide proof that a specific entity authored a message and cannot later deny 
it. As quantum computing looms on the horizon, researchers are already developing post‑quantum algorithms to replace vulnerable schemes, ensuring that the principles of 
cryptography continue to protect information in an increasingly connected and computationally powerful world.

total duration:       24.001994779s
load duration:        342.189318ms
prompt eval count:    74 token(s)
prompt eval duration: 1.409565258s
prompt eval rate:     52.50 tokens/s
eval count:           438 token(s)
eval duration:        22.246255913s
eval rate:            19.69 tokens/s
>>> Send a message (/? for help)
<!-- gh-comment-id:3166035491 --> @alienatedsec commented on GitHub (Aug 7, 2025): ``` root@[redacted]:/# nvidia-smi Thu Aug 7 22:04:07 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX A4000 On | 00000000:01:00.0 Off | Off | | 54% 66C P8 13W / 140W | 12617MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4060 Ti On | 00000000:02:00.0 Off | N/A | | 34% 48C P8 6W / 165W | 14471MiB / 16380MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA RTX A4000 On | 00000000:03:00.0 Off | Off | | 54% 63C P8 14W / 140W | 14285MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA RTX A4000 On | 00000000:04:00.0 Off | Off | | 58% 68C P8 10W / 140W | 12617MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA RTX A4000 On | 00000000:05:00.0 Off | Off | | 60% 69C P8 11W / 140W | 12041MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 122 C /usr/bin/ollama 320MiB | | 1 N/A N/A 122 C /usr/bin/ollama 280MiB | | 2 N/A N/A 122 C /usr/bin/ollama 322MiB | | 3 N/A N/A 122 C /usr/bin/ollama 320MiB | | 4 N/A N/A 122 C /usr/bin/ollama 384MiB | +-----------------------------------------------------------------------------------------+ ``` and to follow up on the ollama side ``` root@[redacted]:/# ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b 735371f916a9 91 GB 13%/87% CPU/GPU 8192 Forever root@[redacted]:/# ollama -v ollama version is 0.11.3 root@[redacted]:/# ``` Seems like there is still some space to allocate GPU(s) memory. Some testing as per the below: `OLLAMA_NUM_PARALLEL=1` ``` root@[redacted]:/# ollama run --verbose gpt-oss:120b >>> write 3 paragraphs about cryptography Thinking... User asks: "write 3 paragraphs about cryptography". Provide three paragraphs, likely descriptive. Ensure good content. ...done thinking. Cryptography, the art and science of securing information, has evolved from simple substitution ciphers used by ancient civilizations to sophisticated mathematical frameworks that underpin modern digital security. At its core, cryptography transforms readable data (plaintext) into an unintelligible form (ciphertext) using algorithms and keys, ensuring that only authorized parties can recover the original message. Early techniques, such as the Caesar shift and the Enigma machine, relied on mechanical or manual processes, but the advent of computers introduced computational complexity as a cornerstone, allowing for encryption schemes that are practically unbreakable given current technology. The modern landscape of cryptography is divided primarily into two complementary paradigms: symmetric-key and asymmetric-key cryptography. Symmetric algorithms, like AES (Advanced Encryption Standard), use a single secret key for both encryption and decryption, offering high speed and efficiency for bulk data protection. Asymmetric systems, exemplified by RSA and elliptic‑curve cryptography (ECC), employ a pair of mathematically linked keys—a public key for encryption and a private key for decryption—enabling secure key exchange, digital signatures, and authentication without prior secret sharing. Together, these techniques form the backbone of protocols such as TLS/SSL, VPNs, and end‑to‑end encrypted messaging apps, safeguarding everything from online banking to personal communications. Beyond confidentiality, cryptography also addresses integrity, authenticity, and non‑repudiation through tools like hash functions, message authentication codes (MACs), and digital signatures. Cryptographic hash functions (e.g., SHA‑256) produce fixed‑size digests that uniquely represent data, making it easy to detect tampering. Digital signatures, generated with a private key and verified with the corresponding public key, provide proof that a specific entity authored a message and cannot later deny it. As quantum computing looms on the horizon, researchers are already developing post‑quantum algorithms to replace vulnerable schemes, ensuring that the principles of cryptography continue to protect information in an increasingly connected and computationally powerful world. total duration: 24.001994779s load duration: 342.189318ms prompt eval count: 74 token(s) prompt eval duration: 1.409565258s prompt eval rate: 52.50 tokens/s eval count: 438 token(s) eval duration: 22.246255913s eval rate: 19.69 tokens/s >>> Send a message (/? for help) ```
Author
Owner

@Jonseed commented on GitHub (Aug 7, 2025):

I'm seeing a similar slowdown on Ollama with my 3060 12gb, where I only get about 4 t/s, which is almost unusable. In LM Studio I'm getting up to 13+ t/s, offloading 20 layers out of 24 (83%). When using Ollama, ollama ps shows it is only using 68% of my gpu, and offloading the rest to cpu, which could account for the slowdown.

<!-- gh-comment-id:3166093542 --> @Jonseed commented on GitHub (Aug 7, 2025): I'm seeing a similar slowdown on Ollama with my 3060 12gb, where I only get about 4 t/s, which is almost unusable. In LM Studio I'm getting up to 13+ t/s, offloading 20 layers out of 24 (83%). When using Ollama, ollama ps shows it is only using 68% of my gpu, and offloading the rest to cpu, which could account for the slowdown.
Author
Owner

@azomDev commented on GitHub (Aug 10, 2025):

Sorry for the spam, I was trying to find all similar issues so far to link them here since this is the earliest issue about what seems to be generally the same problem

<!-- gh-comment-id:3172343540 --> @azomDev commented on GitHub (Aug 10, 2025): Sorry for the spam, I was trying to find all similar issues so far to link them here since this is the earliest issue about what seems to be generally the same problem
Author
Owner

@jhsmith409 commented on GitHub (Aug 10, 2025):

I had the same issue in #11731 (listed above as well). 5090 + 5070 Ti for total of 48GB VRAM. Runs 99%+ on CPU and consumes just a small amount of VRAM. When I calculate KV Cache size + model + est. overhead, I think it should easily fit... I pulled the model parameters from the model card directly on HF. My context size is large - 128k. Qwen3:30B fits easily with that context and matches my VRAM calculation for it. So I'm thinking there is a bug remaining in the implementation or there is something different about the model that the way I'm calculating KV Cache side works for Qwen3 but not for GPT-OSS.

<!-- gh-comment-id:3172550567 --> @jhsmith409 commented on GitHub (Aug 10, 2025): I had the same issue in #11731 (listed above as well). 5090 + 5070 Ti for total of 48GB VRAM. Runs 99%+ on CPU and consumes just a small amount of VRAM. When I calculate KV Cache size + model + est. overhead, I think it should easily fit... I pulled the model parameters from the model card directly on HF. My context size is large - 128k. Qwen3:30B fits easily with that context and matches my VRAM calculation for it. So I'm thinking there is a bug remaining in the implementation or there is something different about the model that the way I'm calculating KV Cache side works for Qwen3 but not for GPT-OSS.
Author
Owner

@alienatedsec commented on GitHub (Aug 10, 2025):

Not sure what happened recently - some wrong reporting with the latest version

root@[redacted]:/# ollama ps
NAME            ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
gpt-oss:120b    735371f916a9    69 GB    100% CPU     128000     Forever    
root@[redacted]:/# ollama -v
ollama version is 0.11.4
root@[redacted]:~# nvidia-smi
Sun Aug 10 15:02:52 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A4000               On  |   00000000:01:00.0 Off |                  Off |
| 42%   65C    P0             55W /  140W |   13131MiB /  16376MiB |     15%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4060 Ti     On  |   00000000:02:00.0 Off |                  N/A |
|  0%   59C    P0             46W /  165W |   15239MiB /  16380MiB |     14%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A4000               On  |   00000000:03:00.0 Off |                  Off |
| 47%   69C    P0             61W /  140W |   15051MiB /  16376MiB |     15%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A4000               On  |   00000000:04:00.0 Off |                  Off |
| 48%   70C    P0             52W /  140W |   13387MiB /  16376MiB |     14%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX A4000               On  |   00000000:05:00.0 Off |                  Off |
| 48%   70C    P0             57W /  140W |   12809MiB /  16376MiB |     17%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    1   N/A  N/A            1367      C   /usr/local/bin/python3                  222MiB |
|    1   N/A  N/A            2417      C   /usr/bin/ollama                         280MiB |
|    2   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    3   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    4   N/A  N/A            2417      C   /usr/bin/ollama                         384MiB |
+-----------------------------------------------------------------------------------------+
<!-- gh-comment-id:3172698087 --> @alienatedsec commented on GitHub (Aug 10, 2025): Not sure what happened recently - some wrong reporting with the latest version ``` root@[redacted]:/# ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b 735371f916a9 69 GB 100% CPU 128000 Forever root@[redacted]:/# ollama -v ollama version is 0.11.4 ``` ``` root@[redacted]:~# nvidia-smi Sun Aug 10 15:02:52 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX A4000 On | 00000000:01:00.0 Off | Off | | 42% 65C P0 55W / 140W | 13131MiB / 16376MiB | 15% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4060 Ti On | 00000000:02:00.0 Off | N/A | | 0% 59C P0 46W / 165W | 15239MiB / 16380MiB | 14% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA RTX A4000 On | 00000000:03:00.0 Off | Off | | 47% 69C P0 61W / 140W | 15051MiB / 16376MiB | 15% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA RTX A4000 On | 00000000:04:00.0 Off | Off | | 48% 70C P0 52W / 140W | 13387MiB / 16376MiB | 14% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA RTX A4000 On | 00000000:05:00.0 Off | Off | | 48% 70C P0 57W / 140W | 12809MiB / 16376MiB | 17% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2417 C /usr/bin/ollama 320MiB | | 1 N/A N/A 1367 C /usr/local/bin/python3 222MiB | | 1 N/A N/A 2417 C /usr/bin/ollama 280MiB | | 2 N/A N/A 2417 C /usr/bin/ollama 320MiB | | 3 N/A N/A 2417 C /usr/bin/ollama 320MiB | | 4 N/A N/A 2417 C /usr/bin/ollama 384MiB | +-----------------------------------------------------------------------------------------+ ```
Author
Owner

@rick-github commented on GitHub (Aug 10, 2025):

Sorry for the spam, I was trying to find all similar issues so far to link them here since this is the earliest issue about what seems to be generally the same problem

Most of these issues are because the context is too big. Reduce context, reduce VRAM.

<!-- gh-comment-id:3172773038 --> @rick-github commented on GitHub (Aug 10, 2025): > Sorry for the spam, I was trying to find all similar issues so far to link them here since this is the earliest issue about what seems to be generally the same problem Most of these issues are because the context is too big. Reduce context, reduce VRAM.
Author
Owner

@alienatedsec commented on GitHub (Aug 11, 2025):

Not sure what happened recently - some wrong reporting with the latest version

root@[redacted]:/# ollama ps
NAME            ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
gpt-oss:120b    735371f916a9    69 GB    100% CPU     128000     Forever    
root@[redacted]:/# ollama -v
ollama version is 0.11.4
root@[redacted]:~# nvidia-smi
Sun Aug 10 15:02:52 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A4000               On  |   00000000:01:00.0 Off |                  Off |
| 42%   65C    P0             55W /  140W |   13131MiB /  16376MiB |     15%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4060 Ti     On  |   00000000:02:00.0 Off |                  N/A |
|  0%   59C    P0             46W /  165W |   15239MiB /  16380MiB |     14%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A4000               On  |   00000000:03:00.0 Off |                  Off |
| 47%   69C    P0             61W /  140W |   15051MiB /  16376MiB |     15%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A4000               On  |   00000000:04:00.0 Off |                  Off |
| 48%   70C    P0             52W /  140W |   13387MiB /  16376MiB |     14%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX A4000               On  |   00000000:05:00.0 Off |                  Off |
| 48%   70C    P0             57W /  140W |   12809MiB /  16376MiB |     17%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    1   N/A  N/A            1367      C   /usr/local/bin/python3                  222MiB |
|    1   N/A  N/A            2417      C   /usr/bin/ollama                         280MiB |
|    2   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    3   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    4   N/A  N/A            2417      C   /usr/bin/ollama                         384MiB |
+-----------------------------------------------------------------------------------------+

Attaching logs if relevant. These are from today, but the same ollama ps output as of yesterday

_ollama_logs.txt

<!-- gh-comment-id:3175264514 --> @alienatedsec commented on GitHub (Aug 11, 2025): > Not sure what happened recently - some wrong reporting with the latest version > > ``` > root@[redacted]:/# ollama ps > NAME ID SIZE PROCESSOR CONTEXT UNTIL > gpt-oss:120b 735371f916a9 69 GB 100% CPU 128000 Forever > root@[redacted]:/# ollama -v > ollama version is 0.11.4 > ``` > > ``` > root@[redacted]:~# nvidia-smi > Sun Aug 10 15:02:52 2025 > +-----------------------------------------------------------------------------------------+ > | NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 | > |-----------------------------------------+------------------------+----------------------+ > | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | > | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | > | | | MIG M. | > |=========================================+========================+======================| > | 0 NVIDIA RTX A4000 On | 00000000:01:00.0 Off | Off | > | 42% 65C P0 55W / 140W | 13131MiB / 16376MiB | 15% Default | > | | | N/A | > +-----------------------------------------+------------------------+----------------------+ > | 1 NVIDIA GeForce RTX 4060 Ti On | 00000000:02:00.0 Off | N/A | > | 0% 59C P0 46W / 165W | 15239MiB / 16380MiB | 14% Default | > | | | N/A | > +-----------------------------------------+------------------------+----------------------+ > | 2 NVIDIA RTX A4000 On | 00000000:03:00.0 Off | Off | > | 47% 69C P0 61W / 140W | 15051MiB / 16376MiB | 15% Default | > | | | N/A | > +-----------------------------------------+------------------------+----------------------+ > | 3 NVIDIA RTX A4000 On | 00000000:04:00.0 Off | Off | > | 48% 70C P0 52W / 140W | 13387MiB / 16376MiB | 14% Default | > | | | N/A | > +-----------------------------------------+------------------------+----------------------+ > | 4 NVIDIA RTX A4000 On | 00000000:05:00.0 Off | Off | > | 48% 70C P0 57W / 140W | 12809MiB / 16376MiB | 17% Default | > | | | N/A | > +-----------------------------------------+------------------------+----------------------+ > > +-----------------------------------------------------------------------------------------+ > | Processes: | > | GPU GI CI PID Type Process name GPU Memory | > | ID ID Usage | > |=========================================================================================| > | 0 N/A N/A 2417 C /usr/bin/ollama 320MiB | > | 1 N/A N/A 1367 C /usr/local/bin/python3 222MiB | > | 1 N/A N/A 2417 C /usr/bin/ollama 280MiB | > | 2 N/A N/A 2417 C /usr/bin/ollama 320MiB | > | 3 N/A N/A 2417 C /usr/bin/ollama 320MiB | > | 4 N/A N/A 2417 C /usr/bin/ollama 384MiB | > +-----------------------------------------------------------------------------------------+ > ``` Attaching logs if relevant. These are from today, but the same `ollama ps` output as of yesterday [_ollama_logs.txt](https://github.com/user-attachments/files/21717377/_ollama_logs.txt)
Author
Owner

@rick-github commented on GitHub (Aug 11, 2025):

What is wrong with the reporting?

<!-- gh-comment-id:3175270460 --> @rick-github commented on GitHub (Aug 11, 2025): What is wrong with the reporting?
Author
Owner

@alienatedsec commented on GitHub (Aug 11, 2025):

@rick-github the CPU usage

root@[redacted]:/# ollama ps
NAME            ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
gpt-oss:120b    735371f916a9    69 GB    100% CPU     128000     Forever    
root@[redacted]:/# ollama -v
ollama version is 0.11.4

Seems the below could also be related, as I also use OpenWebUI

I have a feeling everyone experiencing this is hitting Ollama via OpenWebUI? Command line Ollama works with 100% GPU, but hitting it with open web ui goes 100% CPU

<!-- gh-comment-id:3175278677 --> @alienatedsec commented on GitHub (Aug 11, 2025): @rick-github the CPU usage ``` root@[redacted]:/# ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b 735371f916a9 69 GB 100% CPU 128000 Forever root@[redacted]:/# ollama -v ollama version is 0.11.4 ``` Seems the below could also be related, as I also use `OpenWebUI` > I have a feeling everyone experiencing this is hitting Ollama via OpenWebUI? Command line Ollama works with 100% GPU, but hitting it with open web ui goes 100% CPU
Author
Owner

@rick-github commented on GitHub (Aug 11, 2025):

The model is loaded 100% in CPU, which is correct.

<!-- gh-comment-id:3175283469 --> @rick-github commented on GitHub (Aug 11, 2025): The model is loaded 100% in CPU, which is correct.
Author
Owner

@alienatedsec commented on GitHub (Aug 11, 2025):

@rick-github it doesn't feel that way, unless I am missing something.

<!-- gh-comment-id:3175300219 --> @alienatedsec commented on GitHub (Aug 11, 2025): @rick-github it doesn't feel that way, unless I am missing something.
Author
Owner

@rick-github commented on GitHub (Aug 11, 2025):

Could you define "feel"?

<!-- gh-comment-id:3175306223 --> @rick-github commented on GitHub (Aug 11, 2025): Could you define "feel"?
Author
Owner

@alienatedsec commented on GitHub (Aug 11, 2025):

@rick-github could explain this?

root@[redacted]:~# nvidia-smi
Sun Aug 10 15:02:52 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A4000               On  |   00000000:01:00.0 Off |                  Off |
| 42%   65C    P0             55W /  140W |   13131MiB /  16376MiB |     15%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4060 Ti     On  |   00000000:02:00.0 Off |                  N/A |
|  0%   59C    P0             46W /  165W |   15239MiB /  16380MiB |     14%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A4000               On  |   00000000:03:00.0 Off |                  Off |
| 47%   69C    P0             61W /  140W |   15051MiB /  16376MiB |     15%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A4000               On  |   00000000:04:00.0 Off |                  Off |
| 48%   70C    P0             52W /  140W |   13387MiB /  16376MiB |     14%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX A4000               On  |   00000000:05:00.0 Off |                  Off |
| 48%   70C    P0             57W /  140W |   12809MiB /  16376MiB |     17%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    1   N/A  N/A            1367      C   /usr/local/bin/python3                  222MiB |
|    1   N/A  N/A            2417      C   /usr/bin/ollama                         280MiB |
|    2   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    3   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    4   N/A  N/A            2417      C   /usr/bin/ollama                         384MiB |
+-----------------------------------------------------------------------------------------+

Edit - I don't understand how the model is reported in Ollama as 100% in CPU and at the same time the GPU VRAM (around 80%-90%) is utilised?

<!-- gh-comment-id:3175311181 --> @alienatedsec commented on GitHub (Aug 11, 2025): @rick-github could explain this? ``` root@[redacted]:~# nvidia-smi Sun Aug 10 15:02:52 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX A4000 On | 00000000:01:00.0 Off | Off | | 42% 65C P0 55W / 140W | 13131MiB / 16376MiB | 15% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4060 Ti On | 00000000:02:00.0 Off | N/A | | 0% 59C P0 46W / 165W | 15239MiB / 16380MiB | 14% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA RTX A4000 On | 00000000:03:00.0 Off | Off | | 47% 69C P0 61W / 140W | 15051MiB / 16376MiB | 15% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA RTX A4000 On | 00000000:04:00.0 Off | Off | | 48% 70C P0 52W / 140W | 13387MiB / 16376MiB | 14% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA RTX A4000 On | 00000000:05:00.0 Off | Off | | 48% 70C P0 57W / 140W | 12809MiB / 16376MiB | 17% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2417 C /usr/bin/ollama 320MiB | | 1 N/A N/A 1367 C /usr/local/bin/python3 222MiB | | 1 N/A N/A 2417 C /usr/bin/ollama 280MiB | | 2 N/A N/A 2417 C /usr/bin/ollama 320MiB | | 3 N/A N/A 2417 C /usr/bin/ollama 320MiB | | 4 N/A N/A 2417 C /usr/bin/ollama 384MiB | +-----------------------------------------------------------------------------------------+ ``` Edit - I don't understand how the model is reported in Ollama as 100% in CPU and at the same time the GPU VRAM (around 80%-90%) is utilised?
Author
Owner

@rick-github commented on GitHub (Aug 11, 2025):

What's the output of

ps wwh p$(pidof ollama)
<!-- gh-comment-id:3175346934 --> @rick-github commented on GitHub (Aug 11, 2025): What's the output of ``` ps wwh p$(pidof ollama) ```
Author
Owner

@alienatedsec commented on GitHub (Aug 11, 2025):

@rick-github

root@[redacted]:/# ps wwh p$(pidof ollama)
      1 ?        Ssl    0:18 /bin/ollama serve
    125 ?        Rl     1:58 /usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 256 --threads 16 --no-mmap --parallel 1 --port 44643
<!-- gh-comment-id:3175373395 --> @alienatedsec commented on GitHub (Aug 11, 2025): @rick-github ``` root@[redacted]:/# ps wwh p$(pidof ollama) 1 ? Ssl 0:18 /bin/ollama serve 125 ? Rl 1:58 /usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 256 --threads 16 --no-mmap --parallel 1 --port 44643 ```
Author
Owner

@rick-github commented on GitHub (Aug 11, 2025):

What's the output of

ps wwh p2417
<!-- gh-comment-id:3175376325 --> @rick-github commented on GitHub (Aug 11, 2025): What's the output of ``` ps wwh p2417 ```
Author
Owner

@alienatedsec commented on GitHub (Aug 11, 2025):

no output

<!-- gh-comment-id:3175382139 --> @alienatedsec commented on GitHub (Aug 11, 2025): no output
Author
Owner

@rick-github commented on GitHub (Aug 11, 2025):

What's the output of

nvidia-smi
<!-- gh-comment-id:3175383997 --> @rick-github commented on GitHub (Aug 11, 2025): What's the output of ``` nvidia-smi ```
Author
Owner

@alienatedsec commented on GitHub (Aug 11, 2025):

now I understand - which process you want?

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A4000               On  |   00000000:01:00.0 Off |                  Off |
| 60%   68C    P8             13W /  140W |   15149MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4060 Ti     On  |   00000000:02:00.0 Off |                  N/A |
| 34%   47C    P8              8W /  165W |   15957MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A4000               On  |   00000000:03:00.0 Off |                  Off |
| 58%   66C    P8             15W /  140W |   15983MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A4000               On  |   00000000:04:00.0 Off |                  Off |
| 64%   72C    P8             11W /  140W |   15405MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX A4000               On  |   00000000:05:00.0 Off |                  Off |
| 67%   75C    P8             13W /  140W |   14633MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2269      C   /usr/bin/ollama                         674MiB |
|    1   N/A  N/A            1457      C   /usr/local/bin/python3                  222MiB |
|    1   N/A  N/A            2269      C   /usr/bin/ollama                         636MiB |
|    2   N/A  N/A            2269      C   /usr/bin/ollama                         676MiB |
|    3   N/A  N/A            2269      C   /usr/bin/ollama                         674MiB |
|    4   N/A  N/A            2269      C   /usr/bin/ollama                         670MiB |
+-----------------------------------------------------------------------------------------+
<!-- gh-comment-id:3175390301 --> @alienatedsec commented on GitHub (Aug 11, 2025): now I understand - which process you want? ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX A4000 On | 00000000:01:00.0 Off | Off | | 60% 68C P8 13W / 140W | 15149MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4060 Ti On | 00000000:02:00.0 Off | N/A | | 34% 47C P8 8W / 165W | 15957MiB / 16380MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA RTX A4000 On | 00000000:03:00.0 Off | Off | | 58% 66C P8 15W / 140W | 15983MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA RTX A4000 On | 00000000:04:00.0 Off | Off | | 64% 72C P8 11W / 140W | 15405MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA RTX A4000 On | 00000000:05:00.0 Off | Off | | 67% 75C P8 13W / 140W | 14633MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2269 C /usr/bin/ollama 674MiB | | 1 N/A N/A 1457 C /usr/local/bin/python3 222MiB | | 1 N/A N/A 2269 C /usr/bin/ollama 636MiB | | 2 N/A N/A 2269 C /usr/bin/ollama 676MiB | | 3 N/A N/A 2269 C /usr/bin/ollama 674MiB | | 4 N/A N/A 2269 C /usr/bin/ollama 670MiB | +-----------------------------------------------------------------------------------------+ ```
Author
Owner

@rick-github commented on GitHub (Aug 11, 2025):

So you are running in a container?

<!-- gh-comment-id:3175392654 --> @rick-github commented on GitHub (Aug 11, 2025): So you are running in a container?
Author
Owner

@alienatedsec commented on GitHub (Aug 11, 2025):

So you are running in a container?

yes

<!-- gh-comment-id:3175395057 --> @alienatedsec commented on GitHub (Aug 11, 2025): > So you are running in a container? yes
Author
Owner

@rick-github commented on GitHub (Aug 11, 2025):

What's the output of the following outside of the container

ps wwh p$(pidof ollama)
<!-- gh-comment-id:3175396500 --> @rick-github commented on GitHub (Aug 11, 2025): What's the output of the following outside of the container ``` ps wwh p$(pidof ollama) ```
Author
Owner

@alienatedsec commented on GitHub (Aug 11, 2025):

Outside the container

root@[redacted]:~# ps wwh p$(pidof ollama)
   1466 ?        Ssl    0:20 /bin/ollama serve
   2269 ?        Sl     8:43 /usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 256 --threads 16 --no-mmap --parallel 1 --port 44643
<!-- gh-comment-id:3175401814 --> @alienatedsec commented on GitHub (Aug 11, 2025): Outside the container ``` root@[redacted]:~# ps wwh p$(pidof ollama) 1466 ? Ssl 0:20 /bin/ollama serve 2269 ? Sl 8:43 /usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 256 --threads 16 --no-mmap --parallel 1 --port 44643 ```
Author
Owner

@rick-github commented on GitHub (Aug 11, 2025):

Dump the logs from the container and attach.

<!-- gh-comment-id:3175408796 --> @rick-github commented on GitHub (Aug 11, 2025): Dump the logs from the container and attach.
Author
Owner

@rick-github commented on GitHub (Aug 11, 2025):

What does ollama ps show?

<!-- gh-comment-id:3175410118 --> @rick-github commented on GitHub (Aug 11, 2025): What does `ollama ps` show?
Author
Owner

@alienatedsec commented on GitHub (Aug 11, 2025):

Dump the logs from the container and attach.

_ollama_logs.txt

What does ollama ps show?

root@[redacted]:/# ollama ps
NAME            ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
gpt-oss:120b    735371f916a9    69 GB    100% CPU     128000     Forever    
root@[redacted]:/# 
<!-- gh-comment-id:3175453805 --> @alienatedsec commented on GitHub (Aug 11, 2025): > Dump the logs from the container and attach. [_ollama_logs.txt](https://github.com/user-attachments/files/21718425/_ollama_logs.txt) > What does `ollama ps` show? ``` root@[redacted]:/# ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b 735371f916a9 69 GB 100% CPU 128000 Forever root@[redacted]:/# ```
Author
Owner

@rick-github commented on GitHub (Aug 11, 2025):

You have modified the model and set num_gpu=256. Originally, ollama estimated that no layers would fit on the GPU given the size of the memory graph, so the output of ollama ps shows the result of that estimation. When it came time for the runner to allocate layers, the override took precedence and caused the runner to allocate all layers to the GPU. It didn't OOM because you have set GGML_CUDA_ENABLE_UNIFIED_MEMORY, which results in the layers overflowing in to system RAM. While this prevents an OOM, there is a potential performance hit.

I would be interested to see the statistics (ollama run gpt-oss:120b --verbose 'why is the sky blue?') of this setup verus one where you load an unmodified version of the model and let it run in CPU.

<!-- gh-comment-id:3175552664 --> @rick-github commented on GitHub (Aug 11, 2025): You have modified the model and set `num_gpu=256`. Originally, ollama estimated that no layers would fit on the GPU given the size of the memory graph, so the output of `ollama ps` shows the result of that estimation. When it came time for the runner to allocate layers, the override took precedence and caused the runner to allocate all layers to the GPU. It didn't OOM because you have set `GGML_CUDA_ENABLE_UNIFIED_MEMORY`, which results in the layers overflowing in to system RAM. While this prevents an OOM, there is a potential [performance hit](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900). I would be interested to see the statistics (`ollama run gpt-oss:120b --verbose 'why is the sky blue?'`) of this setup verus one where you load an unmodified version of the model and let it run in CPU.
Author
Owner

@alienatedsec commented on GitHub (Aug 11, 2025):

I would be interested to see the statistics (ollama run gpt-oss:120b --verbose 'why is the sky blue?') of this setup verus one where you load an unmodified version of the model and let it run in CPU.

ollama run gpt-oss:120b --verbose 'why is the sky blue?'
Thinking...
The user asks why the sky is blue. Provide explanation of Rayleigh scattering, shorter wavelengths scatter more, human eye sensitivity, etc. Also mention why 
sunrise/sunset appear red, etc. Maybe ask follow-up? Probably just answer. Should be concise but thorough. Use plain language, some details.
...done thinking.

**Short answer:**  
The sky looks blue because molecules and tiny particles in Earth’s atmosphere scatter sunlight. Short‑wavelength (blue and violet) light is scattered much more 
efficiently than longer‑wavelength (red, orange, yellow) light, and our eyes are more sensitive to blue than to violet. The scattered blue light reaches us from every 
direction, giving the sky its characteristic color.

---

## How it works – a step‑by‑step explanation

| Step | What happens | Why it matters |
|------|--------------|----------------|
| **1. Sunlight reaches Earth** | Sunlight is a mixture of all visible colors (plus infrared and ultraviolet). If you split it with a prism you see a continuous spectrum 
from violet (≈380 nm) to red (≈750 nm). | The light that enters the atmosphere already contains blue light. |
| **2. Light meets the atmosphere** | The atmosphere is filled with gases (N₂, O₂, Ar, CO₂) and tiny particles (dust, water droplets, aerosols). These are **much 
smaller** than the wavelength of visible light. | When particles are much smaller than the wavelength, they cause **Rayleigh scattering**. |
| **3. Rayleigh scattering favors short wavelengths** | The scattering intensity \(I\) varies roughly as \(\frac{1}{\lambda^4}\) (the inverse fourth power of 
wavelength). <br>• Blue (~450 nm) is scattered about **10×** more than green (~550 nm). <br>• Violet (~400 nm) is scattered ~**16×** more than red (~650 nm). | This 
strong wavelength dependence means the sky is flooded with scattered blue (and violet) light from every direction. |
| **4. Our eyes see blue, not violet** | The human eye’s photoreceptors (cones) are less sensitive to violet, and some of the violet is absorbed by the upper 
atmosphere’s ozone layer. | The net result is that the sky appears **blue** rather than violet. |
| **5. Direct sunlight still looks white** | The light that travels straight from the Sun to our eyes is only *partially* scattered, so it retains most of its original 
mix of colors and looks white (or slightly yellowish). | This is why the Sun itself isn’t blue even though the surrounding sky is. |
| **6. Sunrise & sunset turn red** | When the Sun is low on the horizon, its light passes through **much more** atmosphere (up to 40 × the thickness at noon). The 
short‑wavelength light gets scattered out of the direct line of sight long before it reaches you, leaving the longer‑wavelength reds and oranges to dominate the direct 
beam. | That’s why sunrises and sunsets are spectacularly red/orange. |

---

## A little math (optional)

The Rayleigh scattering cross‑section for a particle of radius \(a\) (much smaller than wavelength \(\lambda\)) is roughly  

\[
\sigma \propto \frac{a^6}{\lambda^4}\,
\left(\frac{n^2-1}{n^2+2}\right)^2
\]

where \(n\) is the refractive index of the particle. The \(\lambda^{-4}\) term is the key: halve the wavelength and the scattering becomes 16 times stronger.

---

## Common follow‑up questions

| Question | Quick answer |
|----------|--------------|
| **Why isn’t the sky black at night?** | At night there’s no Sun to provide the light that gets scattered. The sky appears black because we’re looking into space, not 
at scattered sunlight. |
| **Does the sky look different on other planets?** | Yes. Mars, with a thin CO₂ atmosphere and a lot of fine dust, has a butterscotch‑orange sky. Titan’s dense 
nitrogen‑methane haze makes its sky appear orange‑brown. |
| **What about the “blue hour” in photography?** | That’s just the period after sunset (or before sunrise) when the Sun is just below the horizon; scattered blue light 
still fills the sky, giving a deep, even‑tone blue. |
| **Why do clouds look white, not blue?** | Cloud droplets are **much larger** than the wavelength of light, so they scatter all colors roughly equally (Mie scattering). 
The mixture of all colors appears white. |

---

### Bottom line

The sky is blue because Earth’s tiny atmospheric molecules scatter short‑wavelength light far more efficiently than long‑wavelength light, and our eyes are tuned to 
perceive the resulting surplus of blue light. The same scattering principle explains why sunsets are red and why other planets can have dramatically different sky colors.

total duration:       7m2.567317799s
load duration:        1m33.40423142s
prompt eval count:    73 token(s)
prompt eval duration: 12.588667124s
prompt eval rate:     5.80 tokens/s
eval count:           1060 token(s)
eval duration:        5m16.571928276s
eval rate:            3.35 tokens/s

vs OpenWebUI

Image
<!-- gh-comment-id:3176056718 --> @alienatedsec commented on GitHub (Aug 11, 2025): > I would be interested to see the statistics (`ollama run gpt-oss:120b --verbose 'why is the sky blue?'`) of this setup verus one where you load an unmodified version of the model and let it run in CPU. ``` ollama run gpt-oss:120b --verbose 'why is the sky blue?' Thinking... The user asks why the sky is blue. Provide explanation of Rayleigh scattering, shorter wavelengths scatter more, human eye sensitivity, etc. Also mention why sunrise/sunset appear red, etc. Maybe ask follow-up? Probably just answer. Should be concise but thorough. Use plain language, some details. ...done thinking. **Short answer:** The sky looks blue because molecules and tiny particles in Earth’s atmosphere scatter sunlight. Short‑wavelength (blue and violet) light is scattered much more efficiently than longer‑wavelength (red, orange, yellow) light, and our eyes are more sensitive to blue than to violet. The scattered blue light reaches us from every direction, giving the sky its characteristic color. --- ## How it works – a step‑by‑step explanation | Step | What happens | Why it matters | |------|--------------|----------------| | **1. Sunlight reaches Earth** | Sunlight is a mixture of all visible colors (plus infrared and ultraviolet). If you split it with a prism you see a continuous spectrum from violet (≈380 nm) to red (≈750 nm). | The light that enters the atmosphere already contains blue light. | | **2. Light meets the atmosphere** | The atmosphere is filled with gases (N₂, O₂, Ar, CO₂) and tiny particles (dust, water droplets, aerosols). These are **much smaller** than the wavelength of visible light. | When particles are much smaller than the wavelength, they cause **Rayleigh scattering**. | | **3. Rayleigh scattering favors short wavelengths** | The scattering intensity \(I\) varies roughly as \(\frac{1}{\lambda^4}\) (the inverse fourth power of wavelength). <br>• Blue (~450 nm) is scattered about **10×** more than green (~550 nm). <br>• Violet (~400 nm) is scattered ~**16×** more than red (~650 nm). | This strong wavelength dependence means the sky is flooded with scattered blue (and violet) light from every direction. | | **4. Our eyes see blue, not violet** | The human eye’s photoreceptors (cones) are less sensitive to violet, and some of the violet is absorbed by the upper atmosphere’s ozone layer. | The net result is that the sky appears **blue** rather than violet. | | **5. Direct sunlight still looks white** | The light that travels straight from the Sun to our eyes is only *partially* scattered, so it retains most of its original mix of colors and looks white (or slightly yellowish). | This is why the Sun itself isn’t blue even though the surrounding sky is. | | **6. Sunrise & sunset turn red** | When the Sun is low on the horizon, its light passes through **much more** atmosphere (up to 40 × the thickness at noon). The short‑wavelength light gets scattered out of the direct line of sight long before it reaches you, leaving the longer‑wavelength reds and oranges to dominate the direct beam. | That’s why sunrises and sunsets are spectacularly red/orange. | --- ## A little math (optional) The Rayleigh scattering cross‑section for a particle of radius \(a\) (much smaller than wavelength \(\lambda\)) is roughly \[ \sigma \propto \frac{a^6}{\lambda^4}\, \left(\frac{n^2-1}{n^2+2}\right)^2 \] where \(n\) is the refractive index of the particle. The \(\lambda^{-4}\) term is the key: halve the wavelength and the scattering becomes 16 times stronger. --- ## Common follow‑up questions | Question | Quick answer | |----------|--------------| | **Why isn’t the sky black at night?** | At night there’s no Sun to provide the light that gets scattered. The sky appears black because we’re looking into space, not at scattered sunlight. | | **Does the sky look different on other planets?** | Yes. Mars, with a thin CO₂ atmosphere and a lot of fine dust, has a butterscotch‑orange sky. Titan’s dense nitrogen‑methane haze makes its sky appear orange‑brown. | | **What about the “blue hour” in photography?** | That’s just the period after sunset (or before sunrise) when the Sun is just below the horizon; scattered blue light still fills the sky, giving a deep, even‑tone blue. | | **Why do clouds look white, not blue?** | Cloud droplets are **much larger** than the wavelength of light, so they scatter all colors roughly equally (Mie scattering). The mixture of all colors appears white. | --- ### Bottom line The sky is blue because Earth’s tiny atmospheric molecules scatter short‑wavelength light far more efficiently than long‑wavelength light, and our eyes are tuned to perceive the resulting surplus of blue light. The same scattering principle explains why sunsets are red and why other planets can have dramatically different sky colors. total duration: 7m2.567317799s load duration: 1m33.40423142s prompt eval count: 73 token(s) prompt eval duration: 12.588667124s prompt eval rate: 5.80 tokens/s eval count: 1060 token(s) eval duration: 5m16.571928276s eval rate: 3.35 tokens/s ``` vs OpenWebUI <img width="209" height="283" alt="Image" src="https://github.com/user-attachments/assets/54741e49-61c5-4a3c-b687-ce03578ef157" />
Author
Owner

@alienatedsec commented on GitHub (Aug 11, 2025):

Just ran it unmodified on OpenWebUI but left the context size at 128k. Now it fully loaded to CPU and the performance is not great.

ggml_cuda_init: found 5 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
  Device 1: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
  Device 2: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
  Device 3: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
  Device 4: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-x64.so
time=2025-08-11T17:52:11.790Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-11T17:52:12.061Z level=INFO source=ggml.go:365 msg="offloading 0 repeating layers to GPU"
time=2025-08-11T17:52:12.061Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU"
time=2025-08-11T17:52:12.061Z level=INFO source=ggml.go:376 msg="offloaded 0/37 layers to GPU"
time=2025-08-11T17:52:12.061Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="60.8 GiB"
time=2025-08-11T17:52:18.855Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-11T17:52:18.855Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="0 B"
time=2025-08-11T17:52:18.855Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="0 B"
time=2025-08-11T17:52:18.855Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA3 buffer_type=CUDA3 size="0 B"
time=2025-08-11T17:52:18.855Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA4 buffer_type=CUDA4 size="0 B"
time=2025-08-11T17:52:18.855Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="31.3 GiB"
time=2025-08-11T17:52:36.078Z level=INFO source=server.go:637 msg="llama runner started in 25.97 seconds"
[GIN] 2025/08/11 - 17:55:11 | 200 |     836.871µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/11 - 17:55:11 | 200 |      53.323µs |       127.0.0.1 | GET      "/api/ps"
<!-- gh-comment-id:3176173168 --> @alienatedsec commented on GitHub (Aug 11, 2025): Just ran it unmodified on OpenWebUI but left the context size at 128k. Now it fully loaded to CPU and the performance is not great. ``` ggml_cuda_init: found 5 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes Device 1: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 2: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 3: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 4: NVIDIA RTX A4000, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-x64.so time=2025-08-11T17:52:11.790Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-11T17:52:12.061Z level=INFO source=ggml.go:365 msg="offloading 0 repeating layers to GPU" time=2025-08-11T17:52:12.061Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU" time=2025-08-11T17:52:12.061Z level=INFO source=ggml.go:376 msg="offloaded 0/37 layers to GPU" time=2025-08-11T17:52:12.061Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="60.8 GiB" time=2025-08-11T17:52:18.855Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-08-11T17:52:18.855Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="0 B" time=2025-08-11T17:52:18.855Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="0 B" time=2025-08-11T17:52:18.855Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA3 buffer_type=CUDA3 size="0 B" time=2025-08-11T17:52:18.855Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA4 buffer_type=CUDA4 size="0 B" time=2025-08-11T17:52:18.855Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="31.3 GiB" time=2025-08-11T17:52:36.078Z level=INFO source=server.go:637 msg="llama runner started in 25.97 seconds" [GIN] 2025/08/11 - 17:55:11 | 200 | 836.871µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/11 - 17:55:11 | 200 | 53.323µs | 127.0.0.1 | GET "/api/ps" ```
Author
Owner

@rick-github commented on GitHub (Aug 11, 2025):

performance is not great.

Which is?

<!-- gh-comment-id:3176201758 --> @rick-github commented on GitHub (Aug 11, 2025): > performance is not great. Which is?
Author
Owner

@alienatedsec commented on GitHub (Aug 11, 2025):

Still running. It's about a word per second. I'll update this comment when complete.

I will need more time to provide the output.

<!-- gh-comment-id:3176219114 --> @alienatedsec commented on GitHub (Aug 11, 2025): Still running. It's about a word per second. I'll update this comment when complete. I will need more time to provide the output.
Author
Owner

@ericcurtin commented on GitHub (Aug 11, 2025):

offloaded 0/37 layers to GPU

Ollama seem to have turned off GPU for standard ggufs from huggingface, it makes the llama.cpp version of Ollama only use CPU, my advice move to something like docker model runner or llama.cpp . I'm willing to assist with docker model runner. CLI chatbot is just:

docker model run ai/gpt-oss

OpenAI-compatible server is behind:

http://127.0.0.1:12434/engines/llama.cpp/v1

when we turn on TCP in docker model runner.

<!-- gh-comment-id:3176596130 --> @ericcurtin commented on GitHub (Aug 11, 2025): > offloaded 0/37 layers to GPU Ollama seem to have turned off GPU for standard ggufs from huggingface, it makes the llama.cpp version of Ollama only use CPU, my advice move to something like docker model runner or llama.cpp . I'm willing to assist with docker model runner. CLI chatbot is just: docker model run ai/gpt-oss OpenAI-compatible server is behind: http://127.0.0.1:12434/engines/llama.cpp/v1 when we turn on TCP in docker model runner.
Author
Owner

@rick-github commented on GitHub (Aug 11, 2025):

Ollama seem to have turned off GPU for standard ggufs from huggingface,

This is incorrect.

<!-- gh-comment-id:3176726937 --> @rick-github commented on GitHub (Aug 11, 2025): > Ollama seem to have turned off GPU for standard ggufs from huggingface, This is incorrect.
Author
Owner

@alienatedsec commented on GitHub (Aug 12, 2025):

  • Default Ollama - ollama run gpt-oss:120b --verbose 'why is the sky blue?' - looks like the context size is 8192
total duration:       5m33.640378603s
load duration:        1m4.144910501s
prompt eval count:    73 token(s)
prompt eval duration: 13.035602434s
prompt eval rate:     5.60 tokens/s
eval count:           859 token(s)
eval duration:        4m16.457263845s
eval rate:            3.35 tokens/s
root@[redacted]:/# ollama ps
NAME            ID              SIZE     PROCESSOR          CONTEXT    UNTIL   
gpt-oss:120b    735371f916a9    91 GB    13%/87% CPU/GPU    8192       Forever    
root@[redacted]:/# 
time=2025-08-12T06:59:07.662Z level=INFO source=routes.go:1304 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-08-12T06:59:07.694Z level=INFO source=images.go:477 msg="total blobs: 34"
time=2025-08-12T06:59:07.696Z level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2025-08-12T06:59:07.698Z level=INFO source=routes.go:1357 msg="Listening on [::]:11434 (version 0.11.4)"
time=2025-08-12T06:59:07.704Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-08-12T06:59:09.707Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.5 GiB"
time=2025-08-12T06:59:09.707Z level=INFO source=types.go:130 msg="inference compute" id=GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
time=2025-08-12T06:59:09.707Z level=INFO source=types.go:130 msg="inference compute" id=GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
time=2025-08-12T06:59:09.707Z level=INFO source=types.go:130 msg="inference compute" id=GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
time=2025-08-12T06:59:09.707Z level=INFO source=types.go:130 msg="inference compute" id=GPU-eab804bb-3a67-1954-db1a-ddf62d8f427a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
[GIN] 2025/08/12 - 07:01:41 | 200 |    3.459333ms |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/12 - 07:01:42 | 200 |  420.549483ms |       127.0.0.1 | POST     "/api/show"
time=2025-08-12T07:01:56.610Z level=INFO source=server.go:135 msg="system memory" total="125.8 GiB" free="122.9 GiB" free_swap="0 B"
time=2025-08-12T07:01:58.272Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=30 layers.split=6,6,6,6,6 memory.available="[15.2 GiB 15.4 GiB 15.4 GiB 15.4 GiB 15.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="85.6 GiB" memory.required.partial="74.6 GiB" memory.required.kv="450.0 MiB" memory.required.allocations="[14.9 GiB 14.9 GiB 14.9 GiB 14.9 GiB 14.9 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="3.0 GiB" memory.graph.partial="3.0 GiB"
time=2025-08-12T07:01:58.273Z level=WARN source=server.go:211 msg="flash attention enabled but not supported by model"
time=2025-08-12T07:01:58.471Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 8192 --batch-size 512 --n-gpu-layers 30 --threads 16 --parallel 1 --tensor-split 6,6,6,6,6 --port 35981"
time=2025-08-12T07:01:58.473Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-12T07:01:58.473Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-12T07:01:58.473Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-12T07:01:58.517Z level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-12T07:01:58.518Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:35981"
time=2025-08-12T07:01:58.725Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-12T07:01:58.727Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
  Device 1: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
  Device 2: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
  Device 3: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
  Device 4: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-x64.so
time=2025-08-12T07:02:00.777Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:365 msg="offloading 30 repeating layers to GPU"
time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU"
time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:376 msg="offloaded 30/37 layers to GPU"
time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA4 size="9.8 GiB"
time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="11.9 GiB"
time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="9.8 GiB"
time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA1 size="9.8 GiB"
time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA2 size="9.8 GiB"
time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA3 size="9.8 GiB"
time=2025-08-12T07:02:01.423Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="2.1 GiB"
time=2025-08-12T07:02:01.423Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="2.1 GiB"
time=2025-08-12T07:02:01.423Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="2.1 GiB"
time=2025-08-12T07:02:01.423Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA3 buffer_type=CUDA3 size="2.1 GiB"
time=2025-08-12T07:02:01.423Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA4 buffer_type=CUDA4 size="2.1 GiB"
time=2025-08-12T07:02:01.423Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="2.0 GiB"
time=2025-08-12T07:02:46.541Z level=INFO source=server.go:637 msg="llama runner started in 48.07 seconds"
[GIN] 2025/08/12 - 07:07:16 | 200 |         5m33s |       127.0.0.1 | POST     "/api/generate"

OpenWebUI - Context 128k - default GPU offloading

Image
time=2025-08-12T07:11:50.787Z level=INFO source=routes.go:1304 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-08-12T07:11:50.796Z level=INFO source=images.go:477 msg="total blobs: 34"
time=2025-08-12T07:11:50.798Z level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2025-08-12T07:11:50.799Z level=INFO source=routes.go:1357 msg="Listening on [::]:11434 (version 0.11.4)"
time=2025-08-12T07:11:50.800Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-08-12T07:11:52.632Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.2 GiB"
time=2025-08-12T07:11:52.632Z level=INFO source=types.go:130 msg="inference compute" id=GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
time=2025-08-12T07:11:52.632Z level=INFO source=types.go:130 msg="inference compute" id=GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
time=2025-08-12T07:11:52.632Z level=INFO source=types.go:130 msg="inference compute" id=GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
time=2025-08-12T07:11:52.632Z level=INFO source=types.go:130 msg="inference compute" id=GPU-eab804bb-3a67-1954-db1a-ddf62d8f427a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
[GIN] 2025/08/12 - 07:12:31 | 200 |    9.252204ms |      172.17.0.1 | GET      "/api/tags"
[GIN] 2025/08/12 - 07:12:31 | 200 |     212.861µs |      172.17.0.1 | GET      "/api/ps"
[GIN] 2025/08/12 - 07:12:33 | 200 |      77.853µs |      172.17.0.1 | GET      "/api/version"
time=2025-08-12T07:17:17.674Z level=INFO source=server.go:135 msg="system memory" total="125.8 GiB" free="122.8 GiB" free_swap="0 B"
time=2025-08-12T07:17:19.329Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=0 layers.split="" memory.available="[15.2 GiB 15.4 GiB 15.4 GiB 15.4 GiB 15.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="64.3 GiB" memory.required.partial="0 B" memory.required.kv="4.6 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B 0 B]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="46.9 GiB" memory.graph.partial="46.9 GiB"
time=2025-08-12T07:17:19.329Z level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu"
time=2025-08-12T07:17:19.527Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --threads 16 --no-mmap --parallel 1 --port 39935"
time=2025-08-12T07:17:19.528Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-12T07:17:19.529Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-12T07:17:19.529Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-12T07:17:19.567Z level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-12T07:17:19.568Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:39935"
time=2025-08-12T07:17:19.780Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
time=2025-08-12T07:17:19.782Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
  Device 1: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
  Device 2: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
  Device 3: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
  Device 4: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-x64.so
time=2025-08-12T07:17:21.180Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-12T07:17:21.538Z level=INFO source=ggml.go:365 msg="offloading 0 repeating layers to GPU"
time=2025-08-12T07:17:21.539Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU"
time=2025-08-12T07:17:21.539Z level=INFO source=ggml.go:376 msg="offloaded 0/37 layers to GPU"
time=2025-08-12T07:17:21.539Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="60.8 GiB"
time=2025-08-12T07:17:28.251Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-12T07:17:28.251Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="0 B"
time=2025-08-12T07:17:28.251Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="0 B"
time=2025-08-12T07:17:28.251Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA3 buffer_type=CUDA3 size="0 B"
time=2025-08-12T07:17:28.251Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA4 buffer_type=CUDA4 size="0 B"
time=2025-08-12T07:17:28.251Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="31.3 GiB"
time=2025-08-12T07:17:39.466Z level=INFO source=server.go:637 msg="llama runner started in 19.94 seconds"
<!-- gh-comment-id:3178102742 --> @alienatedsec commented on GitHub (Aug 12, 2025): - Default Ollama - `ollama run gpt-oss:120b --verbose 'why is the sky blue?'` - looks like the context size is 8192 ``` total duration: 5m33.640378603s load duration: 1m4.144910501s prompt eval count: 73 token(s) prompt eval duration: 13.035602434s prompt eval rate: 5.60 tokens/s eval count: 859 token(s) eval duration: 4m16.457263845s eval rate: 3.35 tokens/s ``` ``` root@[redacted]:/# ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b 735371f916a9 91 GB 13%/87% CPU/GPU 8192 Forever root@[redacted]:/# ``` ``` time=2025-08-12T06:59:07.662Z level=INFO source=routes.go:1304 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-08-12T06:59:07.694Z level=INFO source=images.go:477 msg="total blobs: 34" time=2025-08-12T06:59:07.696Z level=INFO source=images.go:484 msg="total unused blobs removed: 0" time=2025-08-12T06:59:07.698Z level=INFO source=routes.go:1357 msg="Listening on [::]:11434 (version 0.11.4)" time=2025-08-12T06:59:07.704Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-08-12T06:59:09.707Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.5 GiB" time=2025-08-12T06:59:09.707Z level=INFO source=types.go:130 msg="inference compute" id=GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" time=2025-08-12T06:59:09.707Z level=INFO source=types.go:130 msg="inference compute" id=GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" time=2025-08-12T06:59:09.707Z level=INFO source=types.go:130 msg="inference compute" id=GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" time=2025-08-12T06:59:09.707Z level=INFO source=types.go:130 msg="inference compute" id=GPU-eab804bb-3a67-1954-db1a-ddf62d8f427a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" [GIN] 2025/08/12 - 07:01:41 | 200 | 3.459333ms | 127.0.0.1 | HEAD "/" [GIN] 2025/08/12 - 07:01:42 | 200 | 420.549483ms | 127.0.0.1 | POST "/api/show" time=2025-08-12T07:01:56.610Z level=INFO source=server.go:135 msg="system memory" total="125.8 GiB" free="122.9 GiB" free_swap="0 B" time=2025-08-12T07:01:58.272Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=30 layers.split=6,6,6,6,6 memory.available="[15.2 GiB 15.4 GiB 15.4 GiB 15.4 GiB 15.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="85.6 GiB" memory.required.partial="74.6 GiB" memory.required.kv="450.0 MiB" memory.required.allocations="[14.9 GiB 14.9 GiB 14.9 GiB 14.9 GiB 14.9 GiB]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="3.0 GiB" memory.graph.partial="3.0 GiB" time=2025-08-12T07:01:58.273Z level=WARN source=server.go:211 msg="flash attention enabled but not supported by model" time=2025-08-12T07:01:58.471Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 8192 --batch-size 512 --n-gpu-layers 30 --threads 16 --parallel 1 --tensor-split 6,6,6,6,6 --port 35981" time=2025-08-12T07:01:58.473Z level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-12T07:01:58.473Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-12T07:01:58.473Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-12T07:01:58.517Z level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-12T07:01:58.518Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:35981" time=2025-08-12T07:01:58.725Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-08-12T07:01:58.727Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 5 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes Device 1: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 2: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 3: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 4: NVIDIA RTX A4000, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-x64.so time=2025-08-12T07:02:00.777Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:365 msg="offloading 30 repeating layers to GPU" time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU" time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:376 msg="offloaded 30/37 layers to GPU" time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA4 size="9.8 GiB" time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="11.9 GiB" time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="9.8 GiB" time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA1 size="9.8 GiB" time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA2 size="9.8 GiB" time=2025-08-12T07:02:01.159Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA3 size="9.8 GiB" time=2025-08-12T07:02:01.423Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="2.1 GiB" time=2025-08-12T07:02:01.423Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="2.1 GiB" time=2025-08-12T07:02:01.423Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="2.1 GiB" time=2025-08-12T07:02:01.423Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA3 buffer_type=CUDA3 size="2.1 GiB" time=2025-08-12T07:02:01.423Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA4 buffer_type=CUDA4 size="2.1 GiB" time=2025-08-12T07:02:01.423Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="2.0 GiB" time=2025-08-12T07:02:46.541Z level=INFO source=server.go:637 msg="llama runner started in 48.07 seconds" [GIN] 2025/08/12 - 07:07:16 | 200 | 5m33s | 127.0.0.1 | POST "/api/generate" ``` OpenWebUI - Context 128k - default GPU offloading <img width="221" height="277" alt="Image" src="https://github.com/user-attachments/assets/5d87eb46-433b-43cb-9a3d-dbb79ca00be5" /> ``` time=2025-08-12T07:11:50.787Z level=INFO source=routes.go:1304 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-08-12T07:11:50.796Z level=INFO source=images.go:477 msg="total blobs: 34" time=2025-08-12T07:11:50.798Z level=INFO source=images.go:484 msg="total unused blobs removed: 0" time=2025-08-12T07:11:50.799Z level=INFO source=routes.go:1357 msg="Listening on [::]:11434 (version 0.11.4)" time=2025-08-12T07:11:50.800Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-08-12T07:11:52.632Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.2 GiB" time=2025-08-12T07:11:52.632Z level=INFO source=types.go:130 msg="inference compute" id=GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" time=2025-08-12T07:11:52.632Z level=INFO source=types.go:130 msg="inference compute" id=GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" time=2025-08-12T07:11:52.632Z level=INFO source=types.go:130 msg="inference compute" id=GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" time=2025-08-12T07:11:52.632Z level=INFO source=types.go:130 msg="inference compute" id=GPU-eab804bb-3a67-1954-db1a-ddf62d8f427a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" [GIN] 2025/08/12 - 07:12:31 | 200 | 9.252204ms | 172.17.0.1 | GET "/api/tags" [GIN] 2025/08/12 - 07:12:31 | 200 | 212.861µs | 172.17.0.1 | GET "/api/ps" [GIN] 2025/08/12 - 07:12:33 | 200 | 77.853µs | 172.17.0.1 | GET "/api/version" time=2025-08-12T07:17:17.674Z level=INFO source=server.go:135 msg="system memory" total="125.8 GiB" free="122.8 GiB" free_swap="0 B" time=2025-08-12T07:17:19.329Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=0 layers.split="" memory.available="[15.2 GiB 15.4 GiB 15.4 GiB 15.4 GiB 15.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="64.3 GiB" memory.required.partial="0 B" memory.required.kv="4.6 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B 0 B]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="46.9 GiB" memory.graph.partial="46.9 GiB" time=2025-08-12T07:17:19.329Z level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu" time=2025-08-12T07:17:19.527Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --threads 16 --no-mmap --parallel 1 --port 39935" time=2025-08-12T07:17:19.528Z level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-12T07:17:19.529Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-12T07:17:19.529Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-12T07:17:19.567Z level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-12T07:17:19.568Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:39935" time=2025-08-12T07:17:19.780Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 time=2025-08-12T07:17:19.782Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 5 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes Device 1: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 2: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 3: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 4: NVIDIA RTX A4000, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-x64.so time=2025-08-12T07:17:21.180Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-12T07:17:21.538Z level=INFO source=ggml.go:365 msg="offloading 0 repeating layers to GPU" time=2025-08-12T07:17:21.539Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU" time=2025-08-12T07:17:21.539Z level=INFO source=ggml.go:376 msg="offloaded 0/37 layers to GPU" time=2025-08-12T07:17:21.539Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="60.8 GiB" time=2025-08-12T07:17:28.251Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-08-12T07:17:28.251Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="0 B" time=2025-08-12T07:17:28.251Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="0 B" time=2025-08-12T07:17:28.251Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA3 buffer_type=CUDA3 size="0 B" time=2025-08-12T07:17:28.251Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA4 buffer_type=CUDA4 size="0 B" time=2025-08-12T07:17:28.251Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="31.3 GiB" time=2025-08-12T07:17:39.466Z level=INFO source=server.go:637 msg="llama runner started in 19.94 seconds" ```
Author
Owner

@alienatedsec commented on GitHub (Aug 12, 2025):

OpenWebUI - 128k context - max GPU offloading

Image

another go - that's when the model was already loaded

Image
root@[redacted]:/# ollama ps
NAME            ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
gpt-oss:120b    735371f916a9    69 GB    100% CPU     128000     Forever    
root@[redacted]:/# 
time=2025-08-12T07:41:40.457Z level=INFO source=routes.go:1304 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-08-12T07:41:40.465Z level=INFO source=images.go:477 msg="total blobs: 34"
time=2025-08-12T07:41:40.467Z level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2025-08-12T07:41:40.468Z level=INFO source=routes.go:1357 msg="Listening on [::]:11434 (version 0.11.4)"
time=2025-08-12T07:41:40.468Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-08-12T07:41:42.462Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.2 GiB"
time=2025-08-12T07:41:42.462Z level=INFO source=types.go:130 msg="inference compute" id=GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
time=2025-08-12T07:41:42.462Z level=INFO source=types.go:130 msg="inference compute" id=GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
time=2025-08-12T07:41:42.462Z level=INFO source=types.go:130 msg="inference compute" id=GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
time=2025-08-12T07:41:42.462Z level=INFO source=types.go:130 msg="inference compute" id=GPU-eab804bb-3a67-1954-db1a-ddf62d8f427a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
[GIN] 2025/08/12 - 07:42:18 | 200 |     214.344µs |      172.17.0.1 | GET      "/api/version"
[GIN] 2025/08/12 - 07:42:20 | 200 |    3.704388ms |      172.17.0.1 | GET      "/api/tags"
[GIN] 2025/08/12 - 07:42:20 | 200 |     215.682µs |      172.17.0.1 | GET      "/api/ps"
[GIN] 2025/08/12 - 07:42:21 | 200 |      92.611µs |      172.17.0.1 | GET      "/api/version"
time=2025-08-12T07:42:38.386Z level=INFO source=server.go:135 msg="system memory" total="125.8 GiB" free="122.9 GiB" free_swap="0 B"
time=2025-08-12T07:42:40.029Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=256 layers.model=37 layers.offload=0 layers.split="" memory.available="[15.2 GiB 15.4 GiB 15.4 GiB 15.4 GiB 15.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="64.3 GiB" memory.required.partial="0 B" memory.required.kv="4.6 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B 0 B]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="46.9 GiB" memory.graph.partial="46.9 GiB"
time=2025-08-12T07:42:40.029Z level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu"
time=2025-08-12T07:42:40.219Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 256 --threads 16 --no-mmap --parallel 1 --port 43889"
time=2025-08-12T07:42:40.221Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-12T07:42:40.221Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-12T07:42:40.221Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-12T07:42:40.261Z level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-12T07:42:40.261Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:43889"
time=2025-08-12T07:42:40.473Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-12T07:42:40.475Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 5 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
  Device 1: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
  Device 2: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
  Device 3: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
  Device 4: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-x64.so
time=2025-08-12T07:42:41.941Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-12T07:42:42.331Z level=INFO source=ggml.go:365 msg="offloading 36 repeating layers to GPU"
time=2025-08-12T07:42:42.331Z level=INFO source=ggml.go:371 msg="offloading output layer to GPU"
time=2025-08-12T07:42:42.332Z level=INFO source=ggml.go:376 msg="offloaded 37/37 layers to GPU"
time=2025-08-12T07:42:42.332Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB"
time=2025-08-12T07:42:42.332Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="13.0 GiB"
time=2025-08-12T07:42:42.332Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA1 size="11.4 GiB"
time=2025-08-12T07:42:42.332Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA2 size="13.0 GiB"
time=2025-08-12T07:42:42.332Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA3 size="11.4 GiB"
time=2025-08-12T07:42:42.332Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA4 size="10.9 GiB"
time=2025-08-12T07:42:43.561Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="31.5 GiB"
time=2025-08-12T07:42:43.561Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="31.5 GiB"
time=2025-08-12T07:42:43.561Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="31.5 GiB"
time=2025-08-12T07:42:43.561Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA3 buffer_type=CUDA3 size="31.5 GiB"
time=2025-08-12T07:42:43.561Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA4 buffer_type=CUDA4 size="31.5 GiB"
time=2025-08-12T07:42:43.561Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB"
time=2025-08-12T07:44:16.820Z level=INFO source=server.go:637 msg="llama runner started in 96.60 seconds"
[GIN] 2025/08/12 - 07:45:06 | 200 |         2m42s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 07:45:26 | 200 | 20.203185733s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 07:45:39 | 200 | 12.833296469s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 07:45:58 | 200 | 18.500690622s |      172.17.0.1 | POST     "/api/chat"
<!-- gh-comment-id:3178141351 --> @alienatedsec commented on GitHub (Aug 12, 2025): OpenWebUI - 128k context - max GPU offloading <img width="211" height="276" alt="Image" src="https://github.com/user-attachments/assets/0fef5b46-573b-4eec-802c-6e074bd97e11" /> another go - that's when the model was already loaded <img width="197" height="270" alt="Image" src="https://github.com/user-attachments/assets/20cd212a-7ea3-49ba-a0bd-649a93d589bf" /> ``` root@[redacted]:/# ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b 735371f916a9 69 GB 100% CPU 128000 Forever root@[redacted]:/# ``` ``` time=2025-08-12T07:41:40.457Z level=INFO source=routes.go:1304 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-08-12T07:41:40.465Z level=INFO source=images.go:477 msg="total blobs: 34" time=2025-08-12T07:41:40.467Z level=INFO source=images.go:484 msg="total unused blobs removed: 0" time=2025-08-12T07:41:40.468Z level=INFO source=routes.go:1357 msg="Listening on [::]:11434 (version 0.11.4)" time=2025-08-12T07:41:40.468Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-08-12T07:41:42.462Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.2 GiB" time=2025-08-12T07:41:42.462Z level=INFO source=types.go:130 msg="inference compute" id=GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" time=2025-08-12T07:41:42.462Z level=INFO source=types.go:130 msg="inference compute" id=GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" time=2025-08-12T07:41:42.462Z level=INFO source=types.go:130 msg="inference compute" id=GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" time=2025-08-12T07:41:42.462Z level=INFO source=types.go:130 msg="inference compute" id=GPU-eab804bb-3a67-1954-db1a-ddf62d8f427a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" [GIN] 2025/08/12 - 07:42:18 | 200 | 214.344µs | 172.17.0.1 | GET "/api/version" [GIN] 2025/08/12 - 07:42:20 | 200 | 3.704388ms | 172.17.0.1 | GET "/api/tags" [GIN] 2025/08/12 - 07:42:20 | 200 | 215.682µs | 172.17.0.1 | GET "/api/ps" [GIN] 2025/08/12 - 07:42:21 | 200 | 92.611µs | 172.17.0.1 | GET "/api/version" time=2025-08-12T07:42:38.386Z level=INFO source=server.go:135 msg="system memory" total="125.8 GiB" free="122.9 GiB" free_swap="0 B" time=2025-08-12T07:42:40.029Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=256 layers.model=37 layers.offload=0 layers.split="" memory.available="[15.2 GiB 15.4 GiB 15.4 GiB 15.4 GiB 15.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="64.3 GiB" memory.required.partial="0 B" memory.required.kv="4.6 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B 0 B]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="46.9 GiB" memory.graph.partial="46.9 GiB" time=2025-08-12T07:42:40.029Z level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu" time=2025-08-12T07:42:40.219Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 256 --threads 16 --no-mmap --parallel 1 --port 43889" time=2025-08-12T07:42:40.221Z level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-12T07:42:40.221Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-12T07:42:40.221Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-12T07:42:40.261Z level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-12T07:42:40.261Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:43889" time=2025-08-12T07:42:40.473Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-08-12T07:42:40.475Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 5 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes Device 1: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 2: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 3: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 4: NVIDIA RTX A4000, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-x64.so time=2025-08-12T07:42:41.941Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-12T07:42:42.331Z level=INFO source=ggml.go:365 msg="offloading 36 repeating layers to GPU" time=2025-08-12T07:42:42.331Z level=INFO source=ggml.go:371 msg="offloading output layer to GPU" time=2025-08-12T07:42:42.332Z level=INFO source=ggml.go:376 msg="offloaded 37/37 layers to GPU" time=2025-08-12T07:42:42.332Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB" time=2025-08-12T07:42:42.332Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="13.0 GiB" time=2025-08-12T07:42:42.332Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA1 size="11.4 GiB" time=2025-08-12T07:42:42.332Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA2 size="13.0 GiB" time=2025-08-12T07:42:42.332Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA3 size="11.4 GiB" time=2025-08-12T07:42:42.332Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA4 size="10.9 GiB" time=2025-08-12T07:42:43.561Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="31.5 GiB" time=2025-08-12T07:42:43.561Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="31.5 GiB" time=2025-08-12T07:42:43.561Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="31.5 GiB" time=2025-08-12T07:42:43.561Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA3 buffer_type=CUDA3 size="31.5 GiB" time=2025-08-12T07:42:43.561Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA4 buffer_type=CUDA4 size="31.5 GiB" time=2025-08-12T07:42:43.561Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB" time=2025-08-12T07:44:16.820Z level=INFO source=server.go:637 msg="llama runner started in 96.60 seconds" [GIN] 2025/08/12 - 07:45:06 | 200 | 2m42s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 07:45:26 | 200 | 20.203185733s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 07:45:39 | 200 | 12.833296469s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 07:45:58 | 200 | 18.500690622s | 172.17.0.1 | POST "/api/chat" ```
Author
Owner

@alienatedsec commented on GitHub (Aug 12, 2025):

@rick-github

Here is another example for llama4:16x17b , which seems to report correctly on the CPU/GPU split.

root@[redacted]:/# ollama ps
NAME             ID              SIZE      PROCESSOR          CONTEXT    UNTIL   
llama4:16x17b    bf31604e25c2    159 GB    52%/48% CPU/GPU    128000     Forever    
root@[redacted]:/# 
Image
<!-- gh-comment-id:3178754251 --> @alienatedsec commented on GitHub (Aug 12, 2025): @rick-github Here is another example for `llama4:16x17b` , which seems to report correctly on the CPU/GPU split. ``` root@[redacted]:/# ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL llama4:16x17b bf31604e25c2 159 GB 52%/48% CPU/GPU 128000 Forever root@[redacted]:/# ``` <img width="216" height="278" alt="Image" src="https://github.com/user-attachments/assets/7b55b31e-2e9b-4ffc-89e8-a8ca1b5d56d3" />
Author
Owner

@alienatedsec commented on GitHub (Aug 12, 2025):

Recently, I have thrown another GPU into my setup and checked with nvidia-smi for the gpt-oss model to report on it:

Image

Here is the most relevant output:

GPU 0: 11 461 / 16 376 MiB (≈70 % used)
GPU 1: 15 770 / 20 475 MiB (≈77 % used)
GPU 2: 11 421 / 16 380 MiB (≈70 % used)
GPU 3: 11 461 / 16 376 MiB (≈70 % used)
GPU 4: 11 461 / 16 376 MiB (≈70 % used)
GPU 5: 9 225 / 16 376 MiB (≈56 % used)

Overall: 70 839 / 102 359 MiB (≈69 % used, 31 % free)

time=2025-08-12T12:06:06.622Z level=INFO source=server.go:135 msg="system memory" total="125.8 GiB" free="104.1 GiB" free_swap="0 B"
time=2025-08-12T12:06:08.617Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=256 layers.model=37 layers.offload=0 layers.split="" memory.available="[18.7 GiB 15.5 GiB 15.4 GiB 15.4 GiB 15.4 GiB 15.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="64.3 GiB" memory.required.partial="0 B" memory.required.kv="4.6 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B 0 B 0 B]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="46.9 GiB" memory.graph.partial="46.9 GiB"
time=2025-08-12T12:06:08.617Z level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu"
time=2025-08-12T12:06:08.796Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 256 --threads 16 --no-mmap --parallel 1 --port 37283"
time=2025-08-12T12:06:08.797Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-12T12:06:08.797Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-12T12:06:08.797Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-12T12:06:08.841Z level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-12T12:06:08.843Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:37283"
time=2025-08-12T12:06:09.050Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-12T12:06:09.059Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 6 CUDA devices:
  Device 0: NVIDIA RTX 4000 Ada Generation, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
  Device 2: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
  Device 3: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
  Device 4: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
  Device 5: NVIDIA RTX A4000, compute capability 8.6, VMM: yes
[GIN] 2025/08/12 - 12:06:10 | 200 |      47.379µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/12 - 12:06:10 | 200 |      63.108µs |       127.0.0.1 | GET      "/api/ps"
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-x64.so
time=2025-08-12T12:06:10.880Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 CUDA.5.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.5.USE_GRAPHS=1 CUDA.5.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:365 msg="offloading 36 repeating layers to GPU"
time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:371 msg="offloading output layer to GPU"
time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:376 msg="offloaded 37/37 layers to GPU"
time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA5 size="7.6 GiB"
time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB"
time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="13.0 GiB"
time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA1 size="9.8 GiB"
time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA2 size="9.8 GiB"
time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA3 size="9.8 GiB"
time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA4 size="9.8 GiB"
[GIN] 2025/08/12 - 12:06:11 | 200 |       54.01µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/12 - 12:06:11 | 200 |      67.708µs |       127.0.0.1 | GET      "/api/ps"
time=2025-08-12T12:06:12.460Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="31.5 GiB"
time=2025-08-12T12:06:12.461Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="31.5 GiB"
time=2025-08-12T12:06:12.461Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="31.5 GiB"
time=2025-08-12T12:06:12.461Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA3 buffer_type=CUDA3 size="31.5 GiB"
time=2025-08-12T12:06:12.461Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA4 buffer_type=CUDA4 size="31.5 GiB"
time=2025-08-12T12:06:12.461Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA5 buffer_type=CUDA5 size="31.5 GiB"
time=2025-08-12T12:06:12.461Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB"
[GIN] 2025/08/12 - 12:06:12 | 200 |      69.882µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/12 - 12:06:12 | 200 |      80.267µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/12 - 12:06:13 | 200 |      51.694µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/12 - 12:06:13 | 200 |      71.656µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/12 - 12:06:13 | 200 |      56.547µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/12 - 12:06:13 | 200 |      51.079µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/12 - 12:06:46 | 200 |       54.03µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/12 - 12:06:46 | 200 |      66.716µs |       127.0.0.1 | GET      "/api/ps"
time=2025-08-12T12:06:46.536Z level=INFO source=server.go:637 msg="llama runner started in 37.74 seconds"
[GIN] 2025/08/12 - 12:06:47 | 200 |      78.643µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/12 - 12:06:47 | 200 |      62.992µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/12 - 12:07:19 | 200 |         1m42s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 12:07:36 | 200 | 17.421053094s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 12:08:25 | 200 |  53.63390453s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 12:08:40 | 200 |          1m3s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 12:08:58 | 200 | 32.881046065s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 12:09:11 | 200 | 31.576654164s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 12:09:25 | 200 | 26.374625068s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 12:10:18 | 200 |          1m6s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 12:10:37 | 200 |         1m12s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 12:10:55 | 200 | 37.072014344s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 12:11:10 | 200 | 14.175902058s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 12:11:27 | 200 | 17.388711625s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 12:13:34 | 200 | 34.798147103s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/12 - 12:13:57 | 200 | 22.514621634s |      172.17.0.1 | POST     "/api/chat"
<!-- gh-comment-id:3179084588 --> @alienatedsec commented on GitHub (Aug 12, 2025): Recently, I have thrown another GPU into my setup and checked with `nvidia-smi` for the `gpt-oss` model to report on it: <img width="216" height="278" alt="Image" src="https://github.com/user-attachments/assets/c770823c-13bd-44c5-a33b-09ce3db25add" /> Here is the most relevant output: GPU 0: 11 461 / 16 376 MiB (≈70 % used) GPU 1: 15 770 / 20 475 MiB (≈77 % used) GPU 2: 11 421 / 16 380 MiB (≈70 % used) GPU 3: 11 461 / 16 376 MiB (≈70 % used) GPU 4: 11 461 / 16 376 MiB (≈70 % used) GPU 5: 9 225 / 16 376 MiB (≈56 % used) Overall: 70 839 / 102 359 MiB (≈69 % used, 31 % free) ``` time=2025-08-12T12:06:06.622Z level=INFO source=server.go:135 msg="system memory" total="125.8 GiB" free="104.1 GiB" free_swap="0 B" time=2025-08-12T12:06:08.617Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=256 layers.model=37 layers.offload=0 layers.split="" memory.available="[18.7 GiB 15.5 GiB 15.4 GiB 15.4 GiB 15.4 GiB 15.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="64.3 GiB" memory.required.partial="0 B" memory.required.kv="4.6 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B 0 B 0 B]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="46.9 GiB" memory.graph.partial="46.9 GiB" time=2025-08-12T12:06:08.617Z level=WARN source=server.go:206 msg="flash attention enabled but not supported by gpu" time=2025-08-12T12:06:08.796Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 256 --threads 16 --no-mmap --parallel 1 --port 37283" time=2025-08-12T12:06:08.797Z level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-12T12:06:08.797Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-12T12:06:08.797Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-12T12:06:08.841Z level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-12T12:06:08.843Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:37283" time=2025-08-12T12:06:09.050Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-08-12T12:06:09.059Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 6 CUDA devices: Device 0: NVIDIA RTX 4000 Ada Generation, compute capability 8.9, VMM: yes Device 1: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes Device 2: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 3: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 4: NVIDIA RTX A4000, compute capability 8.6, VMM: yes Device 5: NVIDIA RTX A4000, compute capability 8.6, VMM: yes [GIN] 2025/08/12 - 12:06:10 | 200 | 47.379µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/12 - 12:06:10 | 200 | 63.108µs | 127.0.0.1 | GET "/api/ps" load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-x64.so time=2025-08-12T12:06:10.880Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 CUDA.5.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.5.USE_GRAPHS=1 CUDA.5.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:365 msg="offloading 36 repeating layers to GPU" time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:371 msg="offloading output layer to GPU" time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:376 msg="offloaded 37/37 layers to GPU" time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA5 size="7.6 GiB" time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.1 GiB" time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="13.0 GiB" time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA1 size="9.8 GiB" time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA2 size="9.8 GiB" time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA3 size="9.8 GiB" time=2025-08-12T12:06:11.211Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA4 size="9.8 GiB" [GIN] 2025/08/12 - 12:06:11 | 200 | 54.01µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/12 - 12:06:11 | 200 | 67.708µs | 127.0.0.1 | GET "/api/ps" time=2025-08-12T12:06:12.460Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="31.5 GiB" time=2025-08-12T12:06:12.461Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="31.5 GiB" time=2025-08-12T12:06:12.461Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="31.5 GiB" time=2025-08-12T12:06:12.461Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA3 buffer_type=CUDA3 size="31.5 GiB" time=2025-08-12T12:06:12.461Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA4 buffer_type=CUDA4 size="31.5 GiB" time=2025-08-12T12:06:12.461Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA5 buffer_type=CUDA5 size="31.5 GiB" time=2025-08-12T12:06:12.461Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.6 MiB" [GIN] 2025/08/12 - 12:06:12 | 200 | 69.882µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/12 - 12:06:12 | 200 | 80.267µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/12 - 12:06:13 | 200 | 51.694µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/12 - 12:06:13 | 200 | 71.656µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/12 - 12:06:13 | 200 | 56.547µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/12 - 12:06:13 | 200 | 51.079µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/12 - 12:06:46 | 200 | 54.03µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/12 - 12:06:46 | 200 | 66.716µs | 127.0.0.1 | GET "/api/ps" time=2025-08-12T12:06:46.536Z level=INFO source=server.go:637 msg="llama runner started in 37.74 seconds" [GIN] 2025/08/12 - 12:06:47 | 200 | 78.643µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/12 - 12:06:47 | 200 | 62.992µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/12 - 12:07:19 | 200 | 1m42s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 12:07:36 | 200 | 17.421053094s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 12:08:25 | 200 | 53.63390453s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 12:08:40 | 200 | 1m3s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 12:08:58 | 200 | 32.881046065s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 12:09:11 | 200 | 31.576654164s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 12:09:25 | 200 | 26.374625068s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 12:10:18 | 200 | 1m6s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 12:10:37 | 200 | 1m12s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 12:10:55 | 200 | 37.072014344s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 12:11:10 | 200 | 14.175902058s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 12:11:27 | 200 | 17.388711625s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 12:13:34 | 200 | 34.798147103s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/12 - 12:13:57 | 200 | 22.514621634s | 172.17.0.1 | POST "/api/chat" ```
Author
Owner

@SHU-red commented on GitHub (Aug 14, 2025):

  • ollama in docker
  • model gpt-oss:20b
  • RTX 4080 16GB

running heavily on CPU
did use ollama create to additionally make a

  • num_ctx 32000
  • num_ctx 3000

version

Tested all 3 models, official, 32k and 3k

Logs

Logs below all created with my custom model reduced context window to 32 via Modelfile

FROM gpt-oss:20b
PARAMETER num_ctx 32000

ollama create -f Modelfile gpt-oss:20b_ctx32k

root@63ad3d6a32f1:/# ollama ps
NAME                  ID              SIZE     PROCESSOR          CONTEXT    UNTIL
gpt-oss:20b_ctx32k    244276c2a394    22 GB    39%/61% CPU/GPU    32000      4 minutes from now
services:
  ollama:
    volumes:
....
    container_name: ollama
    network_mode: bridge
    ports:
      - 11434:11434
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    privileged: true
    environment:
      - OLLAMA_RUN_PARALLEL=1 # Tested with/without
     # - OLLAMA_CONTEXT_LENGTH=3000 # Tested with/without and also 32000
      - GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 # Tested with/without
    devices:
      - /dev/dri/card0 # passing grahpics card
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
        limits:
          cpus: '8' # Limited for such occations to not slow down my whole server
Image
$ nvidia-smi                                                                  Thu Aug 14 06:59:16 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.05              Driver Version: 575.64.05      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080        Off |   00000000:06:00.0  On |                  N/A |
|  0%   55C    P2             58W /  320W |    7157MiB /  16376MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           13851      C   frigate.detector.tensorrt               348MiB |
|    0   N/A  N/A           13909      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          340MiB |
|    0   N/A  N/A           13977      C   python3                                 236MiB |
|    0   N/A  N/A           20591      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          320MiB |
|    0   N/A  N/A         1046529      C   /usr/bin/ollama                         358MiB |
|    0   N/A  N/A         3321696      C   /opt/venv/bin/python3                   236MiB |
|    0   N/A  N/A         3830441      G   /usr/lib/xorg/Xorg                      149MiB |
|    0   N/A  N/A         3831365      G   xfwm4                                     4MiB |
|    0   N/A  N/A         3831393    C+G   /usr/bin/sunshine                       241MiB |
|    0   N/A  N/A         3831662      G   ...nstallation/ubuntu12_32/steam          4MiB |
|    0   N/A  N/A         3831929      G   ./steamwebhelper                          9MiB |
|    0   N/A  N/A         3831952      G   ...on/ubuntu12_64/steamwebhelper        122MiB |
+-----------------------------------------------------------------------------------------+
sudo docker logs ollama
$ sudo docker logs ollama
time=2025-08-14T06:57:20.491Z level=INFO source=routes.go:1304 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-08-14T06:57:20.494Z level=INFO source=images.go:477 msg="total blobs: 66"
time=2025-08-14T06:57:20.495Z level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2025-08-14T06:57:20.495Z level=INFO source=routes.go:1357 msg="Listening on [::]:11434 (version 0.11.4)"
time=2025-08-14T06:57:20.495Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-08-14T06:57:20.662Z level=INFO source=types.go:130 msg="inference compute" id=GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6 library=cuda variant=v12 compute=8.9 driver=12.9 name="NVIDIA GeForce RTX 4080" total="15.6 GiB" available="13.2 GiB"
time=2025-08-14T06:57:20.662Z level=INFO source=routes.go:1398 msg="entering low vram mode" "total vram"="15.6 GiB" threshold="20.0 GiB"
[GIN] 2025/08/14 - 06:57:34 | 200 |     1.82955ms |      172.18.0.1 | GET      "/api/tags"
time=2025-08-14T06:57:45.193Z level=INFO source=server.go:135 msg="system memory" total="62.7 GiB" free="46.0 GiB" free_swap="290.9 MiB"
time=2025-08-14T06:57:45.193Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=9 layers.split="" memory.available="[13.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.3 GiB" memory.required.partial="13.1 GiB" memory.required.kv="858.0 MiB" memory.required.allocations="[13.1 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="7.8 GiB" memory.graph.partial="7.8 GiB"
time=2025-08-14T06:57:45.240Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 32000 --batch-size 512 --n-gpu-layers 9 --threads 8 --parallel 1 --port 43991"
time=2025-08-14T06:57:45.241Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-14T06:57:45.241Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-14T06:57:45.241Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-14T06:57:45.250Z level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-14T06:57:45.251Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:43991"
time=2025-08-14T06:57:45.299Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
time=2025-08-14T06:57:45.363Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-14T06:57:45.461Z level=INFO source=ggml.go:365 msg="offloading 9 repeating layers to GPU"
time=2025-08-14T06:57:45.461Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU"
time=2025-08-14T06:57:45.461Z level=INFO source=ggml.go:376 msg="offloaded 9/25 layers to GPU"
time=2025-08-14T06:57:45.461Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="4.0 GiB"
time=2025-08-14T06:57:45.461Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="8.8 GiB"
time=2025-08-14T06:57:45.492Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-14T06:57:45.760Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="7.9 GiB"
time=2025-08-14T06:57:45.761Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="7.8 GiB"
time=2025-08-14T06:57:48.280Z level=INFO source=server.go:637 msg="llama runner started in 3.04 seconds"
[GIN] 2025/08/14 - 06:58:10 | 200 | 26.056937704s |      172.18.0.1 | POST     "/api/generate"
[GIN] 2025/08/14 - 06:58:40 | 200 |      30.728µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/14 - 06:58:40 | 200 |      70.573µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/14 - 06:58:40 | 200 | 55.946239643s |      172.18.0.1 | POST     "/api/generate"
[GIN] 2025/08/14 - 06:59:21 | 200 |         1m36s |      172.18.0.1 | POST     "/api/generate"
<!-- gh-comment-id:3187200652 --> @SHU-red commented on GitHub (Aug 14, 2025): - ollama in docker - model gpt-oss:20b - RTX 4080 16GB running heavily on CPU did use `ollama create` to additionally make a - num_ctx 32000 - num_ctx 3000 version Tested all 3 models, official, 32k and 3k # Logs Logs below all created with my custom model reduced context window to 32 via `Modelfile` ``` FROM gpt-oss:20b PARAMETER num_ctx 32000 ``` `ollama create -f Modelfile gpt-oss:20b_ctx32k` ```bash root@63ad3d6a32f1:/# ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b_ctx32k 244276c2a394 22 GB 39%/61% CPU/GPU 32000 4 minutes from now ``` ```yaml services: ollama: volumes: .... container_name: ollama network_mode: bridge ports: - 11434:11434 pull_policy: always tty: true restart: unless-stopped image: ollama/ollama:latest privileged: true environment: - OLLAMA_RUN_PARALLEL=1 # Tested with/without # - OLLAMA_CONTEXT_LENGTH=3000 # Tested with/without and also 32000 - GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 # Tested with/without devices: - /dev/dri/card0 # passing grahpics card deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] limits: cpus: '8' # Limited for such occations to not slow down my whole server ``` <img width="911" height="282" alt="Image" src="https://github.com/user-attachments/assets/e6d78aa9-d5fd-4aec-a4de-8fed3518f8ff" /> ```bash $ nvidia-smi Thu Aug 14 06:59:16 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 575.64.05 Driver Version: 575.64.05 CUDA Version: 12.9 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4080 Off | 00000000:06:00.0 On | N/A | | 0% 55C P2 58W / 320W | 7157MiB / 16376MiB | 1% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 13851 C frigate.detector.tensorrt 348MiB | | 0 N/A N/A 13909 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 340MiB | | 0 N/A N/A 13977 C python3 236MiB | | 0 N/A N/A 20591 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 320MiB | | 0 N/A N/A 1046529 C /usr/bin/ollama 358MiB | | 0 N/A N/A 3321696 C /opt/venv/bin/python3 236MiB | | 0 N/A N/A 3830441 G /usr/lib/xorg/Xorg 149MiB | | 0 N/A N/A 3831365 G xfwm4 4MiB | | 0 N/A N/A 3831393 C+G /usr/bin/sunshine 241MiB | | 0 N/A N/A 3831662 G ...nstallation/ubuntu12_32/steam 4MiB | | 0 N/A N/A 3831929 G ./steamwebhelper 9MiB | | 0 N/A N/A 3831952 G ...on/ubuntu12_64/steamwebhelper 122MiB | +-----------------------------------------------------------------------------------------+ ``` <details> <summary>sudo docker logs ollama</summary> ```bash $ sudo docker logs ollama time=2025-08-14T06:57:20.491Z level=INFO source=routes.go:1304 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-08-14T06:57:20.494Z level=INFO source=images.go:477 msg="total blobs: 66" time=2025-08-14T06:57:20.495Z level=INFO source=images.go:484 msg="total unused blobs removed: 0" time=2025-08-14T06:57:20.495Z level=INFO source=routes.go:1357 msg="Listening on [::]:11434 (version 0.11.4)" time=2025-08-14T06:57:20.495Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-08-14T06:57:20.662Z level=INFO source=types.go:130 msg="inference compute" id=GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6 library=cuda variant=v12 compute=8.9 driver=12.9 name="NVIDIA GeForce RTX 4080" total="15.6 GiB" available="13.2 GiB" time=2025-08-14T06:57:20.662Z level=INFO source=routes.go:1398 msg="entering low vram mode" "total vram"="15.6 GiB" threshold="20.0 GiB" [GIN] 2025/08/14 - 06:57:34 | 200 | 1.82955ms | 172.18.0.1 | GET "/api/tags" time=2025-08-14T06:57:45.193Z level=INFO source=server.go:135 msg="system memory" total="62.7 GiB" free="46.0 GiB" free_swap="290.9 MiB" time=2025-08-14T06:57:45.193Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=9 layers.split="" memory.available="[13.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.3 GiB" memory.required.partial="13.1 GiB" memory.required.kv="858.0 MiB" memory.required.allocations="[13.1 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="7.8 GiB" memory.graph.partial="7.8 GiB" time=2025-08-14T06:57:45.240Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 32000 --batch-size 512 --n-gpu-layers 9 --threads 8 --parallel 1 --port 43991" time=2025-08-14T06:57:45.241Z level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-14T06:57:45.241Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-14T06:57:45.241Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-14T06:57:45.250Z level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-14T06:57:45.251Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:43991" time=2025-08-14T06:57:45.299Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so time=2025-08-14T06:57:45.363Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-14T06:57:45.461Z level=INFO source=ggml.go:365 msg="offloading 9 repeating layers to GPU" time=2025-08-14T06:57:45.461Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU" time=2025-08-14T06:57:45.461Z level=INFO source=ggml.go:376 msg="offloaded 9/25 layers to GPU" time=2025-08-14T06:57:45.461Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="4.0 GiB" time=2025-08-14T06:57:45.461Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="8.8 GiB" time=2025-08-14T06:57:45.492Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-08-14T06:57:45.760Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="7.9 GiB" time=2025-08-14T06:57:45.761Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="7.8 GiB" time=2025-08-14T06:57:48.280Z level=INFO source=server.go:637 msg="llama runner started in 3.04 seconds" [GIN] 2025/08/14 - 06:58:10 | 200 | 26.056937704s | 172.18.0.1 | POST "/api/generate" [GIN] 2025/08/14 - 06:58:40 | 200 | 30.728µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/14 - 06:58:40 | 200 | 70.573µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/14 - 06:58:40 | 200 | 55.946239643s | 172.18.0.1 | POST "/api/generate" [GIN] 2025/08/14 - 06:59:21 | 200 | 1m36s | 172.18.0.1 | POST "/api/generate" ``` </details>
Author
Owner

@alienatedsec commented on GitHub (Aug 14, 2025):

@SHU-red Just a thought that worked for me and was mentioned by @rick-github

As you already have - GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 , can you also set --n-gpu-layers 256 instead of current --n-gpu-layers 9?

<!-- gh-comment-id:3187941921 --> @alienatedsec commented on GitHub (Aug 14, 2025): @SHU-red Just a thought that worked for me and was mentioned by @rick-github As you already have `- GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` , can you also set `--n-gpu-layers 256` instead of current `--n-gpu-layers 9`?
Author
Owner

@SHU-red commented on GitHub (Aug 14, 2025):

can you also set --n-gpu-layers 256

Hi @alienatedsec
how do i do this?

Is there a option for docker-compose?

<!-- gh-comment-id:3187993186 --> @SHU-red commented on GitHub (Aug 14, 2025): > can you also set `--n-gpu-layers 256` Hi @alienatedsec how do i do this? Is there a option for docker-compose?
Author
Owner

@alienatedsec commented on GitHub (Aug 14, 2025):

@SHU-red When you run the interactive mode - found here https://github.com/ollama/ollama/issues/1855#issuecomment-1881719430

ollama run gpt-oss:20b_ctx32k
>>> /set parameter num_gpu 256
Set parameter 'num_gpu' to '256'

>>>
<!-- gh-comment-id:3188002597 --> @alienatedsec commented on GitHub (Aug 14, 2025): @SHU-red When you run the interactive mode - found here https://github.com/ollama/ollama/issues/1855#issuecomment-1881719430 ``` ollama run gpt-oss:20b_ctx32k >>> /set parameter num_gpu 256 Set parameter 'num_gpu' to '256' >>> ```
Author
Owner

@SHU-red commented on GitHub (Aug 14, 2025):

run the interactive mode

Oh yes! This works! Let me guess: I should have read the above more and there is no way to globally set this and use it with all my other services using ollama via api?

<!-- gh-comment-id:3188016311 --> @SHU-red commented on GitHub (Aug 14, 2025): > run the interactive mode Oh yes! This works! Let me guess: I should have read the above more and there is no way to globally set this and use it with all my other services using ollama via api?
Author
Owner

@alienatedsec commented on GitHub (Aug 14, 2025):

I should have read the above more and there is no way to globally set this and use it with all my other services using ollama via api?

https://github.com/ollama/ollama/issues/4850#issuecomment-2176979850

<!-- gh-comment-id:3188025104 --> @alienatedsec commented on GitHub (Aug 14, 2025): > I should have read the above more and there is no way to globally set this and use it with all my other services using ollama via api? https://github.com/ollama/ollama/issues/4850#issuecomment-2176979850
Author
Owner

@SHU-red commented on GitHub (Aug 14, 2025):

#4850 (comment)

Sorry, saw this one but do not know what number to set for OLLAMA_MAX_VRAM to have the same effect as --n_gpu_layers 256

<!-- gh-comment-id:3188037260 --> @SHU-red commented on GitHub (Aug 14, 2025): > [#4850 (comment)](https://github.com/ollama/ollama/issues/4850#issuecomment-2176979850) Sorry, saw this one but do not know what number to set for `OLLAMA_MAX_VRAM` to have the same effect as `--n_gpu_layers 256`
Author
Owner

@alienatedsec commented on GitHub (Aug 14, 2025):

#4850 (comment)

Sorry, saw this one but do not know what number to set for OLLAMA_MAX_VRAM to have the same effect as --n_gpu_layers 256

Try without first, unless it fails to load the model

<!-- gh-comment-id:3188039426 --> @alienatedsec commented on GitHub (Aug 14, 2025): > > [#4850 (comment)](https://github.com/ollama/ollama/issues/4850#issuecomment-2176979850) > > Sorry, saw this one but do not know what number to set for `OLLAMA_MAX_VRAM` to have the same effect as `--n_gpu_layers 256` Try without first, unless it fails to load the model
Author
Owner

@SHU-red commented on GitHub (Aug 14, 2025):

OK not sure what you mean
I guess this is in bytes the VRAM of my GPU which is 16GB?
I set it in docker-compose to 10000000 which should be 10GB?

Seems to work! But not sure if this is the solution or if this is stilll set from your "interactive mode" hint

Thanks anyway

services:
  ollama:
    container_name: ollama
    network_mode: bridge
    ports:
      - 11434:11434
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    privileged: true
    environment:
      - OLLAMA_RUN_PARALLEL=1
      # - OLLAMA_CONTEXT_LENGTH=3000
      - GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
      - OLLAMA_MAX_VRAM 13000000
    ...
``
<!-- gh-comment-id:3188060032 --> @SHU-red commented on GitHub (Aug 14, 2025): OK not sure what you mean I guess this is in bytes the VRAM of my GPU which is 16GB? I set it in docker-compose to 10000000 which should be 10GB? Seems to work! But not sure if this is the solution or if this is stilll set from your "interactive mode" hint Thanks anyway ```yaml services: ollama: container_name: ollama network_mode: bridge ports: - 11434:11434 pull_policy: always tty: true restart: unless-stopped image: ollama/ollama:latest privileged: true environment: - OLLAMA_RUN_PARALLEL=1 # - OLLAMA_CONTEXT_LENGTH=3000 - GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - OLLAMA_MAX_VRAM 13000000 ... ``
Author
Owner

@alienatedsec commented on GitHub (Aug 14, 2025):

I don't believe you need OLLAMA_MAX_VRAM variable as it would likely use whatever is available anyway. You need = to make it a valid env variable. - OLLAMA_MAX_VRAM=13000000

Regardless, what are your stats now?

<!-- gh-comment-id:3188074799 --> @alienatedsec commented on GitHub (Aug 14, 2025): I don't believe you need `OLLAMA_MAX_VRAM` variable as it would likely use whatever is available anyway. You need `=` to make it a valid env variable. `- OLLAMA_MAX_VRAM=13000000` Regardless, what are your stats now?
Author
Owner

@SHU-red commented on GitHub (Aug 14, 2025):

I don't believe you need OLLAMA_MAX_VRAM

OK sorry, i really dont get what you want me to do

Regardless, what are your stats now?

  • stopped container
  • started with different combinations of
    environment:
      - OLLAMA_RUN_PARALLEL=1
      - OLLAMA_CONTEXT_LENGTH=32000
      - GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
      - OLLAMA_MAX_VRAM=13000000

Only CPU working again
Seems that one time setting /set parameter num_gpu 256 is the only thing that does the trick and keeps being active until i completely shutdown the container and start again and then i would have to re-set it again right?

So after my hard stop-start currently its not working again


$ sudo docker logs ollama; nvidia-smi; sudo docker exec ollama /bin/ollama ps
time=2025-08-14T11:36:27.005Z level=INFO source=routes.go:1304 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:32000 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-08-14T11:36:27.008Z level=INFO source=images.go:477 msg="total blobs: 66"
time=2025-08-14T11:36:27.009Z level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2025-08-14T11:36:27.009Z level=INFO source=routes.go:1357 msg="Listening on [::]:11434 (version 0.11.4)"
time=2025-08-14T11:36:27.010Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-08-14T11:36:27.170Z level=INFO source=types.go:130 msg="inference compute" id=GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6 library=cuda variant=v12 compute=8.9 driver=12.9 name="NVIDIA GeForce RTX 4080" total="15.6 GiB" available="13.2 GiB"
time=2025-08-14T11:36:27.170Z level=INFO source=routes.go:1398 msg="entering low vram mode" "total vram"="15.6 GiB" threshold="20.0 GiB"
time=2025-08-14T11:36:30.709Z level=INFO source=server.go:135 msg="system memory" total="62.7 GiB" free="46.4 GiB" free_swap="310.9 MiB"
time=2025-08-14T11:36:30.710Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=9 layers.split="" memory.available="[13.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.3 GiB" memory.required.partial="13.1 GiB" memory.required.kv="858.0 MiB" memory.required.allocations="[13.1 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="7.8 GiB" memory.graph.partial="7.8 GiB"
time=2025-08-14T11:36:30.752Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 32000 --batch-size 512 --n-gpu-layers 9 --threads 8 --parallel 1 --port 41543"
time=2025-08-14T11:36:30.752Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-14T11:36:30.752Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-14T11:36:30.753Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-14T11:36:30.762Z level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-14T11:36:30.763Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:41543"
time=2025-08-14T11:36:30.812Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
time=2025-08-14T11:36:30.878Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-14T11:36:30.971Z level=INFO source=ggml.go:365 msg="offloading 9 repeating layers to GPU"
time=2025-08-14T11:36:30.971Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU"
time=2025-08-14T11:36:30.971Z level=INFO source=ggml.go:376 msg="offloaded 9/25 layers to GPU"
time=2025-08-14T11:36:30.971Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="8.8 GiB"
time=2025-08-14T11:36:30.971Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="4.0 GiB"
time=2025-08-14T11:36:31.004Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-14T11:36:31.264Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="7.9 GiB"
time=2025-08-14T11:36:31.264Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="7.8 GiB"
time=2025-08-14T11:36:33.541Z level=INFO source=server.go:637 msg="llama runner started in 2.79 seconds"
[GIN] 2025/08/14 - 11:39:31 | 200 |      27.592µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/14 - 11:39:31 | 200 |      95.169µs |       127.0.0.1 | GET      "/api/ps"
Thu Aug 14 11:40:39 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.05              Driver Version: 575.64.05      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080        Off |   00000000:06:00.0  On |                  N/A |
|  0%   59C    P2             71W /  320W |    8091MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           13851      C   frigate.detector.tensorrt               348MiB |
|    0   N/A  N/A           13909      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          340MiB |
|    0   N/A  N/A           13977      C   python3                                 236MiB |
|    0   N/A  N/A         1440630      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          336MiB |
|    0   N/A  N/A         1803736      C   /usr/bin/ollama                         490MiB |
|    0   N/A  N/A         3321696      C   /opt/venv/bin/python3                   236MiB |
|    0   N/A  N/A         3830441      G   /usr/lib/xorg/Xorg                      149MiB |
|    0   N/A  N/A         3831365      G   xfwm4                                     4MiB |
|    0   N/A  N/A         3831393    C+G   /usr/bin/sunshine                       241MiB |
|    0   N/A  N/A         3831662      G   ...nstallation/ubuntu12_32/steam          4MiB |
|    0   N/A  N/A         3831929      G   ./steamwebhelper                          9MiB |
|    0   N/A  N/A         3831952      G   ...on/ubuntu12_64/steamwebhelper        122MiB |
+-----------------------------------------------------------------------------------------+
NAME           ID              SIZE     PROCESSOR          CONTEXT    UNTIL
gpt-oss:20b    f2b8351c629c    22 GB    39%/61% CPU/GPU    32000      4 minutes from now

<!-- gh-comment-id:3188150918 --> @SHU-red commented on GitHub (Aug 14, 2025): > I don't believe you need `OLLAMA_MAX_VRAM` OK sorry, i really dont get what you want me to do > Regardless, what are your stats now? - stopped container - started with different combinations of ```yaml environment: - OLLAMA_RUN_PARALLEL=1 - OLLAMA_CONTEXT_LENGTH=32000 - GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - OLLAMA_MAX_VRAM=13000000 ``` Only CPU working again Seems that one time setting `/set parameter num_gpu 256` is the only thing that does the trick and keeps being active until i completely shutdown the container and start again and then i would have to re-set it again right? So after my hard stop-start currently its not working again ```bash $ sudo docker logs ollama; nvidia-smi; sudo docker exec ollama /bin/ollama ps time=2025-08-14T11:36:27.005Z level=INFO source=routes.go:1304 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:32000 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-08-14T11:36:27.008Z level=INFO source=images.go:477 msg="total blobs: 66" time=2025-08-14T11:36:27.009Z level=INFO source=images.go:484 msg="total unused blobs removed: 0" time=2025-08-14T11:36:27.009Z level=INFO source=routes.go:1357 msg="Listening on [::]:11434 (version 0.11.4)" time=2025-08-14T11:36:27.010Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-08-14T11:36:27.170Z level=INFO source=types.go:130 msg="inference compute" id=GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6 library=cuda variant=v12 compute=8.9 driver=12.9 name="NVIDIA GeForce RTX 4080" total="15.6 GiB" available="13.2 GiB" time=2025-08-14T11:36:27.170Z level=INFO source=routes.go:1398 msg="entering low vram mode" "total vram"="15.6 GiB" threshold="20.0 GiB" time=2025-08-14T11:36:30.709Z level=INFO source=server.go:135 msg="system memory" total="62.7 GiB" free="46.4 GiB" free_swap="310.9 MiB" time=2025-08-14T11:36:30.710Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=9 layers.split="" memory.available="[13.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.3 GiB" memory.required.partial="13.1 GiB" memory.required.kv="858.0 MiB" memory.required.allocations="[13.1 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="7.8 GiB" memory.graph.partial="7.8 GiB" time=2025-08-14T11:36:30.752Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 32000 --batch-size 512 --n-gpu-layers 9 --threads 8 --parallel 1 --port 41543" time=2025-08-14T11:36:30.752Z level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-14T11:36:30.752Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-14T11:36:30.753Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-14T11:36:30.762Z level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-14T11:36:30.763Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:41543" time=2025-08-14T11:36:30.812Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so time=2025-08-14T11:36:30.878Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-14T11:36:30.971Z level=INFO source=ggml.go:365 msg="offloading 9 repeating layers to GPU" time=2025-08-14T11:36:30.971Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU" time=2025-08-14T11:36:30.971Z level=INFO source=ggml.go:376 msg="offloaded 9/25 layers to GPU" time=2025-08-14T11:36:30.971Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="8.8 GiB" time=2025-08-14T11:36:30.971Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="4.0 GiB" time=2025-08-14T11:36:31.004Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" time=2025-08-14T11:36:31.264Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="7.9 GiB" time=2025-08-14T11:36:31.264Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="7.8 GiB" time=2025-08-14T11:36:33.541Z level=INFO source=server.go:637 msg="llama runner started in 2.79 seconds" [GIN] 2025/08/14 - 11:39:31 | 200 | 27.592µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/14 - 11:39:31 | 200 | 95.169µs | 127.0.0.1 | GET "/api/ps" Thu Aug 14 11:40:39 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 575.64.05 Driver Version: 575.64.05 CUDA Version: 12.9 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4080 Off | 00000000:06:00.0 On | N/A | | 0% 59C P2 71W / 320W | 8091MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 13851 C frigate.detector.tensorrt 348MiB | | 0 N/A N/A 13909 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 340MiB | | 0 N/A N/A 13977 C python3 236MiB | | 0 N/A N/A 1440630 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 336MiB | | 0 N/A N/A 1803736 C /usr/bin/ollama 490MiB | | 0 N/A N/A 3321696 C /opt/venv/bin/python3 236MiB | | 0 N/A N/A 3830441 G /usr/lib/xorg/Xorg 149MiB | | 0 N/A N/A 3831365 G xfwm4 4MiB | | 0 N/A N/A 3831393 C+G /usr/bin/sunshine 241MiB | | 0 N/A N/A 3831662 G ...nstallation/ubuntu12_32/steam 4MiB | | 0 N/A N/A 3831929 G ./steamwebhelper 9MiB | | 0 N/A N/A 3831952 G ...on/ubuntu12_64/steamwebhelper 122MiB | +-----------------------------------------------------------------------------------------+ NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b f2b8351c629c 22 GB 39%/61% CPU/GPU 32000 4 minutes from now ```
Author
Owner

@rick-github commented on GitHub (Aug 14, 2025):

OLLAMA_MAX_VRAM is no longer supported, it was a short term workaround that has since been removed.

If you want a model that forces all layers on to the GPU:

echo FROM gpt-oss:20b > Modelfile
echo PARAMETER num_gpu 256 >> Modelfile
echo PARAMETER num_ctx 32000 >> Modelfile
ollama create gpt-oss:20b_ctx32k_gpu256
<!-- gh-comment-id:3188368572 --> @rick-github commented on GitHub (Aug 14, 2025): `OLLAMA_MAX_VRAM` is no longer supported, it was a short term workaround that has since been removed. If you want a model that forces all layers on to the GPU: ```console echo FROM gpt-oss:20b > Modelfile echo PARAMETER num_gpu 256 >> Modelfile echo PARAMETER num_ctx 32000 >> Modelfile ollama create gpt-oss:20b_ctx32k_gpu256 ```
Author
Owner

@SHU-red commented on GitHub (Aug 14, 2025):

echo FROM gpt-oss:20b > Modelfile
echo PARAMETER num_gpu 256 >> Modelfile
echo PARAMETER num_ctx 32000 >> Modelfile
ollama create gpt-oss:20b_ctx32k_gpu256

Awesome! Thank you!

<!-- gh-comment-id:3188477425 --> @SHU-red commented on GitHub (Aug 14, 2025): > echo FROM gpt-oss:20b > Modelfile > echo PARAMETER num_gpu 256 >> Modelfile > echo PARAMETER num_ctx 32000 >> Modelfile > ollama create gpt-oss:20b_ctx32k_gpu256 Awesome! Thank you!
Author
Owner

@alienatedsec commented on GitHub (Aug 16, 2025):

Good news in relation to the latest v0.11.5-rc2

root@[redacted]:/# ollama ps
NAME            ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
gpt-oss:120b    735371f916a9    71 GB    100% GPU     128000     Forever    
root@[redacted]:/# 
Ollama Docker Logs
time=2025-08-16T23:37:28.946Z level=INFO source=routes.go:1305 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NEW_ESTIMATES:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-08-16T23:37:28.955Z level=INFO source=images.go:477 msg="total blobs: 34"
time=2025-08-16T23:37:28.957Z level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2025-08-16T23:37:28.958Z level=INFO source=routes.go:1358 msg="Listening on [::]:11434 (version 0.11.5-rc2)"
time=2025-08-16T23:37:28.959Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-08-16T23:37:31.293Z level=INFO source=types.go:130 msg="inference compute" id=GPU-3b880d35-8b00-861c-3ac2-f8707baced68 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA RTX 4000 Ada Generation" total="19.6 GiB" available="19.1 GiB"
time=2025-08-16T23:37:31.294Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.5 GiB"
time=2025-08-16T23:37:31.294Z level=INFO source=types.go:130 msg="inference compute" id=GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
time=2025-08-16T23:37:31.294Z level=INFO source=types.go:130 msg="inference compute" id=GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
time=2025-08-16T23:37:31.294Z level=INFO source=types.go:130 msg="inference compute" id=GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
time=2025-08-16T23:37:31.294Z level=INFO source=types.go:130 msg="inference compute" id=GPU-eab804bb-3a67-1954-db1a-ddf62d8f427a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB"
[GIN] 2025/08/16 - 23:38:05 | 200 |    9.898517ms |      172.17.0.1 | GET      "/api/tags"
[GIN] 2025/08/16 - 23:38:05 | 200 |     221.709µs |      172.17.0.1 | GET      "/api/ps"
[GIN] 2025/08/16 - 23:38:06 | 200 |      67.121µs |      172.17.0.1 | GET      "/api/version"
[GIN] 2025/08/16 - 23:38:17 | 200 |    3.167459ms |      172.17.0.1 | GET      "/api/tags"
[GIN] 2025/08/16 - 23:38:17 | 200 |      46.448µs |      172.17.0.1 | GET      "/api/ps"
time=2025-08-16T23:38:18.558Z level=INFO source=server.go:166 msg="enabling new memory estimates"
time=2025-08-16T23:38:20.691Z level=INFO source=server.go:211 msg="enabling flash attention"
time=2025-08-16T23:38:20.691Z level=WARN source=server.go:219 msg="kv cache type not supported by model" type=""
time=2025-08-16T23:38:20.692Z level=INFO source=server.go:383 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 38569"
time=2025-08-16T23:38:20.694Z level=INFO source=server.go:657 msg="loading model" "model layers"=37 requested=256
time=2025-08-16T23:38:20.734Z level=INFO source=runner.go:1006 msg="starting ollama engine"
time=2025-08-16T23:38:20.734Z level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:38569"
time=2025-08-16T23:38:22.796Z level=INFO source=server.go:663 msg="system memory" total="125.8 GiB" free="122.7 GiB" free_swap="0 B"
time=2025-08-16T23:38:22.796Z level=INFO source=server.go:667 msg="gpu memory" id=GPU-3b880d35-8b00-861c-3ac2-f8707baced68 available="18.7 GiB" free="19.1 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-08-16T23:38:22.796Z level=INFO source=server.go:667 msg="gpu memory" id=GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 available="15.0 GiB" free="15.5 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-08-16T23:38:22.796Z level=INFO source=server.go:667 msg="gpu memory" id=GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd available="15.0 GiB" free="15.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-08-16T23:38:22.796Z level=INFO source=server.go:667 msg="gpu memory" id=GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 available="15.0 GiB" free="15.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-08-16T23:38:22.796Z level=INFO source=server.go:667 msg="gpu memory" id=GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a available="15.0 GiB" free="15.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-08-16T23:38:22.796Z level=INFO source=server.go:667 msg="gpu memory" id=GPU-eab804bb-3a67-1954-db1a-ddf62d8f427a available="15.0 GiB" free="15.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-08-16T23:38:22.799Z level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:128000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-3b880d35-8b00-861c-3ac2-f8707baced68 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-16T23:38:23.007Z level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 6 CUDA devices:
  Device 0: NVIDIA RTX 4000 Ada Generation, compute capability 8.9, VMM: yes, ID: GPU-3b880d35-8b00-861c-3ac2-f8707baced68
  Device 1: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes, ID: GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7
  Device 2: NVIDIA RTX A4000, compute capability 8.6, VMM: yes, ID: GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd
  Device 3: NVIDIA RTX A4000, compute capability 8.6, VMM: yes, ID: GPU-c95bf02e-0608-db0d-7759-07d27659f5f8
  Device 4: NVIDIA RTX A4000, compute capability 8.6, VMM: yes, ID: GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a
  Device 5: NVIDIA RTX A4000, compute capability 8.6, VMM: yes, ID: GPU-eab804bb-3a67-1954-db1a-ddf62d8f427a
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-x64.so
time=2025-08-16T23:38:24.690Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 CUDA.5.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.5.USE_GRAPHS=1 CUDA.5.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-16T23:38:25.094Z level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:128000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-3b880d35-8b00-861c-3ac2-f8707baced68 Layers:9(0..8) ID:GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 Layers:7(9..15) ID:GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd Layers:7(16..22) ID:GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 Layers:7(23..29) ID:GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a Layers:7(30..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-16T23:38:25.357Z level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:128000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-3b880d35-8b00-861c-3ac2-f8707baced68 Layers:9(0..8) ID:GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 Layers:7(9..15) ID:GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd Layers:7(16..22) ID:GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 Layers:7(23..29) ID:GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a Layers:7(30..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-16T23:38:26.836Z level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:128000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-3b880d35-8b00-861c-3ac2-f8707baced68 Layers:9(0..8) ID:GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 Layers:7(9..15) ID:GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd Layers:7(16..22) ID:GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 Layers:7(23..29) ID:GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a Layers:7(30..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-16T23:38:26.836Z level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU"
time=2025-08-16T23:38:26.836Z level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
time=2025-08-16T23:38:26.836Z level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="14.7 GiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="11.4 GiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="11.4 GiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:310 msg="model weights" device=CUDA3 size="11.4 GiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:310 msg="model weights" device=CUDA4 size="10.9 GiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="1.0 GiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="1.0 GiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="786.0 MiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA3 size="1.0 GiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA4 size="777.0 MiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="238.8 MiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="231.3 MiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="231.3 MiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA3 size="231.3 MiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA4 size="231.3 MiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:342 msg="total memory" size="66.6 GiB"
time=2025-08-16T23:38:26.838Z level=INFO source=sched.go:473 msg="loaded runners" count=1
time=2025-08-16T23:38:26.838Z level=INFO source=server.go:1232 msg="waiting for llama runner to start responding"
time=2025-08-16T23:38:26.839Z level=INFO source=server.go:1266 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-16T23:40:04.710Z level=INFO source=server.go:1270 msg="llama runner started in 104.02 seconds"
[GIN] 2025/08/16 - 23:40:33 | 200 |      71.827µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/08/16 - 23:40:33 | 200 |      50.645µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/08/16 - 23:40:42 | 200 |         2m27s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/16 - 23:40:53 | 200 |  10.33721764s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/16 - 23:41:00 | 200 |  7.650597639s |      172.17.0.1 | POST     "/api/chat"
[GIN] 2025/08/16 - 23:41:15 | 200 | 14.206551211s |      172.17.0.1 | POST     "/api/chat"

It's also around 50% quicker - average 19 vs 29 tokens/s.
Image

total duration:       24.612095582s
load duration:        509.472421ms
prompt eval count:    73 token(s)
prompt eval duration: 259.819606ms
prompt eval rate:     280.96 tokens/s
eval count:           708 token(s)
eval duration:        23.840691078s
eval rate:            29.70 tokens/s
nvidia-smi output
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 Ti     On  |   00000000:01:00.0 Off |                  N/A |
|  0%   51C    P8              7W /  165W |   13105MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX 4000 Ada Gene...    On  |   00000000:02:00.0 Off |                  Off |
| 30%   47C    P8             11W /  130W |   16753MiB /  20475MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A4000               On  |   00000000:03:00.0 Off |                  Off |
| 41%   43C    P8             10W /  140W |   12947MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A4000               On  |   00000000:04:00.0 Off |                  Off |
| 41%   54C    P8             11W /  140W |   13201MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX A4000               On  |   00000000:05:00.0 Off |                  Off |
| 41%   50C    P8              7W /  140W |   12805MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA RTX A4000               On  |   00000000:06:00.0 Off |                  Off |
| 41%   45C    P8              7W /  140W |     167MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3209      C   /usr/bin/ollama                         166MiB |
|    1   N/A  N/A            1532      C   /usr/local/bin/python3                  262MiB |
|    1   N/A  N/A            3209      C   /usr/bin/ollama                         218MiB |
|    2   N/A  N/A            3209      C   /usr/bin/ollama                         264MiB |
|    3   N/A  N/A            3209      C   /usr/bin/ollama                         264MiB |
|    4   N/A  N/A            3209      C   /usr/bin/ollama                         396MiB |
|    5   N/A  N/A            3209      C   /usr/bin/ollama                         158MiB |
+-----------------------------------------------------------------------------------------+
<!-- gh-comment-id:3193972390 --> @alienatedsec commented on GitHub (Aug 16, 2025): Good news in relation to the latest `v0.11.5-rc2` ``` root@[redacted]:/# ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b 735371f916a9 71 GB 100% GPU 128000 Forever root@[redacted]:/# ``` <details> <summary> Ollama Docker Logs </summary> ``` time=2025-08-16T23:37:28.946Z level=INFO source=routes.go:1305 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NEW_ESTIMATES:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-08-16T23:37:28.955Z level=INFO source=images.go:477 msg="total blobs: 34" time=2025-08-16T23:37:28.957Z level=INFO source=images.go:484 msg="total unused blobs removed: 0" time=2025-08-16T23:37:28.958Z level=INFO source=routes.go:1358 msg="Listening on [::]:11434 (version 0.11.5-rc2)" time=2025-08-16T23:37:28.959Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-08-16T23:37:31.293Z level=INFO source=types.go:130 msg="inference compute" id=GPU-3b880d35-8b00-861c-3ac2-f8707baced68 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA RTX 4000 Ada Generation" total="19.6 GiB" available="19.1 GiB" time=2025-08-16T23:37:31.294Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="15.5 GiB" time=2025-08-16T23:37:31.294Z level=INFO source=types.go:130 msg="inference compute" id=GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" time=2025-08-16T23:37:31.294Z level=INFO source=types.go:130 msg="inference compute" id=GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" time=2025-08-16T23:37:31.294Z level=INFO source=types.go:130 msg="inference compute" id=GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" time=2025-08-16T23:37:31.294Z level=INFO source=types.go:130 msg="inference compute" id=GPU-eab804bb-3a67-1954-db1a-ddf62d8f427a library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA RTX A4000" total="15.6 GiB" available="15.4 GiB" [GIN] 2025/08/16 - 23:38:05 | 200 | 9.898517ms | 172.17.0.1 | GET "/api/tags" [GIN] 2025/08/16 - 23:38:05 | 200 | 221.709µs | 172.17.0.1 | GET "/api/ps" [GIN] 2025/08/16 - 23:38:06 | 200 | 67.121µs | 172.17.0.1 | GET "/api/version" [GIN] 2025/08/16 - 23:38:17 | 200 | 3.167459ms | 172.17.0.1 | GET "/api/tags" [GIN] 2025/08/16 - 23:38:17 | 200 | 46.448µs | 172.17.0.1 | GET "/api/ps" time=2025-08-16T23:38:18.558Z level=INFO source=server.go:166 msg="enabling new memory estimates" time=2025-08-16T23:38:20.691Z level=INFO source=server.go:211 msg="enabling flash attention" time=2025-08-16T23:38:20.691Z level=WARN source=server.go:219 msg="kv cache type not supported by model" type="" time=2025-08-16T23:38:20.692Z level=INFO source=server.go:383 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 38569" time=2025-08-16T23:38:20.694Z level=INFO source=server.go:657 msg="loading model" "model layers"=37 requested=256 time=2025-08-16T23:38:20.734Z level=INFO source=runner.go:1006 msg="starting ollama engine" time=2025-08-16T23:38:20.734Z level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:38569" time=2025-08-16T23:38:22.796Z level=INFO source=server.go:663 msg="system memory" total="125.8 GiB" free="122.7 GiB" free_swap="0 B" time=2025-08-16T23:38:22.796Z level=INFO source=server.go:667 msg="gpu memory" id=GPU-3b880d35-8b00-861c-3ac2-f8707baced68 available="18.7 GiB" free="19.1 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-08-16T23:38:22.796Z level=INFO source=server.go:667 msg="gpu memory" id=GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 available="15.0 GiB" free="15.5 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-08-16T23:38:22.796Z level=INFO source=server.go:667 msg="gpu memory" id=GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd available="15.0 GiB" free="15.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-08-16T23:38:22.796Z level=INFO source=server.go:667 msg="gpu memory" id=GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 available="15.0 GiB" free="15.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-08-16T23:38:22.796Z level=INFO source=server.go:667 msg="gpu memory" id=GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a available="15.0 GiB" free="15.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-08-16T23:38:22.796Z level=INFO source=server.go:667 msg="gpu memory" id=GPU-eab804bb-3a67-1954-db1a-ddf62d8f427a available="15.0 GiB" free="15.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-08-16T23:38:22.799Z level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:128000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-3b880d35-8b00-861c-3ac2-f8707baced68 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-16T23:38:23.007Z level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 6 CUDA devices: Device 0: NVIDIA RTX 4000 Ada Generation, compute capability 8.9, VMM: yes, ID: GPU-3b880d35-8b00-861c-3ac2-f8707baced68 Device 1: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes, ID: GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 Device 2: NVIDIA RTX A4000, compute capability 8.6, VMM: yes, ID: GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd Device 3: NVIDIA RTX A4000, compute capability 8.6, VMM: yes, ID: GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 Device 4: NVIDIA RTX A4000, compute capability 8.6, VMM: yes, ID: GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a Device 5: NVIDIA RTX A4000, compute capability 8.6, VMM: yes, ID: GPU-eab804bb-3a67-1954-db1a-ddf62d8f427a load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-x64.so time=2025-08-16T23:38:24.690Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 CUDA.4.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.4.USE_GRAPHS=1 CUDA.4.PEER_MAX_BATCH_SIZE=128 CUDA.5.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.5.USE_GRAPHS=1 CUDA.5.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-16T23:38:25.094Z level=INFO source=runner.go:925 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:128000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-3b880d35-8b00-861c-3ac2-f8707baced68 Layers:9(0..8) ID:GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 Layers:7(9..15) ID:GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd Layers:7(16..22) ID:GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 Layers:7(23..29) ID:GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a Layers:7(30..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-16T23:38:25.357Z level=INFO source=runner.go:925 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:128000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-3b880d35-8b00-861c-3ac2-f8707baced68 Layers:9(0..8) ID:GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 Layers:7(9..15) ID:GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd Layers:7(16..22) ID:GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 Layers:7(23..29) ID:GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a Layers:7(30..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-16T23:38:26.836Z level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:128000 KvCacheType: NumThreads:16 GPULayers:37[ID:GPU-3b880d35-8b00-861c-3ac2-f8707baced68 Layers:9(0..8) ID:GPU-4df21d53-78e5-97fb-c7ee-b18beb63e1a7 Layers:7(9..15) ID:GPU-f2ded6c5-33e4-31ac-5c61-c9088edaedbd Layers:7(16..22) ID:GPU-c95bf02e-0608-db0d-7759-07d27659f5f8 Layers:7(23..29) ID:GPU-fc191601-3661-ad41-ed61-2f8a1d5dbf6a Layers:7(30..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-16T23:38:26.836Z level=INFO source=ggml.go:486 msg="offloading 36 repeating layers to GPU" time=2025-08-16T23:38:26.836Z level=INFO source=ggml.go:492 msg="offloading output layer to GPU" time=2025-08-16T23:38:26.836Z level=INFO source=ggml.go:497 msg="offloaded 37/37 layers to GPU" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="14.7 GiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="11.4 GiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:310 msg="model weights" device=CUDA2 size="11.4 GiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:310 msg="model weights" device=CUDA3 size="11.4 GiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:310 msg="model weights" device=CUDA4 size="10.9 GiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="1.0 GiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="1.0 GiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA2 size="786.0 MiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA3 size="1.0 GiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA4 size="777.0 MiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="238.8 MiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="231.3 MiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA2 size="231.3 MiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA3 size="231.3 MiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA4 size="231.3 MiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" time=2025-08-16T23:38:26.838Z level=INFO source=backend.go:342 msg="total memory" size="66.6 GiB" time=2025-08-16T23:38:26.838Z level=INFO source=sched.go:473 msg="loaded runners" count=1 time=2025-08-16T23:38:26.838Z level=INFO source=server.go:1232 msg="waiting for llama runner to start responding" time=2025-08-16T23:38:26.839Z level=INFO source=server.go:1266 msg="waiting for server to become available" status="llm server loading model" time=2025-08-16T23:40:04.710Z level=INFO source=server.go:1270 msg="llama runner started in 104.02 seconds" [GIN] 2025/08/16 - 23:40:33 | 200 | 71.827µs | 127.0.0.1 | HEAD "/" [GIN] 2025/08/16 - 23:40:33 | 200 | 50.645µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/08/16 - 23:40:42 | 200 | 2m27s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/16 - 23:40:53 | 200 | 10.33721764s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/16 - 23:41:00 | 200 | 7.650597639s | 172.17.0.1 | POST "/api/chat" [GIN] 2025/08/16 - 23:41:15 | 200 | 14.206551211s | 172.17.0.1 | POST "/api/chat" ``` </details> It's also around 50% quicker - average 19 vs 29 tokens/s. <img width="204" height="280" alt="Image" src="https://github.com/user-attachments/assets/4371fb00-7729-4f6b-826c-ce095f55f254" /> ``` total duration: 24.612095582s load duration: 509.472421ms prompt eval count: 73 token(s) prompt eval duration: 259.819606ms prompt eval rate: 280.96 tokens/s eval count: 708 token(s) eval duration: 23.840691078s eval rate: 29.70 tokens/s ``` <details> <summary>nvidia-smi output</summary> ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4060 Ti On | 00000000:01:00.0 Off | N/A | | 0% 51C P8 7W / 165W | 13105MiB / 16380MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA RTX 4000 Ada Gene... On | 00000000:02:00.0 Off | Off | | 30% 47C P8 11W / 130W | 16753MiB / 20475MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA RTX A4000 On | 00000000:03:00.0 Off | Off | | 41% 43C P8 10W / 140W | 12947MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA RTX A4000 On | 00000000:04:00.0 Off | Off | | 41% 54C P8 11W / 140W | 13201MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA RTX A4000 On | 00000000:05:00.0 Off | Off | | 41% 50C P8 7W / 140W | 12805MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA RTX A4000 On | 00000000:06:00.0 Off | Off | | 41% 45C P8 7W / 140W | 167MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 3209 C /usr/bin/ollama 166MiB | | 1 N/A N/A 1532 C /usr/local/bin/python3 262MiB | | 1 N/A N/A 3209 C /usr/bin/ollama 218MiB | | 2 N/A N/A 3209 C /usr/bin/ollama 264MiB | | 3 N/A N/A 3209 C /usr/bin/ollama 264MiB | | 4 N/A N/A 3209 C /usr/bin/ollama 396MiB | | 5 N/A N/A 3209 C /usr/bin/ollama 158MiB | +-----------------------------------------------------------------------------------------+ ``` </details>
Author
Owner

@Queracus commented on GitHub (Oct 6, 2025):

They solved all this in latest Ollama. can crank up the context and it uses gpu. problem was with OLLAMA_FLASH_ATTENTION not being suported for oss in ollama. so even on 24gb gpu you could only max lke 32k context before it switched to cpu.

Have to say that 20b model is really smart for suck a small one. And fast as F*** on 3090. only uses aprox. 16GB vram on 256k ocntext

<!-- gh-comment-id:3370534741 --> @Queracus commented on GitHub (Oct 6, 2025): They solved all this in latest Ollama. can crank up the context and it uses gpu. problem was with OLLAMA_FLASH_ATTENTION not being suported for oss in ollama. so even on 24gb gpu you could only max lke 32k context before it switched to cpu. Have to say that 20b model is really smart for suck a small one. And fast as F*** on 3090. only uses aprox. 16GB vram on 256k ocntext
Author
Owner

@SHU-red commented on GitHub (Oct 6, 2025):

Im not an expert of this but yeah, sometimes its working
I got a RTX4080 with only 15G of memory
Is there a good setting for the model to run stable (without sometimes not working) and on GPU for my lower available memory?

Right now, sometimes it cant load ...

<!-- gh-comment-id:3372857461 --> @SHU-red commented on GitHub (Oct 6, 2025): Im not an expert of this but yeah, sometimes its working I got a RTX4080 with only 15G of memory Is there a good setting for the model to run stable (without sometimes not working) and on GPU for my lower available memory? Right now, sometimes it cant load ...
Author
Owner

@jessegross commented on GitHub (Oct 6, 2025):

@SHU-red Can you post the log from a time when it doesn't load?

<!-- gh-comment-id:3373161931 --> @jessegross commented on GitHub (Oct 6, 2025): @SHU-red Can you post the log from a time when it doesn't load?
Author
Owner

@SHU-red commented on GitHub (Oct 6, 2025):

@SHU-red Can you post the log from a time when it doesn't load?

Yes sorry, should provide more information:

...
    environment:
      - OLLAMA_RUN_PARALLEL=1
      #- OLLAMA_CONTEXT_LENGTH=32000
      - GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
      #- OLLAMA_MAX_VRAM=10000000
...

Prepared models from conversations above:

  • gpt-oss:20b
  • gpt-oss:20b_ctx32k
  • gpt-oss:20b_ctx32k_gpu256
  • gpt-oss:20b_ctx3k
  • gpt-oss:20b_gpu256

gpt-oss:20b:

  • worked but used cpu
time=2025-10-06T17:12:28.388Z level=ERROR source=server.go:1459 msg="post predict" error="Post \"http://127.0.0.1:39695/completion\": EOF"
[GIN] 2025/10/06 - 17:12:28 | 200 |  6.512810209s |      172.18.0.1 | POST     "/api/chat"
time=2025-10-06T17:12:28.414Z level=ERROR source=server.go:425 msg="llama runner terminated" error="exit status 2"
time=2025-10-06T17:17:33.534Z level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.144168176 runner.size="13.8 GiB" runner.vram="13.8 GiB" runner.parallel=1 runner.pid=36294 runner.model=/root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583
time=2025-10-06T17:17:33.783Z level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.393604737 runner.size="13.8 GiB" runner.vram="13.8 GiB" runner.parallel=1 runner.pid=36294 runner.model=/root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583
time=2025-10-06T17:17:34.034Z level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.644011406 runner.size="13.8 GiB" runner.vram="13.8 GiB" runner.parallel=1 runner.pid=36294 runner.model=/root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583
[GIN] 2025/10/06 - 18:13:29 | 200 |      20.147µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/10/06 - 18:13:29 | 200 |       8.566µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/10/06 - 18:14:08 | 200 |    1.984065ms |      172.18.0.1 | GET      "/api/tags"
[GIN] 2025/10/06 - 18:14:08 | 200 |      14.297µs |      172.18.0.1 | GET      "/api/ps"
[GIN] 2025/10/06 - 18:14:09 | 200 |      32.451µs |      172.18.0.1 | GET      "/api/version"
[GIN] 2025/10/06 - 18:15:32 | 200 |    2.006527ms |      172.18.0.1 | GET      "/api/tags"
time=2025-10-06T18:15:54.240Z level=INFO source=server.go:200 msg="model wants flash attention"
time=2025-10-06T18:15:54.240Z level=INFO source=server.go:217 msg="enabling flash attention"
time=2025-10-06T18:15:54.241Z level=INFO source=server.go:399 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 43327"
time=2025-10-06T18:15:54.241Z level=INFO source=server.go:672 msg="loading model" "model layers"=25 requested=-1
time=2025-10-06T18:15:54.253Z level=INFO source=runner.go:1252 msg="starting ollama engine"
time=2025-10-06T18:15:54.253Z level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:43327"
time=2025-10-06T18:15:54.383Z level=INFO source=server.go:678 msg="system memory" total="62.7 GiB" free="44.3 GiB" free_swap="126.7 MiB"
time=2025-10-06T18:15:54.383Z level=INFO source=server.go:686 msg="gpu memory" id=GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6 available="11.7 GiB" free="12.1 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-10-06T18:15:54.384Z level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:25[ID:GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-06T18:15:54.433Z level=INFO source=ggml.go:131 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes, ID: GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2025-10-06T18:15:54.500Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-10-06T18:15:54.605Z level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:24[ID:GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6 Layers:24(0..23)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-06T18:15:54.653Z level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:24[ID:GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6 Layers:24(0..23)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-06T18:15:54.738Z level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:24[ID:GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6 Layers:24(0..23)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-06T18:15:54.738Z level=INFO source=ggml.go:487 msg="offloading 24 repeating layers to GPU"
time=2025-10-06T18:15:54.738Z level=INFO source=ggml.go:491 msg="offloading output layer to CPU"
time=2025-10-06T18:15:54.738Z level=INFO source=ggml.go:498 msg="offloaded 24/25 layers to GPU"
time=2025-10-06T18:15:54.738Z level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="10.7 GiB"
time=2025-10-06T18:15:54.738Z level=INFO source=backend.go:315 msg="model weights" device=CPU size="2.2 GiB"
time=2025-10-06T18:15:54.738Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="204.0 MiB"
time=2025-10-06T18:15:54.738Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="117.8 MiB"
time=2025-10-06T18:15:54.738Z level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
time=2025-10-06T18:15:54.738Z level=INFO source=backend.go:342 msg="total memory" size="13.2 GiB"
time=2025-10-06T18:15:54.738Z level=INFO source=sched.go:470 msg="loaded runners" count=1
time=2025-10-06T18:15:54.738Z level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
time=2025-10-06T18:15:54.740Z level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
time=2025-10-06T18:15:57.998Z level=INFO source=server.go:1289 msg="llama runner started in 3.76 seconds"
[GIN] 2025/10/06 - 18:16:14 | 200 | 20.536099806s |      172.18.0.1 | POST     "/api/chat"
[GIN] 2025/10/06 - 18:16:23 | 200 |  9.119721114s |      172.18.0.1 | POST     "/api/chat"
[GIN] 2025/10/06 - 18:16:33 | 200 |  8.661983665s |      172.18.0.1 | POST     "/api/chat"
[GIN] 2025/10/06 - 18:16:41 | 200 |      18.285µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/10/06 - 18:16:41 | 200 |      18.405µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/10/06 - 18:16:58 | 200 |      25.197µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/10/06 - 18:16:58 | 200 |      23.835µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/10/06 - 18:17:19 | 200 |       21.28µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/10/06 - 18:17:19 | 200 |      24.055µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/10/06 - 18:18:07 | 200 |      19.967µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/10/06 - 18:18:07 | 200 |      27.441µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2025/10/06 - 18:19:59 | 200 |  7.556107172s |      172.18.0.1 | POST     "/api/chat"
[GIN] 2025/10/06 - 18:20:15 | 200 | 14.963713581s |      172.18.0.1 | POST     "/api/chat"
Mon Oct  6 18:20:19 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080        Off |   00000000:06:00.0  On |                  N/A |
|  0%   54C    P2             54W /  320W |   14867MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           48824      C   /opt/venv/bin/python3                   238MiB |
|    0   N/A  N/A           55339      C   python3                                 238MiB |
|    0   N/A  N/A           56644      C   frigate.detector.onnx                   364MiB |
|    0   N/A  N/A           56674      C   frigate.embeddings_manager              958MiB |
|    0   N/A  N/A           56912      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          330MiB |
|    0   N/A  N/A         2420338      G   /usr/lib/xorg/Xorg                      143MiB |
|    0   N/A  N/A         2421301      G   xfwm4                                     4MiB |
|    0   N/A  N/A         2421306    C+G   /usr/bin/sunshine                       243MiB |
|    0   N/A  N/A         2421624      G   ...nstallation/ubuntu12_32/steam          4MiB |
|    0   N/A  N/A         2421967      G   ./steamwebhelper                          9MiB |
|    0   N/A  N/A         2421990      G   ...on/ubuntu12_64/steamwebhelper         10MiB |
|    0   N/A  N/A         2437305      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          294MiB |
|    0   N/A  N/A         3265107      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          322MiB |
|    0   N/A  N/A         3495699      C   /usr/bin/ollama                         320MiB |
+-----------------------------------------------------------------------------------------+
NAME           ID              SIZE     PROCESSOR          CONTEXT    UNTIL
gpt-oss:20b    f2b8351c629c    14 GB    16%/84% CPU/GPU    4096       4 minutes from now

gpt-oss:20b_ctx32k_gpu256:

  • did not work due to resource limitation
        net/http/server.go:3454 +0x485

goroutine 320 gp=0xc0000d7500 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?)
        runtime/proc.go:435 +0xce fp=0xc00146bdd8 sp=0xc00146bdb8 pc=0x55b7d185b86e
runtime.netpollblock(0x55b7d187ebd8?, 0xd17f4666?, 0xb7?)
        runtime/netpoll.go:575 +0xf7 fp=0xc00146be10 sp=0xc00146bdd8 pc=0x55b7d1820357
internal/poll.runtime_pollWait(0x7ff8a4654cc8, 0x72)
        runtime/netpoll.go:351 +0x85 fp=0xc00146be30 sp=0xc00146be10 pc=0x55b7d185aa85
internal/poll.(*pollDesc).wait(0xc00048f380?, 0xc00036c101?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00146be58 sp=0xc00146be30 pc=0x55b7d18e1ec7
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00048f380, {0xc00036c101, 0x1, 0x1})
        internal/poll/fd_unix.go:165 +0x27a fp=0xc00146bef0 sp=0xc00146be58 pc=0x55b7d18e31ba
net.(*netFD).Read(0xc00048f380, {0xc00036c101?, 0xc00012f998?, 0xc00146bf70?})
        net/fd_posix.go:55 +0x25 fp=0xc00146bf38 sp=0xc00146bef0 pc=0x55b7d19582a5
net.(*conn).Read(0xc00011c990, {0xc00036c101?, 0x0?, 0x0?})
        net/net.go:194 +0x45 fp=0xc00146bf80 sp=0xc00146bf38 pc=0x55b7d1966665
net/http.(*connReader).backgroundRead(0xc00036c0f0)
        net/http/server.go:690 +0x37 fp=0xc00146bfc8 sp=0xc00146bf80 pc=0x55b7d1b524d7
net/http.(*connReader).startBackgroundRead.gowrap2()
        net/http/server.go:686 +0x25 fp=0xc00146bfe0 sp=0xc00146bfc8 pc=0x55b7d1b52405
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00146bfe8 sp=0xc00146bfe0 pc=0x55b7d1862fa1
created by net/http.(*connReader).startBackgroundRead in goroutine 8
        net/http/server.go:686 +0xb6

goroutine 373 gp=0xc00104c8c0 m=nil [chan receive]:
runtime.gopark(0x30?, 0x55b7d2c889a0?, 0x1?, 0xd7?, 0xc000096b30?)
        runtime/proc.go:435 +0xce fp=0xc000096ae8 sp=0xc000096ac8 pc=0x55b7d185b86e
runtime.chanrecv(0xc0006612d0, 0x0, 0x1)
        runtime/chan.go:664 +0x445 fp=0xc000096b60 sp=0xc000096ae8 pc=0x55b7d17f7245
runtime.chanrecv1(0x55b7d2874771?, 0x2c?)
        runtime/chan.go:506 +0x12 fp=0xc000096b88 sp=0xc000096b60 pc=0x55b7d17f6dd2
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc00022b0e0, {0x1, {0x55b7d2d331e0, 0xc001606000}, {0x55b7d2d3de48, 0xc000dc4e58}, {0xc000ec8008, 0x134, 0x25f}, {{0x55b7d2d3de48, ...}, ...}, ...})
        github.com/ollama/ollama/runner/ollamarunner/runner.go:602 +0x185 fp=0xc000096ef0 sp=0xc000096b88 pc=0x55b7d1d69925
github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1()
        github.com/ollama/ollama/runner/ollamarunner/runner.go:425 +0x58 fp=0xc000096fe0 sp=0xc000096ef0 pc=0x55b7d1d67e38
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000096fe8 sp=0xc000096fe0 pc=0x55b7d1862fa1
created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 7
        github.com/ollama/ollama/runner/ollamarunner/runner.go:425 +0x2ed

rax    0x0
rbx    0xb775
rcx    0x7ff8eca0fb2c
rdx    0x6
rdi    0xb757
rsi    0xb775
rbp    0x7ff804ffb2e0
rsp    0x7ff804ffb2a0
r8     0x0
r9     0x7
r10    0x8
r11    0x246
r12    0x6
r13    0x7ff858e0d448
r14    0x16
r15    0x7ff47a7a0800
rip    0x7ff8eca0fb2c
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
time=2025-10-06T18:21:28.560Z level=ERROR source=server.go:1459 msg="post predict" error="Post \"http://127.0.0.1:37887/completion\": EOF"
[GIN] 2025/10/06 - 18:21:28 | 200 |  5.899638513s |      172.18.0.1 | POST     "/api/chat"
time=2025-10-06T18:21:28.586Z level=ERROR source=server.go:425 msg="llama runner terminated" error="exit status 2"
Mon Oct  6 18:22:48 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080        Off |   00000000:06:00.0  On |                  N/A |
|  0%   56C    P2             55W /  320W |    3276MiB /  16376MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           48824      C   /opt/venv/bin/python3                   238MiB |
|    0   N/A  N/A           55339      C   python3                                 238MiB |
|    0   N/A  N/A           56644      C   frigate.detector.onnx                   364MiB |
|    0   N/A  N/A           56674      C   frigate.embeddings_manager              958MiB |
|    0   N/A  N/A           56912      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          330MiB |
|    0   N/A  N/A         2420338      G   /usr/lib/xorg/Xorg                      143MiB |
|    0   N/A  N/A         2421301      G   xfwm4                                     4MiB |
|    0   N/A  N/A         2421306    C+G   /usr/bin/sunshine                       243MiB |
|    0   N/A  N/A         2421624      G   ...nstallation/ubuntu12_32/steam          4MiB |
|    0   N/A  N/A         2421967      G   ./steamwebhelper                          9MiB |
|    0   N/A  N/A         2421990      G   ...on/ubuntu12_64/steamwebhelper         10MiB |
|    0   N/A  N/A         2437305      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          294MiB |
|    0   N/A  N/A         3265107      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          322MiB |
+-----------------------------------------------------------------------------------------+
NAME                         ID              SIZE     PROCESSOR    CONTEXT    UNTIL
gpt-oss:20b_ctx32k_gpu256    4faa45587112    14 GB    100% GPU     32000      3 minutes from now

gpt-oss:20b_ctx32k_gpu256:

  • consecutive try also did not work

gpt-oss:20b_gpu256:

  • error running cuda
an error was encountered while running the model: CUDA error: out of memory current device: 0, in function evaluate_and_capture_cuda_graph at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:3015 cudaGraphInstantiate(&cuda_ctx->cuda_graph->instance, cuda_ctx->cuda_graph->graph, __null, __null, 0) //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:84: CUDA error
oroutine 7 gp=0xc000003dc0 m=nil [GC worker (idle)]:
runtime.gopark(0x101bd584e0a87?, 0x1?, 0x43?, 0xca?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000087738 sp=0xc000087718 pc=0x56312433a86e
runtime.gcBgMarkWorker(0xc000111730)
        runtime/mgc.go:1423 +0xe9 fp=0xc0000877c8 sp=0xc000087738 pc=0x5631242e7d69
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc0000877e0 sp=0xc0000877c8 pc=0x5631242e7c45
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000877e8 sp=0xc0000877e0 pc=0x563124341fa1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 23 gp=0xc000103dc0 m=nil [GC worker (idle)]:
runtime.gopark(0x101bd586f1d3f?, 0x1?, 0x80?, 0xb7?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000082738 sp=0xc000082718 pc=0x56312433a86e
runtime.gcBgMarkWorker(0xc000111730)
        runtime/mgc.go:1423 +0xe9 fp=0xc0000827c8 sp=0xc000082738 pc=0x5631242e7d69
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc0000827e0 sp=0xc0000827c8 pc=0x5631242e7c45
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0000827e8 sp=0xc0000827e0 pc=0x563124341fa1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 37 gp=0xc000484540 m=nil [GC worker (idle)]:
runtime.gopark(0x101bd584e927c?, 0x3?, 0xc1?, 0x77?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00048bf38 sp=0xc00048bf18 pc=0x56312433a86e
runtime.gcBgMarkWorker(0xc000111730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00048bfc8 sp=0xc00048bf38 pc=0x5631242e7d69
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00048bfe0 sp=0xc00048bfc8 pc=0x5631242e7c45
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00048bfe8 sp=0xc00048bfe0 pc=0x563124341fa1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 38 gp=0xc000484700 m=nil [GC worker (idle)]:
runtime.gopark(0x101bd584ea473?, 0x3?, 0x91?, 0x82?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00048c738 sp=0xc00048c718 pc=0x56312433a86e
runtime.gcBgMarkWorker(0xc000111730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00048c7c8 sp=0xc00048c738 pc=0x5631242e7d69
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00048c7e0 sp=0xc00048c7c8 pc=0x5631242e7c45
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00048c7e8 sp=0xc00048c7e0 pc=0x563124341fa1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 39 gp=0xc0004848c0 m=nil [GC worker (idle)]:
runtime.gopark(0x101bd584e7ad9?, 0x1?, 0x94?, 0xae?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00048cf38 sp=0xc00048cf18 pc=0x56312433a86e
runtime.gcBgMarkWorker(0xc000111730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00048cfc8 sp=0xc00048cf38 pc=0x5631242e7d69
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00048cfe0 sp=0xc00048cfc8 pc=0x5631242e7c45
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00048cfe8 sp=0xc00048cfe0 pc=0x563124341fa1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 40 gp=0xc000484a80 m=nil [GC worker (idle)]:
runtime.gopark(0x563126167ec0?, 0x1?, 0x13?, 0xfa?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00048d738 sp=0xc00048d718 pc=0x56312433a86e
runtime.gcBgMarkWorker(0xc000111730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00048d7c8 sp=0xc00048d738 pc=0x5631242e7d69
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00048d7e0 sp=0xc00048d7c8 pc=0x5631242e7c45
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00048d7e8 sp=0xc00048d7e0 pc=0x563124341fa1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 41 gp=0xc000484c40 m=nil [GC worker (idle)]:
runtime.gopark(0x101bd584e0082?, 0x3?, 0xe4?, 0x94?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc00048df38 sp=0xc00048df18 pc=0x56312433a86e
runtime.gcBgMarkWorker(0xc000111730)
        runtime/mgc.go:1423 +0xe9 fp=0xc00048dfc8 sp=0xc00048df38 pc=0x5631242e7d69
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc00048dfe0 sp=0xc00048dfc8 pc=0x5631242e7c45
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00048dfe8 sp=0xc00048dfe0 pc=0x563124341fa1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 50 gp=0xc000584000 m=nil [GC worker (idle)]:
runtime.gopark(0x101bd584e53a2?, 0x3?, 0xdc?, 0xca?, 0x0?)
        runtime/proc.go:435 +0xce fp=0xc000486738 sp=0xc000486718 pc=0x56312433a86e
runtime.gcBgMarkWorker(0xc000111730)
        runtime/mgc.go:1423 +0xe9 fp=0xc0004867c8 sp=0xc000486738 pc=0x5631242e7d69
runtime.gcBgMarkStartWorkers.gowrap1()
        runtime/mgc.go:1339 +0x25 fp=0xc0004867e0 sp=0xc0004867c8 pc=0x5631242e7c45
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0004867e8 sp=0xc0004867e0 pc=0x563124341fa1
created by runtime.gcBgMarkStartWorkers in goroutine 1
        runtime/mgc.go:1339 +0x105

goroutine 8 gp=0xc000485880 m=nil [chan receive]:
runtime.gopark(0x30?, 0xffffffffffffffff?, 0x32?, 0x0?, 0xc0013f9840?)
        runtime/proc.go:435 +0xce fp=0xc0013f97f8 sp=0xc0013f97d8 pc=0x56312433a86e
runtime.chanrecv(0xc001956000, 0x0, 0x1)
        runtime/chan.go:664 +0x445 fp=0xc0013f9870 sp=0xc0013f97f8 pc=0x5631242d6245
runtime.chanrecv1(0x5631253502d5?, 0x29?)
        runtime/chan.go:506 +0x12 fp=0xc0013f9898 sp=0xc0013f9870 pc=0x5631242d5dd2
github.com/ollama/ollama/runner/ollamarunner.(*Server).forwardBatch(_, {0x3, {0x5631258121e0, 0xc001952000}, {0x56312581ce48, 0xc001608108}, {0xc00011c138, 0x1, 0x1}, {{0x56312581ce48, ...}, ...}, ...})
        github.com/ollama/ollama/runner/ollamarunner/runner.go:440 +0xfa fp=0xc0013f9bf8 sp=0xc0013f9898 pc=0x563124846f5a
github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc00022ad20, {0x563125808c00, 0xc0006a1360})
        github.com/ollama/ollama/runner/ollamarunner/runner.go:419 +0x1ac fp=0xc0013f9fb8 sp=0xc0013f9bf8 pc=0x563124846bec
github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap1()
        github.com/ollama/ollama/runner/ollamarunner/runner.go:1266 +0x28 fp=0xc0013f9fe0 sp=0xc0013f9fb8 pc=0x56312484f168
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0013f9fe8 sp=0xc0013f9fe0 pc=0x563124341fa1
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
        github.com/ollama/ollama/runner/ollamarunner/runner.go:1266 +0x505

goroutine 9 gp=0xc000485a40 m=nil [select]:
runtime.gopark(0xc000049a10?, 0x2?, 0x4?, 0x0?, 0xc000049874?)
        runtime/proc.go:435 +0xce fp=0xc0000496a0 sp=0xc000049680 pc=0x56312433a86e
runtime.selectgo(0xc000049a10, 0xc000049870, 0xc001132600?, 0x0, 0x1?, 0x1)
        runtime/select.go:351 +0x837 fp=0xc0000497d8 sp=0xc0000496a0 pc=0x563124318ed7
github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(0xc00022ad20, {0x563125806868, 0xc0013680e0}, 0xc00051e140)
        github.com/ollama/ollama/runner/ollamarunner/runner.go:869 +0xb90 fp=0xc000049ac0 sp=0xc0000497d8 pc=0x56312484aed0
github.com/ollama/ollama/runner/ollamarunner.(*Server).completion-fm({0x563125806868?, 0xc0013680e0?}, 0xc0013fdb40?)
        <autogenerated>:1 +0x36 fp=0xc000049af0 sp=0xc000049ac0 pc=0x56312484f5d6
net/http.HandlerFunc.ServeHTTP(0xc0000375c0?, {0x563125806868?, 0xc0013680e0?}, 0xc0013fdb60?)
        net/http/server.go:2294 +0x29 fp=0xc000049b18 sp=0xc000049af0 pc=0x563124639109
net/http.(*ServeMux).ServeHTTP(0x5631242ded85?, {0x563125806868, 0xc0013680e0}, 0xc00051e140)
        net/http/server.go:2822 +0x1c4 fp=0xc000049b68 sp=0xc000049b18 pc=0x56312463b004
net/http.serverHandler.ServeHTTP({0x563125802eb0?}, {0x563125806868?, 0xc0013680e0?}, 0x1?)
        net/http/server.go:3301 +0x8e fp=0xc000049b98 sp=0xc000049b68 pc=0x563124658a8e
net/http.(*conn).serve(0xc0000e43f0, {0x563125808bc8, 0xc000223f50})
        net/http/server.go:2102 +0x625 fp=0xc000049fb8 sp=0xc000049b98 pc=0x563124637605
net/http.(*Server).Serve.gowrap3()
        net/http/server.go:3454 +0x28 fp=0xc000049fe0 sp=0xc000049fb8 pc=0x56312463cec8
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc000049fe8 sp=0xc000049fe0 pc=0x563124341fa1
created by net/http.(*Server).Serve in goroutine 1
        net/http/server.go:3454 +0x485

goroutine 113 gp=0xc000602fc0 m=nil [IO wait]:
runtime.gopark(0xb1c298c3b5c2?, 0x91c3bbc492c3a0c4?, 0xc4?, 0xa5?, 0xb?)
        runtime/proc.go:435 +0xce fp=0xc0013735d8 sp=0xc0013735b8 pc=0x56312433a86e
runtime.netpollblock(0x56312435dbd8?, 0x242d3666?, 0x31?)
        runtime/netpoll.go:575 +0xf7 fp=0xc001373610 sp=0xc0013735d8 pc=0x5631242ff357
internal/poll.runtime_pollWait(0x7f12c4a75cc8, 0x72)
        runtime/netpoll.go:351 +0x85 fp=0xc001373630 sp=0xc001373610 pc=0x563124339a85
internal/poll.(*pollDesc).wait(0xc000693380?, 0xc00036c101?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc001373658 sp=0xc001373630 pc=0x5631243c0ec7
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000693380, {0xc00036c101, 0x1, 0x1})
        internal/poll/fd_unix.go:165 +0x27a fp=0xc0013736f0 sp=0xc001373658 pc=0x5631243c21ba
net.(*netFD).Read(0xc000693380, {0xc00036c101?, 0xc00012f998?, 0xc001373770?})
        net/fd_posix.go:55 +0x25 fp=0xc001373738 sp=0xc0013736f0 pc=0x5631244372a5
net.(*conn).Read(0xc00011c970, {0xc00036c101?, 0x746361726143a0c4?, 0x7265?})
        net/net.go:194 +0x45 fp=0xc001373780 sp=0xc001373738 pc=0x563124445665
net/http.(*connReader).backgroundRead(0xc00036c0f0)
        net/http/server.go:690 +0x37 fp=0xc0013737c8 sp=0xc001373780 pc=0x5631246314d7
net/http.(*connReader).startBackgroundRead.gowrap2()
        net/http/server.go:686 +0x25 fp=0xc0013737e0 sp=0xc0013737c8 pc=0x563124631405
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc0013737e8 sp=0xc0013737e0 pc=0x563124341fa1
created by net/http.(*connReader).startBackgroundRead in goroutine 9
        net/http/server.go:686 +0xb6

goroutine 377 gp=0xc000603500 m=nil [chan receive]:
runtime.gopark(0x30?, 0xffffffffffffffff?, 0x33?, 0x0?, 0xc00138cb30?)
        runtime/proc.go:435 +0xce fp=0xc00138cae8 sp=0xc00138cac8 pc=0x56312433a86e
runtime.chanrecv(0xc0012902a0, 0x0, 0x1)
        runtime/chan.go:664 +0x445 fp=0xc00138cb60 sp=0xc00138cae8 pc=0x5631242d6245
runtime.chanrecv1(0x563125353771?, 0x2c?)
        runtime/chan.go:506 +0x12 fp=0xc00138cb88 sp=0xc00138cb60 pc=0x5631242d5dd2
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc00022ad20, {0x3, {0x5631258121e0, 0xc001952000}, {0x56312581ce48, 0xc001608108}, {0xc00011c138, 0x1, 0x1}, {{0x56312581ce48, ...}, ...}, ...})
        github.com/ollama/ollama/runner/ollamarunner/runner.go:602 +0x185 fp=0xc00138cef0 sp=0xc00138cb88 pc=0x563124848925
github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1()
        github.com/ollama/ollama/runner/ollamarunner/runner.go:425 +0x58 fp=0xc00138cfe0 sp=0xc00138cef0 pc=0x563124846e38
runtime.goexit({})
        runtime/asm_amd64.s:1700 +0x1 fp=0xc00138cfe8 sp=0xc00138cfe0 pc=0x563124341fa1
created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 8
        github.com/ollama/ollama/runner/ollamarunner/runner.go:425 +0x2ed

rax    0x0
rbx    0xb81e
rcx    0x7f130ce28b2c
rdx    0x6
rdi    0xb80b
rsi    0xb81e
rbp    0x7f12417fd250
rsp    0x7f12417fd210
r8     0x0
r9     0x7
r10    0x8
r11    0x246
r12    0x6
r13    0x7f127ce0d448
r14    0x16
r15    0x52e
rip    0x7f130ce28b2c
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
time=2025-10-06T18:24:40.803Z level=ERROR source=server.go:425 msg="llama runner terminated" error="exit status 2"
[GIN] 2025/10/06 - 18:24:40 | 200 | 10.535247191s |      172.18.0.1 | POST     "/api/chat"
Mon Oct  6 18:25:07 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080        Off |   00000000:06:00.0  On |                  N/A |
|  0%   57C    P2             55W /  320W |    3276MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           48824      C   /opt/venv/bin/python3                   238MiB |
|    0   N/A  N/A           55339      C   python3                                 238MiB |
|    0   N/A  N/A           56644      C   frigate.detector.onnx                   364MiB |
|    0   N/A  N/A           56674      C   frigate.embeddings_manager              958MiB |
|    0   N/A  N/A           56912      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          330MiB |
|    0   N/A  N/A         2420338      G   /usr/lib/xorg/Xorg                      143MiB |
|    0   N/A  N/A         2421301      G   xfwm4                                     4MiB |
|    0   N/A  N/A         2421306    C+G   /usr/bin/sunshine                       243MiB |
|    0   N/A  N/A         2421624      G   ...nstallation/ubuntu12_32/steam          4MiB |
|    0   N/A  N/A         2421967      G   ./steamwebhelper                          9MiB |
|    0   N/A  N/A         2421990      G   ...on/ubuntu12_64/steamwebhelper         10MiB |
|    0   N/A  N/A         2437305      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          294MiB |
|    0   N/A  N/A         3265107      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          322MiB |
+-----------------------------------------------------------------------------------------+
NAME                  ID              SIZE     PROCESSOR    CONTEXT    UNTIL
gpt-oss:20b_gpu256    ff94d56c88da    14 GB    100% GPU     4096       4 minutes from now

gpt-oss:20b_gpu256:

  • consecutive try same error
<!-- gh-comment-id:3373255775 --> @SHU-red commented on GitHub (Oct 6, 2025): > [@SHU-red](https://github.com/SHU-red) Can you post the log from a time when it doesn't load? Yes sorry, should provide more information: ```yaml ... environment: - OLLAMA_RUN_PARALLEL=1 #- OLLAMA_CONTEXT_LENGTH=32000 - GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 #- OLLAMA_MAX_VRAM=10000000 ... ``` # Prepared models from conversations above: - gpt-oss:20b - gpt-oss:20b_ctx32k - gpt-oss:20b_ctx32k_gpu256 - gpt-oss:20b_ctx3k - gpt-oss:20b_gpu256 # gpt-oss:20b: - worked but used cpu ```bash time=2025-10-06T17:12:28.388Z level=ERROR source=server.go:1459 msg="post predict" error="Post \"http://127.0.0.1:39695/completion\": EOF" [GIN] 2025/10/06 - 17:12:28 | 200 | 6.512810209s | 172.18.0.1 | POST "/api/chat" time=2025-10-06T17:12:28.414Z level=ERROR source=server.go:425 msg="llama runner terminated" error="exit status 2" time=2025-10-06T17:17:33.534Z level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.144168176 runner.size="13.8 GiB" runner.vram="13.8 GiB" runner.parallel=1 runner.pid=36294 runner.model=/root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 time=2025-10-06T17:17:33.783Z level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.393604737 runner.size="13.8 GiB" runner.vram="13.8 GiB" runner.parallel=1 runner.pid=36294 runner.model=/root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 time=2025-10-06T17:17:34.034Z level=WARN source=sched.go:649 msg="gpu VRAM usage didn't recover within timeout" seconds=5.644011406 runner.size="13.8 GiB" runner.vram="13.8 GiB" runner.parallel=1 runner.pid=36294 runner.model=/root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 [GIN] 2025/10/06 - 18:13:29 | 200 | 20.147µs | 127.0.0.1 | HEAD "/" [GIN] 2025/10/06 - 18:13:29 | 200 | 8.566µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/10/06 - 18:14:08 | 200 | 1.984065ms | 172.18.0.1 | GET "/api/tags" [GIN] 2025/10/06 - 18:14:08 | 200 | 14.297µs | 172.18.0.1 | GET "/api/ps" [GIN] 2025/10/06 - 18:14:09 | 200 | 32.451µs | 172.18.0.1 | GET "/api/version" [GIN] 2025/10/06 - 18:15:32 | 200 | 2.006527ms | 172.18.0.1 | GET "/api/tags" time=2025-10-06T18:15:54.240Z level=INFO source=server.go:200 msg="model wants flash attention" time=2025-10-06T18:15:54.240Z level=INFO source=server.go:217 msg="enabling flash attention" time=2025-10-06T18:15:54.241Z level=INFO source=server.go:399 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 43327" time=2025-10-06T18:15:54.241Z level=INFO source=server.go:672 msg="loading model" "model layers"=25 requested=-1 time=2025-10-06T18:15:54.253Z level=INFO source=runner.go:1252 msg="starting ollama engine" time=2025-10-06T18:15:54.253Z level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:43327" time=2025-10-06T18:15:54.383Z level=INFO source=server.go:678 msg="system memory" total="62.7 GiB" free="44.3 GiB" free_swap="126.7 MiB" time=2025-10-06T18:15:54.383Z level=INFO source=server.go:686 msg="gpu memory" id=GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6 available="11.7 GiB" free="12.1 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-10-06T18:15:54.384Z level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:25[ID:GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-06T18:15:54.433Z level=INFO source=ggml.go:131 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes, ID: GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6 load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so time=2025-10-06T18:15:54.500Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-10-06T18:15:54.605Z level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:24[ID:GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6 Layers:24(0..23)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-06T18:15:54.653Z level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:24[ID:GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6 Layers:24(0..23)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-06T18:15:54.738Z level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:24[ID:GPU-db50747e-f16a-d05b-fe6a-26ffdb825cc6 Layers:24(0..23)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-06T18:15:54.738Z level=INFO source=ggml.go:487 msg="offloading 24 repeating layers to GPU" time=2025-10-06T18:15:54.738Z level=INFO source=ggml.go:491 msg="offloading output layer to CPU" time=2025-10-06T18:15:54.738Z level=INFO source=ggml.go:498 msg="offloaded 24/25 layers to GPU" time=2025-10-06T18:15:54.738Z level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="10.7 GiB" time=2025-10-06T18:15:54.738Z level=INFO source=backend.go:315 msg="model weights" device=CPU size="2.2 GiB" time=2025-10-06T18:15:54.738Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="204.0 MiB" time=2025-10-06T18:15:54.738Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="117.8 MiB" time=2025-10-06T18:15:54.738Z level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" time=2025-10-06T18:15:54.738Z level=INFO source=backend.go:342 msg="total memory" size="13.2 GiB" time=2025-10-06T18:15:54.738Z level=INFO source=sched.go:470 msg="loaded runners" count=1 time=2025-10-06T18:15:54.738Z level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" time=2025-10-06T18:15:54.740Z level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" time=2025-10-06T18:15:57.998Z level=INFO source=server.go:1289 msg="llama runner started in 3.76 seconds" [GIN] 2025/10/06 - 18:16:14 | 200 | 20.536099806s | 172.18.0.1 | POST "/api/chat" [GIN] 2025/10/06 - 18:16:23 | 200 | 9.119721114s | 172.18.0.1 | POST "/api/chat" [GIN] 2025/10/06 - 18:16:33 | 200 | 8.661983665s | 172.18.0.1 | POST "/api/chat" [GIN] 2025/10/06 - 18:16:41 | 200 | 18.285µs | 127.0.0.1 | HEAD "/" [GIN] 2025/10/06 - 18:16:41 | 200 | 18.405µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/10/06 - 18:16:58 | 200 | 25.197µs | 127.0.0.1 | HEAD "/" [GIN] 2025/10/06 - 18:16:58 | 200 | 23.835µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/10/06 - 18:17:19 | 200 | 21.28µs | 127.0.0.1 | HEAD "/" [GIN] 2025/10/06 - 18:17:19 | 200 | 24.055µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/10/06 - 18:18:07 | 200 | 19.967µs | 127.0.0.1 | HEAD "/" [GIN] 2025/10/06 - 18:18:07 | 200 | 27.441µs | 127.0.0.1 | GET "/api/ps" [GIN] 2025/10/06 - 18:19:59 | 200 | 7.556107172s | 172.18.0.1 | POST "/api/chat" [GIN] 2025/10/06 - 18:20:15 | 200 | 14.963713581s | 172.18.0.1 | POST "/api/chat" Mon Oct 6 18:20:19 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4080 Off | 00000000:06:00.0 On | N/A | | 0% 54C P2 54W / 320W | 14867MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 48824 C /opt/venv/bin/python3 238MiB | | 0 N/A N/A 55339 C python3 238MiB | | 0 N/A N/A 56644 C frigate.detector.onnx 364MiB | | 0 N/A N/A 56674 C frigate.embeddings_manager 958MiB | | 0 N/A N/A 56912 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 330MiB | | 0 N/A N/A 2420338 G /usr/lib/xorg/Xorg 143MiB | | 0 N/A N/A 2421301 G xfwm4 4MiB | | 0 N/A N/A 2421306 C+G /usr/bin/sunshine 243MiB | | 0 N/A N/A 2421624 G ...nstallation/ubuntu12_32/steam 4MiB | | 0 N/A N/A 2421967 G ./steamwebhelper 9MiB | | 0 N/A N/A 2421990 G ...on/ubuntu12_64/steamwebhelper 10MiB | | 0 N/A N/A 2437305 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 294MiB | | 0 N/A N/A 3265107 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 322MiB | | 0 N/A N/A 3495699 C /usr/bin/ollama 320MiB | +-----------------------------------------------------------------------------------------+ NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b f2b8351c629c 14 GB 16%/84% CPU/GPU 4096 4 minutes from now ``` # gpt-oss:20b_ctx32k_gpu256: - did not work due to resource limitation ```bash net/http/server.go:3454 +0x485 goroutine 320 gp=0xc0000d7500 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?) runtime/proc.go:435 +0xce fp=0xc00146bdd8 sp=0xc00146bdb8 pc=0x55b7d185b86e runtime.netpollblock(0x55b7d187ebd8?, 0xd17f4666?, 0xb7?) runtime/netpoll.go:575 +0xf7 fp=0xc00146be10 sp=0xc00146bdd8 pc=0x55b7d1820357 internal/poll.runtime_pollWait(0x7ff8a4654cc8, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc00146be30 sp=0xc00146be10 pc=0x55b7d185aa85 internal/poll.(*pollDesc).wait(0xc00048f380?, 0xc00036c101?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc00146be58 sp=0xc00146be30 pc=0x55b7d18e1ec7 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc00048f380, {0xc00036c101, 0x1, 0x1}) internal/poll/fd_unix.go:165 +0x27a fp=0xc00146bef0 sp=0xc00146be58 pc=0x55b7d18e31ba net.(*netFD).Read(0xc00048f380, {0xc00036c101?, 0xc00012f998?, 0xc00146bf70?}) net/fd_posix.go:55 +0x25 fp=0xc00146bf38 sp=0xc00146bef0 pc=0x55b7d19582a5 net.(*conn).Read(0xc00011c990, {0xc00036c101?, 0x0?, 0x0?}) net/net.go:194 +0x45 fp=0xc00146bf80 sp=0xc00146bf38 pc=0x55b7d1966665 net/http.(*connReader).backgroundRead(0xc00036c0f0) net/http/server.go:690 +0x37 fp=0xc00146bfc8 sp=0xc00146bf80 pc=0x55b7d1b524d7 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:686 +0x25 fp=0xc00146bfe0 sp=0xc00146bfc8 pc=0x55b7d1b52405 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00146bfe8 sp=0xc00146bfe0 pc=0x55b7d1862fa1 created by net/http.(*connReader).startBackgroundRead in goroutine 8 net/http/server.go:686 +0xb6 goroutine 373 gp=0xc00104c8c0 m=nil [chan receive]: runtime.gopark(0x30?, 0x55b7d2c889a0?, 0x1?, 0xd7?, 0xc000096b30?) runtime/proc.go:435 +0xce fp=0xc000096ae8 sp=0xc000096ac8 pc=0x55b7d185b86e runtime.chanrecv(0xc0006612d0, 0x0, 0x1) runtime/chan.go:664 +0x445 fp=0xc000096b60 sp=0xc000096ae8 pc=0x55b7d17f7245 runtime.chanrecv1(0x55b7d2874771?, 0x2c?) runtime/chan.go:506 +0x12 fp=0xc000096b88 sp=0xc000096b60 pc=0x55b7d17f6dd2 github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc00022b0e0, {0x1, {0x55b7d2d331e0, 0xc001606000}, {0x55b7d2d3de48, 0xc000dc4e58}, {0xc000ec8008, 0x134, 0x25f}, {{0x55b7d2d3de48, ...}, ...}, ...}) github.com/ollama/ollama/runner/ollamarunner/runner.go:602 +0x185 fp=0xc000096ef0 sp=0xc000096b88 pc=0x55b7d1d69925 github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1() github.com/ollama/ollama/runner/ollamarunner/runner.go:425 +0x58 fp=0xc000096fe0 sp=0xc000096ef0 pc=0x55b7d1d67e38 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000096fe8 sp=0xc000096fe0 pc=0x55b7d1862fa1 created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 7 github.com/ollama/ollama/runner/ollamarunner/runner.go:425 +0x2ed rax 0x0 rbx 0xb775 rcx 0x7ff8eca0fb2c rdx 0x6 rdi 0xb757 rsi 0xb775 rbp 0x7ff804ffb2e0 rsp 0x7ff804ffb2a0 r8 0x0 r9 0x7 r10 0x8 r11 0x246 r12 0x6 r13 0x7ff858e0d448 r14 0x16 r15 0x7ff47a7a0800 rip 0x7ff8eca0fb2c rflags 0x246 cs 0x33 fs 0x0 gs 0x0 time=2025-10-06T18:21:28.560Z level=ERROR source=server.go:1459 msg="post predict" error="Post \"http://127.0.0.1:37887/completion\": EOF" [GIN] 2025/10/06 - 18:21:28 | 200 | 5.899638513s | 172.18.0.1 | POST "/api/chat" time=2025-10-06T18:21:28.586Z level=ERROR source=server.go:425 msg="llama runner terminated" error="exit status 2" Mon Oct 6 18:22:48 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4080 Off | 00000000:06:00.0 On | N/A | | 0% 56C P2 55W / 320W | 3276MiB / 16376MiB | 4% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 48824 C /opt/venv/bin/python3 238MiB | | 0 N/A N/A 55339 C python3 238MiB | | 0 N/A N/A 56644 C frigate.detector.onnx 364MiB | | 0 N/A N/A 56674 C frigate.embeddings_manager 958MiB | | 0 N/A N/A 56912 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 330MiB | | 0 N/A N/A 2420338 G /usr/lib/xorg/Xorg 143MiB | | 0 N/A N/A 2421301 G xfwm4 4MiB | | 0 N/A N/A 2421306 C+G /usr/bin/sunshine 243MiB | | 0 N/A N/A 2421624 G ...nstallation/ubuntu12_32/steam 4MiB | | 0 N/A N/A 2421967 G ./steamwebhelper 9MiB | | 0 N/A N/A 2421990 G ...on/ubuntu12_64/steamwebhelper 10MiB | | 0 N/A N/A 2437305 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 294MiB | | 0 N/A N/A 3265107 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 322MiB | +-----------------------------------------------------------------------------------------+ NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b_ctx32k_gpu256 4faa45587112 14 GB 100% GPU 32000 3 minutes from now ``` # gpt-oss:20b_ctx32k_gpu256: - consecutive try also did not work # gpt-oss:20b_gpu256: - error running cuda ``` an error was encountered while running the model: CUDA error: out of memory current device: 0, in function evaluate_and_capture_cuda_graph at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:3015 cudaGraphInstantiate(&cuda_ctx->cuda_graph->instance, cuda_ctx->cuda_graph->graph, __null, __null, 0) //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:84: CUDA error ``` ```bash oroutine 7 gp=0xc000003dc0 m=nil [GC worker (idle)]: runtime.gopark(0x101bd584e0a87?, 0x1?, 0x43?, 0xca?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000087738 sp=0xc000087718 pc=0x56312433a86e runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc0000877c8 sp=0xc000087738 pc=0x5631242e7d69 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0000877e0 sp=0xc0000877c8 pc=0x5631242e7c45 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000877e8 sp=0xc0000877e0 pc=0x563124341fa1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 23 gp=0xc000103dc0 m=nil [GC worker (idle)]: runtime.gopark(0x101bd586f1d3f?, 0x1?, 0x80?, 0xb7?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000082738 sp=0xc000082718 pc=0x56312433a86e runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc0000827c8 sp=0xc000082738 pc=0x5631242e7d69 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0000827e0 sp=0xc0000827c8 pc=0x5631242e7c45 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000827e8 sp=0xc0000827e0 pc=0x563124341fa1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 37 gp=0xc000484540 m=nil [GC worker (idle)]: runtime.gopark(0x101bd584e927c?, 0x3?, 0xc1?, 0x77?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00048bf38 sp=0xc00048bf18 pc=0x56312433a86e runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc00048bfc8 sp=0xc00048bf38 pc=0x5631242e7d69 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00048bfe0 sp=0xc00048bfc8 pc=0x5631242e7c45 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00048bfe8 sp=0xc00048bfe0 pc=0x563124341fa1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 38 gp=0xc000484700 m=nil [GC worker (idle)]: runtime.gopark(0x101bd584ea473?, 0x3?, 0x91?, 0x82?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00048c738 sp=0xc00048c718 pc=0x56312433a86e runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc00048c7c8 sp=0xc00048c738 pc=0x5631242e7d69 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00048c7e0 sp=0xc00048c7c8 pc=0x5631242e7c45 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00048c7e8 sp=0xc00048c7e0 pc=0x563124341fa1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 39 gp=0xc0004848c0 m=nil [GC worker (idle)]: runtime.gopark(0x101bd584e7ad9?, 0x1?, 0x94?, 0xae?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00048cf38 sp=0xc00048cf18 pc=0x56312433a86e runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc00048cfc8 sp=0xc00048cf38 pc=0x5631242e7d69 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00048cfe0 sp=0xc00048cfc8 pc=0x5631242e7c45 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00048cfe8 sp=0xc00048cfe0 pc=0x563124341fa1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 40 gp=0xc000484a80 m=nil [GC worker (idle)]: runtime.gopark(0x563126167ec0?, 0x1?, 0x13?, 0xfa?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00048d738 sp=0xc00048d718 pc=0x56312433a86e runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc00048d7c8 sp=0xc00048d738 pc=0x5631242e7d69 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00048d7e0 sp=0xc00048d7c8 pc=0x5631242e7c45 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00048d7e8 sp=0xc00048d7e0 pc=0x563124341fa1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 41 gp=0xc000484c40 m=nil [GC worker (idle)]: runtime.gopark(0x101bd584e0082?, 0x3?, 0xe4?, 0x94?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00048df38 sp=0xc00048df18 pc=0x56312433a86e runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc00048dfc8 sp=0xc00048df38 pc=0x5631242e7d69 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00048dfe0 sp=0xc00048dfc8 pc=0x5631242e7c45 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00048dfe8 sp=0xc00048dfe0 pc=0x563124341fa1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 50 gp=0xc000584000 m=nil [GC worker (idle)]: runtime.gopark(0x101bd584e53a2?, 0x3?, 0xdc?, 0xca?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000486738 sp=0xc000486718 pc=0x56312433a86e runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc0004867c8 sp=0xc000486738 pc=0x5631242e7d69 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0004867e0 sp=0xc0004867c8 pc=0x5631242e7c45 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0004867e8 sp=0xc0004867e0 pc=0x563124341fa1 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 8 gp=0xc000485880 m=nil [chan receive]: runtime.gopark(0x30?, 0xffffffffffffffff?, 0x32?, 0x0?, 0xc0013f9840?) runtime/proc.go:435 +0xce fp=0xc0013f97f8 sp=0xc0013f97d8 pc=0x56312433a86e runtime.chanrecv(0xc001956000, 0x0, 0x1) runtime/chan.go:664 +0x445 fp=0xc0013f9870 sp=0xc0013f97f8 pc=0x5631242d6245 runtime.chanrecv1(0x5631253502d5?, 0x29?) runtime/chan.go:506 +0x12 fp=0xc0013f9898 sp=0xc0013f9870 pc=0x5631242d5dd2 github.com/ollama/ollama/runner/ollamarunner.(*Server).forwardBatch(_, {0x3, {0x5631258121e0, 0xc001952000}, {0x56312581ce48, 0xc001608108}, {0xc00011c138, 0x1, 0x1}, {{0x56312581ce48, ...}, ...}, ...}) github.com/ollama/ollama/runner/ollamarunner/runner.go:440 +0xfa fp=0xc0013f9bf8 sp=0xc0013f9898 pc=0x563124846f5a github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc00022ad20, {0x563125808c00, 0xc0006a1360}) github.com/ollama/ollama/runner/ollamarunner/runner.go:419 +0x1ac fp=0xc0013f9fb8 sp=0xc0013f9bf8 pc=0x563124846bec github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap1() github.com/ollama/ollama/runner/ollamarunner/runner.go:1266 +0x28 fp=0xc0013f9fe0 sp=0xc0013f9fb8 pc=0x56312484f168 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0013f9fe8 sp=0xc0013f9fe0 pc=0x563124341fa1 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/ollamarunner/runner.go:1266 +0x505 goroutine 9 gp=0xc000485a40 m=nil [select]: runtime.gopark(0xc000049a10?, 0x2?, 0x4?, 0x0?, 0xc000049874?) runtime/proc.go:435 +0xce fp=0xc0000496a0 sp=0xc000049680 pc=0x56312433a86e runtime.selectgo(0xc000049a10, 0xc000049870, 0xc001132600?, 0x0, 0x1?, 0x1) runtime/select.go:351 +0x837 fp=0xc0000497d8 sp=0xc0000496a0 pc=0x563124318ed7 github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(0xc00022ad20, {0x563125806868, 0xc0013680e0}, 0xc00051e140) github.com/ollama/ollama/runner/ollamarunner/runner.go:869 +0xb90 fp=0xc000049ac0 sp=0xc0000497d8 pc=0x56312484aed0 github.com/ollama/ollama/runner/ollamarunner.(*Server).completion-fm({0x563125806868?, 0xc0013680e0?}, 0xc0013fdb40?) <autogenerated>:1 +0x36 fp=0xc000049af0 sp=0xc000049ac0 pc=0x56312484f5d6 net/http.HandlerFunc.ServeHTTP(0xc0000375c0?, {0x563125806868?, 0xc0013680e0?}, 0xc0013fdb60?) net/http/server.go:2294 +0x29 fp=0xc000049b18 sp=0xc000049af0 pc=0x563124639109 net/http.(*ServeMux).ServeHTTP(0x5631242ded85?, {0x563125806868, 0xc0013680e0}, 0xc00051e140) net/http/server.go:2822 +0x1c4 fp=0xc000049b68 sp=0xc000049b18 pc=0x56312463b004 net/http.serverHandler.ServeHTTP({0x563125802eb0?}, {0x563125806868?, 0xc0013680e0?}, 0x1?) net/http/server.go:3301 +0x8e fp=0xc000049b98 sp=0xc000049b68 pc=0x563124658a8e net/http.(*conn).serve(0xc0000e43f0, {0x563125808bc8, 0xc000223f50}) net/http/server.go:2102 +0x625 fp=0xc000049fb8 sp=0xc000049b98 pc=0x563124637605 net/http.(*Server).Serve.gowrap3() net/http/server.go:3454 +0x28 fp=0xc000049fe0 sp=0xc000049fb8 pc=0x56312463cec8 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000049fe8 sp=0xc000049fe0 pc=0x563124341fa1 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3454 +0x485 goroutine 113 gp=0xc000602fc0 m=nil [IO wait]: runtime.gopark(0xb1c298c3b5c2?, 0x91c3bbc492c3a0c4?, 0xc4?, 0xa5?, 0xb?) runtime/proc.go:435 +0xce fp=0xc0013735d8 sp=0xc0013735b8 pc=0x56312433a86e runtime.netpollblock(0x56312435dbd8?, 0x242d3666?, 0x31?) runtime/netpoll.go:575 +0xf7 fp=0xc001373610 sp=0xc0013735d8 pc=0x5631242ff357 internal/poll.runtime_pollWait(0x7f12c4a75cc8, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc001373630 sp=0xc001373610 pc=0x563124339a85 internal/poll.(*pollDesc).wait(0xc000693380?, 0xc00036c101?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc001373658 sp=0xc001373630 pc=0x5631243c0ec7 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc000693380, {0xc00036c101, 0x1, 0x1}) internal/poll/fd_unix.go:165 +0x27a fp=0xc0013736f0 sp=0xc001373658 pc=0x5631243c21ba net.(*netFD).Read(0xc000693380, {0xc00036c101?, 0xc00012f998?, 0xc001373770?}) net/fd_posix.go:55 +0x25 fp=0xc001373738 sp=0xc0013736f0 pc=0x5631244372a5 net.(*conn).Read(0xc00011c970, {0xc00036c101?, 0x746361726143a0c4?, 0x7265?}) net/net.go:194 +0x45 fp=0xc001373780 sp=0xc001373738 pc=0x563124445665 net/http.(*connReader).backgroundRead(0xc00036c0f0) net/http/server.go:690 +0x37 fp=0xc0013737c8 sp=0xc001373780 pc=0x5631246314d7 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:686 +0x25 fp=0xc0013737e0 sp=0xc0013737c8 pc=0x563124631405 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0013737e8 sp=0xc0013737e0 pc=0x563124341fa1 created by net/http.(*connReader).startBackgroundRead in goroutine 9 net/http/server.go:686 +0xb6 goroutine 377 gp=0xc000603500 m=nil [chan receive]: runtime.gopark(0x30?, 0xffffffffffffffff?, 0x33?, 0x0?, 0xc00138cb30?) runtime/proc.go:435 +0xce fp=0xc00138cae8 sp=0xc00138cac8 pc=0x56312433a86e runtime.chanrecv(0xc0012902a0, 0x0, 0x1) runtime/chan.go:664 +0x445 fp=0xc00138cb60 sp=0xc00138cae8 pc=0x5631242d6245 runtime.chanrecv1(0x563125353771?, 0x2c?) runtime/chan.go:506 +0x12 fp=0xc00138cb88 sp=0xc00138cb60 pc=0x5631242d5dd2 github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc00022ad20, {0x3, {0x5631258121e0, 0xc001952000}, {0x56312581ce48, 0xc001608108}, {0xc00011c138, 0x1, 0x1}, {{0x56312581ce48, ...}, ...}, ...}) github.com/ollama/ollama/runner/ollamarunner/runner.go:602 +0x185 fp=0xc00138cef0 sp=0xc00138cb88 pc=0x563124848925 github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1() github.com/ollama/ollama/runner/ollamarunner/runner.go:425 +0x58 fp=0xc00138cfe0 sp=0xc00138cef0 pc=0x563124846e38 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00138cfe8 sp=0xc00138cfe0 pc=0x563124341fa1 created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 8 github.com/ollama/ollama/runner/ollamarunner/runner.go:425 +0x2ed rax 0x0 rbx 0xb81e rcx 0x7f130ce28b2c rdx 0x6 rdi 0xb80b rsi 0xb81e rbp 0x7f12417fd250 rsp 0x7f12417fd210 r8 0x0 r9 0x7 r10 0x8 r11 0x246 r12 0x6 r13 0x7f127ce0d448 r14 0x16 r15 0x52e rip 0x7f130ce28b2c rflags 0x246 cs 0x33 fs 0x0 gs 0x0 time=2025-10-06T18:24:40.803Z level=ERROR source=server.go:425 msg="llama runner terminated" error="exit status 2" [GIN] 2025/10/06 - 18:24:40 | 200 | 10.535247191s | 172.18.0.1 | POST "/api/chat" Mon Oct 6 18:25:07 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4080 Off | 00000000:06:00.0 On | N/A | | 0% 57C P2 55W / 320W | 3276MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 48824 C /opt/venv/bin/python3 238MiB | | 0 N/A N/A 55339 C python3 238MiB | | 0 N/A N/A 56644 C frigate.detector.onnx 364MiB | | 0 N/A N/A 56674 C frigate.embeddings_manager 958MiB | | 0 N/A N/A 56912 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 330MiB | | 0 N/A N/A 2420338 G /usr/lib/xorg/Xorg 143MiB | | 0 N/A N/A 2421301 G xfwm4 4MiB | | 0 N/A N/A 2421306 C+G /usr/bin/sunshine 243MiB | | 0 N/A N/A 2421624 G ...nstallation/ubuntu12_32/steam 4MiB | | 0 N/A N/A 2421967 G ./steamwebhelper 9MiB | | 0 N/A N/A 2421990 G ...on/ubuntu12_64/steamwebhelper 10MiB | | 0 N/A N/A 2437305 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 294MiB | | 0 N/A N/A 3265107 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 322MiB | +-----------------------------------------------------------------------------------------+ NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b_gpu256 ff94d56c88da 14 GB 100% GPU 4096 4 minutes from now ``` # gpt-oss:20b_gpu256: - consecutive try same error
Author
Owner

@jessegross commented on GitHub (Oct 6, 2025):

@SHU-red In the first case, 84% of the model is loaded on the GPU, so it is using the GPU for most of the model with the remainder spilling onto the CPU. However, the CPU quickly becomes the bottleneck. You have a lot of other things running at the same time (see your nvidia-smi output) so shutting some of them down may free up enough VRAM to get more of the model loaded onto the GPU.

I assume that gpu256 means that you set num_gpu to 256 in an effort to force more to load on the GPU. However, you don't have enough VRAM for this, which is why it crashed. That would seem to indicate that the memory management logic is working correctly and you should stick with the default settings.

<!-- gh-comment-id:3373333806 --> @jessegross commented on GitHub (Oct 6, 2025): @SHU-red In the first case, 84% of the model is loaded on the GPU, so it is using the GPU for most of the model with the remainder spilling onto the CPU. However, the CPU quickly becomes the bottleneck. You have a lot of other things running at the same time (see your nvidia-smi output) so shutting some of them down may free up enough VRAM to get more of the model loaded onto the GPU. I assume that gpu256 means that you set num_gpu to 256 in an effort to force more to load on the GPU. However, you don't have enough VRAM for this, which is why it crashed. That would seem to indicate that the memory management logic is working correctly and you should stick with the default settings.
Author
Owner

@SHU-red commented on GitHub (Oct 6, 2025):

@jessegross
Thank you for that
So, the bottom line is my graphics card has to low memory I guess

Should I leave the docker environment variables as is?

<!-- gh-comment-id:3373437240 --> @SHU-red commented on GitHub (Oct 6, 2025): @jessegross Thank you for that So, the bottom line is my graphics card has to low memory I guess Should I leave the docker environment variables as is?
Author
Owner

@Queracus commented on GitHub (Oct 7, 2025):

I dont understand why you guys mess with the OLLAMA_RUN_PARALLEL=1. i never touched it and it works like a charm.
@SHU-red what happens if you run it trough the ollama GUI? with basic settings. 32k context should fit into 16gb Vram easely. not even a question. if max context fits into about 16-17gb in 20b model

<!-- gh-comment-id:3375466431 --> @Queracus commented on GitHub (Oct 7, 2025): I dont understand why you guys mess with the OLLAMA_RUN_PARALLEL=1. i never touched it and it works like a charm. @SHU-red what happens if you run it trough the ollama GUI? with basic settings. 32k context should fit into 16gb Vram easely. not even a question. if max context fits into about 16-17gb in 20b model
Author
Owner

@SHU-red commented on GitHub (Oct 7, 2025):

@Queracus

  • i do not know what im doing
  • using this env-var was left over from discussions above
  • even without it sometimes struggles
  • i guess theres a difference between people using the GPU for ollama only and me, using it for my homeserver doing constant other tasks? Maybe frigate constantly transcoding 2x 4K Surveillance camera stream? Sometimes Tdarr Video transcodint etc?

I gutess its just me multiplying the memory consumption my running too much in parallel ...


+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           48824      C   /opt/venv/bin/python3                   238MiB |
|    0   N/A  N/A           55339      C   python3                                 238MiB |
|    0   N/A  N/A           56644      C   frigate.detector.onnx                   364MiB |
|    0   N/A  N/A           56674      C   frigate.embeddings_manager              958MiB |
|    0   N/A  N/A           56912      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          330MiB |
|    0   N/A  N/A         2420338      G   /usr/lib/xorg/Xorg                      143MiB |
|    0   N/A  N/A         2421301      G   xfwm4                                     4MiB |
|    0   N/A  N/A         2421306    C+G   /usr/bin/sunshine                       243MiB |
|    0   N/A  N/A         2421624      G   ...nstallation/ubuntu12_32/steam          4MiB |
|    0   N/A  N/A         2421967      G   ./steamwebhelper                          9MiB |
|    0   N/A  N/A         2421990      G   ...on/ubuntu12_64/steamwebhelper         10MiB |
|    0   N/A  N/A         2437305      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          294MiB |
|    0   N/A  N/A         3265107      C   /usr/lib/ffmpeg/7.0/bin/ffmpeg          322MiB |
+-----------------------------------------------------------------------------------------+
<!-- gh-comment-id:3377822031 --> @SHU-red commented on GitHub (Oct 7, 2025): @Queracus - i do not know what im doing - using this env-var was left over from discussions above - even without it sometimes struggles - i guess theres a difference between people using the GPU for ollama only and me, using it for my homeserver doing constant other tasks? Maybe frigate constantly transcoding 2x 4K Surveillance camera stream? Sometimes Tdarr Video transcodint etc? I gutess its just me multiplying the memory consumption my running too much in parallel ... ```bash +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 48824 C /opt/venv/bin/python3 238MiB | | 0 N/A N/A 55339 C python3 238MiB | | 0 N/A N/A 56644 C frigate.detector.onnx 364MiB | | 0 N/A N/A 56674 C frigate.embeddings_manager 958MiB | | 0 N/A N/A 56912 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 330MiB | | 0 N/A N/A 2420338 G /usr/lib/xorg/Xorg 143MiB | | 0 N/A N/A 2421301 G xfwm4 4MiB | | 0 N/A N/A 2421306 C+G /usr/bin/sunshine 243MiB | | 0 N/A N/A 2421624 G ...nstallation/ubuntu12_32/steam 4MiB | | 0 N/A N/A 2421967 G ./steamwebhelper 9MiB | | 0 N/A N/A 2421990 G ...on/ubuntu12_64/steamwebhelper 10MiB | | 0 N/A N/A 2437305 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 294MiB | | 0 N/A N/A 3265107 C /usr/lib/ffmpeg/7.0/bin/ffmpeg 322MiB | +-----------------------------------------------------------------------------------------+ ```
Author
Owner

@kiliansinger commented on GitHub (Oct 25, 2025):

I have a similar issue when switching from ollama 0.9.3:

It worked with large models and used the GPU RTX 4070 8GB at least offloading some of the data. When upgrading it uses just cpu that models such as Qwen3-Coder-30B-A3B-Instruct-1M-Unsloth:UD-IQ3_XXS becomes unusable.

From the above comments it is not clear to me how to get it to work with newer ollama versions.

<!-- gh-comment-id:3446997972 --> @kiliansinger commented on GitHub (Oct 25, 2025): I have a similar issue when switching from ollama 0.9.3: It worked with large models and used the GPU RTX 4070 8GB at least offloading some of the data. When upgrading it uses just cpu that models such as Qwen3-Coder-30B-A3B-Instruct-1M-Unsloth:UD-IQ3_XXS becomes unusable. From the above comments it is not clear to me how to get it to work with newer ollama versions.
Author
Owner

@Queracus commented on GitHub (Oct 25, 2025):

I have a similar issue when switching from ollama 0.9.3:

It worked with large models and used the GPU RTX 4070 8GB at least offloading some of the data. When upgrading it uses just cpu that models such as Qwen3-Coder-30B-A3B-Instruct-1M-Unsloth:UD-IQ3_XXS becomes unusable.

From the above comments it is not clear to me how to get it to work with newer ollama versions.

if it doesnt fit in your GPU it will jsut load into RAM and run on cpu. thats about it

<!-- gh-comment-id:3447024895 --> @Queracus commented on GitHub (Oct 25, 2025): > I have a similar issue when switching from ollama 0.9.3: > > It worked with large models and used the GPU RTX 4070 8GB at least offloading some of the data. When upgrading it uses just cpu that models such as Qwen3-Coder-30B-A3B-Instruct-1M-Unsloth:UD-IQ3_XXS becomes unusable. > > From the above comments it is not clear to me how to get it to work with newer ollama versions. if it doesnt fit in your GPU it will jsut load into RAM and run on cpu. thats about it
Author
Owner

@kiliansinger commented on GitHub (Oct 25, 2025):

yes indeed the model (13GB) is bigger than 8GB so it does not fit completely into GPU. But compared to 0.9.3 it was working and partially offloading calculations to gpu. It was fast enough to be usable. This feature stopped to work. LMStudio actually also is able to use the model. But Ollama was (0.9.3) actually 30% more efficient. It would be sad to loose this.

<!-- gh-comment-id:3447201492 --> @kiliansinger commented on GitHub (Oct 25, 2025): yes indeed the model (13GB) is bigger than 8GB so it does not fit completely into GPU. But compared to 0.9.3 it was working and partially offloading calculations to gpu. It was fast enough to be usable. This feature stopped to work. LMStudio actually also is able to use the model. But Ollama was (0.9.3) actually 30% more efficient. It would be sad to loose this.
Author
Owner

@kiliansinger commented on GitHub (Oct 30, 2025):

I wrote a PR to probably also fix this issue: https://github.com/ollama/ollama/pull/12856

<!-- gh-comment-id:3468103482 --> @kiliansinger commented on GitHub (Oct 30, 2025): I wrote a PR to probably also fix this issue: https://github.com/ollama/ollama/pull/12856
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7723