[GH-ISSUE #11676] Ollama not using NVIDIA GPUs with gpt-oss models #7723

New Issue

@Shawneau commented on GitHub (Aug 5, 2025):

Can confirm not working in Docker on Nvidia gpu, while other models load fine. Host is Ubuntu 22.something

@Shawneau commented on GitHub (Aug 5, 2025): Can confirm not working in Docker on Nvidia gpu, while other models load fine. Host is Ubuntu 22.something

GiteaMirror commented

@jessegross commented on GitHub (Aug 5, 2025):

Can you please post the server logs?

@jessegross commented on GitHub (Aug 5, 2025): Can you please post the [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues)?

GiteaMirror commented

@av commented on GitHub (Aug 5, 2025):

@jessegross , sorry for the extra pull logs in the middle

Details

harbor.ollama  | time=2025-08-05T18:39:37.230Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
harbor.ollama  | time=2025-08-05T18:39:37.380Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4f549573-5491-abe4-bcf8-8804171f6b2b library=cuda variant=v12 compute=8.9 driver=12.9 name="NVIDIA GeForce RTX 4090 Laptop GPU" total="15.6 GiB" available="15.3 GiB"
harbor.ollama  | [GIN] 2025/08/05 - 18:39:38 | 200 |      80.352µs |      172.22.0.3 | HEAD     "/"
harbor.ollama  | [GIN] 2025/08/05 - 18:39:39 | 200 |  650.019494ms |      172.22.0.3 | POST     "/api/pull"
harbor.ollama  | [GIN] 2025/08/05 - 18:39:40 | 200 |      25.962µs |      172.22.0.4 | HEAD     "/"
harbor.ollama  | time=2025-08-05T18:39:40.835Z level=INFO source=download.go:177 msg="downloading b112e727c6f1 in 16 861 MB part(s)"
harbor.ollama  | time=2025-08-05T18:41:21.300Z level=INFO source=download.go:295 msg="b112e727c6f1 part 13 attempt 0 failed: unexpected EOF, retrying in 1s"
harbor.ollama  | time=2025-08-05T18:41:50.892Z level=INFO source=download.go:295 msg="b112e727c6f1 part 6 attempt 0 failed: unexpected EOF, retrying in 1s"
harbor.ollama  | [GIN] 2025/08/05 - 18:43:29 | 200 |      14.745µs |      172.22.0.5 | HEAD     "/"
harbor.ollama  | [GIN] 2025/08/05 - 18:43:35 | 200 |  6.075061495s |      172.22.0.5 | POST     "/api/pull"
harbor.ollama  | time=2025-08-05T18:44:12.210Z level=INFO source=download.go:177 msg="downloading 51468a0fd901 in 1 7.4 KB part(s)"
harbor.ollama  | time=2025-08-05T18:44:13.589Z level=INFO source=download.go:177 msg="downloading d8ba2f9a17b3 in 1 18 B part(s)"
harbor.ollama  | time=2025-08-05T18:44:14.979Z level=INFO source=download.go:177 msg="downloading fcaef9305bb6 in 1 415 B part(s)"
harbor.ollama  | [GIN] 2025/08/05 - 18:44:22 | 200 |         4m42s |      172.22.0.4 | POST     "/api/pull"
harbor.ollama  | [GIN] 2025/08/05 - 18:44:28 | 200 |    27.91755ms |      172.22.0.3 | GET      "/api/tags"
harbor.ollama  | [GIN] 2025/08/05 - 18:44:28 | 200 |      84.839µs |      172.22.0.3 | GET      "/api/ps"
harbor.ollama  | [GIN] 2025/08/05 - 18:44:28 | 200 |      35.083µs |      172.22.0.3 | GET      "/api/version"
harbor.ollama  | time=2025-08-05T18:44:31.245Z level=INFO source=server.go:135 msg="system memory" total="62.4 GiB" free="52.0 GiB" free_swap="20.0 GiB"
harbor.ollama  | time=2025-08-05T18:44:31.245Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[15.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="12.9 GiB" memory.required.partial="0 B" memory.required.kv="1.1 GiB" memory.required.allocations="[0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="8.0 GiB" memory.graph.partial="16.0 GiB"
harbor.ollama  | time=2025-08-05T18:44:31.273Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 32768 --batch-size 512 --threads 8 --no-mmap --parallel 4 --port 35337"
harbor.ollama  | time=2025-08-05T18:44:31.273Z level=INFO source=sched.go:481 msg="loaded runners" count=1
harbor.ollama  | time=2025-08-05T18:44:31.274Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
harbor.ollama  | time=2025-08-05T18:44:31.274Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
harbor.ollama  | time=2025-08-05T18:44:31.282Z level=INFO source=runner.go:925 msg="starting ollama engine"
harbor.ollama  | time=2025-08-05T18:44:31.282Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:35337"
harbor.ollama  | time=2025-08-05T18:44:31.314Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
harbor.ollama  | ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
harbor.ollama  | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
harbor.ollama  | ggml_cuda_init: found 1 CUDA devices:
harbor.ollama  |   Device 0: NVIDIA GeForce RTX 4090 Laptop GPU, compute capability 8.9, VMM: yes
harbor.ollama  | load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
harbor.ollama  | load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so
harbor.ollama  | time=2025-08-05T18:44:31.361Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
harbor.ollama  | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:367 msg="offloading 0 repeating layers to GPU"
harbor.ollama  | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU"
harbor.ollama  | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:378 msg="offloaded 0/25 layers to GPU"
harbor.ollama  | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="12.8 GiB"
harbor.ollama  | time=2025-08-05T18:44:31.525Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
harbor.ollama  | time=2025-08-05T18:44:31.698Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
harbor.ollama  | time=2025-08-05T18:44:31.698Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="8.0 GiB"
harbor.ollama  | time=2025-08-05T18:44:32.787Z level=INFO source=server.go:637 msg="llama runner started in 1.51 seconds"

Similar setup, Ollama v0.11 + Docker, other models use GPU as expected

@av commented on GitHub (Aug 5, 2025): @jessegross , sorry for the extra `pull` logs in the middle <details><summary>Details</summary> <p> ```bash harbor.ollama | time=2025-08-05T18:39:37.230Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" harbor.ollama | time=2025-08-05T18:39:37.380Z level=INFO source=types.go:130 msg="inference compute" id=GPU-4f549573-5491-abe4-bcf8-8804171f6b2b library=cuda variant=v12 compute=8.9 driver=12.9 name="NVIDIA GeForce RTX 4090 Laptop GPU" total="15.6 GiB" available="15.3 GiB" harbor.ollama | [GIN] 2025/08/05 - 18:39:38 | 200 | 80.352µs | 172.22.0.3 | HEAD "/" harbor.ollama | [GIN] 2025/08/05 - 18:39:39 | 200 | 650.019494ms | 172.22.0.3 | POST "/api/pull" harbor.ollama | [GIN] 2025/08/05 - 18:39:40 | 200 | 25.962µs | 172.22.0.4 | HEAD "/" harbor.ollama | time=2025-08-05T18:39:40.835Z level=INFO source=download.go:177 msg="downloading b112e727c6f1 in 16 861 MB part(s)" harbor.ollama | time=2025-08-05T18:41:21.300Z level=INFO source=download.go:295 msg="b112e727c6f1 part 13 attempt 0 failed: unexpected EOF, retrying in 1s" harbor.ollama | time=2025-08-05T18:41:50.892Z level=INFO source=download.go:295 msg="b112e727c6f1 part 6 attempt 0 failed: unexpected EOF, retrying in 1s" harbor.ollama | [GIN] 2025/08/05 - 18:43:29 | 200 | 14.745µs | 172.22.0.5 | HEAD "/" harbor.ollama | [GIN] 2025/08/05 - 18:43:35 | 200 | 6.075061495s | 172.22.0.5 | POST "/api/pull" harbor.ollama | time=2025-08-05T18:44:12.210Z level=INFO source=download.go:177 msg="downloading 51468a0fd901 in 1 7.4 KB part(s)" harbor.ollama | time=2025-08-05T18:44:13.589Z level=INFO source=download.go:177 msg="downloading d8ba2f9a17b3 in 1 18 B part(s)" harbor.ollama | time=2025-08-05T18:44:14.979Z level=INFO source=download.go:177 msg="downloading fcaef9305bb6 in 1 415 B part(s)" harbor.ollama | [GIN] 2025/08/05 - 18:44:22 | 200 | 4m42s | 172.22.0.4 | POST "/api/pull" harbor.ollama | [GIN] 2025/08/05 - 18:44:28 | 200 | 27.91755ms | 172.22.0.3 | GET "/api/tags" harbor.ollama | [GIN] 2025/08/05 - 18:44:28 | 200 | 84.839µs | 172.22.0.3 | GET "/api/ps" harbor.ollama | [GIN] 2025/08/05 - 18:44:28 | 200 | 35.083µs | 172.22.0.3 | GET "/api/version" harbor.ollama | time=2025-08-05T18:44:31.245Z level=INFO source=server.go:135 msg="system memory" total="62.4 GiB" free="52.0 GiB" free_swap="20.0 GiB" harbor.ollama | time=2025-08-05T18:44:31.245Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[15.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="12.9 GiB" memory.required.partial="0 B" memory.required.kv="1.1 GiB" memory.required.allocations="[0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="8.0 GiB" memory.graph.partial="16.0 GiB" harbor.ollama | time=2025-08-05T18:44:31.273Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --ctx-size 32768 --batch-size 512 --threads 8 --no-mmap --parallel 4 --port 35337" harbor.ollama | time=2025-08-05T18:44:31.273Z level=INFO source=sched.go:481 msg="loaded runners" count=1 harbor.ollama | time=2025-08-05T18:44:31.274Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" harbor.ollama | time=2025-08-05T18:44:31.274Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" harbor.ollama | time=2025-08-05T18:44:31.282Z level=INFO source=runner.go:925 msg="starting ollama engine" harbor.ollama | time=2025-08-05T18:44:31.282Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:35337" harbor.ollama | time=2025-08-05T18:44:31.314Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 harbor.ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no harbor.ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no harbor.ollama | ggml_cuda_init: found 1 CUDA devices: harbor.ollama | Device 0: NVIDIA GeForce RTX 4090 Laptop GPU, compute capability 8.9, VMM: yes harbor.ollama | load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so harbor.ollama | load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-alderlake.so harbor.ollama | time=2025-08-05T18:44:31.361Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) harbor.ollama | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:367 msg="offloading 0 repeating layers to GPU" harbor.ollama | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU" harbor.ollama | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:378 msg="offloaded 0/25 layers to GPU" harbor.ollama | time=2025-08-05T18:44:31.422Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="12.8 GiB" harbor.ollama | time=2025-08-05T18:44:31.525Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" harbor.ollama | time=2025-08-05T18:44:31.698Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" harbor.ollama | time=2025-08-05T18:44:31.698Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="8.0 GiB" harbor.ollama | time=2025-08-05T18:44:32.787Z level=INFO source=server.go:637 msg="llama runner started in 1.51 seconds" ``` </p> </details> Similar setup, Ollama v0.11 + Docker, other models use GPU as expected

GiteaMirror commented

2026-04-12 19:50:12 -05:00

@hrz6976 commented on GitHub (Aug 5, 2025):

Same here on 4xL40s.

time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-464c9e2c-5e57-838c-e947-f75970e572bd library=cuda total="44.4 GiB" available="43.6 GiB"
time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-b41cc73c-5bc1-2795-95ad-ec87002c38e2 library=cuda total="44.4 GiB" available="43.6 GiB"
time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-b7ad3d9e-25dc-58ee-8b74-aaa56e955517 library=cuda total="44.4 GiB" available="43.6 GiB"
time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-81971a31-64fe-a071-c6a5-de5dc026e0f7 library=cuda total="44.4 GiB" available="43.6 GiB"
time=2025-08-05T18:48:41.318Z level=INFO source=server.go:135 msg="system memory" total="755.5 GiB" free="688.2 GiB" free_swap="60.0 GiB"
time=2025-08-05T18:48:41.318Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=0 layers.split="" memory.available="[44.0 GiB 44.0 GiB 44.0 GiB 44.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="86.7 GiB" memory.required.partial="0 B" memory.required.kv="27.0 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="384.0 GiB" memory.graph.partial="384.0 GiB"
time=2025-08-05T18:48:41.361Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 524288 --batch-size 512 --threads 112 --no-mmap --parallel 64 --port 32909"
time=2025-08-05T18:48:41.362Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-05T18:48:41.362Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-05T18:48:41.362Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-05T18:48:41.378Z level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-05T18:48:41.379Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:32909"
time=2025-08-05T18:48:41.451Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA L40S, compute capability 8.9, VMM: yes
  Device 1: NVIDIA L40S, compute capability 8.9, VMM: yes
  Device 2: NVIDIA L40S, compute capability 8.9, VMM: yes
  Device 3: NVIDIA L40S, compute capability 8.9, VMM: yes
time=2025-08-05T18:48:41.614Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2025-08-05T18:48:41.849Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:367 msg="offloading 0 repeating layers to GPU"
time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU"
time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:378 msg="offloaded 0/37 layers to GPU"
time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="60.8 GiB"
time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="0 B"
time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="0 B"
time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA3 buffer_type=CUDA3 size="0 B"
time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="128.0 GiB"
time=2025-08-05T18:48:56.185Z level=INFO source=server.go:637 msg="llama runner started in 14.82 seconds"

@hrz6976 commented on GitHub (Aug 5, 2025): Same here on 4xL40s. <details> ``` time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-464c9e2c-5e57-838c-e947-f75970e572bd library=cuda total="44.4 GiB" available="43.6 GiB" time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-b41cc73c-5bc1-2795-95ad-ec87002c38e2 library=cuda total="44.4 GiB" available="43.6 GiB" time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-b7ad3d9e-25dc-58ee-8b74-aaa56e955517 library=cuda total="44.4 GiB" available="43.6 GiB" time=2025-08-05T18:48:36.321Z level=INFO source=sched.go:546 msg="updated VRAM based on existing loaded models" gpu=GPU-81971a31-64fe-a071-c6a5-de5dc026e0f7 library=cuda total="44.4 GiB" available="43.6 GiB" time=2025-08-05T18:48:41.318Z level=INFO source=server.go:135 msg="system memory" total="755.5 GiB" free="688.2 GiB" free_swap="60.0 GiB" time=2025-08-05T18:48:41.318Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=37 layers.offload=0 layers.split="" memory.available="[44.0 GiB 44.0 GiB 44.0 GiB 44.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="86.7 GiB" memory.required.partial="0 B" memory.required.kv="27.0 GiB" memory.required.allocations="[0 B 0 B 0 B 0 B]" memory.weights.total="59.7 GiB" memory.weights.repeating="58.6 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="384.0 GiB" memory.graph.partial="384.0 GiB" time=2025-08-05T18:48:41.361Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 524288 --batch-size 512 --threads 112 --no-mmap --parallel 64 --port 32909" time=2025-08-05T18:48:41.362Z level=INFO source=sched.go:481 msg="loaded runners" count=1 time=2025-08-05T18:48:41.362Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-08-05T18:48:41.362Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-08-05T18:48:41.378Z level=INFO source=runner.go:925 msg="starting ollama engine" time=2025-08-05T18:48:41.379Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:32909" time=2025-08-05T18:48:41.451Z level=INFO source=ggml.go:92 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 4 CUDA devices: Device 0: NVIDIA L40S, compute capability 8.9, VMM: yes Device 1: NVIDIA L40S, compute capability 8.9, VMM: yes Device 2: NVIDIA L40S, compute capability 8.9, VMM: yes Device 3: NVIDIA L40S, compute capability 8.9, VMM: yes time=2025-08-05T18:48:41.614Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2025-08-05T18:48:41.849Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 CUDA.3.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.3.USE_GRAPHS=1 CUDA.3.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:367 msg="offloading 0 repeating layers to GPU" time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:371 msg="offloading output layer to CPU" time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:378 msg="offloaded 0/37 layers to GPU" time=2025-08-05T18:48:41.955Z level=INFO source=ggml.go:381 msg="model weights" buffer=CPU size="60.8 GiB" time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B" time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA1 buffer_type=CUDA1 size="0 B" time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA2 buffer_type=CUDA2 size="0 B" time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CUDA3 buffer_type=CUDA3 size="0 B" time=2025-08-05T18:48:53.769Z level=INFO source=ggml.go:672 msg="compute graph" backend=CPU buffer_type=CPU size="128.0 GiB" time=2025-08-05T18:48:56.185Z level=INFO source=server.go:637 msg="llama runner started in 14.82 seconds" ``` </details>

GiteaMirror commented

@jessegross commented on GitHub (Aug 5, 2025):

@av @hrz6976

It looks like you both increased OLLAMA_NUM_PARALLEL. I would recommend leaving it at the default setting as higher values use more VRAM and reduce ability to offload.

@jessegross commented on GitHub (Aug 5, 2025): @av @hrz6976 It looks like you both increased OLLAMA_NUM_PARALLEL. I would recommend leaving it at the default setting as higher values use more VRAM and reduce ability to offload.

GiteaMirror commented

2026-04-12 19:50:13 -05:00

@nadamas2000 commented on GitHub (Aug 5, 2025):

Thanks for the suggestion. I've confirmed that I'm using OLLAMA_NUM_PARALLEL=1. I have updated the issue description with the latest logs.

@nadamas2000 commented on GitHub (Aug 5, 2025): Thanks for the suggestion. I've confirmed that I'm using OLLAMA_NUM_PARALLEL=1. I have updated the issue description with the latest logs.

GiteaMirror commented

2026-04-12 19:50:13 -05:00

@Shawneau commented on GitHub (Aug 5, 2025):

I have a feeling everyone experiencing this is hitting Ollama via OpenWebUI? Command line Ollama works with 100% GPU, but hitting it with open web ui goes 100% CPU

@Shawneau commented on GitHub (Aug 5, 2025): I have a feeling everyone experiencing this is hitting Ollama via OpenWebUI? Command line Ollama works with 100% GPU, but hitting it with open web ui goes 100% CPU

GiteaMirror commented

2026-04-12 19:50:15 -05:00

@av commented on GitHub (Aug 5, 2025):

Understandable!

Setting OLLAMA_NUM_PARALLEL=1 the split is now:

NAME           ID              SIZE     PROCESSOR          CONTEXT    UNTIL              
gpt-oss:20b    05afbac4bad6    18 GB    12%/88% CPU/GPU    8192       4 minutes from now

With OLLAMA_NUM_PARALLEL=4, it looks like:

NAME           ID              SIZE     PROCESSOR    CONTEXT    UNTIL              
gpt-oss:20b    05afbac4bad6    13 GB    100% CPU     8192       4 minutes from now

So, possibly something is off with either ps or the estimator, as clearly batching should allocate more memory.

In both instances it only uses ~12.9 GB of VRAM, leaving some space unallocated, I hope there's some way to use that and improve the performance a bit.

@av commented on GitHub (Aug 5, 2025): Understandable! Setting `OLLAMA_NUM_PARALLEL=1` the split is now: ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b 05afbac4bad6 18 GB 12%/88% CPU/GPU 8192 4 minutes from now ``` With `OLLAMA_NUM_PARALLEL=4`, it looks like: ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b 05afbac4bad6 13 GB 100% CPU 8192 4 minutes from now ``` So, possibly something is off with either `ps` or the estimator, as clearly batching should allocate more memory. In both instances it only uses ~12.9 GB of VRAM, leaving some space unallocated, I hope there's some way to use that and improve the performance a bit.

GiteaMirror commented

2026-04-12 19:50:15 -05:00

@jessegross commented on GitHub (Aug 5, 2025):

@nadamas2000 It looks like you increased the context length, this has a similar effect to increasing NUM_PARALLEL. You'll need to use a lower value or the default.

@jessegross commented on GitHub (Aug 5, 2025): @nadamas2000 It looks like you increased the context length, this has a similar effect to increasing NUM_PARALLEL. You'll need to use a lower value or the default.

GiteaMirror commented

2026-04-12 19:50:15 -05:00

@nadamas2000 commented on GitHub (Aug 5, 2025):

Ok, in my case, with 4k context GPUs running well.
Thanks.

@nadamas2000 commented on GitHub (Aug 5, 2025): Ok, in my case, with 4k context GPUs running well. Thanks.

GiteaMirror commented

2026-04-12 19:50:16 -05:00

@hrz6976 commented on GitHub (Aug 5, 2025):

Thanks for spotting this! I misunderstood how OLLAMA_NUM_PARALLEL works (related: https://github.com/ollama/ollama/issues/4170). It worked after removing OLLAMA_NUM_PARALLEL from envvars. 😄
P.S. Is there a way for ollama itself calculate how many requests it can handle before falling back to CPU? I can't find a optimal OLLAMA_NUM_PARALLEL as it applies to all models and I sometimes need to run different models in parallel

@hrz6976 commented on GitHub (Aug 5, 2025): Thanks for spotting this! I misunderstood how OLLAMA_NUM_PARALLEL works (related: https://github.com/ollama/ollama/issues/4170). It worked after removing OLLAMA_NUM_PARALLEL from envvars. 😄 P.S. Is there a way for ollama itself calculate how many requests it can handle before falling back to CPU? I can't find a optimal OLLAMA_NUM_PARALLEL as it applies to all models and I sometimes need to run different models in parallel

GiteaMirror commented

2026-04-12 19:50:16 -05:00

@HuChundong commented on GitHub (Aug 5, 2025):

i have 4x2080ti 22gb, all 88GB, gpt-oss 120b use 10% cpu, ctx is 8k, 88GB is not enough for 120B model?

@HuChundong commented on GitHub (Aug 5, 2025): i have 4x2080ti 22gb, all 88GB, gpt-oss 120b use 10% cpu, ctx is 8k, 88GB is not enough for 120B model?

GiteaMirror commented

2026-04-12 19:50:17 -05:00

@russellmm commented on GitHub (Aug 5, 2025):

It was context size for me. On a 5090, I can set the context windows to 32K and the model uses the GPU. Setting to 64K and it switches to the CPU.

@russellmm commented on GitHub (Aug 5, 2025): It was context size for me. On a 5090, I can set the context windows to 32K and the model uses the GPU. Setting to 64K and it switches to the CPU.

GiteaMirror commented

2026-04-12 19:50:17 -05:00

@Shawneau commented on GitHub (Aug 5, 2025):

It was context size for me. On a 5090, I can set the context windows to 32K and the model uses the GPU. Setting to 64K and it switches to the CPU.

Yeah that works for me too, what's the context window for the model though? Still a bug if over 32K (might not be Ollama bug though might Open WebUI or elsewhere)

@Shawneau commented on GitHub (Aug 5, 2025): > It was context size for me. On a 5090, I can set the context windows to 32K and the model uses the GPU. Setting to 64K and it switches to the CPU. Yeah that works for me too, what's the context window for the model though? Still a bug if over 32K (might not be Ollama bug though might Open WebUI or elsewhere)

GiteaMirror commented

2026-04-12 19:50:17 -05:00

@abhinavxd commented on GitHub (Aug 5, 2025):

Yes, it's the context size. It works well with the Ollama UI and CLI (uses GPU).
But when I add this model to GitHub Copilot, the context goes up to 32,768 and it doesn't use the GPU at all.

I got a 4080

@abhinavxd commented on GitHub (Aug 5, 2025): Yes, it's the context size. It works well with the Ollama UI and CLI (uses GPU). But when I add this model to GitHub Copilot, the context goes up to 32,768 and it doesn't use the GPU at all. I got a 4080

GiteaMirror commented

2026-04-12 19:50:18 -05:00

@torbwol commented on GitHub (Aug 5, 2025):

It's so weird... With a context of 8192 it utilizes one of my two gpus and says size is 22GB. When increasing the context to 16384 it goes 100% CPU and says size is 13GB. How does this make any sense? Why can't it use both gpus and why can't it use gpus at all when increasing the context size?

@torbwol commented on GitHub (Aug 5, 2025): It's so weird... With a context of 8192 it utilizes one of my two gpus and says size is 22GB. When increasing the context to 16384 it goes 100% CPU and says size is 13GB. How does this make any sense? Why can't it use both gpus and why can't it use gpus at all when increasing the context size?

GiteaMirror commented

2026-04-12 19:50:18 -05:00

@SierraKiloGulf commented on GitHub (Aug 5, 2025):

Same here- team red- 7900XTX. Doesn't matter if using CLI, openWebUI, AnythingLLM and the likes. Windows/Ubuntu

@SierraKiloGulf commented on GitHub (Aug 5, 2025): Same here- team red- 7900XTX. Doesn't matter if using CLI, openWebUI, AnythingLLM and the likes. Windows/Ubuntu

GiteaMirror commented

2026-04-12 19:50:19 -05:00

@thedaveCA commented on GitHub (Aug 6, 2025):

As a datapoint: 0.11.0 ran gpt-oss:20b on CPU for me, 0.11.2 on GPU. 7900XTX w/24GB VRAM, reporting 14.8GiB in use.

@thedaveCA commented on GitHub (Aug 6, 2025): As a datapoint: `0.11.0` ran gpt-oss:20b on CPU for me, `0.11.2` on GPU. 7900XTX w/24GB VRAM, reporting 14.8GiB in use.

GiteaMirror commented

2026-04-12 19:50:19 -05:00

@ZYJZYJZYJ0801 commented on GitHub (Aug 6, 2025):

same as me
NAME SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:120b 67 GB 100% CPU 8192 3 minutes from now
how to fix it??
use other model, can use 100% GPU

@ZYJZYJZYJ0801 commented on GitHub (Aug 6, 2025): same as me NAME SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b 67 GB 100% CPU 8192 3 minutes from now how to fix it?? use other model, can use 100% GPU

GiteaMirror commented

2026-04-12 19:50:20 -05:00

@coolbirdzik commented on GitHub (Aug 6, 2025):

Same on me too with 2 A4000

@coolbirdzik commented on GitHub (Aug 6, 2025): Same on me too with 2 A4000

GiteaMirror commented

2026-04-12 19:50:21 -05:00

@n0k0de commented on GitHub (Aug 6, 2025):

In my setup with a 5060 Ti 16GB (Ollama + Open WebUI all on Docker), Ollama only offloads 22 out of 24 layers to the GPU, even though there are still 3GB of VRAM available.

The OLLAMA_NUM_PARALLEL variable is set to 1.

Additionally, even when I set num_ctx to 4096 via Open WebUI, the context remains at 8192. Hard to say whether this issue comes from Ollama or Open WebUI.

@n0k0de commented on GitHub (Aug 6, 2025): In my setup with a 5060 Ti 16GB (Ollama + Open WebUI all on Docker), Ollama only offloads 22 out of 24 layers to the GPU, even though there are still 3GB of VRAM available. The OLLAMA_NUM_PARALLEL variable is set to 1. Additionally, even when I set num_ctx to 4096 via Open WebUI, the context remains at 8192. Hard to say whether this issue comes from Ollama or Open WebUI.

GiteaMirror commented

2026-04-12 19:50:21 -05:00

@Ca-rs-on commented on GitHub (Aug 6, 2025):

FWIW I accidentally pulled the wrong Docker image when upgrading to use gpt-oss and it caused this same problem, if you're running NVIDIA don't pull the rocm tag lol.

@Ca-rs-on commented on GitHub (Aug 6, 2025): FWIW I accidentally pulled the wrong Docker image when upgrading to use gpt-oss and it caused this same problem, if you're running NVIDIA don't pull the rocm tag lol.

GiteaMirror commented

2026-04-12 19:50:22 -05:00

@ricardofiorani commented on GitHub (Aug 6, 2025):

Same here

@ricardofiorani commented on GitHub (Aug 6, 2025): Same here

GiteaMirror commented

2026-04-12 19:50:22 -05:00

@jessegross commented on GitHub (Aug 6, 2025):

There was a bug in 0.11.2 and below where the memory estimation would become too high for gpt-oss if the model needed to be split across GPU and CPU or multiple GPUs. This would often cause 100% on CPU usage once the model overflows a single GPU.

This is fixed in 0.11.3.

@jessegross commented on GitHub (Aug 6, 2025): There was a bug in 0.11.2 and below where the memory estimation would become too high for gpt-oss if the model needed to be split across GPU and CPU or multiple GPUs. This would often cause 100% on CPU usage once the model overflows a single GPU. This is fixed in 0.11.3.

GiteaMirror commented

@trdischat commented on GitHub (Aug 6, 2025):

Upgrading to 0.11.3 allowed gpt-oss:20b to load at least partially on the GPU. But the memory consumed by the model more than doubled. With 0.11.2, the model used 13GB of memory, 100% on the CPU. With 0.11.3, the model uses 32GB of memory, split 24%/76% between CPU and GPU. This is just running ollama run gpt-oss at the command line.

I am running Ollama on Ubuntu 20.04 with these environment variable settings:

Environment="OLLAMA_CONTEXT_LENGTH=32000"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"

The server has 2 RTX 3060 for a total of 24GB of VRAM and 96GB of system RAM. Reducing the context length to 2000 brought the memory used by the model down to 19GB (running 100% on GPU), still way more than in Ollama 0.11.2.

Testing with other models, including llama, mistral, qwen3, etc., reveals that all models seem to be using more RAM in 0.11.3 than they were in 0.11.2.

@trdischat commented on GitHub (Aug 6, 2025): Upgrading to 0.11.3 allowed gpt-oss:20b to load at least partially on the GPU. But the memory consumed by the model more than doubled. With 0.11.2, the model used 13GB of memory, 100% on the CPU. With 0.11.3, the model uses 32GB of memory, split 24%/76% between CPU and GPU. This is just running `ollama run gpt-oss` at the command line. I am running Ollama on Ubuntu 20.04 with these environment variable settings: ``` Environment="OLLAMA_CONTEXT_LENGTH=32000" Environment="OLLAMA_NUM_PARALLEL=1" Environment="GGML_CUDA_ENABLE_UNIFIED_MEMORY=1" ``` The server has 2 RTX 3060 for a total of 24GB of VRAM and 96GB of system RAM. Reducing the context length to 2000 brought the memory used by the model down to 19GB (running 100% on GPU), still way more than in Ollama 0.11.2. Testing with other models, including llama, mistral, qwen3, etc., reveals that all models seem to be using more RAM in 0.11.3 than they were in 0.11.2.

GiteaMirror commented

@ZYJZYJZYJ0801 commented on GitHub (Aug 7, 2025):

ollama version is 0.11.3
NAME SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:120b 151 GB 37%/63% CPU/GPU 8192 About a minute from now
GPU: 5000Ada *3
memory: 128G *2
can't use full GPU, how to fix it????

@ZYJZYJZYJ0801 commented on GitHub (Aug 7, 2025): ollama version is 0.11.3 NAME SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b 151 GB 37%/63% CPU/GPU 8192 About a minute from now GPU: 5000Ada *3 memory: 128G *2 can't use full GPU, how to fix it????

GiteaMirror commented

@azomDev commented on GitHub (Aug 7, 2025):

Similar issue here #11688

@azomDev commented on GitHub (Aug 7, 2025): Similar issue here #11688

GiteaMirror commented

2026-04-12 19:50:24 -05:00

@alienatedsec commented on GitHub (Aug 7, 2025):

root@[redacted]:/# nvidia-smi
Thu Aug  7 22:04:07 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A4000               On  |   00000000:01:00.0 Off |                  Off |
| 54%   66C    P8             13W /  140W |   12617MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4060 Ti     On  |   00000000:02:00.0 Off |                  N/A |
| 34%   48C    P8              6W /  165W |   14471MiB /  16380MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A4000               On  |   00000000:03:00.0 Off |                  Off |
| 54%   63C    P8             14W /  140W |   14285MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A4000               On  |   00000000:04:00.0 Off |                  Off |
| 58%   68C    P8             10W /  140W |   12617MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX A4000               On  |   00000000:05:00.0 Off |                  Off |
| 60%   69C    P8             11W /  140W |   12041MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A             122      C   /usr/bin/ollama                         320MiB |
|    1   N/A  N/A             122      C   /usr/bin/ollama                         280MiB |
|    2   N/A  N/A             122      C   /usr/bin/ollama                         322MiB |
|    3   N/A  N/A             122      C   /usr/bin/ollama                         320MiB |
|    4   N/A  N/A             122      C   /usr/bin/ollama                         384MiB |
+-----------------------------------------------------------------------------------------+

and to follow up on the ollama side

root@[redacted]:/# ollama ps
NAME            ID              SIZE     PROCESSOR          CONTEXT    UNTIL   
gpt-oss:120b    735371f916a9    91 GB    13%/87% CPU/GPU    8192       Forever    
root@[redacted]:/# ollama -v
ollama version is 0.11.3
root@[redacted]:/#

Seems like there is still some space to allocate GPU(s) memory. Some testing as per the below:

OLLAMA_NUM_PARALLEL=1

root@[redacted]:/# ollama run --verbose gpt-oss:120b
>>> write 3 paragraphs about cryptography
Thinking...
User asks: "write 3 paragraphs about cryptography". Provide three paragraphs, likely descriptive. Ensure good content.
...done thinking.

Cryptography, the art and science of securing information, has evolved from simple substitution ciphers used by ancient civilizations to sophisticated mathematical 
frameworks that underpin modern digital security. At its core, cryptography transforms readable data (plaintext) into an unintelligible form (ciphertext) using 
algorithms and keys, ensuring that only authorized parties can recover the original message. Early techniques, such as the Caesar shift and the Enigma machine, relied on 
mechanical or manual processes, but the advent of computers introduced computational complexity as a cornerstone, allowing for encryption schemes that are practically 
unbreakable given current technology.

The modern landscape of cryptography is divided primarily into two complementary paradigms: symmetric-key and asymmetric-key cryptography. Symmetric algorithms, like AES 
(Advanced Encryption Standard), use a single secret key for both encryption and decryption, offering high speed and efficiency for bulk data protection. Asymmetric 
systems, exemplified by RSA and elliptic‑curve cryptography (ECC), employ a pair of mathematically linked keys—a public key for encryption and a private key for 
decryption—enabling secure key exchange, digital signatures, and authentication without prior secret sharing. Together, these techniques form the backbone of protocols 
such as TLS/SSL, VPNs, and end‑to‑end encrypted messaging apps, safeguarding everything from online banking to personal communications.

Beyond confidentiality, cryptography also addresses integrity, authenticity, and non‑repudiation through tools like hash functions, message authentication codes (MACs), 
and digital signatures. Cryptographic hash functions (e.g., SHA‑256) produce fixed‑size digests that uniquely represent data, making it easy to detect tampering. Digital 
signatures, generated with a private key and verified with the corresponding public key, provide proof that a specific entity authored a message and cannot later deny 
it. As quantum computing looms on the horizon, researchers are already developing post‑quantum algorithms to replace vulnerable schemes, ensuring that the principles of 
cryptography continue to protect information in an increasingly connected and computationally powerful world.

total duration:       24.001994779s
load duration:        342.189318ms
prompt eval count:    74 token(s)
prompt eval duration: 1.409565258s
prompt eval rate:     52.50 tokens/s
eval count:           438 token(s)
eval duration:        22.246255913s
eval rate:            19.69 tokens/s
>>> Send a message (/? for help)

@alienatedsec commented on GitHub (Aug 7, 2025): ``` root@[redacted]:/# nvidia-smi Thu Aug 7 22:04:07 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX A4000 On | 00000000:01:00.0 Off | Off | | 54% 66C P8 13W / 140W | 12617MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4060 Ti On | 00000000:02:00.0 Off | N/A | | 34% 48C P8 6W / 165W | 14471MiB / 16380MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA RTX A4000 On | 00000000:03:00.0 Off | Off | | 54% 63C P8 14W / 140W | 14285MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA RTX A4000 On | 00000000:04:00.0 Off | Off | | 58% 68C P8 10W / 140W | 12617MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA RTX A4000 On | 00000000:05:00.0 Off | Off | | 60% 69C P8 11W / 140W | 12041MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 122 C /usr/bin/ollama 320MiB | | 1 N/A N/A 122 C /usr/bin/ollama 280MiB | | 2 N/A N/A 122 C /usr/bin/ollama 322MiB | | 3 N/A N/A 122 C /usr/bin/ollama 320MiB | | 4 N/A N/A 122 C /usr/bin/ollama 384MiB | +-----------------------------------------------------------------------------------------+ ``` and to follow up on the ollama side ``` root@[redacted]:/# ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b 735371f916a9 91 GB 13%/87% CPU/GPU 8192 Forever root@[redacted]:/# ollama -v ollama version is 0.11.3 root@[redacted]:/# ``` Seems like there is still some space to allocate GPU(s) memory. Some testing as per the below: `OLLAMA_NUM_PARALLEL=1` ``` root@[redacted]:/# ollama run --verbose gpt-oss:120b >>> write 3 paragraphs about cryptography Thinking... User asks: "write 3 paragraphs about cryptography". Provide three paragraphs, likely descriptive. Ensure good content. ...done thinking. Cryptography, the art and science of securing information, has evolved from simple substitution ciphers used by ancient civilizations to sophisticated mathematical frameworks that underpin modern digital security. At its core, cryptography transforms readable data (plaintext) into an unintelligible form (ciphertext) using algorithms and keys, ensuring that only authorized parties can recover the original message. Early techniques, such as the Caesar shift and the Enigma machine, relied on mechanical or manual processes, but the advent of computers introduced computational complexity as a cornerstone, allowing for encryption schemes that are practically unbreakable given current technology. The modern landscape of cryptography is divided primarily into two complementary paradigms: symmetric-key and asymmetric-key cryptography. Symmetric algorithms, like AES (Advanced Encryption Standard), use a single secret key for both encryption and decryption, offering high speed and efficiency for bulk data protection. Asymmetric systems, exemplified by RSA and elliptic‑curve cryptography (ECC), employ a pair of mathematically linked keys—a public key for encryption and a private key for decryption—enabling secure key exchange, digital signatures, and authentication without prior secret sharing. Together, these techniques form the backbone of protocols such as TLS/SSL, VPNs, and end‑to‑end encrypted messaging apps, safeguarding everything from online banking to personal communications. Beyond confidentiality, cryptography also addresses integrity, authenticity, and non‑repudiation through tools like hash functions, message authentication codes (MACs), and digital signatures. Cryptographic hash functions (e.g., SHA‑256) produce fixed‑size digests that uniquely represent data, making it easy to detect tampering. Digital signatures, generated with a private key and verified with the corresponding public key, provide proof that a specific entity authored a message and cannot later deny it. As quantum computing looms on the horizon, researchers are already developing post‑quantum algorithms to replace vulnerable schemes, ensuring that the principles of cryptography continue to protect information in an increasingly connected and computationally powerful world. total duration: 24.001994779s load duration: 342.189318ms prompt eval count: 74 token(s) prompt eval duration: 1.409565258s prompt eval rate: 52.50 tokens/s eval count: 438 token(s) eval duration: 22.246255913s eval rate: 19.69 tokens/s >>> Send a message (/? for help) ```

GiteaMirror commented

@Jonseed commented on GitHub (Aug 7, 2025):

I'm seeing a similar slowdown on Ollama with my 3060 12gb, where I only get about 4 t/s, which is almost unusable. In LM Studio I'm getting up to 13+ t/s, offloading 20 layers out of 24 (83%). When using Ollama, ollama ps shows it is only using 68% of my gpu, and offloading the rest to cpu, which could account for the slowdown.

@Jonseed commented on GitHub (Aug 7, 2025): I'm seeing a similar slowdown on Ollama with my 3060 12gb, where I only get about 4 t/s, which is almost unusable. In LM Studio I'm getting up to 13+ t/s, offloading 20 layers out of 24 (83%). When using Ollama, ollama ps shows it is only using 68% of my gpu, and offloading the rest to cpu, which could account for the slowdown.

GiteaMirror commented

2026-04-12 19:50:25 -05:00

@azomDev commented on GitHub (Aug 10, 2025):

Sorry for the spam, I was trying to find all similar issues so far to link them here since this is the earliest issue about what seems to be generally the same problem

@azomDev commented on GitHub (Aug 10, 2025): Sorry for the spam, I was trying to find all similar issues so far to link them here since this is the earliest issue about what seems to be generally the same problem

GiteaMirror commented

2026-04-12 19:50:25 -05:00

@jhsmith409 commented on GitHub (Aug 10, 2025):

I had the same issue in #11731 (listed above as well). 5090 + 5070 Ti for total of 48GB VRAM. Runs 99%+ on CPU and consumes just a small amount of VRAM. When I calculate KV Cache size + model + est. overhead, I think it should easily fit... I pulled the model parameters from the model card directly on HF. My context size is large - 128k. Qwen3:30B fits easily with that context and matches my VRAM calculation for it. So I'm thinking there is a bug remaining in the implementation or there is something different about the model that the way I'm calculating KV Cache side works for Qwen3 but not for GPT-OSS.

@jhsmith409 commented on GitHub (Aug 10, 2025): I had the same issue in #11731 (listed above as well). 5090 + 5070 Ti for total of 48GB VRAM. Runs 99%+ on CPU and consumes just a small amount of VRAM. When I calculate KV Cache size + model + est. overhead, I think it should easily fit... I pulled the model parameters from the model card directly on HF. My context size is large - 128k. Qwen3:30B fits easily with that context and matches my VRAM calculation for it. So I'm thinking there is a bug remaining in the implementation or there is something different about the model that the way I'm calculating KV Cache side works for Qwen3 but not for GPT-OSS.

GiteaMirror commented

2026-04-12 19:50:26 -05:00

@alienatedsec commented on GitHub (Aug 10, 2025):

Not sure what happened recently - some wrong reporting with the latest version

root@[redacted]:/# ollama ps
NAME            ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
gpt-oss:120b    735371f916a9    69 GB    100% CPU     128000     Forever    
root@[redacted]:/# ollama -v
ollama version is 0.11.4

root@[redacted]:~# nvidia-smi
Sun Aug 10 15:02:52 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A4000               On  |   00000000:01:00.0 Off |                  Off |
| 42%   65C    P0             55W /  140W |   13131MiB /  16376MiB |     15%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4060 Ti     On  |   00000000:02:00.0 Off |                  N/A |
|  0%   59C    P0             46W /  165W |   15239MiB /  16380MiB |     14%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A4000               On  |   00000000:03:00.0 Off |                  Off |
| 47%   69C    P0             61W /  140W |   15051MiB /  16376MiB |     15%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A4000               On  |   00000000:04:00.0 Off |                  Off |
| 48%   70C    P0             52W /  140W |   13387MiB /  16376MiB |     14%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX A4000               On  |   00000000:05:00.0 Off |                  Off |
| 48%   70C    P0             57W /  140W |   12809MiB /  16376MiB |     17%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    1   N/A  N/A            1367      C   /usr/local/bin/python3                  222MiB |
|    1   N/A  N/A            2417      C   /usr/bin/ollama                         280MiB |
|    2   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    3   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    4   N/A  N/A            2417      C   /usr/bin/ollama                         384MiB |
+-----------------------------------------------------------------------------------------+

GiteaMirror commented

2026-04-12 19:50:27 -05:00

@rick-github commented on GitHub (Aug 10, 2025):

Sorry for the spam, I was trying to find all similar issues so far to link them here since this is the earliest issue about what seems to be generally the same problem

Most of these issues are because the context is too big. Reduce context, reduce VRAM.

@rick-github commented on GitHub (Aug 10, 2025): > Sorry for the spam, I was trying to find all similar issues so far to link them here since this is the earliest issue about what seems to be generally the same problem Most of these issues are because the context is too big. Reduce context, reduce VRAM.

GiteaMirror commented

2026-04-12 19:50:27 -05:00

@alienatedsec commented on GitHub (Aug 11, 2025):

Not sure what happened recently - some wrong reporting with the latest version

root@[redacted]:/# ollama ps
NAME            ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
gpt-oss:120b    735371f916a9    69 GB    100% CPU     128000     Forever    
root@[redacted]:/# ollama -v
ollama version is 0.11.4

root@[redacted]:~# nvidia-smi
Sun Aug 10 15:02:52 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A4000               On  |   00000000:01:00.0 Off |                  Off |
| 42%   65C    P0             55W /  140W |   13131MiB /  16376MiB |     15%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4060 Ti     On  |   00000000:02:00.0 Off |                  N/A |
|  0%   59C    P0             46W /  165W |   15239MiB /  16380MiB |     14%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A4000               On  |   00000000:03:00.0 Off |                  Off |
| 47%   69C    P0             61W /  140W |   15051MiB /  16376MiB |     15%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A4000               On  |   00000000:04:00.0 Off |                  Off |
| 48%   70C    P0             52W /  140W |   13387MiB /  16376MiB |     14%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX A4000               On  |   00000000:05:00.0 Off |                  Off |
| 48%   70C    P0             57W /  140W |   12809MiB /  16376MiB |     17%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    1   N/A  N/A            1367      C   /usr/local/bin/python3                  222MiB |
|    1   N/A  N/A            2417      C   /usr/bin/ollama                         280MiB |
|    2   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    3   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    4   N/A  N/A            2417      C   /usr/bin/ollama                         384MiB |
+-----------------------------------------------------------------------------------------+

Attaching logs if relevant. These are from today, but the same ollama ps output as of yesterday

_ollama_logs.txt

@alienatedsec commented on GitHub (Aug 11, 2025): > Not sure what happened recently - some wrong reporting with the latest version > > ``` > root@[redacted]:/# ollama ps > NAME ID SIZE PROCESSOR CONTEXT UNTIL > gpt-oss:120b 735371f916a9 69 GB 100% CPU 128000 Forever > root@[redacted]:/# ollama -v > ollama version is 0.11.4 > ``` > > ``` > root@[redacted]:~# nvidia-smi > Sun Aug 10 15:02:52 2025 > +-----------------------------------------------------------------------------------------+ > | NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 | > |-----------------------------------------+------------------------+----------------------+ > | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | > | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | > | | | MIG M. | > |=========================================+========================+======================| > | 0 NVIDIA RTX A4000 On | 00000000:01:00.0 Off | Off | > | 42% 65C P0 55W / 140W | 13131MiB / 16376MiB | 15% Default | > | | | N/A | > +-----------------------------------------+------------------------+----------------------+ > | 1 NVIDIA GeForce RTX 4060 Ti On | 00000000:02:00.0 Off | N/A | > | 0% 59C P0 46W / 165W | 15239MiB / 16380MiB | 14% Default | > | | | N/A | > +-----------------------------------------+------------------------+----------------------+ > | 2 NVIDIA RTX A4000 On | 00000000:03:00.0 Off | Off | > | 47% 69C P0 61W / 140W | 15051MiB / 16376MiB | 15% Default | > | | | N/A | > +-----------------------------------------+------------------------+----------------------+ > | 3 NVIDIA RTX A4000 On | 00000000:04:00.0 Off | Off | > | 48% 70C P0 52W / 140W | 13387MiB / 16376MiB | 14% Default | > | | | N/A | > +-----------------------------------------+------------------------+----------------------+ > | 4 NVIDIA RTX A4000 On | 00000000:05:00.0 Off | Off | > | 48% 70C P0 57W / 140W | 12809MiB / 16376MiB | 17% Default | > | | | N/A | > +-----------------------------------------+------------------------+----------------------+ > > +-----------------------------------------------------------------------------------------+ > | Processes: | > | GPU GI CI PID Type Process name GPU Memory | > | ID ID Usage | > |=========================================================================================| > | 0 N/A N/A 2417 C /usr/bin/ollama 320MiB | > | 1 N/A N/A 1367 C /usr/local/bin/python3 222MiB | > | 1 N/A N/A 2417 C /usr/bin/ollama 280MiB | > | 2 N/A N/A 2417 C /usr/bin/ollama 320MiB | > | 3 N/A N/A 2417 C /usr/bin/ollama 320MiB | > | 4 N/A N/A 2417 C /usr/bin/ollama 384MiB | > +-----------------------------------------------------------------------------------------+ > ``` Attaching logs if relevant. These are from today, but the same `ollama ps` output as of yesterday [_ollama_logs.txt](https://github.com/user-attachments/files/21717377/_ollama_logs.txt)

GiteaMirror commented

2026-04-12 19:50:27 -05:00

@rick-github commented on GitHub (Aug 11, 2025):

What is wrong with the reporting?

@rick-github commented on GitHub (Aug 11, 2025): What is wrong with the reporting?

GiteaMirror commented

2026-04-12 19:50:28 -05:00

@alienatedsec commented on GitHub (Aug 11, 2025):

@rick-github the CPU usage

root@[redacted]:/# ollama ps
NAME            ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
gpt-oss:120b    735371f916a9    69 GB    100% CPU     128000     Forever    
root@[redacted]:/# ollama -v
ollama version is 0.11.4

Seems the below could also be related, as I also use OpenWebUI

I have a feeling everyone experiencing this is hitting Ollama via OpenWebUI? Command line Ollama works with 100% GPU, but hitting it with open web ui goes 100% CPU

@alienatedsec commented on GitHub (Aug 11, 2025): @rick-github the CPU usage ``` root@[redacted]:/# ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b 735371f916a9 69 GB 100% CPU 128000 Forever root@[redacted]:/# ollama -v ollama version is 0.11.4 ``` Seems the below could also be related, as I also use `OpenWebUI` > I have a feeling everyone experiencing this is hitting Ollama via OpenWebUI? Command line Ollama works with 100% GPU, but hitting it with open web ui goes 100% CPU

GiteaMirror commented

2026-04-12 19:50:28 -05:00

@rick-github commented on GitHub (Aug 11, 2025):

The model is loaded 100% in CPU, which is correct.

@rick-github commented on GitHub (Aug 11, 2025): The model is loaded 100% in CPU, which is correct.

GiteaMirror commented

2026-04-12 19:50:29 -05:00

@alienatedsec commented on GitHub (Aug 11, 2025):

@rick-github it doesn't feel that way, unless I am missing something.

@alienatedsec commented on GitHub (Aug 11, 2025): @rick-github it doesn't feel that way, unless I am missing something.

GiteaMirror commented

2026-04-12 19:50:29 -05:00

@rick-github commented on GitHub (Aug 11, 2025):

Could you define "feel"?

@rick-github commented on GitHub (Aug 11, 2025): Could you define "feel"?

GiteaMirror commented

2026-04-12 19:50:29 -05:00

@alienatedsec commented on GitHub (Aug 11, 2025):

@rick-github could explain this?

root@[redacted]:~# nvidia-smi
Sun Aug 10 15:02:52 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08             Driver Version: 570.172.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A4000               On  |   00000000:01:00.0 Off |                  Off |
| 42%   65C    P0             55W /  140W |   13131MiB /  16376MiB |     15%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4060 Ti     On  |   00000000:02:00.0 Off |                  N/A |
|  0%   59C    P0             46W /  165W |   15239MiB /  16380MiB |     14%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A4000               On  |   00000000:03:00.0 Off |                  Off |
| 47%   69C    P0             61W /  140W |   15051MiB /  16376MiB |     15%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A4000               On  |   00000000:04:00.0 Off |                  Off |
| 48%   70C    P0             52W /  140W |   13387MiB /  16376MiB |     14%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX A4000               On  |   00000000:05:00.0 Off |                  Off |
| 48%   70C    P0             57W /  140W |   12809MiB /  16376MiB |     17%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    1   N/A  N/A            1367      C   /usr/local/bin/python3                  222MiB |
|    1   N/A  N/A            2417      C   /usr/bin/ollama                         280MiB |
|    2   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    3   N/A  N/A            2417      C   /usr/bin/ollama                         320MiB |
|    4   N/A  N/A            2417      C   /usr/bin/ollama                         384MiB |
+-----------------------------------------------------------------------------------------+

Edit - I don't understand how the model is reported in Ollama as 100% in CPU and at the same time the GPU VRAM (around 80%-90%) is utilised?

GiteaMirror commented

@rick-github commented on GitHub (Aug 11, 2025):

What's the output of

ps wwh p$(pidof ollama)

@rick-github commented on GitHub (Aug 11, 2025): What's the output of ``` ps wwh p$(pidof ollama) ```

GiteaMirror commented

@alienatedsec commented on GitHub (Aug 11, 2025):

@rick-github

root@[redacted]:/# ps wwh p$(pidof ollama)
      1 ?        Ssl    0:18 /bin/ollama serve
    125 ?        Rl     1:58 /usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 256 --threads 16 --no-mmap --parallel 1 --port 44643

@alienatedsec commented on GitHub (Aug 11, 2025): @rick-github ``` root@[redacted]:/# ps wwh p$(pidof ollama) 1 ? Ssl 0:18 /bin/ollama serve 125 ? Rl 1:58 /usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --ctx-size 128000 --batch-size 512 --n-gpu-layers 256 --threads 16 --no-mmap --parallel 1 --port 44643 ```

GiteaMirror commented

@rick-github commented on GitHub (Aug 11, 2025):

What's the output of

ps wwh p2417

@rick-github commented on GitHub (Aug 11, 2025): What's the output of ``` ps wwh p2417 ```

GiteaMirror commented