[GH-ISSUE #12197] Some requests get processed on CPU, even though model is loaded in GPU (GPT-OSS) #33873

Open
opened 2026-04-22 17:00:03 -05:00 by GiteaMirror · 55 comments
Owner

Originally created by @shiraz-shah on GitHub (Sep 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12197

Originally assigned to: @ParthSareen on GitHub.

What is the issue?

The "overlapping GPU/CPU" feature from the last version results in some requests always being processed on the CPU only, even though the model is loaded in GPU. These requests take way longer than they should as a result. Like 10 times longer.

This happens even though the GPU is completely idle. It also happens across multiple platforms and installations of Ollama.

E.g. I can have two coding editors open. One code editor's requests get handled on the GPU, while the other's get handled consistently on the CPU, even when the GPU is idle. If I change the code editors to send API requests with a different server (with a different GPU, CPU and RAM amount) but with the same Ollama version, the same editors respectively will keep using GPU vs. CPU as with the first server, so that it must be something about CPU/GPU routing logic within Ollama, and not hardware constrains.

Is there any way to disable the overlapping GPU/CPU capability?

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @shiraz-shah on GitHub (Sep 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12197 Originally assigned to: @ParthSareen on GitHub. ### What is the issue? The "overlapping GPU/CPU" feature from the last version results in some requests always being processed on the CPU only, even though the model is loaded in GPU. These requests take way longer than they should as a result. Like 10 times longer. This happens even though the GPU is completely idle. It also happens across multiple platforms and installations of Ollama. E.g. I can have two coding editors open. One code editor's requests get handled on the GPU, while the other's get handled consistently on the CPU, even when the GPU is idle. If I change the code editors to send API requests with a different server (with a different GPU, CPU and RAM amount) but with the same Ollama version, the same editors respectively will keep using GPU vs. CPU as with the first server, so that it must be something about CPU/GPU routing logic within Ollama, and not hardware constrains. Is there any way to disable the overlapping GPU/CPU capability? ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-22 17:00:03 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 6, 2025):

Server logs will help in debugging.

<!-- gh-comment-id:3261637636 --> @rick-github commented on GitHub (Sep 6, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will help in debugging.
Author
Owner

@shiraz-shah commented on GitHub (Sep 6, 2025):

From 8-core i5 machine with 64 gigs of system ram and RTX 4090. The multi-minute reqests are the CPU ones, whereas the ones that took less than a minute were the GPU ones:

Sep 06 09:02:24 ersa ollama[1177]: [GIN] 2025/09/06 - 09:02:24 | 200 |  4.445931013s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:02:28 ersa ollama[1177]: [GIN] 2025/09/06 - 09:02:28 | 200 |  8.115474696s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:02:32 ersa ollama[1177]: [GIN] 2025/09/06 - 09:02:32 | 200 |  4.367456039s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:05:50 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:50 | 200 |      17.181µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:05:50 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:50 | 200 |      20.987µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:05:52 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:52 | 200 |      16.244µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:05:52 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:52 | 200 |      19.509µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:05:54 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:54 | 200 |      15.223µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:05:54 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:54 | 200 |      14.804µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:05:56 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:56 | 200 |      16.335µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:05:56 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:56 | 200 |      12.701µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:05:58 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:58 | 200 |       19.62µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:05:58 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:58 | 200 |      12.603µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:00 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:00 | 200 |      15.716µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:00 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:00 | 200 |      17.709µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:02 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:02 | 200 |      15.347µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:02 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:02 | 200 |      18.952µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:04 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:04 | 200 |      21.642µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:04 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:04 | 200 |      14.254µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:06 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:06 | 200 |      15.577µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:06 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:06 | 200 |      13.479µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:08 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:08 | 200 |      14.691µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:08 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:08 | 200 |       13.71µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:10 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:10 | 200 |      19.424µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:10 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:10 | 200 |      18.943µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:12 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:12 | 200 |      14.311µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:12 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:12 | 200 |      14.798µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:14 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:14 | 200 |      20.094µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:14 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:14 | 200 |      17.698µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:16 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:16 | 200 |      16.746µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:16 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:16 | 200 |      12.151µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:18 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:18 | 200 |      16.082µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:18 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:18 | 200 |      14.362µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:20 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:20 | 200 |      20.729µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:20 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:20 | 200 |      14.712µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:22 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:22 | 200 |      16.145µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:22 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:22 | 200 |      13.636µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:24 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:24 | 200 |      15.597µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:24 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:24 | 200 |      15.276µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:26 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:26 | 200 |      15.365µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:26 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:26 | 200 |      18.928µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:29 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:29 | 200 |      15.635µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:29 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:29 | 200 |      17.938µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:31 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:31 | 200 |      14.343µs |       127.0.0.1 | HEAD     "/"
Sep 06 09:06:31 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:31 | 200 |      20.918µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:06:41 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:41 | 200 |         2m39s |    192.168.1.57 | POST     "/api/generate"
Sep 06 09:06:46 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:46 | 200 |         1m12s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:06:50 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:50 | 200 |   3.48477101s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:06:58 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:58 | 200 |  584.056205ms |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:10:44 ersa ollama[1177]: [GIN] 2025/09/06 - 09:10:44 | 200 | 20.677645904s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:13:22 ersa ollama[1177]: [GIN] 2025/09/06 - 09:13:22 | 200 |         2m36s |    192.168.1.57 | POST     "/api/generate"
Sep 06 09:13:26 ersa ollama[1177]: [GIN] 2025/09/06 - 09:13:26 | 200 |         2m41s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:14:18 ersa ollama[1177]: [GIN] 2025/09/06 - 09:14:18 | 200 |         5m14s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:14:25 ersa ollama[1177]: [GIN] 2025/09/06 - 09:14:25 | 200 | 59.472853625s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:14:35 ersa ollama[1177]: [GIN] 2025/09/06 - 09:14:35 | 200 |  9.801204967s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:15:03 ersa ollama[1177]: [GIN] 2025/09/06 - 09:15:03 | 200 | 19.575180022s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:20:29 ersa ollama[1177]: [GIN] 2025/09/06 - 09:20:29 | 200 |         2m59s |    192.168.1.57 | POST     "/api/generate"
Sep 06 09:20:35 ersa ollama[1177]: [GIN] 2025/09/06 - 09:20:35 | 200 |          4m2s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:21:28 ersa ollama[1177]: [GIN] 2025/09/06 - 09:21:28 | 200 |          7m9s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:22:33 ersa ollama[1177]: [GIN] 2025/09/06 - 09:22:33 | 200 | 39.112550948s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:24:33 ersa ollama[1177]: [GIN] 2025/09/06 - 09:24:33 | 200 |          1m9s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:26:01 ersa ollama[1177]: [GIN] 2025/09/06 - 09:26:01 | 200 |         4m32s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:27:58 ersa ollama[1177]: time=2025-09-06T09:27:58.911+02:00 level=WARN source=harmonyparser.go:402 msg="harmony parser: no reverse mapping found for function name" harmonyFunctionName=typebash
Sep 06 09:27:58 ersa ollama[1177]: [GIN] 2025/09/06 - 09:27:58 | 200 | 55.605017854s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:29:22 ersa ollama[1177]: [GIN] 2025/09/06 - 09:29:22 | 200 |         1m23s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:30:41 ersa ollama[1177]: [GIN] 2025/09/06 - 09:30:41 | 200 |         4m40s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:30:46 ersa ollama[1177]: [GIN] 2025/09/06 - 09:30:46 | 200 |         1m21s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:31:38 ersa ollama[1177]: [GIN] 2025/09/06 - 09:31:38 | 200 | 35.664520566s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:33:49 ersa ollama[1177]: [GIN] 2025/09/06 - 09:33:49 | 200 |  17.74590042s |    192.168.1.57 | POST     "/api/generate"
Sep 06 09:34:19 ersa ollama[1177]: [GIN] 2025/09/06 - 09:34:19 | 200 |         1m46s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:35:40 ersa ollama[1177]: [GIN] 2025/09/06 - 09:35:40 | 200 |         4m58s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:35:50 ersa ollama[1177]: [GIN] 2025/09/06 - 09:35:50 | 200 |          1m5s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:36:35 ersa ollama[1177]: [GIN] 2025/09/06 - 09:36:35 | 200 |   4.38553234s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:36:47 ersa ollama[1177]: [GIN] 2025/09/06 - 09:36:47 | 200 |  4.370975001s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:37:13 ersa ollama[1177]: [GIN] 2025/09/06 - 09:37:13 | 200 | 11.272572506s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:37:23 ersa ollama[1177]: [GIN] 2025/09/06 - 09:37:23 | 200 |  6.406938698s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:37:49 ersa ollama[1177]: [GIN] 2025/09/06 - 09:37:49 | 200 |  6.128323622s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 09:38:00 ersa ollama[1177]: [GIN] 2025/09/06 - 09:38:00 | 200 |  7.270876557s |       127.0.0.1 | POST     "/v1/chat/completions"
Sep 06 11:01:36 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:36 | 200 |    3.707021ms |    192.168.1.57 | GET      "/api/tags"
Sep 06 11:01:36 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:36 | 200 |      58.934µs |    192.168.1.57 | GET      "/api/ps"
Sep 06 11:01:37 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:37 | 200 |      81.331µs |    192.168.1.57 | GET      "/api/version"
Sep 06 11:01:40 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:40 | 200 |    3.765025ms |    192.168.1.57 | GET      "/api/tags"
Sep 06 11:01:40 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:40 | 200 |      53.903µs |    192.168.1.57 | GET      "/api/ps"
Sep 06 11:01:40 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:40 | 200 |    3.529917ms |    192.168.1.57 | GET      "/api/tags"
Sep 06 11:01:40 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:40 | 200 |      55.603µs |    192.168.1.57 | GET      "/api/ps"
Sep 06 11:01:53 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:53 | 200 |    3.764105ms |    192.168.1.57 | GET      "/api/tags"
Sep 06 11:01:53 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:53 | 200 |      54.271µs |    192.168.1.57 | GET      "/api/ps"
Sep 06 11:01:54 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:54 | 200 |    3.736763ms |    192.168.1.57 | GET      "/api/tags"
Sep 06 11:01:54 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:54 | 200 |      54.405µs |    192.168.1.57 | GET      "/api/ps"
<!-- gh-comment-id:3263009968 --> @shiraz-shah commented on GitHub (Sep 6, 2025): From 8-core i5 machine with 64 gigs of system ram and RTX 4090. The multi-minute reqests are the CPU ones, whereas the ones that took less than a minute were the GPU ones: ``` Sep 06 09:02:24 ersa ollama[1177]: [GIN] 2025/09/06 - 09:02:24 | 200 | 4.445931013s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:02:28 ersa ollama[1177]: [GIN] 2025/09/06 - 09:02:28 | 200 | 8.115474696s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:02:32 ersa ollama[1177]: [GIN] 2025/09/06 - 09:02:32 | 200 | 4.367456039s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:05:50 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:50 | 200 | 17.181µs | 127.0.0.1 | HEAD "/" Sep 06 09:05:50 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:50 | 200 | 20.987µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:05:52 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:52 | 200 | 16.244µs | 127.0.0.1 | HEAD "/" Sep 06 09:05:52 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:52 | 200 | 19.509µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:05:54 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:54 | 200 | 15.223µs | 127.0.0.1 | HEAD "/" Sep 06 09:05:54 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:54 | 200 | 14.804µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:05:56 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:56 | 200 | 16.335µs | 127.0.0.1 | HEAD "/" Sep 06 09:05:56 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:56 | 200 | 12.701µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:05:58 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:58 | 200 | 19.62µs | 127.0.0.1 | HEAD "/" Sep 06 09:05:58 ersa ollama[1177]: [GIN] 2025/09/06 - 09:05:58 | 200 | 12.603µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:00 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:00 | 200 | 15.716µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:00 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:00 | 200 | 17.709µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:02 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:02 | 200 | 15.347µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:02 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:02 | 200 | 18.952µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:04 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:04 | 200 | 21.642µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:04 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:04 | 200 | 14.254µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:06 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:06 | 200 | 15.577µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:06 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:06 | 200 | 13.479µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:08 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:08 | 200 | 14.691µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:08 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:08 | 200 | 13.71µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:10 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:10 | 200 | 19.424µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:10 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:10 | 200 | 18.943µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:12 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:12 | 200 | 14.311µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:12 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:12 | 200 | 14.798µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:14 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:14 | 200 | 20.094µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:14 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:14 | 200 | 17.698µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:16 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:16 | 200 | 16.746µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:16 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:16 | 200 | 12.151µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:18 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:18 | 200 | 16.082µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:18 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:18 | 200 | 14.362µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:20 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:20 | 200 | 20.729µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:20 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:20 | 200 | 14.712µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:22 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:22 | 200 | 16.145µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:22 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:22 | 200 | 13.636µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:24 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:24 | 200 | 15.597µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:24 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:24 | 200 | 15.276µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:26 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:26 | 200 | 15.365µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:26 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:26 | 200 | 18.928µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:29 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:29 | 200 | 15.635µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:29 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:29 | 200 | 17.938µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:31 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:31 | 200 | 14.343µs | 127.0.0.1 | HEAD "/" Sep 06 09:06:31 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:31 | 200 | 20.918µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:06:41 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:41 | 200 | 2m39s | 192.168.1.57 | POST "/api/generate" Sep 06 09:06:46 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:46 | 200 | 1m12s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:06:50 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:50 | 200 | 3.48477101s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:06:58 ersa ollama[1177]: [GIN] 2025/09/06 - 09:06:58 | 200 | 584.056205ms | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:10:44 ersa ollama[1177]: [GIN] 2025/09/06 - 09:10:44 | 200 | 20.677645904s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:13:22 ersa ollama[1177]: [GIN] 2025/09/06 - 09:13:22 | 200 | 2m36s | 192.168.1.57 | POST "/api/generate" Sep 06 09:13:26 ersa ollama[1177]: [GIN] 2025/09/06 - 09:13:26 | 200 | 2m41s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:14:18 ersa ollama[1177]: [GIN] 2025/09/06 - 09:14:18 | 200 | 5m14s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:14:25 ersa ollama[1177]: [GIN] 2025/09/06 - 09:14:25 | 200 | 59.472853625s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:14:35 ersa ollama[1177]: [GIN] 2025/09/06 - 09:14:35 | 200 | 9.801204967s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:15:03 ersa ollama[1177]: [GIN] 2025/09/06 - 09:15:03 | 200 | 19.575180022s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:20:29 ersa ollama[1177]: [GIN] 2025/09/06 - 09:20:29 | 200 | 2m59s | 192.168.1.57 | POST "/api/generate" Sep 06 09:20:35 ersa ollama[1177]: [GIN] 2025/09/06 - 09:20:35 | 200 | 4m2s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:21:28 ersa ollama[1177]: [GIN] 2025/09/06 - 09:21:28 | 200 | 7m9s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:22:33 ersa ollama[1177]: [GIN] 2025/09/06 - 09:22:33 | 200 | 39.112550948s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:24:33 ersa ollama[1177]: [GIN] 2025/09/06 - 09:24:33 | 200 | 1m9s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:26:01 ersa ollama[1177]: [GIN] 2025/09/06 - 09:26:01 | 200 | 4m32s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:27:58 ersa ollama[1177]: time=2025-09-06T09:27:58.911+02:00 level=WARN source=harmonyparser.go:402 msg="harmony parser: no reverse mapping found for function name" harmonyFunctionName=typebash Sep 06 09:27:58 ersa ollama[1177]: [GIN] 2025/09/06 - 09:27:58 | 200 | 55.605017854s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:29:22 ersa ollama[1177]: [GIN] 2025/09/06 - 09:29:22 | 200 | 1m23s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:30:41 ersa ollama[1177]: [GIN] 2025/09/06 - 09:30:41 | 200 | 4m40s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:30:46 ersa ollama[1177]: [GIN] 2025/09/06 - 09:30:46 | 200 | 1m21s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:31:38 ersa ollama[1177]: [GIN] 2025/09/06 - 09:31:38 | 200 | 35.664520566s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:33:49 ersa ollama[1177]: [GIN] 2025/09/06 - 09:33:49 | 200 | 17.74590042s | 192.168.1.57 | POST "/api/generate" Sep 06 09:34:19 ersa ollama[1177]: [GIN] 2025/09/06 - 09:34:19 | 200 | 1m46s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:35:40 ersa ollama[1177]: [GIN] 2025/09/06 - 09:35:40 | 200 | 4m58s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:35:50 ersa ollama[1177]: [GIN] 2025/09/06 - 09:35:50 | 200 | 1m5s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:36:35 ersa ollama[1177]: [GIN] 2025/09/06 - 09:36:35 | 200 | 4.38553234s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:36:47 ersa ollama[1177]: [GIN] 2025/09/06 - 09:36:47 | 200 | 4.370975001s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:37:13 ersa ollama[1177]: [GIN] 2025/09/06 - 09:37:13 | 200 | 11.272572506s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:37:23 ersa ollama[1177]: [GIN] 2025/09/06 - 09:37:23 | 200 | 6.406938698s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:37:49 ersa ollama[1177]: [GIN] 2025/09/06 - 09:37:49 | 200 | 6.128323622s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 09:38:00 ersa ollama[1177]: [GIN] 2025/09/06 - 09:38:00 | 200 | 7.270876557s | 127.0.0.1 | POST "/v1/chat/completions" Sep 06 11:01:36 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:36 | 200 | 3.707021ms | 192.168.1.57 | GET "/api/tags" Sep 06 11:01:36 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:36 | 200 | 58.934µs | 192.168.1.57 | GET "/api/ps" Sep 06 11:01:37 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:37 | 200 | 81.331µs | 192.168.1.57 | GET "/api/version" Sep 06 11:01:40 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:40 | 200 | 3.765025ms | 192.168.1.57 | GET "/api/tags" Sep 06 11:01:40 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:40 | 200 | 53.903µs | 192.168.1.57 | GET "/api/ps" Sep 06 11:01:40 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:40 | 200 | 3.529917ms | 192.168.1.57 | GET "/api/tags" Sep 06 11:01:40 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:40 | 200 | 55.603µs | 192.168.1.57 | GET "/api/ps" Sep 06 11:01:53 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:53 | 200 | 3.764105ms | 192.168.1.57 | GET "/api/tags" Sep 06 11:01:53 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:53 | 200 | 54.271µs | 192.168.1.57 | GET "/api/ps" Sep 06 11:01:54 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:54 | 200 | 3.736763ms | 192.168.1.57 | GET "/api/tags" Sep 06 11:01:54 ersa ollama[1177]: [GIN] 2025/09/06 - 11:01:54 | 200 | 54.405µs | 192.168.1.57 | GET "/api/ps" ```
Author
Owner

@rick-github commented on GitHub (Sep 6, 2025):

What's the output of ollama ps? Set OLLAMA_DEBUG=1 in the server environment and post the logs with the extra debug information.

<!-- gh-comment-id:3263030368 --> @rick-github commented on GitHub (Sep 6, 2025): What's the output of `ollama ps`? Set `OLLAMA_DEBUG=1` in the server environment and post the logs with the extra debug information.
Author
Owner

@shiraz-shah commented on GitHub (Sep 6, 2025):

From 40-core HP Proliant with 512 GB system memory and dual RTX 3060s. The 17-minute request was the one that done on the CPU here:

Sep 06 07:29:07 hades ollama[2290141]: [GIN] 2025/09/06 - 07:29:07 | 200 |      74.229µs |       127.0.0.1 | HEAD     "/"
Sep 06 07:29:07 hades ollama[2290141]: [GIN] 2025/09/06 - 07:29:07 | 200 |    2.476322ms |       127.0.0.1 | GET      "/api/tags"
Sep 06 07:29:12 hades ollama[2290141]: [GIN] 2025/09/06 - 07:29:12 | 200 |      45.532µs |       127.0.0.1 | HEAD     "/"
Sep 06 07:29:12 hades ollama[2290141]: [GIN] 2025/09/06 - 07:29:12 | 200 |        52.9µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 07:30:00 hades ollama[2290141]: [GIN] 2025/09/06 - 07:30:00 | 200 |       60.42µs |       127.0.0.1 | HEAD     "/"
Sep 06 07:30:00 hades ollama[2290141]: [GIN] 2025/09/06 - 07:30:00 | 200 |    2.496576ms |       127.0.0.1 | GET      "/api/tags"
Sep 06 07:30:05 hades ollama[2290141]: [GIN] 2025/09/06 - 07:30:05 | 200 |      51.084µs |       127.0.0.1 | HEAD     "/"
Sep 06 07:30:05 hades ollama[2290141]: [GIN] 2025/09/06 - 07:30:05 | 200 |  346.290892ms |       127.0.0.1 | POST     "/api/show"
Sep 06 07:30:07 hades ollama[2290141]: time=2025-09-06T07:30:07.934Z level=INFO source=server.go:199 msg="model wants flash attention"
Sep 06 07:30:07 hades ollama[2290141]: time=2025-09-06T07:30:07.934Z level=INFO source=server.go:216 msg="enabling flash attention"
Sep 06 07:30:07 hades ollama[2290141]: time=2025-09-06T07:30:07.934Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss
Sep 06 07:30:07 hades ollama[2290141]: time=2025-09-06T07:30:07.934Z level=WARN source=server.go:224 msg="kv cache type not supported by model" type=q8_0
Sep 06 07:30:07 hades ollama[2290141]: time=2025-09-06T07:30:07.934Z level=INFO source=server.go:388 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 44345"
Sep 06 07:30:07 hades ollama[2290141]: time=2025-09-06T07:30:07.963Z level=INFO source=runner.go:1006 msg="starting ollama engine"
Sep 06 07:30:07 hades ollama[2290141]: time=2025-09-06T07:30:07.964Z level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:44345"
Sep 06 07:30:08 hades ollama[2290141]: time=2025-09-06T07:30:08.528Z level=INFO source=server.go:493 msg="system memory" total="503.9 GiB" free="448.8 GiB" free_swap="0 B"
Sep 06 07:30:09 hades ollama[2290141]: time=2025-09-06T07:30:09.114Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss
Sep 06 07:30:09 hades ollama[2290141]: time=2025-09-06T07:30:09.702Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss
Sep 06 07:30:09 hades ollama[2290141]: time=2025-09-06T07:30:09.703Z level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 library=cuda parallel=1 required="17.1 GiB" gpus=2
Sep 06 07:30:10 hades ollama[2290141]: time=2025-09-06T07:30:10.292Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss
Sep 06 07:30:10 hades ollama[2290141]: time=2025-09-06T07:30:10.293Z level=INFO source=server.go:533 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=25 layers.split="[13 12]" memory.available="[11.5 GiB 11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="17.1 GiB" memory.required.partial="17.1 GiB" memory.required.kv="3.1 GiB" memory.required.allocations="[7.7 GiB 9.5 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="242.0 MiB" memory.graph.partial="242.0 MiB"
Sep 06 07:30:10 hades ollama[2290141]: time=2025-09-06T07:30:10.297Z level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:40 GPULayers:25[ID:GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb Layers:13(0..12) ID:GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b Layers:12(13..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 06 07:30:10 hades ollama[2290141]: time=2025-09-06T07:30:10.501Z level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
Sep 06 07:30:10 hades ollama[2290141]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 06 07:30:10 hades ollama[2290141]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 06 07:30:10 hades ollama[2290141]: ggml_cuda_init: found 2 CUDA devices:
Sep 06 07:30:10 hades ollama[2290141]:   Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, ID: GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb
Sep 06 07:30:10 hades ollama[2290141]:   Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, ID: GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b
Sep 06 07:30:10 hades ollama[2290141]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Sep 06 07:30:10 hades ollama[2290141]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-sse42.so
Sep 06 07:30:10 hades ollama[2290141]: time=2025-09-06T07:30:10.682Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.168Z level=INFO source=ggml.go:486 msg="offloading 24 repeating layers to GPU"
Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.168Z level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.168Z level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU"
Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="5.8 GiB"
Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="6.0 GiB"
Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="1.6 GiB"
Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="1.5 GiB"
Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="241.8 MiB"
Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="233.9 MiB"
Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:342 msg="total memory" size="16.4 GiB"
Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=sched.go:473 msg="loaded runners" count=1
Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=server.go:1236 msg="waiting for llama runner to start responding"
Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.171Z level=INFO source=server.go:1270 msg="waiting for server to become available" status="llm server loading model"
Sep 06 07:30:18 hades ollama[2290141]: time=2025-09-06T07:30:18.986Z level=INFO source=server.go:1274 msg="llama runner started in 11.05 seconds"
Sep 06 07:30:18 hades ollama[2290141]: [GIN] 2025/09/06 - 07:30:18 | 200 | 13.131135129s |       127.0.0.1 | POST     "/api/generate"
Sep 06 07:31:51 hades ollama[2290141]: [GIN] 2025/09/06 - 07:31:51 | 200 |     102.794µs |       127.0.0.1 | HEAD     "/"
Sep 06 07:31:51 hades ollama[2290141]: [GIN] 2025/09/06 - 07:31:51 | 200 |      131.16µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 07:58:54 hades ollama[2290141]: time=2025-09-06T07:58:54.341Z level=INFO source=server.go:1405 msg="aborting completion request due to client closing the connection"
Sep 06 07:58:54 hades ollama[2290141]: [GIN] 2025/09/06 - 07:58:54 | 500 |        17m49s |    192.168.1.52 | POST     "/v1/chat/completions"
Sep 06 09:01:36 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:36 | 200 |    3.536646ms |       127.0.0.1 | GET      "/api/tags"
Sep 06 09:01:36 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:36 | 200 |      69.077µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:01:37 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:37 | 200 |     141.294µs |       127.0.0.1 | GET      "/api/version"
Sep 06 09:01:40 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:40 | 200 |    2.911168ms |       127.0.0.1 | GET      "/api/tags"
Sep 06 09:01:40 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:40 | 200 |       49.06µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:01:40 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:40 | 200 |    2.839001ms |       127.0.0.1 | GET      "/api/tags"
Sep 06 09:01:40 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:40 | 200 |       43.44µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:01:53 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:53 | 200 |    2.757047ms |       127.0.0.1 | GET      "/api/tags"
Sep 06 09:01:53 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:53 | 200 |      46.399µs |       127.0.0.1 | GET      "/api/ps"
Sep 06 09:01:54 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:54 | 200 |    2.608707ms |       127.0.0.1 | GET      "/api/tags"
Sep 06 09:01:54 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:54 | 200 |      44.176µs |       127.0.0.1 | GET      "/api/ps"
<!-- gh-comment-id:3263033205 --> @shiraz-shah commented on GitHub (Sep 6, 2025): From 40-core HP Proliant with 512 GB system memory and dual RTX 3060s. The 17-minute request was the one that done on the CPU here: ``` Sep 06 07:29:07 hades ollama[2290141]: [GIN] 2025/09/06 - 07:29:07 | 200 | 74.229µs | 127.0.0.1 | HEAD "/" Sep 06 07:29:07 hades ollama[2290141]: [GIN] 2025/09/06 - 07:29:07 | 200 | 2.476322ms | 127.0.0.1 | GET "/api/tags" Sep 06 07:29:12 hades ollama[2290141]: [GIN] 2025/09/06 - 07:29:12 | 200 | 45.532µs | 127.0.0.1 | HEAD "/" Sep 06 07:29:12 hades ollama[2290141]: [GIN] 2025/09/06 - 07:29:12 | 200 | 52.9µs | 127.0.0.1 | GET "/api/ps" Sep 06 07:30:00 hades ollama[2290141]: [GIN] 2025/09/06 - 07:30:00 | 200 | 60.42µs | 127.0.0.1 | HEAD "/" Sep 06 07:30:00 hades ollama[2290141]: [GIN] 2025/09/06 - 07:30:00 | 200 | 2.496576ms | 127.0.0.1 | GET "/api/tags" Sep 06 07:30:05 hades ollama[2290141]: [GIN] 2025/09/06 - 07:30:05 | 200 | 51.084µs | 127.0.0.1 | HEAD "/" Sep 06 07:30:05 hades ollama[2290141]: [GIN] 2025/09/06 - 07:30:05 | 200 | 346.290892ms | 127.0.0.1 | POST "/api/show" Sep 06 07:30:07 hades ollama[2290141]: time=2025-09-06T07:30:07.934Z level=INFO source=server.go:199 msg="model wants flash attention" Sep 06 07:30:07 hades ollama[2290141]: time=2025-09-06T07:30:07.934Z level=INFO source=server.go:216 msg="enabling flash attention" Sep 06 07:30:07 hades ollama[2290141]: time=2025-09-06T07:30:07.934Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss Sep 06 07:30:07 hades ollama[2290141]: time=2025-09-06T07:30:07.934Z level=WARN source=server.go:224 msg="kv cache type not supported by model" type=q8_0 Sep 06 07:30:07 hades ollama[2290141]: time=2025-09-06T07:30:07.934Z level=INFO source=server.go:388 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 44345" Sep 06 07:30:07 hades ollama[2290141]: time=2025-09-06T07:30:07.963Z level=INFO source=runner.go:1006 msg="starting ollama engine" Sep 06 07:30:07 hades ollama[2290141]: time=2025-09-06T07:30:07.964Z level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:44345" Sep 06 07:30:08 hades ollama[2290141]: time=2025-09-06T07:30:08.528Z level=INFO source=server.go:493 msg="system memory" total="503.9 GiB" free="448.8 GiB" free_swap="0 B" Sep 06 07:30:09 hades ollama[2290141]: time=2025-09-06T07:30:09.114Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss Sep 06 07:30:09 hades ollama[2290141]: time=2025-09-06T07:30:09.702Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss Sep 06 07:30:09 hades ollama[2290141]: time=2025-09-06T07:30:09.703Z level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 library=cuda parallel=1 required="17.1 GiB" gpus=2 Sep 06 07:30:10 hades ollama[2290141]: time=2025-09-06T07:30:10.292Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss Sep 06 07:30:10 hades ollama[2290141]: time=2025-09-06T07:30:10.293Z level=INFO source=server.go:533 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=25 layers.split="[13 12]" memory.available="[11.5 GiB 11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="17.1 GiB" memory.required.partial="17.1 GiB" memory.required.kv="3.1 GiB" memory.required.allocations="[7.7 GiB 9.5 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="242.0 MiB" memory.graph.partial="242.0 MiB" Sep 06 07:30:10 hades ollama[2290141]: time=2025-09-06T07:30:10.297Z level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:40 GPULayers:25[ID:GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb Layers:13(0..12) ID:GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b Layers:12(13..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 06 07:30:10 hades ollama[2290141]: time=2025-09-06T07:30:10.501Z level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 Sep 06 07:30:10 hades ollama[2290141]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 06 07:30:10 hades ollama[2290141]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 06 07:30:10 hades ollama[2290141]: ggml_cuda_init: found 2 CUDA devices: Sep 06 07:30:10 hades ollama[2290141]: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, ID: GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb Sep 06 07:30:10 hades ollama[2290141]: Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, ID: GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b Sep 06 07:30:10 hades ollama[2290141]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Sep 06 07:30:10 hades ollama[2290141]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-sse42.so Sep 06 07:30:10 hades ollama[2290141]: time=2025-09-06T07:30:10.682Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.168Z level=INFO source=ggml.go:486 msg="offloading 24 repeating layers to GPU" Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.168Z level=INFO source=ggml.go:492 msg="offloading output layer to GPU" Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.168Z level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU" Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="5.8 GiB" Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="6.0 GiB" Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="1.6 GiB" Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="1.5 GiB" Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="241.8 MiB" Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="233.9 MiB" Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=backend.go:342 msg="total memory" size="16.4 GiB" Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=sched.go:473 msg="loaded runners" count=1 Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.169Z level=INFO source=server.go:1236 msg="waiting for llama runner to start responding" Sep 06 07:30:11 hades ollama[2290141]: time=2025-09-06T07:30:11.171Z level=INFO source=server.go:1270 msg="waiting for server to become available" status="llm server loading model" Sep 06 07:30:18 hades ollama[2290141]: time=2025-09-06T07:30:18.986Z level=INFO source=server.go:1274 msg="llama runner started in 11.05 seconds" Sep 06 07:30:18 hades ollama[2290141]: [GIN] 2025/09/06 - 07:30:18 | 200 | 13.131135129s | 127.0.0.1 | POST "/api/generate" Sep 06 07:31:51 hades ollama[2290141]: [GIN] 2025/09/06 - 07:31:51 | 200 | 102.794µs | 127.0.0.1 | HEAD "/" Sep 06 07:31:51 hades ollama[2290141]: [GIN] 2025/09/06 - 07:31:51 | 200 | 131.16µs | 127.0.0.1 | GET "/api/ps" Sep 06 07:58:54 hades ollama[2290141]: time=2025-09-06T07:58:54.341Z level=INFO source=server.go:1405 msg="aborting completion request due to client closing the connection" Sep 06 07:58:54 hades ollama[2290141]: [GIN] 2025/09/06 - 07:58:54 | 500 | 17m49s | 192.168.1.52 | POST "/v1/chat/completions" Sep 06 09:01:36 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:36 | 200 | 3.536646ms | 127.0.0.1 | GET "/api/tags" Sep 06 09:01:36 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:36 | 200 | 69.077µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:01:37 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:37 | 200 | 141.294µs | 127.0.0.1 | GET "/api/version" Sep 06 09:01:40 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:40 | 200 | 2.911168ms | 127.0.0.1 | GET "/api/tags" Sep 06 09:01:40 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:40 | 200 | 49.06µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:01:40 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:40 | 200 | 2.839001ms | 127.0.0.1 | GET "/api/tags" Sep 06 09:01:40 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:40 | 200 | 43.44µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:01:53 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:53 | 200 | 2.757047ms | 127.0.0.1 | GET "/api/tags" Sep 06 09:01:53 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:53 | 200 | 46.399µs | 127.0.0.1 | GET "/api/ps" Sep 06 09:01:54 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:54 | 200 | 2.608707ms | 127.0.0.1 | GET "/api/tags" Sep 06 09:01:54 hades ollama[2290141]: [GIN] 2025/09/06 - 09:01:54 | 200 | 44.176µs | 127.0.0.1 | GET "/api/ps" ```
Author
Owner

@shiraz-shah commented on GitHub (Sep 6, 2025):

Ollama ps:

NAME                   ID              SIZE     PROCESSOR    CONTEXT    UNTIL               
gpt-oss-long:latest    23de7335a074    18 GB    100% GPU     131072     29 minutes from now    

Nvidia-smi:

Sat Sep  6 19:02:32 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:86:00.0 Off |                  N/A |
|  0%   38C    P8             18W /  170W |    7909MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060        Off |   00000000:8C:00.0 Off |                  N/A |
|  0%   38C    P8             16W /  170W |    8069MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         2468366      C   /usr/local/bin/ollama                  7900MiB |
|    1   N/A  N/A         2468366      C   /usr/local/bin/ollama                  8060MiB |
+-----------------------------------------------------------------------------------------+

top:

Tasks: 814 total,   5 running, 809 sleeping,   0 stopped,   0 zombie
%Cpu(s): 13.8 us,  0.4 sy,  0.0 ni, 85.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 
MiB Mem : 515944.1 total, 109350.8 free,  57836.2 used, 392241.3 buff/cache     
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 458107.8 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                   
2290141 ollama    20   0   15.2g 630112  30724 S 693.8   0.1 135:37.37 ollama                                                                    

So GPU utilisation is 0% even though VRAM is booked. And CPU utilisation is 700% here

<!-- gh-comment-id:3263044843 --> @shiraz-shah commented on GitHub (Sep 6, 2025): Ollama ps: ``` NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss-long:latest 23de7335a074 18 GB 100% GPU 131072 29 minutes from now ``` Nvidia-smi: ``` Sat Sep 6 19:02:32 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 Off | 00000000:86:00.0 Off | N/A | | 0% 38C P8 18W / 170W | 7909MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3060 Off | 00000000:8C:00.0 Off | N/A | | 0% 38C P8 16W / 170W | 8069MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2468366 C /usr/local/bin/ollama 7900MiB | | 1 N/A N/A 2468366 C /usr/local/bin/ollama 8060MiB | +-----------------------------------------------------------------------------------------+ ``` top: ``` Tasks: 814 total, 5 running, 809 sleeping, 0 stopped, 0 zombie %Cpu(s): 13.8 us, 0.4 sy, 0.0 ni, 85.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 515944.1 total, 109350.8 free, 57836.2 used, 392241.3 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 458107.8 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2290141 ollama 20 0 15.2g 630112 30724 S 693.8 0.1 135:37.37 ollama ``` So GPU utilisation is 0% even though VRAM is booked. And CPU utilisation is 700% here
Author
Owner

@shiraz-shah commented on GitHub (Sep 6, 2025):

Server log with debug mode set to on, while it's doing inference on CPU, with while ollama ps says 100% GPU:

Sep 06 19:06:35 hades systemd[1]: Started ollama.service - Ollama Service.
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.109Z level=INFO source=routes.go:1331 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:30m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/data/ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.112Z level=INFO source=images.go:477 msg="total blobs: 19"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.113Z level=INFO source=images.go:484 msg="total unused blobs removed: 0"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.115Z level=INFO source=routes.go:1384 msg="Listening on [::]:11434 (version 0.11.8)"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.116Z level=DEBUG source=sched.go:121 msg="starting llm scheduler"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.116Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.122Z level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.123Z level=DEBUG source=gpu.go:503 msg="Searching for GPU library" name=libcuda.so*
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.123Z level=DEBUG source=gpu.go:527 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.125Z level=DEBUG source=gpu.go:560 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07]
Sep 06 19:06:35 hades ollama[2468567]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07
Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuInit - 0x76aee3d0fe70
Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuDriverGetVersion - 0x76aee3d0fe90
Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuDeviceGetCount - 0x76aee3d0fed0
Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuDeviceGet - 0x76aee3d0feb0
Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuDeviceGetAttribute - 0x76aee3d0ffb0
Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuDeviceGetUuid - 0x76aee3d0ff10
Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuDeviceGetName - 0x76aee3d0fef0
Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuCtxCreate_v3 - 0x76aee3d10190
Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuMemGetInfo_v2 - 0x76aee3d10910
Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuCtxDestroy - 0x76aee3d6eab0
Sep 06 19:06:35 hades ollama[2468567]: calling cuInit
Sep 06 19:06:35 hades ollama[2468567]: calling cuDriverGetVersion
Sep 06 19:06:35 hades ollama[2468567]: raw version 0x2f30
Sep 06 19:06:35 hades ollama[2468567]: CUDA driver version: 12.8
Sep 06 19:06:35 hades ollama[2468567]: calling cuDeviceGetCount
Sep 06 19:06:35 hades ollama[2468567]: device count 2
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.243Z level=DEBUG source=gpu.go:125 msg="detected GPUs" count=2 library=/usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07
Sep 06 19:06:35 hades ollama[2468567]: [GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb] CUDA totalMem 11920mb
Sep 06 19:06:35 hades ollama[2468567]: [GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb] CUDA freeMem 11809mb
Sep 06 19:06:35 hades ollama[2468567]: [GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb] Compute Capability 8.6
Sep 06 19:06:35 hades ollama[2468567]: [GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b] CUDA totalMem 11920mb
Sep 06 19:06:35 hades ollama[2468567]: [GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b] CUDA freeMem 11809mb
Sep 06 19:06:35 hades ollama[2468567]: [GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b] Compute Capability 8.6
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.881Z level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/download/linux-drivers.html" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/0/properties"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:122 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/0/properties"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/1/properties"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:122 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/1/properties"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/2/properties"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:122 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/2/properties"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/3/properties"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:122 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/3/properties"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=INFO source=amd_linux.go:405 msg="no compatible amdgpu devices detected"
Sep 06 19:06:35 hades ollama[2468567]: releasing cuda driver library
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=INFO source=types.go:130 msg="inference compute" id=GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="11.6 GiB" available="11.5 GiB"
Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=INFO source=types.go:130 msg="inference compute" id=GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="11.6 GiB" available="11.5 GiB"
Sep 06 19:07:09 hades ollama[2468567]: time=2025-09-06T19:07:09.087Z level=DEBUG source=gpu.go:393 msg="updating system memory data" before.total="503.9 GiB" before.free="449.4 GiB" before.free_swap="0 B" now.total="503.9 GiB" now.free="449.3 GiB" now.free_swap="0 B"
Sep 06 19:07:09 hades ollama[2468567]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07
Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuInit - 0x76aee3d0fe70
Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuDriverGetVersion - 0x76aee3d0fe90
Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuDeviceGetCount - 0x76aee3d0fed0
Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuDeviceGet - 0x76aee3d0feb0
Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuDeviceGetAttribute - 0x76aee3d0ffb0
Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuDeviceGetUuid - 0x76aee3d0ff10
Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuDeviceGetName - 0x76aee3d0fef0
Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuCtxCreate_v3 - 0x76aee3d10190
Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuMemGetInfo_v2 - 0x76aee3d10910
Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuCtxDestroy - 0x76aee3d6eab0
Sep 06 19:07:09 hades ollama[2468567]: calling cuInit
Sep 06 19:07:09 hades ollama[2468567]: calling cuDriverGetVersion
Sep 06 19:07:09 hades ollama[2468567]: raw version 0x2f30
Sep 06 19:07:09 hades ollama[2468567]: CUDA driver version: 12.8
Sep 06 19:07:09 hades ollama[2468567]: calling cuDeviceGetCount
Sep 06 19:07:09 hades ollama[2468567]: device count 2
Sep 06 19:07:09 hades ollama[2468567]: time=2025-09-06T19:07:09.426Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB"
Sep 06 19:07:09 hades ollama[2468567]: time=2025-09-06T19:07:09.730Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB"
Sep 06 19:07:09 hades ollama[2468567]: releasing cuda driver library
Sep 06 19:07:09 hades ollama[2468567]: time=2025-09-06T19:07:09.731Z level=DEBUG source=sched.go:188 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=6 gpu_count=2
Sep 06 19:07:09 hades ollama[2468567]: time=2025-09-06T19:07:09.805Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=general.alignment default=32
Sep 06 19:07:09 hades ollama[2468567]: time=2025-09-06T19:07:09.805Z level=DEBUG source=sched.go:208 msg="loading first model" model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583
Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.159Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=general.alignment default=32
Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.159Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.160Z level=DEBUG source=gpu.go:393 msg="updating system memory data" before.total="503.9 GiB" before.free="449.3 GiB" before.free_swap="0 B" now.total="503.9 GiB" now.free="449.2 GiB" now.free_swap="0 B"
Sep 06 19:07:10 hades ollama[2468567]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuInit - 0x76aee3d0fe70
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDriverGetVersion - 0x76aee3d0fe90
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetCount - 0x76aee3d0fed0
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGet - 0x76aee3d0feb0
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetAttribute - 0x76aee3d0ffb0
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetUuid - 0x76aee3d0ff10
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetName - 0x76aee3d0fef0
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuCtxCreate_v3 - 0x76aee3d10190
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuMemGetInfo_v2 - 0x76aee3d10910
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuCtxDestroy - 0x76aee3d6eab0
Sep 06 19:07:10 hades ollama[2468567]: calling cuInit
Sep 06 19:07:10 hades ollama[2468567]: calling cuDriverGetVersion
Sep 06 19:07:10 hades ollama[2468567]: raw version 0x2f30
Sep 06 19:07:10 hades ollama[2468567]: CUDA driver version: 12.8
Sep 06 19:07:10 hades ollama[2468567]: calling cuDeviceGetCount
Sep 06 19:07:10 hades ollama[2468567]: device count 2
Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.483Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB"
Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.781Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB"
Sep 06 19:07:10 hades ollama[2468567]: releasing cuda driver library
Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.781Z level=INFO source=server.go:199 msg="model wants flash attention"
Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.781Z level=INFO source=server.go:216 msg="enabling flash attention"
Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.781Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss
Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.781Z level=WARN source=server.go:224 msg="kv cache type not supported by model" type=q8_0
Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.782Z level=INFO source=server.go:388 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 34411"
Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.783Z level=DEBUG source=server.go:389 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_MODELS=/data/ollama/models OLLAMA_KEEP_ALIVE=1800 OLLAMA_HOST=0.0.0.0 OLLAMA_DEBUG=1 OLLAMA_MAX_LOADED_MODELS=6 OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama
Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.784Z level=DEBUG source=gpu.go:393 msg="updating system memory data" before.total="503.9 GiB" before.free="449.2 GiB" before.free_swap="0 B" now.total="503.9 GiB" now.free="449.2 GiB" now.free_swap="0 B"
Sep 06 19:07:10 hades ollama[2468567]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuInit - 0x76aee3d0fe70
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDriverGetVersion - 0x76aee3d0fe90
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetCount - 0x76aee3d0fed0
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGet - 0x76aee3d0feb0
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetAttribute - 0x76aee3d0ffb0
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetUuid - 0x76aee3d0ff10
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetName - 0x76aee3d0fef0
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuCtxCreate_v3 - 0x76aee3d10190
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuMemGetInfo_v2 - 0x76aee3d10910
Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuCtxDestroy - 0x76aee3d6eab0
Sep 06 19:07:10 hades ollama[2468567]: calling cuInit
Sep 06 19:07:10 hades ollama[2468567]: calling cuDriverGetVersion
Sep 06 19:07:10 hades ollama[2468567]: raw version 0x2f30
Sep 06 19:07:10 hades ollama[2468567]: CUDA driver version: 12.8
Sep 06 19:07:10 hades ollama[2468567]: calling cuDeviceGetCount
Sep 06 19:07:10 hades ollama[2468567]: device count 2
Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.814Z level=INFO source=runner.go:1006 msg="starting ollama engine"
Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.814Z level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:34411"
Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.077Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB"
Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.376Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB"
Sep 06 19:07:11 hades ollama[2468567]: releasing cuda driver library
Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.376Z level=INFO source=server.go:493 msg="system memory" total="503.9 GiB" free="449.2 GiB" free_swap="0 B"
Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.376Z level=DEBUG source=memory.go:181 msg=evaluating library=cuda gpu_count=1 available="[11.5 GiB]"
Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.377Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=gptoss.vision.block_count default=0
Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.377Z level=DEBUG source=gpu.go:393 msg="updating system memory data" before.total="503.9 GiB" before.free="449.2 GiB" before.free_swap="0 B" now.total="503.9 GiB" now.free="449.1 GiB" now.free_swap="0 B"
Sep 06 19:07:11 hades ollama[2468567]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuInit - 0x76aee3d0fe70
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDriverGetVersion - 0x76aee3d0fe90
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetCount - 0x76aee3d0fed0
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGet - 0x76aee3d0feb0
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetAttribute - 0x76aee3d0ffb0
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetUuid - 0x76aee3d0ff10
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetName - 0x76aee3d0fef0
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuCtxCreate_v3 - 0x76aee3d10190
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuMemGetInfo_v2 - 0x76aee3d10910
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuCtxDestroy - 0x76aee3d6eab0
Sep 06 19:07:11 hades ollama[2468567]: calling cuInit
Sep 06 19:07:11 hades ollama[2468567]: calling cuDriverGetVersion
Sep 06 19:07:11 hades ollama[2468567]: raw version 0x2f30
Sep 06 19:07:11 hades ollama[2468567]: CUDA driver version: 12.8
Sep 06 19:07:11 hades ollama[2468567]: calling cuDeviceGetCount
Sep 06 19:07:11 hades ollama[2468567]: device count 2
Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.672Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB"
Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.966Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB"
Sep 06 19:07:11 hades ollama[2468567]: releasing cuda driver library
Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.966Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss
Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.967Z level=DEBUG source=memory.go:181 msg=evaluating library=cuda gpu_count=2 available="[11.5 GiB 11.5 GiB]"
Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.967Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=gptoss.vision.block_count default=0
Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.968Z level=DEBUG source=gpu.go:393 msg="updating system memory data" before.total="503.9 GiB" before.free="449.1 GiB" before.free_swap="0 B" now.total="503.9 GiB" now.free="449.2 GiB" now.free_swap="0 B"
Sep 06 19:07:11 hades ollama[2468567]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuInit - 0x76aee3d0fe70
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDriverGetVersion - 0x76aee3d0fe90
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetCount - 0x76aee3d0fed0
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGet - 0x76aee3d0feb0
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetAttribute - 0x76aee3d0ffb0
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetUuid - 0x76aee3d0ff10
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetName - 0x76aee3d0fef0
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuCtxCreate_v3 - 0x76aee3d10190
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuMemGetInfo_v2 - 0x76aee3d10910
Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuCtxDestroy - 0x76aee3d6eab0
Sep 06 19:07:11 hades ollama[2468567]: calling cuInit
Sep 06 19:07:11 hades ollama[2468567]: calling cuDriverGetVersion
Sep 06 19:07:11 hades ollama[2468567]: raw version 0x2f30
Sep 06 19:07:11 hades ollama[2468567]: CUDA driver version: 12.8
Sep 06 19:07:11 hades ollama[2468567]: calling cuDeviceGetCount
Sep 06 19:07:11 hades ollama[2468567]: device count 2
Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.264Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB"
Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.552Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB"
Sep 06 19:07:12 hades ollama[2468567]: releasing cuda driver library
Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.552Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss
Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.553Z level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 library=cuda parallel=1 required="17.1 GiB" gpus=2
Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.553Z level=DEBUG source=memory.go:181 msg=evaluating library=cuda gpu_count=2 available="[11.5 GiB 11.5 GiB]"
Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.553Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=gptoss.vision.block_count default=0
Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.553Z level=DEBUG source=gpu.go:393 msg="updating system memory data" before.total="503.9 GiB" before.free="449.2 GiB" before.free_swap="0 B" now.total="503.9 GiB" now.free="449.2 GiB" now.free_swap="0 B"
Sep 06 19:07:12 hades ollama[2468567]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07
Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuInit - 0x76aee3d0fe70
Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuDriverGetVersion - 0x76aee3d0fe90
Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuDeviceGetCount - 0x76aee3d0fed0
Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuDeviceGet - 0x76aee3d0feb0
Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuDeviceGetAttribute - 0x76aee3d0ffb0
Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuDeviceGetUuid - 0x76aee3d0ff10
Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuDeviceGetName - 0x76aee3d0fef0
Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuCtxCreate_v3 - 0x76aee3d10190
Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuMemGetInfo_v2 - 0x76aee3d10910
Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuCtxDestroy - 0x76aee3d6eab0
Sep 06 19:07:12 hades ollama[2468567]: calling cuInit
Sep 06 19:07:12 hades ollama[2468567]: calling cuDriverGetVersion
Sep 06 19:07:12 hades ollama[2468567]: raw version 0x2f30
Sep 06 19:07:12 hades ollama[2468567]: CUDA driver version: 12.8
Sep 06 19:07:12 hades ollama[2468567]: calling cuDeviceGetCount
Sep 06 19:07:12 hades ollama[2468567]: device count 2
Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.842Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB"
Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.129Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB"
Sep 06 19:07:13 hades ollama[2468567]: releasing cuda driver library
Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.129Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss
Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.130Z level=INFO source=server.go:533 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=25 layers.split="[13 12]" memory.available="[11.5 GiB 11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="17.1 GiB" memory.required.partial="17.1 GiB" memory.required.kv="3.1 GiB" memory.required.allocations="[7.7 GiB 9.5 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="242.0 MiB" memory.graph.partial="242.0 MiB"
Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.134Z level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:40 GPULayers:25[ID:GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb Layers:13(0..12) ID:GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b Layers:12(13..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.329Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=general.alignment default=32
Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.330Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=general.name default=""
Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.330Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=general.description default=""
Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.330Z level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.330Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama
Sep 06 19:07:13 hades ollama[2468567]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 06 19:07:13 hades ollama[2468567]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 06 19:07:13 hades ollama[2468567]: ggml_cuda_init: found 2 CUDA devices:
Sep 06 19:07:13 hades ollama[2468567]:   Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, ID: GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb
Sep 06 19:07:13 hades ollama[2468567]:   Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, ID: GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b
Sep 06 19:07:13 hades ollama[2468567]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so
Sep 06 19:07:13 hades ollama[2468567]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-sse42.so
Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.515Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.970Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.002Z level=DEBUG source=ggml.go:784 msg="compute graph" nodes=1325 splits=3
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.002Z level=INFO source=ggml.go:486 msg="offloading 24 repeating layers to GPU"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.002Z level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.002Z level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="5.8 GiB"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="6.0 GiB"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="1.6 GiB"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="1.5 GiB"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="241.8 MiB"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="233.9 MiB"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:342 msg="total memory" size="16.4 GiB"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=sched.go:473 msg="loaded runners" count=1
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=server.go:1236 msg="waiting for llama runner to start responding"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.004Z level=INFO source=server.go:1270 msg="waiting for server to become available" status="llm server loading model"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.004Z level=DEBUG source=server.go:1280 msg="model load progress 0.00"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.256Z level=DEBUG source=server.go:1280 msg="model load progress 0.03"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.508Z level=DEBUG source=server.go:1280 msg="model load progress 0.06"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.760Z level=DEBUG source=server.go:1280 msg="model load progress 0.09"
Sep 06 19:07:15 hades ollama[2468567]: time=2025-09-06T19:07:15.012Z level=DEBUG source=server.go:1280 msg="model load progress 0.12"
Sep 06 19:07:15 hades ollama[2468567]: time=2025-09-06T19:07:15.264Z level=DEBUG source=server.go:1280 msg="model load progress 0.15"
Sep 06 19:07:15 hades ollama[2468567]: time=2025-09-06T19:07:15.517Z level=DEBUG source=server.go:1280 msg="model load progress 0.18"
Sep 06 19:07:15 hades ollama[2468567]: time=2025-09-06T19:07:15.768Z level=DEBUG source=server.go:1280 msg="model load progress 0.22"
Sep 06 19:07:16 hades ollama[2468567]: time=2025-09-06T19:07:16.020Z level=DEBUG source=server.go:1280 msg="model load progress 0.27"
Sep 06 19:07:16 hades ollama[2468567]: time=2025-09-06T19:07:16.272Z level=DEBUG source=server.go:1280 msg="model load progress 0.31"
Sep 06 19:07:16 hades ollama[2468567]: time=2025-09-06T19:07:16.523Z level=DEBUG source=server.go:1280 msg="model load progress 0.36"
Sep 06 19:07:16 hades ollama[2468567]: time=2025-09-06T19:07:16.776Z level=DEBUG source=server.go:1280 msg="model load progress 0.41"
Sep 06 19:07:17 hades ollama[2468567]: time=2025-09-06T19:07:17.028Z level=DEBUG source=server.go:1280 msg="model load progress 0.45"
Sep 06 19:07:17 hades ollama[2468567]: time=2025-09-06T19:07:17.280Z level=DEBUG source=server.go:1280 msg="model load progress 0.49"
Sep 06 19:07:17 hades ollama[2468567]: time=2025-09-06T19:07:17.532Z level=DEBUG source=server.go:1280 msg="model load progress 0.52"
Sep 06 19:07:17 hades ollama[2468567]: time=2025-09-06T19:07:17.784Z level=DEBUG source=server.go:1280 msg="model load progress 0.55"
Sep 06 19:07:18 hades ollama[2468567]: time=2025-09-06T19:07:18.036Z level=DEBUG source=server.go:1280 msg="model load progress 0.58"
Sep 06 19:07:18 hades ollama[2468567]: time=2025-09-06T19:07:18.288Z level=DEBUG source=server.go:1280 msg="model load progress 0.61"
Sep 06 19:07:18 hades ollama[2468567]: time=2025-09-06T19:07:18.541Z level=DEBUG source=server.go:1280 msg="model load progress 0.65"
Sep 06 19:07:18 hades ollama[2468567]: time=2025-09-06T19:07:18.793Z level=DEBUG source=server.go:1280 msg="model load progress 0.68"
Sep 06 19:07:19 hades ollama[2468567]: time=2025-09-06T19:07:19.045Z level=DEBUG source=server.go:1280 msg="model load progress 0.71"
Sep 06 19:07:19 hades ollama[2468567]: time=2025-09-06T19:07:19.297Z level=DEBUG source=server.go:1280 msg="model load progress 0.74"
Sep 06 19:07:19 hades ollama[2468567]: time=2025-09-06T19:07:19.549Z level=DEBUG source=server.go:1280 msg="model load progress 0.77"
Sep 06 19:07:19 hades ollama[2468567]: time=2025-09-06T19:07:19.802Z level=DEBUG source=server.go:1280 msg="model load progress 0.80"
Sep 06 19:07:20 hades ollama[2468567]: time=2025-09-06T19:07:20.054Z level=DEBUG source=server.go:1280 msg="model load progress 0.84"
Sep 06 19:07:20 hades ollama[2468567]: time=2025-09-06T19:07:20.306Z level=DEBUG source=server.go:1280 msg="model load progress 0.87"
Sep 06 19:07:20 hades ollama[2468567]: time=2025-09-06T19:07:20.558Z level=DEBUG source=server.go:1280 msg="model load progress 0.90"
Sep 06 19:07:20 hades ollama[2468567]: time=2025-09-06T19:07:20.811Z level=DEBUG source=server.go:1280 msg="model load progress 0.93"
Sep 06 19:07:21 hades ollama[2468567]: time=2025-09-06T19:07:21.063Z level=DEBUG source=server.go:1280 msg="model load progress 0.96"
Sep 06 19:07:21 hades ollama[2468567]: time=2025-09-06T19:07:21.315Z level=DEBUG source=server.go:1280 msg="model load progress 0.98"
Sep 06 19:07:21 hades ollama[2468567]: time=2025-09-06T19:07:21.567Z level=DEBUG source=server.go:1280 msg="model load progress 1.00"
Sep 06 19:07:21 hades ollama[2468567]: time=2025-09-06T19:07:21.819Z level=INFO source=server.go:1274 msg="llama runner started in 11.04 seconds"
Sep 06 19:07:21 hades ollama[2468567]: time=2025-09-06T19:07:21.819Z level=DEBUG source=sched.go:485 msg="finished setting up" runner.name=registry.ollama.ai/library/gpt-oss-long:latest runner.inference=cuda runner.devices=2 runner.size="17.1 GiB" runner.vram="17.1 GiB" runner.parallel=1 runner.pid=2468609 runner.model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=131072
<!-- gh-comment-id:3263057187 --> @shiraz-shah commented on GitHub (Sep 6, 2025): Server log with debug mode set to on, while it's doing inference on CPU, with while ollama ps says 100% GPU: ``` Sep 06 19:06:35 hades systemd[1]: Started ollama.service - Ollama Service. Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.109Z level=INFO source=routes.go:1331 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:30m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/data/ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.112Z level=INFO source=images.go:477 msg="total blobs: 19" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.113Z level=INFO source=images.go:484 msg="total unused blobs removed: 0" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.115Z level=INFO source=routes.go:1384 msg="Listening on [::]:11434 (version 0.11.8)" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.116Z level=DEBUG source=sched.go:121 msg="starting llm scheduler" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.116Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.122Z level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.123Z level=DEBUG source=gpu.go:503 msg="Searching for GPU library" name=libcuda.so* Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.123Z level=DEBUG source=gpu.go:527 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.125Z level=DEBUG source=gpu.go:560 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07] Sep 06 19:06:35 hades ollama[2468567]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07 Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuInit - 0x76aee3d0fe70 Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuDriverGetVersion - 0x76aee3d0fe90 Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuDeviceGetCount - 0x76aee3d0fed0 Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuDeviceGet - 0x76aee3d0feb0 Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuDeviceGetAttribute - 0x76aee3d0ffb0 Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuDeviceGetUuid - 0x76aee3d0ff10 Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuDeviceGetName - 0x76aee3d0fef0 Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuCtxCreate_v3 - 0x76aee3d10190 Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuMemGetInfo_v2 - 0x76aee3d10910 Sep 06 19:06:35 hades ollama[2468567]: dlsym: cuCtxDestroy - 0x76aee3d6eab0 Sep 06 19:06:35 hades ollama[2468567]: calling cuInit Sep 06 19:06:35 hades ollama[2468567]: calling cuDriverGetVersion Sep 06 19:06:35 hades ollama[2468567]: raw version 0x2f30 Sep 06 19:06:35 hades ollama[2468567]: CUDA driver version: 12.8 Sep 06 19:06:35 hades ollama[2468567]: calling cuDeviceGetCount Sep 06 19:06:35 hades ollama[2468567]: device count 2 Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.243Z level=DEBUG source=gpu.go:125 msg="detected GPUs" count=2 library=/usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07 Sep 06 19:06:35 hades ollama[2468567]: [GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb] CUDA totalMem 11920mb Sep 06 19:06:35 hades ollama[2468567]: [GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb] CUDA freeMem 11809mb Sep 06 19:06:35 hades ollama[2468567]: [GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb] Compute Capability 8.6 Sep 06 19:06:35 hades ollama[2468567]: [GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b] CUDA totalMem 11920mb Sep 06 19:06:35 hades ollama[2468567]: [GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b] CUDA freeMem 11809mb Sep 06 19:06:35 hades ollama[2468567]: [GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b] Compute Capability 8.6 Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.881Z level=WARN source=amd_linux.go:61 msg="ollama recommends running the https://www.amd.com/en/support/download/linux-drivers.html" error="amdgpu version file missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or directory" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/0/properties" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:122 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/0/properties" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/1/properties" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:122 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/1/properties" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/2/properties" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:122 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/2/properties" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:102 msg="evaluating amdgpu node /sys/class/kfd/kfd/topology/nodes/3/properties" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=DEBUG source=amd_linux.go:122 msg="detected CPU /sys/class/kfd/kfd/topology/nodes/3/properties" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=INFO source=amd_linux.go:405 msg="no compatible amdgpu devices detected" Sep 06 19:06:35 hades ollama[2468567]: releasing cuda driver library Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=INFO source=types.go:130 msg="inference compute" id=GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="11.6 GiB" available="11.5 GiB" Sep 06 19:06:35 hades ollama[2468567]: time=2025-09-06T19:06:35.882Z level=INFO source=types.go:130 msg="inference compute" id=GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b library=cuda variant=v12 compute=8.6 driver=12.8 name="NVIDIA GeForce RTX 3060" total="11.6 GiB" available="11.5 GiB" Sep 06 19:07:09 hades ollama[2468567]: time=2025-09-06T19:07:09.087Z level=DEBUG source=gpu.go:393 msg="updating system memory data" before.total="503.9 GiB" before.free="449.4 GiB" before.free_swap="0 B" now.total="503.9 GiB" now.free="449.3 GiB" now.free_swap="0 B" Sep 06 19:07:09 hades ollama[2468567]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07 Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuInit - 0x76aee3d0fe70 Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuDriverGetVersion - 0x76aee3d0fe90 Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuDeviceGetCount - 0x76aee3d0fed0 Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuDeviceGet - 0x76aee3d0feb0 Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuDeviceGetAttribute - 0x76aee3d0ffb0 Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuDeviceGetUuid - 0x76aee3d0ff10 Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuDeviceGetName - 0x76aee3d0fef0 Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuCtxCreate_v3 - 0x76aee3d10190 Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuMemGetInfo_v2 - 0x76aee3d10910 Sep 06 19:07:09 hades ollama[2468567]: dlsym: cuCtxDestroy - 0x76aee3d6eab0 Sep 06 19:07:09 hades ollama[2468567]: calling cuInit Sep 06 19:07:09 hades ollama[2468567]: calling cuDriverGetVersion Sep 06 19:07:09 hades ollama[2468567]: raw version 0x2f30 Sep 06 19:07:09 hades ollama[2468567]: CUDA driver version: 12.8 Sep 06 19:07:09 hades ollama[2468567]: calling cuDeviceGetCount Sep 06 19:07:09 hades ollama[2468567]: device count 2 Sep 06 19:07:09 hades ollama[2468567]: time=2025-09-06T19:07:09.426Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB" Sep 06 19:07:09 hades ollama[2468567]: time=2025-09-06T19:07:09.730Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB" Sep 06 19:07:09 hades ollama[2468567]: releasing cuda driver library Sep 06 19:07:09 hades ollama[2468567]: time=2025-09-06T19:07:09.731Z level=DEBUG source=sched.go:188 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=6 gpu_count=2 Sep 06 19:07:09 hades ollama[2468567]: time=2025-09-06T19:07:09.805Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=general.alignment default=32 Sep 06 19:07:09 hades ollama[2468567]: time=2025-09-06T19:07:09.805Z level=DEBUG source=sched.go:208 msg="loading first model" model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.159Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=general.alignment default=32 Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.159Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.160Z level=DEBUG source=gpu.go:393 msg="updating system memory data" before.total="503.9 GiB" before.free="449.3 GiB" before.free_swap="0 B" now.total="503.9 GiB" now.free="449.2 GiB" now.free_swap="0 B" Sep 06 19:07:10 hades ollama[2468567]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuInit - 0x76aee3d0fe70 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDriverGetVersion - 0x76aee3d0fe90 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetCount - 0x76aee3d0fed0 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGet - 0x76aee3d0feb0 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetAttribute - 0x76aee3d0ffb0 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetUuid - 0x76aee3d0ff10 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetName - 0x76aee3d0fef0 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuCtxCreate_v3 - 0x76aee3d10190 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuMemGetInfo_v2 - 0x76aee3d10910 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuCtxDestroy - 0x76aee3d6eab0 Sep 06 19:07:10 hades ollama[2468567]: calling cuInit Sep 06 19:07:10 hades ollama[2468567]: calling cuDriverGetVersion Sep 06 19:07:10 hades ollama[2468567]: raw version 0x2f30 Sep 06 19:07:10 hades ollama[2468567]: CUDA driver version: 12.8 Sep 06 19:07:10 hades ollama[2468567]: calling cuDeviceGetCount Sep 06 19:07:10 hades ollama[2468567]: device count 2 Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.483Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB" Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.781Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB" Sep 06 19:07:10 hades ollama[2468567]: releasing cuda driver library Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.781Z level=INFO source=server.go:199 msg="model wants flash attention" Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.781Z level=INFO source=server.go:216 msg="enabling flash attention" Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.781Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.781Z level=WARN source=server.go:224 msg="kv cache type not supported by model" type=q8_0 Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.782Z level=INFO source=server.go:388 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 34411" Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.783Z level=DEBUG source=server.go:389 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_MODELS=/data/ollama/models OLLAMA_KEEP_ALIVE=1800 OLLAMA_HOST=0.0.0.0 OLLAMA_DEBUG=1 OLLAMA_MAX_LOADED_MODELS=6 OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.784Z level=DEBUG source=gpu.go:393 msg="updating system memory data" before.total="503.9 GiB" before.free="449.2 GiB" before.free_swap="0 B" now.total="503.9 GiB" now.free="449.2 GiB" now.free_swap="0 B" Sep 06 19:07:10 hades ollama[2468567]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuInit - 0x76aee3d0fe70 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDriverGetVersion - 0x76aee3d0fe90 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetCount - 0x76aee3d0fed0 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGet - 0x76aee3d0feb0 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetAttribute - 0x76aee3d0ffb0 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetUuid - 0x76aee3d0ff10 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuDeviceGetName - 0x76aee3d0fef0 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuCtxCreate_v3 - 0x76aee3d10190 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuMemGetInfo_v2 - 0x76aee3d10910 Sep 06 19:07:10 hades ollama[2468567]: dlsym: cuCtxDestroy - 0x76aee3d6eab0 Sep 06 19:07:10 hades ollama[2468567]: calling cuInit Sep 06 19:07:10 hades ollama[2468567]: calling cuDriverGetVersion Sep 06 19:07:10 hades ollama[2468567]: raw version 0x2f30 Sep 06 19:07:10 hades ollama[2468567]: CUDA driver version: 12.8 Sep 06 19:07:10 hades ollama[2468567]: calling cuDeviceGetCount Sep 06 19:07:10 hades ollama[2468567]: device count 2 Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.814Z level=INFO source=runner.go:1006 msg="starting ollama engine" Sep 06 19:07:10 hades ollama[2468567]: time=2025-09-06T19:07:10.814Z level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:34411" Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.077Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB" Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.376Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB" Sep 06 19:07:11 hades ollama[2468567]: releasing cuda driver library Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.376Z level=INFO source=server.go:493 msg="system memory" total="503.9 GiB" free="449.2 GiB" free_swap="0 B" Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.376Z level=DEBUG source=memory.go:181 msg=evaluating library=cuda gpu_count=1 available="[11.5 GiB]" Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.377Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=gptoss.vision.block_count default=0 Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.377Z level=DEBUG source=gpu.go:393 msg="updating system memory data" before.total="503.9 GiB" before.free="449.2 GiB" before.free_swap="0 B" now.total="503.9 GiB" now.free="449.1 GiB" now.free_swap="0 B" Sep 06 19:07:11 hades ollama[2468567]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuInit - 0x76aee3d0fe70 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDriverGetVersion - 0x76aee3d0fe90 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetCount - 0x76aee3d0fed0 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGet - 0x76aee3d0feb0 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetAttribute - 0x76aee3d0ffb0 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetUuid - 0x76aee3d0ff10 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetName - 0x76aee3d0fef0 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuCtxCreate_v3 - 0x76aee3d10190 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuMemGetInfo_v2 - 0x76aee3d10910 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuCtxDestroy - 0x76aee3d6eab0 Sep 06 19:07:11 hades ollama[2468567]: calling cuInit Sep 06 19:07:11 hades ollama[2468567]: calling cuDriverGetVersion Sep 06 19:07:11 hades ollama[2468567]: raw version 0x2f30 Sep 06 19:07:11 hades ollama[2468567]: CUDA driver version: 12.8 Sep 06 19:07:11 hades ollama[2468567]: calling cuDeviceGetCount Sep 06 19:07:11 hades ollama[2468567]: device count 2 Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.672Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB" Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.966Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB" Sep 06 19:07:11 hades ollama[2468567]: releasing cuda driver library Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.966Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.967Z level=DEBUG source=memory.go:181 msg=evaluating library=cuda gpu_count=2 available="[11.5 GiB 11.5 GiB]" Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.967Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=gptoss.vision.block_count default=0 Sep 06 19:07:11 hades ollama[2468567]: time=2025-09-06T19:07:11.968Z level=DEBUG source=gpu.go:393 msg="updating system memory data" before.total="503.9 GiB" before.free="449.1 GiB" before.free_swap="0 B" now.total="503.9 GiB" now.free="449.2 GiB" now.free_swap="0 B" Sep 06 19:07:11 hades ollama[2468567]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuInit - 0x76aee3d0fe70 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDriverGetVersion - 0x76aee3d0fe90 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetCount - 0x76aee3d0fed0 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGet - 0x76aee3d0feb0 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetAttribute - 0x76aee3d0ffb0 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetUuid - 0x76aee3d0ff10 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuDeviceGetName - 0x76aee3d0fef0 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuCtxCreate_v3 - 0x76aee3d10190 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuMemGetInfo_v2 - 0x76aee3d10910 Sep 06 19:07:11 hades ollama[2468567]: dlsym: cuCtxDestroy - 0x76aee3d6eab0 Sep 06 19:07:11 hades ollama[2468567]: calling cuInit Sep 06 19:07:11 hades ollama[2468567]: calling cuDriverGetVersion Sep 06 19:07:11 hades ollama[2468567]: raw version 0x2f30 Sep 06 19:07:11 hades ollama[2468567]: CUDA driver version: 12.8 Sep 06 19:07:11 hades ollama[2468567]: calling cuDeviceGetCount Sep 06 19:07:11 hades ollama[2468567]: device count 2 Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.264Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB" Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.552Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB" Sep 06 19:07:12 hades ollama[2468567]: releasing cuda driver library Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.552Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.553Z level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 library=cuda parallel=1 required="17.1 GiB" gpus=2 Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.553Z level=DEBUG source=memory.go:181 msg=evaluating library=cuda gpu_count=2 available="[11.5 GiB 11.5 GiB]" Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.553Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=gptoss.vision.block_count default=0 Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.553Z level=DEBUG source=gpu.go:393 msg="updating system memory data" before.total="503.9 GiB" before.free="449.2 GiB" before.free_swap="0 B" now.total="503.9 GiB" now.free="449.2 GiB" now.free_swap="0 B" Sep 06 19:07:12 hades ollama[2468567]: initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.133.07 Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuInit - 0x76aee3d0fe70 Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuDriverGetVersion - 0x76aee3d0fe90 Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuDeviceGetCount - 0x76aee3d0fed0 Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuDeviceGet - 0x76aee3d0feb0 Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuDeviceGetAttribute - 0x76aee3d0ffb0 Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuDeviceGetUuid - 0x76aee3d0ff10 Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuDeviceGetName - 0x76aee3d0fef0 Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuCtxCreate_v3 - 0x76aee3d10190 Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuMemGetInfo_v2 - 0x76aee3d10910 Sep 06 19:07:12 hades ollama[2468567]: dlsym: cuCtxDestroy - 0x76aee3d6eab0 Sep 06 19:07:12 hades ollama[2468567]: calling cuInit Sep 06 19:07:12 hades ollama[2468567]: calling cuDriverGetVersion Sep 06 19:07:12 hades ollama[2468567]: raw version 0x2f30 Sep 06 19:07:12 hades ollama[2468567]: CUDA driver version: 12.8 Sep 06 19:07:12 hades ollama[2468567]: calling cuDeviceGetCount Sep 06 19:07:12 hades ollama[2468567]: device count 2 Sep 06 19:07:12 hades ollama[2468567]: time=2025-09-06T19:07:12.842Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB" Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.129Z level=DEBUG source=gpu.go:443 msg="updating cuda memory data" gpu=GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b name="NVIDIA GeForce RTX 3060" overhead="0 B" before.total="11.6 GiB" before.free="11.5 GiB" now.total="11.6 GiB" now.free="11.5 GiB" now.used="110.9 MiB" Sep 06 19:07:13 hades ollama[2468567]: releasing cuda driver library Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.129Z level=WARN source=ggml.go:764 msg="model only supports non-quantized cache types " mode=gptoss Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.130Z level=INFO source=server.go:533 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=25 layers.split="[13 12]" memory.available="[11.5 GiB 11.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="17.1 GiB" memory.required.partial="17.1 GiB" memory.required.kv="3.1 GiB" memory.required.allocations="[7.7 GiB 9.5 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="242.0 MiB" memory.graph.partial="242.0 MiB" Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.134Z level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:40 GPULayers:25[ID:GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb Layers:13(0..12) ID:GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b Layers:12(13..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.329Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=general.alignment default=32 Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.330Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=general.name default="" Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.330Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=general.description default="" Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.330Z level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.330Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama Sep 06 19:07:13 hades ollama[2468567]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 06 19:07:13 hades ollama[2468567]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 06 19:07:13 hades ollama[2468567]: ggml_cuda_init: found 2 CUDA devices: Sep 06 19:07:13 hades ollama[2468567]: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, ID: GPU-63e34265-b678-21e5-4fc2-d55b2c62d1fb Sep 06 19:07:13 hades ollama[2468567]: Device 1: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes, ID: GPU-689620b9-e9eb-499a-8bb1-d5b8390e5f5b Sep 06 19:07:13 hades ollama[2468567]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/libggml-cuda.so Sep 06 19:07:13 hades ollama[2468567]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-sse42.so Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.515Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Sep 06 19:07:13 hades ollama[2468567]: time=2025-09-06T19:07:13.970Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=tokenizer.ggml.pretokenizer default="[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.002Z level=DEBUG source=ggml.go:784 msg="compute graph" nodes=1325 splits=3 Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.002Z level=INFO source=ggml.go:486 msg="offloading 24 repeating layers to GPU" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.002Z level=INFO source=ggml.go:492 msg="offloading output layer to GPU" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.002Z level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="5.8 GiB" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:310 msg="model weights" device=CUDA1 size="6.0 GiB" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="1.6 GiB" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA1 size="1.5 GiB" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="241.8 MiB" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA1 size="233.9 MiB" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=backend.go:342 msg="total memory" size="16.4 GiB" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=sched.go:473 msg="loaded runners" count=1 Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.003Z level=INFO source=server.go:1236 msg="waiting for llama runner to start responding" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.004Z level=INFO source=server.go:1270 msg="waiting for server to become available" status="llm server loading model" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.004Z level=DEBUG source=server.go:1280 msg="model load progress 0.00" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.256Z level=DEBUG source=server.go:1280 msg="model load progress 0.03" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.508Z level=DEBUG source=server.go:1280 msg="model load progress 0.06" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.760Z level=DEBUG source=server.go:1280 msg="model load progress 0.09" Sep 06 19:07:15 hades ollama[2468567]: time=2025-09-06T19:07:15.012Z level=DEBUG source=server.go:1280 msg="model load progress 0.12" Sep 06 19:07:15 hades ollama[2468567]: time=2025-09-06T19:07:15.264Z level=DEBUG source=server.go:1280 msg="model load progress 0.15" Sep 06 19:07:15 hades ollama[2468567]: time=2025-09-06T19:07:15.517Z level=DEBUG source=server.go:1280 msg="model load progress 0.18" Sep 06 19:07:15 hades ollama[2468567]: time=2025-09-06T19:07:15.768Z level=DEBUG source=server.go:1280 msg="model load progress 0.22" Sep 06 19:07:16 hades ollama[2468567]: time=2025-09-06T19:07:16.020Z level=DEBUG source=server.go:1280 msg="model load progress 0.27" Sep 06 19:07:16 hades ollama[2468567]: time=2025-09-06T19:07:16.272Z level=DEBUG source=server.go:1280 msg="model load progress 0.31" Sep 06 19:07:16 hades ollama[2468567]: time=2025-09-06T19:07:16.523Z level=DEBUG source=server.go:1280 msg="model load progress 0.36" Sep 06 19:07:16 hades ollama[2468567]: time=2025-09-06T19:07:16.776Z level=DEBUG source=server.go:1280 msg="model load progress 0.41" Sep 06 19:07:17 hades ollama[2468567]: time=2025-09-06T19:07:17.028Z level=DEBUG source=server.go:1280 msg="model load progress 0.45" Sep 06 19:07:17 hades ollama[2468567]: time=2025-09-06T19:07:17.280Z level=DEBUG source=server.go:1280 msg="model load progress 0.49" Sep 06 19:07:17 hades ollama[2468567]: time=2025-09-06T19:07:17.532Z level=DEBUG source=server.go:1280 msg="model load progress 0.52" Sep 06 19:07:17 hades ollama[2468567]: time=2025-09-06T19:07:17.784Z level=DEBUG source=server.go:1280 msg="model load progress 0.55" Sep 06 19:07:18 hades ollama[2468567]: time=2025-09-06T19:07:18.036Z level=DEBUG source=server.go:1280 msg="model load progress 0.58" Sep 06 19:07:18 hades ollama[2468567]: time=2025-09-06T19:07:18.288Z level=DEBUG source=server.go:1280 msg="model load progress 0.61" Sep 06 19:07:18 hades ollama[2468567]: time=2025-09-06T19:07:18.541Z level=DEBUG source=server.go:1280 msg="model load progress 0.65" Sep 06 19:07:18 hades ollama[2468567]: time=2025-09-06T19:07:18.793Z level=DEBUG source=server.go:1280 msg="model load progress 0.68" Sep 06 19:07:19 hades ollama[2468567]: time=2025-09-06T19:07:19.045Z level=DEBUG source=server.go:1280 msg="model load progress 0.71" Sep 06 19:07:19 hades ollama[2468567]: time=2025-09-06T19:07:19.297Z level=DEBUG source=server.go:1280 msg="model load progress 0.74" Sep 06 19:07:19 hades ollama[2468567]: time=2025-09-06T19:07:19.549Z level=DEBUG source=server.go:1280 msg="model load progress 0.77" Sep 06 19:07:19 hades ollama[2468567]: time=2025-09-06T19:07:19.802Z level=DEBUG source=server.go:1280 msg="model load progress 0.80" Sep 06 19:07:20 hades ollama[2468567]: time=2025-09-06T19:07:20.054Z level=DEBUG source=server.go:1280 msg="model load progress 0.84" Sep 06 19:07:20 hades ollama[2468567]: time=2025-09-06T19:07:20.306Z level=DEBUG source=server.go:1280 msg="model load progress 0.87" Sep 06 19:07:20 hades ollama[2468567]: time=2025-09-06T19:07:20.558Z level=DEBUG source=server.go:1280 msg="model load progress 0.90" Sep 06 19:07:20 hades ollama[2468567]: time=2025-09-06T19:07:20.811Z level=DEBUG source=server.go:1280 msg="model load progress 0.93" Sep 06 19:07:21 hades ollama[2468567]: time=2025-09-06T19:07:21.063Z level=DEBUG source=server.go:1280 msg="model load progress 0.96" Sep 06 19:07:21 hades ollama[2468567]: time=2025-09-06T19:07:21.315Z level=DEBUG source=server.go:1280 msg="model load progress 0.98" Sep 06 19:07:21 hades ollama[2468567]: time=2025-09-06T19:07:21.567Z level=DEBUG source=server.go:1280 msg="model load progress 1.00" Sep 06 19:07:21 hades ollama[2468567]: time=2025-09-06T19:07:21.819Z level=INFO source=server.go:1274 msg="llama runner started in 11.04 seconds" Sep 06 19:07:21 hades ollama[2468567]: time=2025-09-06T19:07:21.819Z level=DEBUG source=sched.go:485 msg="finished setting up" runner.name=registry.ollama.ai/library/gpt-oss-long:latest runner.inference=cuda runner.devices=2 runner.size="17.1 GiB" runner.vram="17.1 GiB" runner.parallel=1 runner.pid=2468609 runner.model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=131072 ```
Author
Owner

@shiraz-shah commented on GitHub (Sep 6, 2025):

In the meantime, I can run the same model on cli with ollama run gpt-oss-long and say Hey, and that request gets executed on GPU, while the other one is still running on CPU. The server log then has the following appended to the above:

Sep 06 19:14:46 hades ollama[2468567]: [GIN] 2025/09/06 - 19:14:46 | 200 |       94.43µs |       127.0.0.1 | HEAD     "/"
Sep 06 19:14:46 hades ollama[2468567]: time=2025-09-06T19:14:46.701Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=general.alignment default=32
Sep 06 19:14:46 hades ollama[2468567]: [GIN] 2025/09/06 - 19:14:46 | 200 |  363.567081ms |       127.0.0.1 | POST     "/api/show"
Sep 06 19:14:47 hades ollama[2468567]: time=2025-09-06T19:14:47.303Z level=DEBUG source=sched.go:583 msg="evaluating already loaded" model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583
Sep 06 19:14:47 hades ollama[2468567]: [GIN] 2025/09/06 - 19:14:47 | 200 |   596.52635ms |       127.0.0.1 | POST     "/api/generate"
Sep 06 19:14:47 hades ollama[2468567]: time=2025-09-06T19:14:47.307Z level=DEBUG source=sched.go:377 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss-long:latest runner.inference=cuda runner.devices=2 runner.size="17.1 GiB" runner.vram="17.1 GiB" runner.parallel=1 runner.pid=2468609 runner.model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=131072
Sep 06 19:14:47 hades ollama[2468567]: time=2025-09-06T19:14:47.307Z level=DEBUG source=sched.go:304 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss-long:latest runner.inference=cuda runner.devices=2 runner.size="17.1 GiB" runner.vram="17.1 GiB" runner.parallel=1 runner.pid=2468609 runner.model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=131072 refCount=1
Sep 06 19:14:49 hades ollama[2468567]: time=2025-09-06T19:14:49.939Z level=DEBUG source=sched.go:583 msg="evaluating already loaded" model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583
Sep 06 19:14:49 hades ollama[2468567]: time=2025-09-06T19:14:49.941Z level=DEBUG source=server.go:1373 msg="completion request" images=0 prompt=305 format=""
Sep 06 19:14:50 hades ollama[2468567]: time=2025-09-06T19:14:50.108Z level=DEBUG source=cache.go:140 msg="loading cache slot" id=0 cache=0 prompt=68 used=0 remaining=68
Sep 06 19:14:52 hades ollama[2468567]: [GIN] 2025/09/06 - 19:14:52 | 200 |  2.852194733s |       127.0.0.1 | POST     "/api/chat"
Sep 06 19:14:52 hades ollama[2468567]: time=2025-09-06T19:14:52.196Z level=DEBUG source=sched.go:377 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss-long:latest runner.inference=cuda runner.devices=2 runner.size="17.1 GiB" runner.vram="17.1 GiB" runner.parallel=1 runner.pid=2468609 runner.model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=131072
Sep 06 19:14:52 hades ollama[2468567]: time=2025-09-06T19:14:52.197Z level=DEBUG source=sched.go:304 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss-long:latest runner.inference=cuda runner.devices=2 runner.size="17.1 GiB" runner.vram="17.1 GiB" runner.parallel=1 runner.pid=2468609 runner.model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=131072 refCount=1
<!-- gh-comment-id:3263079149 --> @shiraz-shah commented on GitHub (Sep 6, 2025): In the meantime, I can run the same model on cli with `ollama run gpt-oss-long` and say `Hey`, and that request gets executed on GPU, while the other one is still running on CPU. The server log then has the following appended to the above: ``` Sep 06 19:14:46 hades ollama[2468567]: [GIN] 2025/09/06 - 19:14:46 | 200 | 94.43µs | 127.0.0.1 | HEAD "/" Sep 06 19:14:46 hades ollama[2468567]: time=2025-09-06T19:14:46.701Z level=DEBUG source=ggml.go:210 msg="key with type not found" key=general.alignment default=32 Sep 06 19:14:46 hades ollama[2468567]: [GIN] 2025/09/06 - 19:14:46 | 200 | 363.567081ms | 127.0.0.1 | POST "/api/show" Sep 06 19:14:47 hades ollama[2468567]: time=2025-09-06T19:14:47.303Z level=DEBUG source=sched.go:583 msg="evaluating already loaded" model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 Sep 06 19:14:47 hades ollama[2468567]: [GIN] 2025/09/06 - 19:14:47 | 200 | 596.52635ms | 127.0.0.1 | POST "/api/generate" Sep 06 19:14:47 hades ollama[2468567]: time=2025-09-06T19:14:47.307Z level=DEBUG source=sched.go:377 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss-long:latest runner.inference=cuda runner.devices=2 runner.size="17.1 GiB" runner.vram="17.1 GiB" runner.parallel=1 runner.pid=2468609 runner.model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=131072 Sep 06 19:14:47 hades ollama[2468567]: time=2025-09-06T19:14:47.307Z level=DEBUG source=sched.go:304 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss-long:latest runner.inference=cuda runner.devices=2 runner.size="17.1 GiB" runner.vram="17.1 GiB" runner.parallel=1 runner.pid=2468609 runner.model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=131072 refCount=1 Sep 06 19:14:49 hades ollama[2468567]: time=2025-09-06T19:14:49.939Z level=DEBUG source=sched.go:583 msg="evaluating already loaded" model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 Sep 06 19:14:49 hades ollama[2468567]: time=2025-09-06T19:14:49.941Z level=DEBUG source=server.go:1373 msg="completion request" images=0 prompt=305 format="" Sep 06 19:14:50 hades ollama[2468567]: time=2025-09-06T19:14:50.108Z level=DEBUG source=cache.go:140 msg="loading cache slot" id=0 cache=0 prompt=68 used=0 remaining=68 Sep 06 19:14:52 hades ollama[2468567]: [GIN] 2025/09/06 - 19:14:52 | 200 | 2.852194733s | 127.0.0.1 | POST "/api/chat" Sep 06 19:14:52 hades ollama[2468567]: time=2025-09-06T19:14:52.196Z level=DEBUG source=sched.go:377 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss-long:latest runner.inference=cuda runner.devices=2 runner.size="17.1 GiB" runner.vram="17.1 GiB" runner.parallel=1 runner.pid=2468609 runner.model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=131072 Sep 06 19:14:52 hades ollama[2468567]: time=2025-09-06T19:14:52.197Z level=DEBUG source=sched.go:304 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss-long:latest runner.inference=cuda runner.devices=2 runner.size="17.1 GiB" runner.vram="17.1 GiB" runner.parallel=1 runner.pid=2468609 runner.model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=131072 refCount=1 ```
Author
Owner

@shiraz-shah commented on GitHub (Sep 6, 2025):

Next, my code editor times out waiting for the CPU-routed request to finish. The log gets updated with this:

Sep 06 19:17:08 hades ollama[2468567]: time=2025-09-06T19:17:08.641Z level=DEBUG source=sched.go:493 msg="context for request finished"
Sep 06 19:17:08 hades ollama[2468567]: time=2025-09-06T19:17:08.641Z level=DEBUG source=sched.go:286 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss-long:latest runner.inference=cuda runner.devices=2 runner.size="17.1 GiB" runner.vram="17.1 GiB" runner.parallel=1 runner.pid=2468609 runner.model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=131072 duration=30m0s
Sep 06 19:17:08 hades ollama[2468567]: time=2025-09-06T19:17:08.641Z level=DEBUG source=sched.go:304 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss-long:latest runner.inference=cuda runner.devices=2 runner.size="17.1 GiB" runner.vram="17.1 GiB" runner.parallel=1 runner.pid=2468609 runner.model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=131072 refCount=0

while the code editor looks like this:

Image
<!-- gh-comment-id:3263084572 --> @shiraz-shah commented on GitHub (Sep 6, 2025): Next, my code editor times out waiting for the CPU-routed request to finish. The log gets updated with this: ``` Sep 06 19:17:08 hades ollama[2468567]: time=2025-09-06T19:17:08.641Z level=DEBUG source=sched.go:493 msg="context for request finished" Sep 06 19:17:08 hades ollama[2468567]: time=2025-09-06T19:17:08.641Z level=DEBUG source=sched.go:286 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss-long:latest runner.inference=cuda runner.devices=2 runner.size="17.1 GiB" runner.vram="17.1 GiB" runner.parallel=1 runner.pid=2468609 runner.model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=131072 duration=30m0s Sep 06 19:17:08 hades ollama[2468567]: time=2025-09-06T19:17:08.641Z level=DEBUG source=sched.go:304 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss-long:latest runner.inference=cuda runner.devices=2 runner.size="17.1 GiB" runner.vram="17.1 GiB" runner.parallel=1 runner.pid=2468609 runner.model=/data/ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 runner.num_ctx=131072 refCount=0 ``` while the code editor looks like this: <img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/a99ef973-15a9-4d7d-87bd-afe5a3ffe5c1" />
Author
Owner

@whp-Henry commented on GitHub (Sep 7, 2025):

I encountered similar issues. If there are no necessary new features, try downgrading to Ollama v0.11.8 in order to avoid the feature "Improved performance via overlapping GPU and CPU computations" introduced in v0.11.9

<!-- gh-comment-id:3263556112 --> @whp-Henry commented on GitHub (Sep 7, 2025): I encountered similar issues. If there are no necessary new features, try downgrading to Ollama v0.11.8 in order to avoid the feature "Improved performance via overlapping GPU and CPU computations" introduced in v0.11.9
Author
Owner

@shiraz-shah commented on GitHub (Sep 7, 2025):

OK, I just downgraded to 0.11.8 as shown below.

I still have the same problem. Some requests end up on CPU, while others get handled by the GPU.

How I downgraded:

# remove debug flag
systemctl edit ollama.service
# stop ollama
systemctl stop ollama
# remove existing ollama
rm -rf /usr/lib/ollama
# get ollama 0.11.8
curl -LO https://github.com/ollama/ollama/releases/download/v0.11.8/ollama-linux-amd64.tgz
# install ollama 0.11.8
tar -C /usr -xzf ollama-linux-amd64.tgz
# start ollama 0.11.8
systemctl daemon-reload
systemctl start ollama
# verify version
ollama -v
<!-- gh-comment-id:3263705996 --> @shiraz-shah commented on GitHub (Sep 7, 2025): OK, I just downgraded to 0.11.8 as shown below. I still have the same problem. Some requests end up on CPU, while others get handled by the GPU. How I downgraded: ``` # remove debug flag systemctl edit ollama.service # stop ollama systemctl stop ollama # remove existing ollama rm -rf /usr/lib/ollama # get ollama 0.11.8 curl -LO https://github.com/ollama/ollama/releases/download/v0.11.8/ollama-linux-amd64.tgz # install ollama 0.11.8 tar -C /usr -xzf ollama-linux-amd64.tgz # start ollama 0.11.8 systemctl daemon-reload systemctl start ollama # verify version ollama -v ```
Author
Owner

@rick-github commented on GitHub (Sep 7, 2025):

"Improved performance via overlapping GPU and CPU computations" doesn't mean that inference is run on different processors. It means that while the GPU is running an inference, the CPU is preparing the state necessary for the next step of the inference. This has always been the case but previously has been serialized (CPU->GPU->CPU->GPU), the change in 0.11.9 was to allow these to overlap so that the GPU is not waiting as long for the CPU to finish preparation.

The logs posted so far show that the model is running 100% on the GPU.

Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.002Z level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.002Z level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU"

Since the issue is happening on 0.11.8, it's unrelated to the pipeline optimization.

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                   
2290141 ollama    20   0   15.2g 630112  30724 S 693.8   0.1 135:37.37 ollama                                                                    

It's not clear if this is the server or the runner. What's the output of

ps wo pid,user,priority,ni,vsize:10,rss:8,sz:8,s,pcpu,pmem,time,cmd p$(pidof ollama)
<!-- gh-comment-id:3263890501 --> @rick-github commented on GitHub (Sep 7, 2025): "Improved performance via overlapping GPU and CPU computations" doesn't mean that inference is run on different processors. It means that while the GPU is running an inference, the CPU is preparing the state necessary for the next step of the inference. This has always been the case but previously has been serialized (CPU->GPU->CPU->GPU), the change in 0.11.9 was to allow these to overlap so that the GPU is not waiting as long for the CPU to finish preparation. The logs posted so far show that the model is running 100% on the GPU. ``` Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.002Z level=INFO source=ggml.go:492 msg="offloading output layer to GPU" Sep 06 19:07:14 hades ollama[2468567]: time=2025-09-06T19:07:14.002Z level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU" ``` Since the issue is happening on 0.11.8, it's unrelated to the pipeline optimization. ``` PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2290141 ollama 20 0 15.2g 630112 30724 S 693.8 0.1 135:37.37 ollama ``` It's not clear if this is the server or the runner. What's the output of ``` ps wo pid,user,priority,ni,vsize:10,rss:8,sz:8,s,pcpu,pmem,time,cmd p$(pidof ollama) ```
Author
Owner

@shiraz-shah commented on GitHub (Sep 7, 2025):

Looks like this when inference is happening on GPU:

ps wo pid,user,priority,ni,vsize:10,rss:8,sz:8,s,pcpu,pmem,time,cmd p$(pidof ollama)
    PID USER     PRI  NI        VSZ      RSS       SZ S %CPU %MEM     TIME CMD
2470951 ollama    20   0   15992640   206384  3998160 S 71.7  0.0 04:40:06 /usr/local/bin/ollama serve
2472092 ollama    20   0  107362684  2625368 26840671 S  182  0.4 00:08:37 /usr/local/bin/ollama runner --ollama-engine --model /data/ollama/model

But top says:

top - 18:07:29 up 40 days,  8:27,  6 users,  load average: 13.74, 11.21, 8.26
Tasks: 816 total,   4 running, 812 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.7 us,  0.1 sy,  0.0 ni, 93.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 
MiB Mem : 515944.1 total, 104596.4 free,  58664.6 used, 396171.3 buff/cache     
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 457279.5 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                   
2472092 ollama    20   0  102.4g   2.9g 296960 S 236.7   0.6  23:31.57 /usr/local/bin/ollama runner --ollama-engine --model /data/ollama/models/+
2470951 ollama    20   0   15.3g 233288  30688 S   1.3   0.0      6,02 /usr/local/bin/ollama serve                                               

and nvidia-smi says:

Sun Sep  7 18:20:41 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:86:00.0 Off |                  N/A |
|  0%   40C    P0             54W /  170W |    8011MiB /  12288MiB |     24%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060        Off |   00000000:8C:00.0 Off |                  N/A |
|  0%   42C    P0             62W /  170W |    8471MiB /  12288MiB |     32%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         2472092      C   /usr/local/bin/ollama                  8002MiB |
|    1   N/A  N/A         2472092      C   /usr/local/bin/ollama                  8462MiB |
+-----------------------------------------------------------------------------------------+

And when inference is happening on CPU, ps says:

    PID USER     PRI  NI        VSZ      RSS       SZ S %CPU %MEM     TIME CMD
2470951 ollama    20   0   15992640   238656  3998160 S 73.6  0.0 04:53:03 /usr/local/bin/ollama serve
2472092 ollama    20   0  107368732  3894984 26842183 S  158  0.7 00:19:58 /usr/local/bin/ollama runner --ollama-engine --model /data/ollama/model

But top looks like so:

top - 18:05:19 up 40 days,  8:25,  6 users,  load average: 7.87, 9.60, 7.38
Tasks: 816 total,   4 running, 812 sleeping,   0 stopped,   0 zombie
%Cpu(s): 13.0 us,  0.2 sy,  0.0 ni, 86.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 
MiB Mem : 515944.1 total, 104642.9 free,  58618.0 used, 396171.3 buff/cache     
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 457326.1 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                   
2470951 ollama    20   0   15.3g 237272  30688 S 749.5   0.0 348:02.36 /usr/local/bin/ollama serve                                               
2472092 ollama    20   0  102.4g   2.8g 296960 S   0.3   0.6  22:48.27 /usr/local/bin/ollama runner --ollama-engine --model /data/ollama/models/+

and nvidia-smi says:

Sun Sep  7 18:09:55 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:86:00.0 Off |                  N/A |
|  0%   40C    P8             18W /  170W |    8011MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060        Off |   00000000:8C:00.0 Off |                  N/A |
|  0%   40C    P8             16W /  170W |    8471MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         2472092      C   /usr/local/bin/ollama                  8002MiB |
|    1   N/A  N/A         2472092      C   /usr/local/bin/ollama                  8462MiB |
+-----------------------------------------------------------------------------------------+

Don't know why there's this difference between what top shows and what ps shows

But looks like when it's the GPU inferring, the runner is using between 150 and 300% CPU. And when it's the CPU inferring, the runner is using almost 0, as is GPU, while the server is using on average 750% in top. But the ps command from above only shows 75%, don't know why.

So I guess the GPU is inferring in both cases, but in some cases it's the server that's the bottleneck. This is also consistent with "CPU inference" not being memory-heavy.

Wonder if this intermittent server bottlenecking issue is GPT-OSS-specific. And whether it has anything to do with flash attention. Don't remember such issues before flash attention was introduced for GPT-OSS.

<!-- gh-comment-id:3263955824 --> @shiraz-shah commented on GitHub (Sep 7, 2025): Looks like this when inference is happening on GPU: ``` ps wo pid,user,priority,ni,vsize:10,rss:8,sz:8,s,pcpu,pmem,time,cmd p$(pidof ollama) PID USER PRI NI VSZ RSS SZ S %CPU %MEM TIME CMD 2470951 ollama 20 0 15992640 206384 3998160 S 71.7 0.0 04:40:06 /usr/local/bin/ollama serve 2472092 ollama 20 0 107362684 2625368 26840671 S 182 0.4 00:08:37 /usr/local/bin/ollama runner --ollama-engine --model /data/ollama/model ``` But top says: ``` top - 18:07:29 up 40 days, 8:27, 6 users, load average: 13.74, 11.21, 8.26 Tasks: 816 total, 4 running, 812 sleeping, 0 stopped, 0 zombie %Cpu(s): 6.7 us, 0.1 sy, 0.0 ni, 93.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 515944.1 total, 104596.4 free, 58664.6 used, 396171.3 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 457279.5 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2472092 ollama 20 0 102.4g 2.9g 296960 S 236.7 0.6 23:31.57 /usr/local/bin/ollama runner --ollama-engine --model /data/ollama/models/+ 2470951 ollama 20 0 15.3g 233288 30688 S 1.3 0.0 6,02 /usr/local/bin/ollama serve ``` and nvidia-smi says: ``` Sun Sep 7 18:20:41 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 Off | 00000000:86:00.0 Off | N/A | | 0% 40C P0 54W / 170W | 8011MiB / 12288MiB | 24% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3060 Off | 00000000:8C:00.0 Off | N/A | | 0% 42C P0 62W / 170W | 8471MiB / 12288MiB | 32% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2472092 C /usr/local/bin/ollama 8002MiB | | 1 N/A N/A 2472092 C /usr/local/bin/ollama 8462MiB | +-----------------------------------------------------------------------------------------+ ``` And when inference is happening on CPU, ps says: ``` PID USER PRI NI VSZ RSS SZ S %CPU %MEM TIME CMD 2470951 ollama 20 0 15992640 238656 3998160 S 73.6 0.0 04:53:03 /usr/local/bin/ollama serve 2472092 ollama 20 0 107368732 3894984 26842183 S 158 0.7 00:19:58 /usr/local/bin/ollama runner --ollama-engine --model /data/ollama/model ``` But top looks like so: ``` top - 18:05:19 up 40 days, 8:25, 6 users, load average: 7.87, 9.60, 7.38 Tasks: 816 total, 4 running, 812 sleeping, 0 stopped, 0 zombie %Cpu(s): 13.0 us, 0.2 sy, 0.0 ni, 86.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 515944.1 total, 104642.9 free, 58618.0 used, 396171.3 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 457326.1 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2470951 ollama 20 0 15.3g 237272 30688 S 749.5 0.0 348:02.36 /usr/local/bin/ollama serve 2472092 ollama 20 0 102.4g 2.8g 296960 S 0.3 0.6 22:48.27 /usr/local/bin/ollama runner --ollama-engine --model /data/ollama/models/+ ``` and nvidia-smi says: ``` Sun Sep 7 18:09:55 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 Off | 00000000:86:00.0 Off | N/A | | 0% 40C P8 18W / 170W | 8011MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3060 Off | 00000000:8C:00.0 Off | N/A | | 0% 40C P8 16W / 170W | 8471MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 2472092 C /usr/local/bin/ollama 8002MiB | | 1 N/A N/A 2472092 C /usr/local/bin/ollama 8462MiB | +-----------------------------------------------------------------------------------------+ ``` Don't know why there's this difference between what top shows and what ps shows But looks like when it's the GPU inferring, the runner is using between 150 and 300% CPU. And when it's the CPU inferring, the runner is using almost 0, as is GPU, while the server is using on average 750% in top. But the ps command from above only shows 75%, don't know why. So I guess the GPU is inferring in both cases, but in some cases it's the server that's the bottleneck. This is also consistent with "CPU inference" not being memory-heavy. Wonder if this intermittent server bottlenecking issue is GPT-OSS-specific. And whether it has anything to do with flash attention. Don't remember such issues before flash attention was introduced for GPT-OSS.
Author
Owner

@rick-github commented on GitHub (Sep 7, 2025):

There is no inference running on the CPU. The CPU is busy in the server, not the runner. Exactly why the CPU is busy in the server is unclear. What's the output of the following when the server is busy:

timeout 5 sudo strace -f -s 1500 -p $(ps ax | grep ollama.serve | awk '{print $1}')
<!-- gh-comment-id:3264133458 --> @rick-github commented on GitHub (Sep 7, 2025): There is no inference running on the CPU. The CPU is busy in the server, not the runner. Exactly why the CPU is busy in the server is unclear. What's the output of the following when the server is busy: ``` timeout 5 sudo strace -f -s 1500 -p $(ps ax | grep ollama.serve | awk '{print $1}') ```
Author
Owner

@shiraz-shah commented on GitHub (Sep 8, 2025):

root> timeout 5 strace -f -s 1500 -p $(ps ax | grep ollama.serve | awk '{print $1}')
strace: Cannot find executable '2483635'
<!-- gh-comment-id:3265412480 --> @shiraz-shah commented on GitHub (Sep 8, 2025): ``` root> timeout 5 strace -f -s 1500 -p $(ps ax | grep ollama.serve | awk '{print $1}') strace: Cannot find executable '2483635' ```
Author
Owner

@rick-github commented on GitHub (Sep 8, 2025):

Seems like you are running multiple ollama servers. What's the output of

ps uax | grep ollama.serve
<!-- gh-comment-id:3265628541 --> @rick-github commented on GitHub (Sep 8, 2025): Seems like you are running multiple ollama servers. What's the output of ``` ps uax | grep ollama.serve ```
Author
Owner

@shiraz-shah commented on GitHub (Sep 8, 2025):

No, I don't think so. Not on purpose anway.

root> ps uax | grep ollama.serve
ollama   2470951 40.6  0.0 15992640 496112 ?     Ssl  Sep07 572:08 /usr/local/bin/ollama serve
root     2485089  0.0  0.0   4092  1920 pts/4    S+   10:45   0:00 grep --color=auto ollama.serve
<!-- gh-comment-id:3265712545 --> @shiraz-shah commented on GitHub (Sep 8, 2025): No, I don't think so. Not on purpose anway. ``` root> ps uax | grep ollama.serve ollama 2470951 40.6 0.0 15992640 496112 ? Ssl Sep07 572:08 /usr/local/bin/ollama serve root 2485089 0.0 0.0 4092 1920 pts/4 S+ 10:45 0:00 grep --color=auto ollama.serve ```
Author
Owner

@shiraz-shah commented on GitHub (Sep 8, 2025):

Under load it looks like this:

root> ps uax | grep ollama.serve
ollama   2470951 40.7  0.1 15992640 623348 ?     Ssl  Sep07 574:02 /usr/local/bin/ollama serve
root     2485268  0.0  0.0   4092  1920 pts/4    S+   10:47   0:00 grep --color=auto ollama.serve
<!-- gh-comment-id:3265718577 --> @shiraz-shah commented on GitHub (Sep 8, 2025): Under load it looks like this: ``` root> ps uax | grep ollama.serve ollama 2470951 40.7 0.1 15992640 623348 ? Ssl Sep07 574:02 /usr/local/bin/ollama serve root 2485268 0.0 0.0 4092 1920 pts/4 S+ 10:47 0:00 grep --color=auto ollama.serve ```
Author
Owner

@rick-github commented on GitHub (Sep 8, 2025):

My mistake. Try:

timeout 5 sudo strace -f -s 1500 -p $(ps ax | grep 'ollama[.]serve' | awk '{print $1}')
<!-- gh-comment-id:3265733717 --> @rick-github commented on GitHub (Sep 8, 2025): My mistake. Try: ``` timeout 5 sudo strace -f -s 1500 -p $(ps ax | grep 'ollama[.]serve' | awk '{print $1}') ```
Author
Owner

@shiraz-shah commented on GitHub (Sep 8, 2025):

I modified your command to make it work better. It was tracing the "grep" command because that had the same string repeated, hence the lack of output before.

root> timeout 1 strace -f -s 1500 -p $(ps ax | grep 'ollama serve' | head -1 | awk '{print $1}')
strace: Process 2470951 attached with 81 threads
[pid 2471471] futex(0xc006988148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471393] futex(0xc00611a948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471387] futex(0xc00639a948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471379] futex(0xc0060b8948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471378] futex(0xc006300948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>

remaining output attached

straceOllama.txt

<!-- gh-comment-id:3265815582 --> @shiraz-shah commented on GitHub (Sep 8, 2025): I modified your command to make it work better. It was tracing the "grep" command because that had the same string repeated, hence the lack of output before. ``` root> timeout 1 strace -f -s 1500 -p $(ps ax | grep 'ollama serve' | head -1 | awk '{print $1}') strace: Process 2470951 attached with 81 threads [pid 2471471] futex(0xc006988148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471393] futex(0xc00611a948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471387] futex(0xc00639a948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471379] futex(0xc0060b8948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471378] futex(0xc006300948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> ``` remaining output attached [straceOllama.txt](https://github.com/user-attachments/files/22208951/straceOllama.txt)
Author
Owner

@shiraz-shah commented on GitHub (Sep 8, 2025):

Here's how it looks when there's no load:

root> timeout 5 strace -f -s 1500 -p $(ps ax | grep 'ollama serve' | head -1 | awk '{print $1}')
strace: Process 2470951 attached with 81 threads
[pid 2471471] epoll_pwait(4,  <unfinished ...>
[pid 2471393] futex(0xc00611a948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471387] futex(0xc00639a948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471379] futex(0xc0060b8948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471378] futex(0xc006300948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471366] futex(0xc005f8c948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471357] futex(0xc006800948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471356] futex(0xc006200948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471325] futex(0xc005b0a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471324] futex(0xc006190148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471323] futex(0xc005f8c148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471322] futex(0xc006216148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471316] futex(0xc006012148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471311] futex(0xc005f96148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471305] futex(0xc00639a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471304] futex(0xc006322148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471300] futex(0xc00611a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471299] futex(0xc00660a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471298] futex(0xc00650a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471297] futex(0xc006780148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471296] futex(0xc006700148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471295] futex(0xc006680148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471294] futex(0xc006800148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471289] futex(0xc00600c148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471288] futex(0xc0062a2148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471287] futex(0xc00625e148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471286] futex(0xc0060b8148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471285] futex(0xc006504148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471284] futex(0xc006490148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471283] futex(0xc006390948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471282] futex(0xc006390148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471281] futex(0xc006314148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471280] futex(0xc006218148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471279] futex(0xc006192148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471278] futex(0xc006312148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471277] futex(0xc0062b8148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471276] futex(0xc006900148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471275] futex(0xc00660e148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471274] futex(0xc006588148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471273] futex(0xc006508148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471272] futex(0xc006408148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471271] futex(0xc006310148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471270] futex(0xc006480148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471269] futex(0xc005c0a948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471268] futex(0xc005c0a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471267] futex(0xc006380148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471266] futex(0xc006300148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471265] futex(0xc006280148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471264] futex(0xc006200148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471263] futex(0xc006180148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471262] futex(0xc006100148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471261] futex(0xc006080148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471260] futex(0xc006000148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471259] futex(0xc005f80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471258] futex(0xc005f00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471257] futex(0xc000f81148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471256] futex(0xc00467a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471255] futex(0xc000f80948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471254] futex(0xc000581948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471253] futex(0xc000181948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471252] futex(0xc000681148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471251] futex(0xc000610948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471001] futex(0xc0003b3948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471000] futex(0xc000181148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2470999] futex(0xc000580948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2470996] futex(0xc000680948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2470995] futex(0xc000f80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2470994] futex(0xc000153948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2470993] futex(0xc0003b2148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2470962] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 2470961] futex(0xc000680148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2470960] waitid(P_PIDFD, 32,  <unfinished ...>
[pid 2470959] futex(0xc000600148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2470958] futex(0xc0003b3148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2470957] futex(0xc000580148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2470956] futex(0x55eddfbd2498, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2470955] futex(0xc000180148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2470954] futex(0x55eddfbd2640, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2470953] futex(0xc000123148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2470952] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 2470951] futex(0x55eddfb26e60, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2471471] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2471471] epoll_pwait(4, strace: Process 2471387 detached
strace: Process 2471379 detached
strace: Process 2471378 detached
strace: Process 2471366 detached
strace: Process 2471357 detached
strace: Process 2471471 detached
 <detached ...>
strace: Process 2471393 detached
strace: Process 2471356 detached
strace: Process 2471325 detached
strace: Process 2471324 detached
strace: Process 2471323 detached
strace: Process 2471322 detached
strace: Process 2471316 detached
strace: Process 2471311 detached
strace: Process 2471305 detached
strace: Process 2471304 detached
strace: Process 2471300 detached
strace: Process 2471299 detached
strace: Process 2471298 detached
strace: Process 2471297 detached
strace: Process 2471296 detached
strace: Process 2471295 detached
strace: Process 2471294 detached
strace: Process 2471289 detached
strace: Process 2471288 detached
strace: Process 2471287 detached
strace: Process 2471286 detached
strace: Process 2471285 detached
strace: Process 2471284 detached
strace: Process 2471283 detached
strace: Process 2471282 detached
strace: Process 2471281 detached
strace: Process 2471280 detached
strace: Process 2471279 detached
strace: Process 2471278 detached
strace: Process 2471277 detached
strace: Process 2471276 detached
strace: Process 2471275 detached
strace: Process 2471274 detached
strace: Process 2471273 detached
strace: Process 2471272 detached
strace: Process 2471271 detached
strace: Process 2471270 detached
strace: Process 2471269 detached
strace: Process 2471268 detached
strace: Process 2471267 detached
strace: Process 2471266 detached
strace: Process 2471265 detached
strace: Process 2471264 detached
strace: Process 2471263 detached
strace: Process 2471262 detached
strace: Process 2471261 detached
strace: Process 2471260 detached
strace: Process 2471259 detached
strace: Process 2471258 detached
strace: Process 2471257 detached
strace: Process 2471256 detached
strace: Process 2471255 detached
strace: Process 2471254 detached
strace: Process 2471253 detached
strace: Process 2471252 detached
strace: Process 2471251 detached
strace: Process 2471001 detached
strace: Process 2471000 detached
strace: Process 2470999 detached
strace: Process 2470996 detached
strace: Process 2470995 detached
strace: Process 2470994 detached
strace: Process 2470993 detached
strace: Process 2470962 detached
strace: Process 2470961 detached
strace: Process 2470960 detached
strace: Process 2470959 detached
strace: Process 2470958 detached
strace: Process 2470957 detached
strace: Process 2470956 detached
strace: Process 2470955 detached
strace: Process 2470954 detached
strace: Process 2470953 detached
strace: Process 2470952 detached
strace: Process 2470951 detached
<!-- gh-comment-id:3265826902 --> @shiraz-shah commented on GitHub (Sep 8, 2025): Here's how it looks when there's no load: ``` root> timeout 5 strace -f -s 1500 -p $(ps ax | grep 'ollama serve' | head -1 | awk '{print $1}') strace: Process 2470951 attached with 81 threads [pid 2471471] epoll_pwait(4, <unfinished ...> [pid 2471393] futex(0xc00611a948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471387] futex(0xc00639a948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471379] futex(0xc0060b8948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471378] futex(0xc006300948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471366] futex(0xc005f8c948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471357] futex(0xc006800948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471356] futex(0xc006200948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471325] futex(0xc005b0a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471324] futex(0xc006190148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471323] futex(0xc005f8c148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471322] futex(0xc006216148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471316] futex(0xc006012148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471311] futex(0xc005f96148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471305] futex(0xc00639a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471304] futex(0xc006322148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471300] futex(0xc00611a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471299] futex(0xc00660a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471298] futex(0xc00650a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471297] futex(0xc006780148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471296] futex(0xc006700148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471295] futex(0xc006680148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471294] futex(0xc006800148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471289] futex(0xc00600c148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471288] futex(0xc0062a2148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471287] futex(0xc00625e148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471286] futex(0xc0060b8148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471285] futex(0xc006504148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471284] futex(0xc006490148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471283] futex(0xc006390948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471282] futex(0xc006390148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471281] futex(0xc006314148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471280] futex(0xc006218148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471279] futex(0xc006192148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471278] futex(0xc006312148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471277] futex(0xc0062b8148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471276] futex(0xc006900148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471275] futex(0xc00660e148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471274] futex(0xc006588148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471273] futex(0xc006508148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471272] futex(0xc006408148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471271] futex(0xc006310148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471270] futex(0xc006480148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471269] futex(0xc005c0a948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471268] futex(0xc005c0a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471267] futex(0xc006380148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471266] futex(0xc006300148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471265] futex(0xc006280148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471264] futex(0xc006200148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471263] futex(0xc006180148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471262] futex(0xc006100148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471261] futex(0xc006080148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471260] futex(0xc006000148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471259] futex(0xc005f80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471258] futex(0xc005f00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471257] futex(0xc000f81148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471256] futex(0xc00467a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471255] futex(0xc000f80948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471254] futex(0xc000581948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471253] futex(0xc000181948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471252] futex(0xc000681148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471251] futex(0xc000610948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471001] futex(0xc0003b3948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471000] futex(0xc000181148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2470999] futex(0xc000580948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2470996] futex(0xc000680948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2470995] futex(0xc000f80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2470994] futex(0xc000153948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2470993] futex(0xc0003b2148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2470962] restart_syscall(<... resuming interrupted read ...> <unfinished ...> [pid 2470961] futex(0xc000680148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2470960] waitid(P_PIDFD, 32, <unfinished ...> [pid 2470959] futex(0xc000600148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2470958] futex(0xc0003b3148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2470957] futex(0xc000580148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2470956] futex(0x55eddfbd2498, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2470955] futex(0xc000180148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2470954] futex(0x55eddfbd2640, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2470953] futex(0xc000123148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2470952] restart_syscall(<... resuming interrupted read ...> <unfinished ...> [pid 2470951] futex(0x55eddfb26e60, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2471471] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0 [pid 2471471] epoll_pwait(4, strace: Process 2471387 detached strace: Process 2471379 detached strace: Process 2471378 detached strace: Process 2471366 detached strace: Process 2471357 detached strace: Process 2471471 detached <detached ...> strace: Process 2471393 detached strace: Process 2471356 detached strace: Process 2471325 detached strace: Process 2471324 detached strace: Process 2471323 detached strace: Process 2471322 detached strace: Process 2471316 detached strace: Process 2471311 detached strace: Process 2471305 detached strace: Process 2471304 detached strace: Process 2471300 detached strace: Process 2471299 detached strace: Process 2471298 detached strace: Process 2471297 detached strace: Process 2471296 detached strace: Process 2471295 detached strace: Process 2471294 detached strace: Process 2471289 detached strace: Process 2471288 detached strace: Process 2471287 detached strace: Process 2471286 detached strace: Process 2471285 detached strace: Process 2471284 detached strace: Process 2471283 detached strace: Process 2471282 detached strace: Process 2471281 detached strace: Process 2471280 detached strace: Process 2471279 detached strace: Process 2471278 detached strace: Process 2471277 detached strace: Process 2471276 detached strace: Process 2471275 detached strace: Process 2471274 detached strace: Process 2471273 detached strace: Process 2471272 detached strace: Process 2471271 detached strace: Process 2471270 detached strace: Process 2471269 detached strace: Process 2471268 detached strace: Process 2471267 detached strace: Process 2471266 detached strace: Process 2471265 detached strace: Process 2471264 detached strace: Process 2471263 detached strace: Process 2471262 detached strace: Process 2471261 detached strace: Process 2471260 detached strace: Process 2471259 detached strace: Process 2471258 detached strace: Process 2471257 detached strace: Process 2471256 detached strace: Process 2471255 detached strace: Process 2471254 detached strace: Process 2471253 detached strace: Process 2471252 detached strace: Process 2471251 detached strace: Process 2471001 detached strace: Process 2471000 detached strace: Process 2470999 detached strace: Process 2470996 detached strace: Process 2470995 detached strace: Process 2470994 detached strace: Process 2470993 detached strace: Process 2470962 detached strace: Process 2470961 detached strace: Process 2470960 detached strace: Process 2470959 detached strace: Process 2470958 detached strace: Process 2470957 detached strace: Process 2470956 detached strace: Process 2470955 detached strace: Process 2470954 detached strace: Process 2470953 detached strace: Process 2470952 detached strace: Process 2470951 detached ```
Author
Owner

@shiraz-shah commented on GitHub (Sep 8, 2025):

And attached is how it looks under "correct" load (i.e. well-functing GPU inference without the CPU bottleneck)

straceOllamaGPU.txt

<!-- gh-comment-id:3265841815 --> @shiraz-shah commented on GitHub (Sep 8, 2025): And attached is how it looks under "correct" load (i.e. well-functing GPU inference without the CPU bottleneck) [straceOllamaGPU.txt](https://github.com/user-attachments/files/22209075/straceOllamaGPU.txt)
Author
Owner

@shiraz-shah commented on GitHub (Sep 9, 2025):

The problem seems to be GPT-OSS-specific.

Maybe it's related to context quantisation. Not sure though.

Can I disable context quantisation for this model without having to downgrade to 0.11.7?

<!-- gh-comment-id:3271461770 --> @shiraz-shah commented on GitHub (Sep 9, 2025): The problem seems to be GPT-OSS-specific. Maybe it's related to context quantisation. Not sure though. Can I disable context quantisation for this model without having to downgrade to 0.11.7?
Author
Owner

@scotty2 commented on GitHub (Sep 9, 2025):

I'm having the same problem (I think)

gpt-oss-120b, M4 Max 128GB MacBook Pro.

Running codex, ollama sits there using 800-900% CPU, then swaps to GPU for a bit, then back to CPU.
ollama ps shows 100% GPU offloaded, but this is obviously not the case.

It looks to me like it's doing prompt processing on the CPU. Inference itself does seem to actually be happening on the GPU.

ollama 0.11.10, arm64, macOS 15.6.1

<!-- gh-comment-id:3272481188 --> @scotty2 commented on GitHub (Sep 9, 2025): I'm having the same problem (I think) gpt-oss-120b, M4 Max 128GB MacBook Pro. Running codex, ollama sits there using 800-900% CPU, then swaps to GPU for a bit, then back to CPU. ollama ps shows 100% GPU offloaded, but this is obviously not the case. It looks to me like it's doing prompt processing on the CPU. Inference itself does seem to actually be happening on the GPU. ollama 0.11.10, arm64, macOS 15.6.1
Author
Owner

@scotty2 commented on GitHub (Sep 9, 2025):

time=2025-09-09T16:12:32.411-07:00 level=DEBUG source=sched.go:583 msg="evaluating already loaded" model=/Users/scotty2/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
time=2025-09-09T16:12:44.009-07:00 level=DEBUG source=server.go:1387 msg="completion request" images=0 prompt=97693 format=""

It is between these 2 log entries in a request that ollama is completely CPU bound. This gap gets larger and larger with increasing context size, it seems.

After that, the load moves to the GPU, and it returns the completion.

<!-- gh-comment-id:3272585062 --> @scotty2 commented on GitHub (Sep 9, 2025): ``` time=2025-09-09T16:12:32.411-07:00 level=DEBUG source=sched.go:583 msg="evaluating already loaded" model=/Users/scotty2/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 ``` ``` time=2025-09-09T16:12:44.009-07:00 level=DEBUG source=server.go:1387 msg="completion request" images=0 prompt=97693 format="" ``` It is between these 2 log entries in a request that ollama is completely CPU bound. This gap gets larger and larger with increasing context size, it seems. After that, the load moves to the GPU, and it returns the completion.
Author
Owner

@scotty2 commented on GitHub (Sep 9, 2025):

ollama run details:

OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 ollama serve
time=2025-09-09T16:32:51.883-07:00 level=INFO source=routes.go:1331 msg="server config" env="map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/scotty2/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false http_proxy: https_proxy: no_proxy:]"
<!-- gh-comment-id:3272608264 --> @scotty2 commented on GitHub (Sep 9, 2025): ollama run details: ``` OLLAMA_DEBUG=1 OLLAMA_FLASH_ATTENTION=1 ollama serve ``` ``` time=2025-09-09T16:32:51.883-07:00 level=INFO source=routes.go:1331 msg="server config" env="map[HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/Users/scotty2/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false http_proxy: https_proxy: no_proxy:]" ```
Author
Owner

@eggshake commented on GitHub (Sep 19, 2025):

I'm having the same problem.
Using gpt-oss and had no issues with 0.11.6

<!-- gh-comment-id:3310749669 --> @eggshake commented on GitHub (Sep 19, 2025): I'm having the same problem. Using gpt-oss and had no issues with 0.11.6
Author
Owner

@jessegross commented on GitHub (Sep 19, 2025):

Possibly related to the Harmony parser.

<!-- gh-comment-id:3313923961 --> @jessegross commented on GitHub (Sep 19, 2025): Possibly related to the Harmony parser.
Author
Owner

@shiraz-shah commented on GitHub (Sep 20, 2025):

EIther that or the introduction of context quantization maybe.

I experience the problem mostly with agentic workloads. Haven't felt it with chat loads.

And yes, it does seem to scale with context size.

<!-- gh-comment-id:3314997309 --> @shiraz-shah commented on GitHub (Sep 20, 2025): EIther that or the introduction of context quantization maybe. I experience the problem mostly with agentic workloads. Haven't felt it with chat loads. And yes, it does seem to scale with context size.
Author
Owner

@scotty2 commented on GitHub (Sep 22, 2025):

KV quantization should be disabled on my run (OLLAMA_KV_CACHE_TYPE=)
Harmony parser doesn't seem likely to me since it maxes all of my CPU cores. Smells very much like compute that has been scheduled using a CPU kernel, rather than some kind of serial parsing.

<!-- gh-comment-id:3320787387 --> @scotty2 commented on GitHub (Sep 22, 2025): KV quantization should be disabled on my run (OLLAMA_KV_CACHE_TYPE=) Harmony parser doesn't seem likely to me since it maxes all of my CPU cores. Smells very much like compute that has been scheduled using a CPU kernel, rather than some kind of serial parsing.
Author
Owner

@shiraz-shah commented on GitHub (Sep 22, 2025):

Are you sure it's disabled though? I feel like GPT-OSS, the way ollama treats it, disregards your context quant settings and does its own thing. Have you checked how your GPU footprint scales with increasing context size?

<!-- gh-comment-id:3320802631 --> @shiraz-shah commented on GitHub (Sep 22, 2025): Are you sure it's disabled though? I feel like GPT-OSS, the way ollama treats it, disregards your context quant settings and does its own thing. Have you checked how your GPU footprint scales with increasing context size?
Author
Owner

@scotty2 commented on GitHub (Sep 22, 2025):

I can't be 100% sure- only that if that's unset, it's supposed to use fp16.
I can definitely try to get a gauge of the quantization via memory use.
I do agree that it seems that the prompt processing is being scheduled on the CPU, which is at least at some level related to the KV caches, whether they're quantized or not.

<!-- gh-comment-id:3320904018 --> @scotty2 commented on GitHub (Sep 22, 2025): I can't be 100% sure- only that if that's unset, it's supposed to use fp16. I can definitely try to get a gauge of the quantization via memory use. I do agree that it seems that the prompt processing is being scheduled on the CPU, which is at least at some level related to the KV caches, whether they're quantized or not.
Author
Owner

@shiraz-shah commented on GitHub (Sep 22, 2025):

It's because I don't remember this problem before context quantization was introduced for GPT OSS. But back then I was also running way smaller context windows. So yes, could definitely be prompt processing over context quantization.

<!-- gh-comment-id:3320983202 --> @shiraz-shah commented on GitHub (Sep 22, 2025): It's because I don't remember this problem before context quantization was introduced for GPT OSS. But back then I was also running way smaller context windows. So yes, could definitely be prompt processing over context quantization.
Author
Owner

@scotty2 commented on GitHub (Sep 22, 2025):

There is definitely something funny going on in KV quantization land.

GPT-OSS 120b (MXFP4)
 OLLAMA_KV_CACHE_TYPE=
  8192   65 GB
  131072 70 GB
 OLLAMA_KV_CACHE_TYPE=q8_0
  8192   65 GB
  131072 70 GB

Qwen2.5 Coder 32b (FP16)
 OLLAMA_KV_CACHE_TYPE=
  131072 111 GB
  4096   66 GB
 OLLAMA_KV_CACHE_TYPE=q8_0
  131072 93 GB
  4096   66 GB

Gemma3 27b (FP16)
 OLLAMA_KV_CACHE_TYPE=
  131072 68 GB
  4096   58 GB
 OLLAMA_KV_CACHE_TYPE=q8_0
  131072 63 GB
  4096   57 GB

honestly, i can't make any sense of those numbers.

<!-- gh-comment-id:3321013639 --> @scotty2 commented on GitHub (Sep 22, 2025): There is definitely something funny going on in KV quantization land. ``` GPT-OSS 120b (MXFP4) OLLAMA_KV_CACHE_TYPE= 8192 65 GB 131072 70 GB OLLAMA_KV_CACHE_TYPE=q8_0 8192 65 GB 131072 70 GB Qwen2.5 Coder 32b (FP16) OLLAMA_KV_CACHE_TYPE= 131072 111 GB 4096 66 GB OLLAMA_KV_CACHE_TYPE=q8_0 131072 93 GB 4096 66 GB Gemma3 27b (FP16) OLLAMA_KV_CACHE_TYPE= 131072 68 GB 4096 58 GB OLLAMA_KV_CACHE_TYPE=q8_0 131072 63 GB 4096 57 GB ``` honestly, i can't make any sense of those numbers.
Author
Owner

@scotty2 commented on GitHub (Sep 22, 2025):

Prompt processing (pre-fill) creates the KV cache, so quantization would (I assume) be hooked into there. That's the compute-heavy workload that I think we're seeing offloaded to our CPUs.

<!-- gh-comment-id:3321030889 --> @scotty2 commented on GitHub (Sep 22, 2025): Prompt processing (pre-fill) creates the KV cache, so quantization would (I assume) be hooked into there. That's the compute-heavy workload that I think we're seeing offloaded to our CPUs.
Author
Owner

@shiraz-shah commented on GitHub (Sep 22, 2025):

Off topic, Ollama docs are not transparent about how context quant is done and how manual settings are handled for different models. For some models it just disabled no matter what.

And my theory is, that for GPT OSS it defaults to Q4 no matter what you set.

Nice detective work though!!

<!-- gh-comment-id:3321079123 --> @shiraz-shah commented on GitHub (Sep 22, 2025): Off topic, Ollama docs are not transparent about how context quant is done and how manual settings are handled for different models. For some models it just disabled no matter what. And my theory is, that for GPT OSS it defaults to Q4 no matter what you set. Nice detective work though!!
Author
Owner

@scotty2 commented on GitHub (Sep 22, 2025):

The numbers for GPT-OSS definitely support your hypothesis that OLLAMA_KV_CACHE_TYPE is meaningless for GPT-OSS.

<!-- gh-comment-id:3321086408 --> @scotty2 commented on GitHub (Sep 22, 2025): The numbers for GPT-OSS definitely support your hypothesis that OLLAMA_KV_CACHE_TYPE is meaningless for GPT-OSS.
Author
Owner

@rick-github commented on GitHub (Sep 22, 2025):

https://github.com/ollama/ollama/pull/11929

<!-- gh-comment-id:3321185403 --> @rick-github commented on GitHub (Sep 22, 2025): https://github.com/ollama/ollama/pull/11929
Author
Owner

@scotty2 commented on GitHub (Sep 22, 2025):

Image If it's of value, the high CPU usage happens in the `ollama serve` process, rather than the `ollama runner` process.
<!-- gh-comment-id:3321495308 --> @scotty2 commented on GitHub (Sep 22, 2025): <img width="2032" height="1167" alt="Image" src="https://github.com/user-attachments/assets/b375b44b-4420-4ab2-b7f2-b9b20c77b0e7" /> If it's of value, the high CPU usage happens in the `ollama serve` process, rather than the `ollama runner` process.
Author
Owner

@shiraz-shah commented on GitHub (Sep 23, 2025):

Yes, that's exactly how it is on linux as well.

Looks like @rick-github has provided the verdict though. It has to do GPT OSS's use of attention sinks for KV quant, which can't be CUDA'd as of today. No quick fix for this, I guess.

Before we close this, @scotty2 , since you're running this on a mac, have you tried wether you get the same problem in LM Studio using the MLX version of the model?

<!-- gh-comment-id:3322442753 --> @shiraz-shah commented on GitHub (Sep 23, 2025): Yes, that's exactly how it is on linux as well. Looks like @rick-github has provided the verdict though. It has to do GPT OSS's use of attention sinks for KV quant, which can't be CUDA'd as of today. No quick fix for this, I guess. Before we close this, @scotty2 , since you're running this on a mac, have you tried wether you get the same problem in LM Studio using the MLX version of the model?
Author
Owner

@scotty2 commented on GitHub (Sep 23, 2025):

I do not have the same problem on LMS, via llama.cpp backend or MLX.

<!-- gh-comment-id:3324557759 --> @scotty2 commented on GitHub (Sep 23, 2025): I do not have the same problem on LMS, via llama.cpp backend or MLX.
Author
Owner

@scotty2 commented on GitHub (Sep 23, 2025):

I tested with the same workload (codex) pointed at LMS, and it does not exhibit the behavior.
MLX doesn't fully support MXFP4 at the moment, so it's quite broken in other ways, but not this particular way.
The llama.cpp backend though should be an equivalent comparison.

<!-- gh-comment-id:3324704353 --> @scotty2 commented on GitHub (Sep 23, 2025): I tested with the same workload (codex) pointed at LMS, and it does not exhibit the behavior. MLX doesn't fully support MXFP4 at the moment, so it's quite broken in other ways, but not this particular way. The llama.cpp backend though should be an equivalent comparison.
Author
Owner

@jessegross commented on GitHub (Sep 23, 2025):

As Rick pointed out, KV cache quantization is disabled for gpt-oss, so the setting has no effect and is not the cause of the issue here. In addition, inference does not run on the Ollama server process.

Looking at these log lines, the time in between them is after a request is received and before it is passed to the runner for inference. The load moves to the GPU because inference is actually happening on the GPU, as reported. Therefore, it must be something else that happens on the server and CPU, such as parsing.

time=2025-09-09T16:12:32.411-07:00 level=DEBUG source=sched.go:583 msg="evaluating already loaded" model=/Users/scotty2/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
time=2025-09-09T16:12:44.009-07:00 level=DEBUG source=server.go:1387 msg="completion request" images=0 prompt=97693 format=""

It is between these 2 log entries in a request that ollama is completely CPU bound. This gap gets larger and larger with increasing context size, it seems.

After that, the load moves to the GPU, and it returns the completion.

If you post the log with OLLAMA_DEBUG=2 set, we might be able to reproduce the issue. WARNING: This will include user data and will be large.

<!-- gh-comment-id:3324942707 --> @jessegross commented on GitHub (Sep 23, 2025): As Rick pointed out, KV cache quantization is disabled for gpt-oss, so the setting has no effect and is not the cause of the issue here. In addition, inference does not run on the Ollama server process. Looking at these log lines, the time in between them is after a request is received and before it is passed to the runner for inference. The load moves to the GPU because inference is actually happening on the GPU, as reported. Therefore, it must be something else that happens on the server and CPU, such as parsing. > ``` > time=2025-09-09T16:12:32.411-07:00 level=DEBUG source=sched.go:583 msg="evaluating already loaded" model=/Users/scotty2/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 > ``` > > ``` > time=2025-09-09T16:12:44.009-07:00 level=DEBUG source=server.go:1387 msg="completion request" images=0 prompt=97693 format="" > ``` > > It is between these 2 log entries in a request that ollama is completely CPU bound. This gap gets larger and larger with increasing context size, it seems. > > After that, the load moves to the GPU, and it returns the completion. If you post the log with OLLAMA_DEBUG=2 set, we might be able to reproduce the issue. WARNING: This will include user data and will be large.
Author
Owner

@scotty2 commented on GitHub (Sep 23, 2025):

It's clear that inference is happening on the GPU. The question was whether or not pre-fill/prompt processing was happening on the main process. If not, then something is parsing with a very, very high degree of parallelism. Which is a cool trick, for sure.

I will be happy to provide an OLLAMA_DEBUG=2 dump.

<!-- gh-comment-id:3325002137 --> @scotty2 commented on GitHub (Sep 23, 2025): It's clear that inference is happening on the GPU. The question was whether or not pre-fill/prompt processing was happening on the main process. If not, then something is parsing with a very, very high degree of parallelism. Which is a cool trick, for sure. I will be happy to provide an OLLAMA_DEBUG=2 dump.
Author
Owner

@scotty2 commented on GitHub (Sep 23, 2025):

Without sanitizing the log yet, it's very obvious where the CPU-bound load is happening.

time=2025-09-23T11:14:13.613-07:00 level=TRACE source=bytepairencoding.go:208 msg=encoded string="... prompt ..." ids="[ ... ids ...]"

So, during tokenization, perhaps?
As it is dumping that very long list of tokens, it gets slower and slower.
Next log line (after the multi-line token ID dump) is:

time=2025-09-23T11:14:13.619-07:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=32520 prompt=32597 used=32520 remaining=77

At which point, we're on the GPU (or rather, visually, whatever comes next is indistinguishable on the load graphs).

<!-- gh-comment-id:3325073456 --> @scotty2 commented on GitHub (Sep 23, 2025): Without sanitizing the log yet, it's very obvious where the CPU-bound load is happening. ``` time=2025-09-23T11:14:13.613-07:00 level=TRACE source=bytepairencoding.go:208 msg=encoded string="... prompt ..." ids="[ ... ids ...]" ``` So, during tokenization, perhaps? As it is dumping that very long list of tokens, it gets slower and slower. Next log line (after the multi-line token ID dump) is: ``` time=2025-09-23T11:14:13.619-07:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=32520 prompt=32597 used=32520 remaining=77 ``` At which point, we're on the GPU (or rather, visually, whatever comes next is indistinguishable on the load graphs).
Author
Owner

@nfsecurity commented on GitHub (Sep 24, 2025):

I think I am experimenting the same behavior:

Image

Some inferences (not all, for example: 6/30) get stuck and doesn't respond in normal times, for example, the majority of my 100% GPU inferences take between 1 and 12 seconds, but when one of them get stuck, it takes 11 minutes to complete. I am investigating this behavior and saw an excessive CPU usage DURING that specific problematic inference and an IDLE GPU Utilization (see the htop image) and my first conclusion was: "that specific inference was processed entirely by the CPU, not the GPU", despite my ollama ps shows that all my model was loaded 100% on the GPU.

I did several tests and "I think" I found the cause: "this is happening only with large SYSTEM content", let me explain:

All my prompts are made by a SYSTEM, DEVELOPER and USER instructions. My SYSTEM and DEVELOPER content is around 1024 tokens and then, the USER content could be another 2048 tokens. If I reduce the SYSTEM and DEVELOPER length, the behavior of excessive CPU consumption in some inferences doesn't happen.

My conclusion at this point is maybe those large PROMPTS are causing some kind of bottleneck, but not at the first try (because all my prompts are large and works well the majority of the time), is something like a buffer overflow. If I try to do the same problematic inference, isolated (only that one, manually), that inference works well, but in "batch" is not the same case.

My temporary solution was to reduce the length of the SYSTEM until more investigation.

Other very interesting things I have found:

That is not an Ollama only related issue. The inference through Unsloth suffers the same problem with large prompts (some of them get stuck with heavy CPU usage and takes 10 minutes to respond instead of 10 seconds). I was able to reproduce this same behavior in Unsloth and the same solution works (reduce the system and developer length).

I have a NVIDIA RTX 6000 ADA SFF 48GB GPU and I am running GPT-OSS 20B (pulled from Ollama) and also the problem occurs when I run my fine tuned GPT-OSS 20B in GGUF format MXFP4.

Hope this helps!

<!-- gh-comment-id:3329741553 --> @nfsecurity commented on GitHub (Sep 24, 2025): I think I am experimenting the same behavior: <img width="1898" height="987" alt="Image" src="https://github.com/user-attachments/assets/ae1dcc23-7e52-4c62-8c21-c3e23dc312ae" /> Some inferences (not all, for example: 6/30) get stuck and doesn't respond in normal times, for example, the majority of my 100% GPU inferences take between 1 and 12 seconds, but when one of them get stuck, it takes 11 minutes to complete. I am investigating this behavior and saw an excessive CPU usage DURING that specific problematic inference and an IDLE GPU Utilization (see the htop image) and my first conclusion was: "that specific inference was processed entirely by the CPU, not the GPU", despite my **ollama ps** shows that all my model was loaded 100% on the GPU. I did several tests and "I think" I found the cause: "this is happening only with large SYSTEM content", let me explain: All my prompts are made by a SYSTEM, DEVELOPER and USER instructions. My SYSTEM and DEVELOPER content is around 1024 tokens and then, the USER content could be another 2048 tokens. If I reduce the SYSTEM and DEVELOPER length, the behavior of excessive CPU consumption in some inferences doesn't happen. My conclusion at this point is maybe those large PROMPTS are causing some kind of bottleneck, but not at the first try (because all my prompts are large and works well the majority of the time), is something like a buffer overflow. If I try to do the same problematic inference, isolated (only that one, manually), that inference works well, but in "batch" is not the same case. My temporary solution was to reduce the length of the SYSTEM until more investigation. Other very interesting things I have found: That is not an Ollama only related issue. The inference through Unsloth suffers the same problem with large prompts (some of them get stuck with heavy CPU usage and takes 10 minutes to respond instead of 10 seconds). I was able to reproduce this same behavior in Unsloth and the same solution works (reduce the system and developer length). I have a NVIDIA RTX 6000 ADA SFF 48GB GPU and I am running GPT-OSS 20B (pulled from Ollama) and also the problem occurs when I run my fine tuned GPT-OSS 20B in GGUF format MXFP4. Hope this helps!
Author
Owner

@jessegross commented on GitHub (Sep 24, 2025):

@scotty2 Possibly, there is a token counting step as part of the preprocessing on the server. (This is not prompt processing, that still happens on the runner.) However, tokenization also prints out a lot of log lines so that could cause slow downs with debug logging.

It would be most helpful if you could share the actual prompts that trigger this so that we can reproduce it, as it is likely dependent on the actual content as @nfsecurity pointed out. Sanitizing the logs unfortunately will remove this.

<!-- gh-comment-id:3330172534 --> @jessegross commented on GitHub (Sep 24, 2025): @scotty2 Possibly, there is a token counting step as part of the preprocessing on the server. (This is not prompt processing, that still happens on the runner.) However, tokenization also prints out a lot of log lines so that could cause slow downs with debug logging. It would be most helpful if you could share the actual prompts that trigger this so that we can reproduce it, as it is likely dependent on the actual content as @nfsecurity pointed out. Sanitizing the logs unfortunately will remove this.
Author
Owner

@T1bolus commented on GitHub (Mar 29, 2026):

Can confirm, still a huge problem for big prompts >100k tokens. Its not the preprocessing and its not the normal inference itself. Both run completely on GPU as intended. I sadly couldn't pinpoint exactly what it is processing, but its the ollama serve process and it does not using all cores, just a few.

<!-- gh-comment-id:4151142774 --> @T1bolus commented on GitHub (Mar 29, 2026): Can confirm, still a huge problem for big prompts >100k tokens. Its not the preprocessing and its not the normal inference itself. Both run completely on GPU as intended. I sadly couldn't pinpoint exactly what it is processing, but its the ollama serve process and it does not using all cores, just a few.
Author
Owner

@shiraz-shah commented on GitHub (Mar 30, 2026):

I've stopped using GPT-OSS for this reason.

<!-- gh-comment-id:4154155975 --> @shiraz-shah commented on GitHub (Mar 30, 2026): I've stopped using GPT-OSS for this reason.
Author
Owner

@T1bolus commented on GitHub (Mar 30, 2026):

I've stopped using GPT-OSS for this reason.

It is not an GPT OSS specific model problem. It also happens with Qwen3.5, Nemotron 3 Super and all other models I have tested.

<!-- gh-comment-id:4154369013 --> @T1bolus commented on GitHub (Mar 30, 2026): > I've stopped using GPT-OSS for this reason. It is not an GPT OSS specific model problem. It also happens with Qwen3.5, Nemotron 3 Super and all other models I have tested.
Author
Owner

@shiraz-shah commented on GitHub (Mar 30, 2026):

OK, that's interesting. For me it does not happen with Qwen3.5 27B and Nemotron Cascade 2. Haven't tested Super extensively yet. Also doesn't happen with GLM 4.7 Flash and Qwen 3 Coder 30B. So far I've only really seen it with GPT-OSS.

<!-- gh-comment-id:4154424967 --> @shiraz-shah commented on GitHub (Mar 30, 2026): OK, that's interesting. For me it does not happen with Qwen3.5 27B and Nemotron Cascade 2. Haven't tested Super extensively yet. Also doesn't happen with GLM 4.7 Flash and Qwen 3 Coder 30B. So far I've only really seen it with GPT-OSS.
Author
Owner

@T1bolus commented on GitHub (Mar 30, 2026):

Have you tried it with context length above 100k?

<!-- gh-comment-id:4156077812 --> @T1bolus commented on GitHub (Mar 30, 2026): Have you tried it with context length above 100k?
Author
Owner

@shiraz-shah commented on GitHub (Mar 31, 2026):

This thread is about CPU-bound prompt processing so that's what I thought you meant.

But to answer your question, yes. Prompt processing time does go up increasing context length for all models. I've tried up to 262,144 tokens. But with the other models it's not CPU-bound. With GPT-OSS some of the prompt processing is happening on the CPU, and it's therefore much slower than the other models with large prompts.

It varies between models and hardware, but as a rule of thumb, I see prompt processing speed around 1000 tps, meaning a 100k prompt can take two minutes on the GPU. For GPT-OSS it can take 10 minutes.

<!-- gh-comment-id:4161202934 --> @shiraz-shah commented on GitHub (Mar 31, 2026): This thread is about CPU-bound prompt processing so that's what I thought you meant. But to answer your question, yes. Prompt processing time does go up increasing context length for all models. I've tried up to 262,144 tokens. But with the other models it's not CPU-bound. With GPT-OSS some of the prompt processing is happening on the CPU, and it's therefore much slower than the other models with large prompts. It varies between models and hardware, but as a rule of thumb, I see prompt processing speed around 1000 tps, meaning a 100k prompt can take two minutes on the GPU. For GPT-OSS it can take 10 minutes.
Author
Owner

@T1bolus commented on GitHub (Mar 31, 2026):

For me its CPU bounded as well, but well more noticeable with large context sizes. So everything is on the GPU, but for whatever reasons the CPU goes wild and is doing some preprocessing which can take up to a minute. And across all models. So not just GPT OSS, nearly all models.
This does not happen on vLLM as example.

<!-- gh-comment-id:4161555241 --> @T1bolus commented on GitHub (Mar 31, 2026): For me its CPU bounded as well, but well more noticeable with large context sizes. So everything is on the GPU, but for whatever reasons the CPU goes wild and is doing some preprocessing which can take up to a minute. And across all models. So not just GPT OSS, nearly all models. This does not happen on vLLM as example.
Author
Owner

@shiraz-shah commented on GitHub (Apr 1, 2026):

I haven't tried vLLM yet.

In my experience, with ollama, a minute of prompt processing for a 100k token query is normal, especially if the GPU is fully engaged. What's strange with GPT-OSS is that the GPU is flat while this happens, and instead the CPU is working with 8 threads or so for several minutes before the GPU becomes active and starts generating tokens.

<!-- gh-comment-id:4168107101 --> @shiraz-shah commented on GitHub (Apr 1, 2026): I haven't tried vLLM yet. In my experience, with ollama, a minute of prompt processing for a 100k token query is normal, especially if the GPU is fully engaged. What's strange with GPT-OSS is that the GPU is flat while this happens, and instead the CPU is working with 8 threads or so for several minutes before the GPU becomes active and starts generating tokens.
Author
Owner

@eliphatfs commented on GitHub (Apr 9, 2026):

i met the same issue, on gemma 4 31b, on openclaw default system prompt like 80k tokens. i don't know why it is so slow either. it doesn't finish in 1 minute.

<!-- gh-comment-id:4211624871 --> @eliphatfs commented on GitHub (Apr 9, 2026): i met the same issue, on gemma 4 31b, on openclaw default system prompt like 80k tokens. i don't know why it is so slow either. it doesn't finish in 1 minute.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33873