[GH-ISSUE #13751] Ollama switches to CPU usage in Codex-CLI 0.86.0 after 10K input token is reached with gpt-oss:120b #71074

Open
opened 2026-05-04 23:55:36 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @SingularityMan on GitHub (Jan 16, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13751

Originally assigned to: @ParthSareen on GitHub.

What is the issue?

Issue in codex-cli repo

I was redirected here immediately after posting, figured you guys might look into this one. Everything about the issue is displayed in the post, but I do want to mention some additional things:

I tried playing around with different configurations in both Ollama and Codex-cli. I had done the following:

  • Created a separate modelfile to explicitly set gpt-oss:120b num_ctx to 131072 tokens (128K) and changed some sampling parameter settings.

  • Then I switched back to the original model when I discovered you can set model_context_window = 131072 in codex-cli's config.toml file so I had set it to that. I also tried setting lower num_ctx values but I still ran into the same issue.

  • I then set Codex CLI's auto-compact feature to a very low ceiling of 12K tokens when I noticed the CPU usage would kickstart after 10k tokens before switching back to GPU when it was less than 10K.

I hadn't seen this issue in any other application I developed that runs Ollama on my GPU so that's why I went to Codex CLI first, thinking the issue was on their end, and that's when they sent me to you. Is there some sort of issue with this model? Are there any updated versions that specifically fix this issue? I'm a little worried about upgrading in case Ollama breaks something so that's why I had stayed at 0.13.5.

Relevant log output

time=2026-01-16T16:01:52.424-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a
time=2026-01-16T16:02:08.806-05:00 level=DEBUG source=server.go:1509 msg="completion request" images=0 prompt=72931 format=""
time=2026-01-16T16:02:09.139-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=17268 prompt=17314 used=17268 remaining=46
[GIN] 2026/01/16 - 16:02:10 | 200 |   18.0416882s |       127.0.0.1 | POST     "/v1/responses"
time=2026-01-16T16:02:10.333-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072
time=2026-01-16T16:02:10.334-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 duration=2562047h47m16.854775807s
time=2026-01-16T16:02:10.334-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 refCount=0
time=2026-01-16T16:02:11.859-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a
time=2026-01-16T16:02:29.729-05:00 level=DEBUG source=server.go:1509 msg="completion request" images=0 prompt=73493 format=""
time=2026-01-16T16:02:30.088-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=17392 prompt=17440 used=17340 remaining=100
[GIN] 2026/01/16 - 16:02:31 | 200 |    19.817603s |       127.0.0.1 | POST     "/v1/responses"
time=2026-01-16T16:02:31.551-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072
time=2026-01-16T16:02:31.552-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 duration=2562047h47m16.854775807s
time=2026-01-16T16:02:31.552-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 refCount=0
time=2026-01-16T16:02:33.594-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a
time=2026-01-16T16:02:53.708-05:00 level=DEBUG source=server.go:1509 msg="completion request" images=0 prompt=74176 format=""
time=2026-01-16T16:02:54.098-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=17548 prompt=17594 used=17548 remaining=46
[GIN] 2026/01/16 - 16:02:55 | 200 |    22.281456s |       127.0.0.1 | POST     "/v1/responses"
time=2026-01-16T16:02:55.748-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072
time=2026-01-16T16:02:55.750-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 duration=2562047h47m16.854775807s
time=2026-01-16T16:02:55.750-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 refCount=0
time=2026-01-16T16:02:57.676-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a
time=2026-01-16T16:03:19.501-05:00 level=DEBUG source=server.go:1509 msg="completion request" images=0 prompt=74927 format=""
time=2026-01-16T16:03:19.936-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=17725 prompt=17778 used=17641 remaining=137
[GIN] 2026/01/16 - 16:03:21 | 200 |   24.0370866s |       127.0.0.1 | POST     "/v1/responses"
time=2026-01-16T16:03:21.583-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072
time=2026-01-16T16:03:21.585-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 duration=2562047h47m16.854775807s
time=2026-01-16T16:03:21.585-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 refCount=0
time=2026-01-16T16:03:22.998-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a
time=2026-01-16T16:03:41.606-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a
time=2026-01-16T16:03:41.606-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a
time=2026-01-16T16:03:41.608-05:00 level=DEBUG source=sched.go:161 msg=reloading runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072
time=2026-01-16T16:03:41.608-05:00 level=DEBUG source=sched.go:236 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 refCount=1
time=2026-01-16T16:03:41.608-05:00 level=DEBUG source=sched.go:247 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072
time=2026-01-16T16:03:46.591-05:00 level=DEBUG source=server.go:1509 msg="completion request" images=0 prompt=75593 format=""
time=2026-01-16T16:03:47.038-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=17887 prompt=17942 used=17807 remaining=135
[GIN] 2026/01/16 - 16:03:48 | 200 |   25.9359246s |       127.0.0.1 | POST     "/v1/responses"
time=2026-01-16T16:03:48.811-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072
time=2026-01-16T16:03:48.812-05:00 level=DEBUG source=sched.go:283 msg="runner with zero duration has gone idle, expiring to unload" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072
time=2026-01-16T16:03:48.812-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 refCount=0

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.13.5

Originally created by @SingularityMan on GitHub (Jan 16, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13751 Originally assigned to: @ParthSareen on GitHub. ### What is the issue? [Issue in codex-cli repo](https://github.com/openai/codex/issues/9395) I was redirected here immediately after posting, figured you guys might look into this one. Everything about the issue is displayed in the post, but I do want to mention some additional things: I tried playing around with different configurations in both Ollama and Codex-cli. I had done the following: - Created a separate modelfile to explicitly set `gpt-oss:120b` `num_ctx` to `131072` tokens (128K) and changed some sampling parameter settings. - Then I switched back to the original model when I discovered you can set `model_context_window = 131072` in codex-cli's `config.toml` file so I had set it to that. I also tried setting lower `num_ctx` values but I still ran into the same issue. - I then set Codex CLI's `auto-compact` feature to a very low ceiling of 12K tokens when I noticed the CPU usage would kickstart after 10k tokens before switching back to GPU when it was less than 10K. I hadn't seen this issue in any other application I developed that runs Ollama on my GPU so that's why I went to Codex CLI first, thinking the issue was on their end, and that's when they sent me to you. Is there some sort of issue with this model? Are there any updated versions that specifically fix this issue? I'm a little worried about upgrading in case Ollama breaks something so that's why I had stayed at 0.13.5. ### Relevant log output ```shell time=2026-01-16T16:01:52.424-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a time=2026-01-16T16:02:08.806-05:00 level=DEBUG source=server.go:1509 msg="completion request" images=0 prompt=72931 format="" time=2026-01-16T16:02:09.139-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=17268 prompt=17314 used=17268 remaining=46 [GIN] 2026/01/16 - 16:02:10 | 200 | 18.0416882s | 127.0.0.1 | POST "/v1/responses" time=2026-01-16T16:02:10.333-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 time=2026-01-16T16:02:10.334-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 duration=2562047h47m16.854775807s time=2026-01-16T16:02:10.334-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 refCount=0 time=2026-01-16T16:02:11.859-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a time=2026-01-16T16:02:29.729-05:00 level=DEBUG source=server.go:1509 msg="completion request" images=0 prompt=73493 format="" time=2026-01-16T16:02:30.088-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=17392 prompt=17440 used=17340 remaining=100 [GIN] 2026/01/16 - 16:02:31 | 200 | 19.817603s | 127.0.0.1 | POST "/v1/responses" time=2026-01-16T16:02:31.551-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 time=2026-01-16T16:02:31.552-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 duration=2562047h47m16.854775807s time=2026-01-16T16:02:31.552-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 refCount=0 time=2026-01-16T16:02:33.594-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a time=2026-01-16T16:02:53.708-05:00 level=DEBUG source=server.go:1509 msg="completion request" images=0 prompt=74176 format="" time=2026-01-16T16:02:54.098-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=17548 prompt=17594 used=17548 remaining=46 [GIN] 2026/01/16 - 16:02:55 | 200 | 22.281456s | 127.0.0.1 | POST "/v1/responses" time=2026-01-16T16:02:55.748-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 time=2026-01-16T16:02:55.750-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 duration=2562047h47m16.854775807s time=2026-01-16T16:02:55.750-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 refCount=0 time=2026-01-16T16:02:57.676-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a time=2026-01-16T16:03:19.501-05:00 level=DEBUG source=server.go:1509 msg="completion request" images=0 prompt=74927 format="" time=2026-01-16T16:03:19.936-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=17725 prompt=17778 used=17641 remaining=137 [GIN] 2026/01/16 - 16:03:21 | 200 | 24.0370866s | 127.0.0.1 | POST "/v1/responses" time=2026-01-16T16:03:21.583-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 time=2026-01-16T16:03:21.585-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 duration=2562047h47m16.854775807s time=2026-01-16T16:03:21.585-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 refCount=0 time=2026-01-16T16:03:22.998-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a time=2026-01-16T16:03:41.606-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a time=2026-01-16T16:03:41.606-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a time=2026-01-16T16:03:41.608-05:00 level=DEBUG source=sched.go:161 msg=reloading runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 time=2026-01-16T16:03:41.608-05:00 level=DEBUG source=sched.go:236 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 refCount=1 time=2026-01-16T16:03:41.608-05:00 level=DEBUG source=sched.go:247 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 time=2026-01-16T16:03:46.591-05:00 level=DEBUG source=server.go:1509 msg="completion request" images=0 prompt=75593 format="" time=2026-01-16T16:03:47.038-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=17887 prompt=17942 used=17807 remaining=135 [GIN] 2026/01/16 - 16:03:48 | 200 | 25.9359246s | 127.0.0.1 | POST "/v1/responses" time=2026-01-16T16:03:48.811-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 time=2026-01-16T16:03:48.812-05:00 level=DEBUG source=sched.go:283 msg="runner with zero duration has gone idle, expiring to unload" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 time=2026-01-16T16:03:48.812-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:120b-128k runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="64.1 GiB" runner.vram="64.1 GiB" runner.parallel=1 runner.pid=510340 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=131072 refCount=0 ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.13.5
GiteaMirror added the bug label 2026-05-04 23:55:36 -05:00
Author
Owner

@ParthSareen commented on GitHub (Jan 20, 2026):

Haven't run into this before! Let me dig in a bit. Thanks for sharing!

<!-- gh-comment-id:3770558154 --> @ParthSareen commented on GitHub (Jan 20, 2026): Haven't run into this before! Let me dig in a bit. Thanks for sharing!
Author
Owner

@SingularityMan commented on GitHub (Jan 20, 2026):

Haven't run into this before! Let me dig in a bit. Thanks for sharing!

Much appreciated!

<!-- gh-comment-id:3770928695 --> @SingularityMan commented on GitHub (Jan 20, 2026): > Haven't run into this before! Let me dig in a bit. Thanks for sharing! Much appreciated!
Author
Owner

@SingularityMan commented on GitHub (Jan 20, 2026):

Haven't run into this before! Let me dig in a bit. Thanks for sharing!

UPDATE: Found something big, but nothing conclusive yet. I think the codex team might be mistaken about redirecting me to you but I'm not sure yet. Doesn't seem to be Codex's fault neither. I think the fault lies squarely on OpenAI's Agents SDK so I'm going to be opening a separate issue there shortly and update this message once I do so.

I tried using Openai's Agents SDK separately from Codex CLI and locally via Ollama for a custom agentic solution but Ollama RAPIDLY blew up CPU and RAM usage to dangerously high levels, like SKYROCKETING past a reasonable amount of RAM used, never mind the fact that the model is loaded entirely on GPU to begin with.

Something about that Agents SDK triggers a huge CPU/RAM blowup when called but this is particularly challenging to diagnose. I'm gonna look into the source code and do some digging on my end once I post that issue.

<!-- gh-comment-id:3774432345 --> @SingularityMan commented on GitHub (Jan 20, 2026): > Haven't run into this before! Let me dig in a bit. Thanks for sharing! UPDATE: Found something big, but nothing conclusive yet. I think the codex team might be mistaken about redirecting me to you but I'm not sure yet. Doesn't seem to be Codex's fault neither. I think the fault lies squarely on OpenAI's [Agents SDK](https://github.com/openai/openai-agents-python/issues) so I'm going to be opening a separate issue there shortly and update this message once I do so. I tried using Openai's Agents SDK separately from Codex CLI and locally via Ollama for a custom agentic solution but Ollama RAPIDLY blew up CPU and RAM usage to dangerously high levels, like SKYROCKETING past a reasonable amount of RAM used, never mind the fact that the model is loaded entirely on GPU to begin with. Something about that Agents SDK triggers a huge CPU/RAM blowup when called but this is particularly challenging to diagnose. I'm gonna look into the source code and do some digging on my end once I post that issue.
Author
Owner

@pd95 commented on GitHub (Jan 30, 2026):

To be honest I see also this problem of a rather huge slow down of Codex inference on my Mac using the gpt-oss:20b model after around 10% of the context window is used.
Currently I suppose (⚠️warning: wild guess from my side) this is due to Ollama not having a „prompt cache“ so every new item added to the current session/thread will make Ollama process all the tokens from the beginning again.

See OpenAI Blog Unrolling the Codex agent loop

But I will also try to debug the situation you are describing, that some CPU fallback happens after a certain number of tokens being processed.

<!-- gh-comment-id:3822991684 --> @pd95 commented on GitHub (Jan 30, 2026): To be honest I see also this problem of a rather huge slow down of Codex inference on my Mac using the gpt-oss:20b model after around 10% of the context window is used. Currently I suppose (⚠️warning: wild guess from my side) this is due to Ollama not having a „prompt cache“ so every new item added to the current session/thread will make Ollama process all the tokens from the beginning again. See OpenAI Blog [Unrolling the Codex agent loop](https://openai.com/index/unrolling-the-codex-agent-loop/#:~:text=Generally%2C%20the%20cost,in%20more%20detail) But I will also try to debug the situation you are describing, that some CPU fallback happens after a certain number of tokens being processed.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71074