[GH-ISSUE #13401] GPT-OSS-120b RAM blows up occasionally when performing tool calls despite being loaded entirely on GPU. #34609

Open
opened 2026-04-22 18:19:41 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @SingularityMan on GitHub (Dec 9, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13401

Originally assigned to: @ParthSareen on GitHub.

What is the issue?

I have issues that seem to arise with gpt-oss-120b running locally and performing tool calls in the appropriate format entirely on GPU. Actually, I have a couple of extra smaller models on the GPU loaded during a particular framework I run besides Ollama, but the sum of all their VRAM usage doesn't exceed 86GB VRAM on that GPU alone.

In spite of that, I've noticed a pattern where if my python script calls ollama.chat() with tool calling enabled, occasionally my RAM randomly blows up to an obscene amount and my PC turns into a brick. I'm talking like 40-50GB RAM end up flooded by a single Ollama process when a tool call is performed by gpt-oss-120b. A couple of additional points:

  • I have 128GB RAM on my MOBO and none of the models or tool call functions are using any of that.

  • The output of the tool call is usually recycled back to gpt-oss-120b along with its entire thought chain up until that point so it can continue performing tool calls in between thoughts instead of in between messages. That part works, but then the occasionally RAM blowup is triggered by the Ollama instance that runs that model. Note: The tool call output/thought process are cleared immediately before the next tool call is performed and gc.collect() and toch.cuda.empty_cache() are run immediately after.

  • The entire think/tool_call_result chains are wiped from the chat history once the next tool call is completed since they're considered single-use contextual data and are immediately dropped and replaced with a short tool call message extended to the chat history list in the proper format Ollama expects: {"role": "tool", "content": "Tool call was made here.", "tool_name": _tool_call_name}.

  • When I check the server logs, nothing seems out of the ordinary: Regular /chat messages with reasonable prompt sizes well under the num_ctx set are routinely issued, ollama's server recognizes that gpt-oss-120b is running on CUDA in the right GPU.

Image Image Image
  • ollama ps shows the model loaded entirely on GPU with plenty of VRAM to spare, nvidia-smi also supports this.
Image
  • On Windows, I disabled system memory fallback via Nvidia Control panel so Ollama wouldn't spill anything to RAM. This should've put an end to this kind of stuff by isolating my AI GPU from the rest of the PC's components, yet it still did it anyway. I'm talking about the top-most process:
Image

The ollama process at the top blows up the RAM sky high when it happens.

  • Ollama itself (both the program and the python API) is up-to-date, running on the new engine set in the system's environment variables with KV Cache set to q8 and flash attention enabled.

I can't say for sure whether it happens when a tool call is performed or not. I also can't say for sure if it only happens with this model because this is the only model I've tested tool calling on and it has a pretty unique message format that Ollama supports. But I CAN say is that the only time I've noticed this pattern re-emerging is when a tool call happens.

Relevant log output

[GIN] 2025/12/09 - 18:02:57 | 200 |    1.1105271s |       127.0.0.1 | POST     "/api/generate"
time=2025-12-09T18:02:57.876-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/qwen3-vl:2b-instruct-q4_K_M runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="2.9 GiB" runner.vram="2.9 GiB" runner.parallel=1 runner.pid=26768 runner.model=H:\ai\ollama\models\blobs\sha256-aafed9e48b157ae913cee994e0d9ac927af51e256feafbd923bf2852e8856d00 runner.num_ctx=8192
time=2025-12-09T18:02:57.876-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3-vl:2b-instruct-q4_K_M runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="2.9 GiB" runner.vram="2.9 GiB" runner.parallel=1 runner.pid=26768 runner.model=H:\ai\ollama\models\blobs\sha256-aafed9e48b157ae913cee994e0d9ac927af51e256feafbd923bf2852e8856d00 runner.num_ctx=8192 duration=2562047h47m16.854775807s
time=2025-12-09T18:02:57.876-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3-vl:2b-instruct-q4_K_M runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="2.9 GiB" runner.vram="2.9 GiB" runner.parallel=1 runner.pid=26768 runner.model=H:\ai\ollama\models\blobs\sha256-aafed9e48b157ae913cee994e0d9ac927af51e256feafbd923bf2852e8856d00 runner.num_ctx=8192 refCount=0
time=2025-12-09T18:03:02.582-05:00 level=WARN source=types.go:800 msg="invalid option provided" option=reasoning_effort
time=2025-12-09T18:03:02.637-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a
time=2025-12-09T18:03:03.370-05:00 level=DEBUG source=server.go:1465 msg="completion request" images=0 prompt=18636 format=""
time=2025-12-09T18:03:03.407-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=3735 prompt=3861 used=602 remaining=3259
[GIN] 2025/12/09 - 18:03:05 | 200 |    3.3750209s |       127.0.0.1 | POST     "/api/chat"
time=2025-12-09T18:03:05.841-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss:120b runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="67.0 GiB" runner.vram="67.0 GiB" runner.parallel=2 runner.pid=31716 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=128000
time=2025-12-09T18:03:05.843-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:120b runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="67.0 GiB" runner.vram="67.0 GiB" runner.parallel=2 runner.pid=31716 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=128000 duration=2562047h47m16.854775807s
time=2025-12-09T18:03:05.843-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:120b runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="67.0 GiB" runner.vram="67.0 GiB" runner.parallel=2 runner.pid=31716 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=128000 refCount=0
time=2025-12-09T18:03:21.986-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-aafed9e48b157ae913cee994e0d9ac927af51e256feafbd923bf2852e8856d00
time=2025-12-09T18:03:21.987-05:00 level=DEBUG source=server.go:1465 msg="completion request" images=1 prompt=98 format=""
time=2025-12-09T18:03:22.124-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=2166 prompt=2060 used=7 remaining=2053
[GIN] 2025/12/09 - 18:03:23 | 200 |    1.5598903s |       127.0.0.1 | POST     "/api/generate"
time=2025-12-09T18:03:23.482-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/qwen3-vl:2b-instruct-q4_K_M runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="2.9 GiB" runner.vram="2.9 GiB" runner.parallel=1 runner.pid=26768 runner.model=H:\ai\ollama\models\blobs\sha256-aafed9e48b157ae913cee994e0d9ac927af51e256feafbd923bf2852e8856d00 runner.num_ctx=8192
time=2025-12-09T18:03:23.482-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3-vl:2b-instruct-q4_K_M runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="2.9 GiB" runner.vram="2.9 GiB" runner.parallel=1 runner.pid=26768 runner.model=H:\ai\ollama\models\blobs\sha256-aafed9e48b157ae913cee994e0d9ac927af51e256feafbd923bf2852e8856d00 runner.num_ctx=8192 duration=2562047h47m16.854775807s
time=2025-12-09T18:03:23.482-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3-vl:2b-instruct-q4_K_M runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="2.9 GiB" runner.vram="2.9 GiB" runner.parallel=1 runner.pid=26768 runner.model=H:\ai\ollama\models\blobs\sha256-aafed9e48b157ae913cee994e0d9ac927af51e256feafbd923bf2852e8856d00 runner.num_ctx=8192 refCount=0
time=2025-12-09T18:03:33.245-05:00 level=WARN source=types.go:800 msg="invalid option provided" option=reasoning_effort
time=2025-12-09T18:03:33.300-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a
time=2025-12-09T18:03:34.060-05:00 level=DEBUG source=server.go:1465 msg="completion request" images=0 prompt=12490 format=""
time=2025-12-09T18:03:34.101-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=4043 prompt=2690 used=603 remaining=2087
[GIN] 2025/12/09 - 18:03:36 | 200 |    2.8823465s |       127.0.0.1 | POST     "/api/chat"
time=2025-12-09T18:03:36.007-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss:120b runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="67.0 GiB" runner.vram="67.0 GiB" runner.parallel=2 runner.pid=31716 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=128000
time=2025-12-09T18:03:36.008-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:120b runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="67.0 GiB" runner.vram="67.0 GiB" runner.parallel=2 runner.pid=31716 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=128000 duration=2562047h47m16.854775807s

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.13.2

Originally created by @SingularityMan on GitHub (Dec 9, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13401 Originally assigned to: @ParthSareen on GitHub. ### What is the issue? I have issues that seem to arise with `gpt-oss-120b` running locally and performing tool calls in the appropriate format entirely on GPU. Actually, I have a couple of extra smaller models on the GPU loaded during a particular framework I run besides Ollama, but the sum of all their VRAM usage doesn't exceed 86GB VRAM on that GPU alone. In spite of that, I've noticed a pattern where if my python script calls `ollama.chat()` with tool calling enabled, occasionally my RAM randomly blows up to an obscene amount and my PC turns into a brick. I'm talking like 40-50GB RAM end up flooded by a single Ollama process when a tool call is performed by `gpt-oss-120b`. A couple of additional points: - I have 128GB RAM on my MOBO and none of the models or tool call functions are using any of that. - The output of the tool call is usually recycled back to `gpt-oss-120b` along with its entire thought chain up until that point so it can continue performing tool calls in between thoughts instead of in between messages. That part works, but then the occasionally RAM blowup is triggered by the Ollama instance that runs that model. Note: The tool call output/thought process are cleared immediately before the next tool call is performed and `gc.collect()` and `toch.cuda.empty_cache()` are run immediately after. - The entire `think/tool_call_result` chains are wiped from the chat history once the *next* tool call is completed since they're considered single-use contextual data and are immediately dropped and replaced with a short tool call message extended to the chat history list in the proper format Ollama expects: `{"role": "tool", "content": "Tool call was made here.", "tool_name": _tool_call_name}`. - When I check the server logs, nothing seems out of the ordinary: Regular `/chat` messages with reasonable prompt sizes well under the `num_ctx` set are routinely issued, ollama's server recognizes that `gpt-oss-120b` is running on CUDA in the right GPU. <img width="734" height="647" alt="Image" src="https://github.com/user-attachments/assets/5e8b93ae-67fa-42c2-a6d4-1649b806e6dc" /> <img width="1906" height="166" alt="Image" src="https://github.com/user-attachments/assets/424d54ef-a815-4c48-9ebe-77961e12608a" /> <img width="1903" height="197" alt="Image" src="https://github.com/user-attachments/assets/d657fb93-1540-481f-a75f-9f31c26202ce" /> - `ollama ps` shows the model loaded entirely on GPU with plenty of VRAM to spare, `nvidia-smi` also supports this. <img width="711" height="61" alt="Image" src="https://github.com/user-attachments/assets/677d0c7d-87a7-4d6b-9648-52bc11a30a23" /> - On Windows, I disabled `system memory fallback` via Nvidia Control panel so Ollama wouldn't spill anything to RAM. This should've put an end to this kind of stuff by isolating my AI GPU from the rest of the PC's components, yet it still did it anyway. I'm talking about the top-most process: <img width="1920" height="1080" alt="Image" src="https://github.com/user-attachments/assets/c89b8231-eccb-4bbf-9817-aabde282556f" /> The ollama process at the top blows up the RAM sky high when it happens. - Ollama itself (both the program and the python API) is up-to-date, running on the new engine set in the system's environment variables with `KV Cache` set to `q8` and flash attention enabled. I can't say for sure whether it happens when a tool call is performed or not. I also can't say for sure if it only happens with this model because this is the only model I've tested tool calling on and it has a pretty unique message format that Ollama supports. But I CAN say is that the only time I've noticed this pattern re-emerging is when a tool call happens. ### Relevant log output ```shell [GIN] 2025/12/09 - 18:02:57 | 200 | 1.1105271s | 127.0.0.1 | POST "/api/generate" time=2025-12-09T18:02:57.876-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/qwen3-vl:2b-instruct-q4_K_M runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="2.9 GiB" runner.vram="2.9 GiB" runner.parallel=1 runner.pid=26768 runner.model=H:\ai\ollama\models\blobs\sha256-aafed9e48b157ae913cee994e0d9ac927af51e256feafbd923bf2852e8856d00 runner.num_ctx=8192 time=2025-12-09T18:02:57.876-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3-vl:2b-instruct-q4_K_M runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="2.9 GiB" runner.vram="2.9 GiB" runner.parallel=1 runner.pid=26768 runner.model=H:\ai\ollama\models\blobs\sha256-aafed9e48b157ae913cee994e0d9ac927af51e256feafbd923bf2852e8856d00 runner.num_ctx=8192 duration=2562047h47m16.854775807s time=2025-12-09T18:02:57.876-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3-vl:2b-instruct-q4_K_M runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="2.9 GiB" runner.vram="2.9 GiB" runner.parallel=1 runner.pid=26768 runner.model=H:\ai\ollama\models\blobs\sha256-aafed9e48b157ae913cee994e0d9ac927af51e256feafbd923bf2852e8856d00 runner.num_ctx=8192 refCount=0 time=2025-12-09T18:03:02.582-05:00 level=WARN source=types.go:800 msg="invalid option provided" option=reasoning_effort time=2025-12-09T18:03:02.637-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a time=2025-12-09T18:03:03.370-05:00 level=DEBUG source=server.go:1465 msg="completion request" images=0 prompt=18636 format="" time=2025-12-09T18:03:03.407-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=3735 prompt=3861 used=602 remaining=3259 [GIN] 2025/12/09 - 18:03:05 | 200 | 3.3750209s | 127.0.0.1 | POST "/api/chat" time=2025-12-09T18:03:05.841-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss:120b runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="67.0 GiB" runner.vram="67.0 GiB" runner.parallel=2 runner.pid=31716 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=128000 time=2025-12-09T18:03:05.843-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:120b runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="67.0 GiB" runner.vram="67.0 GiB" runner.parallel=2 runner.pid=31716 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=128000 duration=2562047h47m16.854775807s time=2025-12-09T18:03:05.843-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gpt-oss:120b runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="67.0 GiB" runner.vram="67.0 GiB" runner.parallel=2 runner.pid=31716 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=128000 refCount=0 time=2025-12-09T18:03:21.986-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-aafed9e48b157ae913cee994e0d9ac927af51e256feafbd923bf2852e8856d00 time=2025-12-09T18:03:21.987-05:00 level=DEBUG source=server.go:1465 msg="completion request" images=1 prompt=98 format="" time=2025-12-09T18:03:22.124-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=2166 prompt=2060 used=7 remaining=2053 [GIN] 2025/12/09 - 18:03:23 | 200 | 1.5598903s | 127.0.0.1 | POST "/api/generate" time=2025-12-09T18:03:23.482-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/qwen3-vl:2b-instruct-q4_K_M runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="2.9 GiB" runner.vram="2.9 GiB" runner.parallel=1 runner.pid=26768 runner.model=H:\ai\ollama\models\blobs\sha256-aafed9e48b157ae913cee994e0d9ac927af51e256feafbd923bf2852e8856d00 runner.num_ctx=8192 time=2025-12-09T18:03:23.482-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3-vl:2b-instruct-q4_K_M runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="2.9 GiB" runner.vram="2.9 GiB" runner.parallel=1 runner.pid=26768 runner.model=H:\ai\ollama\models\blobs\sha256-aafed9e48b157ae913cee994e0d9ac927af51e256feafbd923bf2852e8856d00 runner.num_ctx=8192 duration=2562047h47m16.854775807s time=2025-12-09T18:03:23.482-05:00 level=DEBUG source=sched.go:308 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3-vl:2b-instruct-q4_K_M runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="2.9 GiB" runner.vram="2.9 GiB" runner.parallel=1 runner.pid=26768 runner.model=H:\ai\ollama\models\blobs\sha256-aafed9e48b157ae913cee994e0d9ac927af51e256feafbd923bf2852e8856d00 runner.num_ctx=8192 refCount=0 time=2025-12-09T18:03:33.245-05:00 level=WARN source=types.go:800 msg="invalid option provided" option=reasoning_effort time=2025-12-09T18:03:33.300-05:00 level=DEBUG source=sched.go:626 msg="evaluating already loaded" model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a time=2025-12-09T18:03:34.060-05:00 level=DEBUG source=server.go:1465 msg="completion request" images=0 prompt=12490 format="" time=2025-12-09T18:03:34.101-05:00 level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=4043 prompt=2690 used=603 remaining=2087 [GIN] 2025/12/09 - 18:03:36 | 200 | 2.8823465s | 127.0.0.1 | POST "/api/chat" time=2025-12-09T18:03:36.007-05:00 level=DEBUG source=sched.go:385 msg="context for request finished" runner.name=registry.ollama.ai/library/gpt-oss:120b runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="67.0 GiB" runner.vram="67.0 GiB" runner.parallel=2 runner.pid=31716 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=128000 time=2025-12-09T18:03:36.008-05:00 level=DEBUG source=sched.go:290 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gpt-oss:120b runner.inference="[{ID:GPU-94db278a-855e-2012-495e-be319102a97a Library:CUDA}]" runner.size="67.0 GiB" runner.vram="67.0 GiB" runner.parallel=2 runner.pid=31716 runner.model=H:\ai\ollama\models\blobs\sha256-6be6d66a3f546d8c19b130dc41dc24b2fc159f84ffbc76a0ee0676205083cf5a runner.num_ctx=128000 duration=2562047h47m16.854775807s ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.13.2
GiteaMirror added the bug label 2026-04-22 18:19:41 -05:00
Author
Owner

@SingularityMan commented on GitHub (Dec 16, 2025):

Bump.

<!-- gh-comment-id:3658875566 --> @SingularityMan commented on GitHub (Dec 16, 2025): Bump.
Author
Owner

@SingularityMan commented on GitHub (Dec 16, 2025):

PARTIALLY SOLVED: So the issue doesn't actually seem related to Ollama itself but with the old qwen3-0.6b-reranker model I ran from sentence-transformers directly. This particular model is problematic due to padding issues that cause a 50GB RAM blowup, so another model was released named tomaarsen/Qwen3-Reranker-0.6B-seq-cls which tackles those RAM blowups.

I've only seen that RAM blowups from Ollama once since I switched models on sentence-transformers so I say this is partially solved.

Its not fully solved because the RAM blowups were always coming from Ollama, not sentence-transformers. I can confirm this because the RAM would free up every time I would unload gpt-oss-120b from Ollama. Yet despite disabling system memory fallback on Ollama via Nvidia Control Panel, the RAM blowups persisted when the previous reranker ran from sentence-transformers. But given that the Task Manager insists Ollama is the one blowing up the RAM and not the other way around, I feel like whatever is happening with the reranker model seems to be causing Ollama to overreact, sidesteppeing the System Memory Fallback issue and continue blowing up the RAM with gpt-oss-120b.

Of course, that was with the old reranker. I've only seen this happen once with the new one and that was hours ago. I think this needs more looking into because this confirms to me that Ollama in its current state can't actually be isolated from your System's RAM Fallback solution NVIDIA has for Windows 10 and any future models or systems that would run alongside Ollama might trigger this overreaction in the future.

<!-- gh-comment-id:3662413453 --> @SingularityMan commented on GitHub (Dec 16, 2025): PARTIALLY SOLVED: So the issue doesn't actually seem related to Ollama itself but with the old `qwen3-0.6b-reranker` model I ran from `sentence-transformers` directly. This particular model is problematic due to padding issues that cause a 50GB RAM blowup, so another model was released named `tomaarsen/Qwen3-Reranker-0.6B-seq-cls` which tackles those RAM blowups. I've only seen that RAM blowups from Ollama once since I switched models on `sentence-transformers` so I say this is partially solved. Its not fully solved because the RAM blowups were always coming from Ollama, not `sentence-transformers`. I can confirm this because the RAM would free up every time I would unload `gpt-oss-120b` from Ollama. Yet despite disabling `system memory fallback` on Ollama via Nvidia Control Panel, the RAM blowups persisted when the previous reranker ran from `sentence-transformers`. But given that the Task Manager insists Ollama is the one blowing up the RAM and not the other way around, I feel like whatever is happening with the reranker model seems to be causing Ollama to overreact, sidesteppeing the `System Memory Fallback` issue and continue blowing up the RAM with `gpt-oss-120b`. Of course, that was with the old reranker. I've only seen this happen once with the new one and that was hours ago. I think this needs more looking into because this confirms to me that Ollama in its current state can't actually be isolated from your System's RAM Fallback solution NVIDIA has for Windows 10 and any future models or systems that would run alongside Ollama might trigger this overreaction in the future.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#34609