[GH-ISSUE #11729] GPT-OSS 120B just use CPU #7767

Closed
opened 2026-04-12 19:55:26 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @ZYJZYJZYJ0801 on GitHub (Aug 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11729

What is the issue?

OS: win11
Ollama version: 0.11.2
GPU: 5000Ada*3

NAME SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:120b 67 GB 100% CPU 8192 3 minutes from now

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @ZYJZYJZYJ0801 on GitHub (Aug 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11729 ### What is the issue? OS: win11 Ollama version: 0.11.2 GPU: 5000Ada*3 NAME SIZE PROCESSOR CONTEXT UNTIL gpt-oss:120b 67 GB **100% CPU** 8192 3 minutes from now ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-12 19:55:26 -05:00
Author
Owner

@ajunca commented on GitHub (Aug 6, 2025):

I had a similar problem with the 20B model. I could force setting the offloading layers to GPU on OpenWebUI to max and then the GPU is used.
Now the problem is that it crashes mid answering. I have two RTX 3090, and one always crashes mid response. Not a hardware problem, and all the other models working correctly.

I'm attaching log, but probably the most relevant lines are:

panic: failed to sample token: sample: logits sum to NaN, check model output

goroutine 12 [running]:
github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc0005bf8c0, {0x55bd8ee85af0, 0xc000383630})
	github.com/ollama/ollama/runner/ollamarunner/runner.go:364 +0x65
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	github.com/ollama/ollama/runner/ollamarunner/runner.go:960 +0xa74

log.txt

<!-- gh-comment-id:3158638017 --> @ajunca commented on GitHub (Aug 6, 2025): I had a similar problem with the 20B model. I could force setting the offloading layers to GPU on OpenWebUI to max and then the GPU is used. Now the problem is that it crashes mid answering. I have two RTX 3090, and one always crashes mid response. Not a hardware problem, and all the other models working correctly. I'm attaching log, but probably the most relevant lines are: ``` panic: failed to sample token: sample: logits sum to NaN, check model output goroutine 12 [running]: github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc0005bf8c0, {0x55bd8ee85af0, 0xc000383630}) github.com/ollama/ollama/runner/ollamarunner/runner.go:364 +0x65 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/ollamarunner/runner.go:960 +0xa74 ``` [log.txt](https://github.com/user-attachments/files/21615080/log.txt)
Author
Owner

@danielfl commented on GitHub (Aug 6, 2025):

@ZYJZYJZYJ0801 Are you using an interface or the direct ollama cli?
I get the same issue using openwebui - ubuntu with AMD card that fits the 20b model (not the 120b).
Using the CLI worked on the GPU during a few tests. Saying "Hi" on openwebui makes the model to move to the CPU.

https://github.com/open-webui/open-webui/issues/16303#issuecomment-3157788670

<!-- gh-comment-id:3159817397 --> @danielfl commented on GitHub (Aug 6, 2025): @ZYJZYJZYJ0801 Are you using an interface or the direct ollama cli? I get the same issue using openwebui - ubuntu with AMD card that fits the 20b model (not the 120b). Using the CLI worked on the GPU during a few tests. Saying "Hi" on openwebui makes the model to move to the CPU. https://github.com/open-webui/open-webui/issues/16303#issuecomment-3157788670
Author
Owner

@jessegross commented on GitHub (Aug 6, 2025):

There was a bug in 0.11.2 and below where the memory estimation would become too high for gpt-oss if the model needed to be split across GPU and CPU or multiple GPUs. This would often cause 100% on CPU usage once the model overflows a single GPU.

This is fixed in 0.11.3.

<!-- gh-comment-id:3161750300 --> @jessegross commented on GitHub (Aug 6, 2025): There was a bug in 0.11.2 and below where the memory estimation would become too high for gpt-oss if the model needed to be split across GPU and CPU or multiple GPUs. This would often cause 100% on CPU usage once the model overflows a single GPU. This is fixed in 0.11.3.
Author
Owner

@alienatedsec commented on GitHub (Aug 7, 2025):

There was a bug in 0.11.2 and below where the memory estimation would become too high for gpt-oss if the model needed to be split across GPU and CPU or multiple GPUs. This would often cause 100% on CPU usage once the model overflows a single GPU.

This is fixed in 0.11.3.

@jessegross it's much better, thanks, but please see https://github.com/ollama/ollama/issues/11676#issuecomment-3166035491

<!-- gh-comment-id:3166042276 --> @alienatedsec commented on GitHub (Aug 7, 2025): > There was a bug in 0.11.2 and below where the memory estimation would become too high for gpt-oss if the model needed to be split across GPU and CPU or multiple GPUs. This would often cause 100% on CPU usage once the model overflows a single GPU. > > This is fixed in 0.11.3. @jessegross it's much better, thanks, but please see https://github.com/ollama/ollama/issues/11676#issuecomment-3166035491
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7767