[GH-ISSUE #14002] CPU > gpt-oss:latest vs GPU > llama3.1:latest, Ollama w/ Open WebUI #55665

Closed
opened 2026-04-29 09:33:31 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @jekv2 on GitHub (Jan 31, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14002

I been asking some stuff on both models in Ollama through Open WebUI.

I noticed that CPU Intensive > gpt-oss:latest. GPU Intensive> llama3.1:latest, Ollama w/ Open WebUI.

I understand that AI is render intensive, my question is, why is gpt-oss is using my CPU "9950x" Rather my "3090TIFE"?
llama uses my GPU.

Is there a way to change this for gpt-oss in Open WebUI?

As I cannot access the settings to Ollama from windows10 tray icon. Does not open as described here https://github.com/ollama/ollama/issues/14000

Originally created by @jekv2 on GitHub (Jan 31, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14002 I been asking some stuff on both models in Ollama through Open WebUI. I noticed that CPU Intensive > gpt-oss:latest. GPU Intensive> llama3.1:latest, Ollama w/ Open WebUI. I understand that AI is render intensive, my question is, why is gpt-oss is using my CPU "9950x" Rather my "3090TIFE"? llama uses my GPU. Is there a way to change this for gpt-oss in Open WebUI? As I cannot access the settings to Ollama from windows10 tray icon. Does not open as described here https://github.com/ollama/ollama/issues/14000
Author
Owner

@winstonma commented on GitHub (Jan 31, 2026):

llama3.1:latest is a 4.9GB model while gpt-oss:latest is a 14GB model.

Could you also list which GPU you’re using and how much memory it has?

<!-- gh-comment-id:3829640357 --> @winstonma commented on GitHub (Jan 31, 2026): [llama3.1:latest](https://ollama.com/library/llama3.1) is a 4.9GB model while [gpt-oss:latest](https://ollama.com/library/gpt-oss) is a 14GB model. Could you also list which GPU you’re using and how much memory it has?
Author
Owner

@rick-github commented on GitHub (Feb 1, 2026):

time=2026-01-31T15:21:10.646-06:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-31f97312-c981-417d-08bd-00e35abe8dd1 filter_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3060 Ti" libdirs=ollama,cuda_v13 driver=13.1 pci_id=0000:01:00.0 type=discrete total="8.0 GiB" available="7.5 GiB"

You have 7.5GiB VRAM available. As winstonma has pointed out, the two models are very different in size. The 14G gpt-oss will not fit in the GPU VRAM, so part of the model is loaded in system RAM where the CPU does inference. If you run ollama ps when gpt-oss is loaded it will show the percentage of model running in CPU and GPU.

<!-- gh-comment-id:3831324757 --> @rick-github commented on GitHub (Feb 1, 2026): ``` time=2026-01-31T15:21:10.646-06:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-31f97312-c981-417d-08bd-00e35abe8dd1 filter_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3060 Ti" libdirs=ollama,cuda_v13 driver=13.1 pci_id=0000:01:00.0 type=discrete total="8.0 GiB" available="7.5 GiB" ``` You have 7.5GiB VRAM available. As winstonma has pointed out, the two models are very different in size. The 14G gpt-oss will not fit in the GPU VRAM, so part of the model is loaded in system RAM where the CPU does inference. If you run `ollama ps` when gpt-oss is loaded it will show the percentage of model running in CPU and GPU.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55665