[GH-ISSUE #11864] Slow performance when web search is enabled in GPT_OSS #54388

Closed
opened 2026-04-29 05:52:28 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @sakana-max on GitHub (Aug 12, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11864

Originally assigned to: @ParthSareen on GitHub.

What is the issue?

Slow performance when web search is enabled in GPT_OSS

When web search is enabled in GPT_OSS, it uses a lot of CPU power, resulting in very long output times.

The CPU is an i5-12400.
The GPU is an RTX4060TI with 8GB of Vram.
The RAM and internet connection are also sufficient.
The model is gpt_oss:20b.
It seems that the web search itself is slow, and the text never loads, even after waiting.

I used machine translation.

Image Image

Relevant log output


OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.11.4

Originally created by @sakana-max on GitHub (Aug 12, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11864 Originally assigned to: @ParthSareen on GitHub. ### What is the issue? Slow performance when web search is enabled in GPT_OSS When web search is enabled in GPT_OSS, it uses a lot of CPU power, resulting in very long output times. The CPU is an i5-12400. The GPU is an RTX4060TI with 8GB of Vram. The RAM and internet connection are also sufficient. The model is gpt_oss:20b. It seems that the web search itself is slow, and the text never loads, even after waiting. I used machine translation. <img width="1919" height="1026" alt="Image" src="https://github.com/user-attachments/assets/36fe3110-b550-4f28-8fe7-ea11d3a5be76" /> <img width="1919" height="1025" alt="Image" src="https://github.com/user-attachments/assets/5066388b-8a5e-41ec-b710-8838bc6543ec" /> ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.11.4
GiteaMirror added the bug label 2026-04-29 05:52:28 -05:00
Author
Owner

@ParthSareen commented on GitHub (Aug 12, 2025):

Hey @sakana-max. The gpt-oss model needs at least 16gb vram on the low end so my gut says that it's CPU offloading causing the spike and slowness. Could you show me the output while the model is loaded with ollama ps

<!-- gh-comment-id:3181224868 --> @ParthSareen commented on GitHub (Aug 12, 2025): Hey @sakana-max. The gpt-oss model needs at least 16gb vram on the low end so my gut says that it's CPU offloading causing the spike and slowness. Could you show me the output while the model is loaded with `ollama ps`
Author
Owner

@sakana-max commented on GitHub (Aug 12, 2025):

Thank you for your reply. Is the output from PS correct?

NAME ID SIZE PROCESSOR CONTEXT UNTIL
gpt-oss:20b e95023cf3b7b 14 GB 54%/46% CPU/GPU 4096 4 minutes from now

I have one more question. When outputting, the same internet search is performed multiple times. Is this correct? Or is this due to insufficient VRAM?

Image
<!-- gh-comment-id:3181319691 --> @sakana-max commented on GitHub (Aug 12, 2025): Thank you for your reply. Is the output from PS correct? NAME ID SIZE PROCESSOR CONTEXT UNTIL gpt-oss:20b e95023cf3b7b 14 GB 54%/46% CPU/GPU 4096 4 minutes from now I have one more question. When outputting, the same internet search is performed multiple times. Is this correct? Or is this due to insufficient VRAM? <img width="692" height="854" alt="Image" src="https://github.com/user-attachments/assets/ce532b44-301c-4f0f-93b4-337395988b2c" />
Author
Owner

@ParthSareen commented on GitHub (Aug 12, 2025):

Yeah so the issue is that the model is too big for VRAM - the 54%/46% CPU/GPU 4096 split shows that more than half the model is working on CPU which is why you see the slow down. The model has been trained to make multiple searches so this is expected behavior. Sorry. We should have websearch for smaller models coming soon! You can try out turbo in the meantime if that is of interest.

<!-- gh-comment-id:3181332939 --> @ParthSareen commented on GitHub (Aug 12, 2025): Yeah so the issue is that the model is too big for VRAM - the `54%/46% CPU/GPU 4096` split shows that more than half the model is working on CPU which is why you see the slow down. The model has been trained to make multiple searches so this is expected behavior. Sorry. We should have websearch for smaller models coming soon! You can try out turbo in the meantime if that is of interest.
Author
Owner

@sakana-max commented on GitHub (Aug 12, 2025):

Thank you for your reply. I understand the problem. I'm a junior high school student and can't upgrade my graphics card, so I'll try a different model.

<!-- gh-comment-id:3181411554 --> @sakana-max commented on GitHub (Aug 12, 2025): Thank you for your reply. I understand the problem. I'm a junior high school student and can't upgrade my graphics card, so I'll try a different model.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54388