[GH-ISSUE #11773] GPT-OSS:20B Lower GPU Usage when increase the PARAMETER num_ctx #54315

Closed
opened 2026-04-29 05:44:07 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @fangshengren on GitHub (Aug 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11773

What is the issue?

Now I use a RTX3090 to run the GPT-oss:20b
facing the problem
Image
default PARAMETER num_ctx GPU usage 100%
when I increase the PARAMETER num_ctx, GPU usage will be lower

Image

finally will only use cpu

It's so wired, why this will happen?

Relevant log output


OS

Linux

GPU

NVIDIA GeForce RTX 3090

CPU

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: 13th Gen Intel(R) Core(TM) i5-13400
CPU family: 6
Model: 191
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 1
Stepping: 2
CPU max MHz: 4600.0000
CPU min MHz: 800.0000
BogoMIPS: 4992.00

Ollama version

Image

0.11.3

Originally created by @fangshengren on GitHub (Aug 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11773 ### What is the issue? Now I use a RTX3090 to run the GPT-oss:20b facing the problem <img width="513" height="30" alt="Image" src="https://github.com/user-attachments/assets/8a1c9541-9a23-49cb-a25d-4c83d322386b" /> default PARAMETER num_ctx GPU usage 100% when I increase the PARAMETER num_ctx, GPU usage will be lower <img width="477" height="24" alt="Image" src="https://github.com/user-attachments/assets/c79fff7c-9df6-40f2-a65a-5e3c3f8acf29" /> finally will only use cpu It's so wired, why this will happen? ### Relevant log output ```shell ``` ### OS Linux ### GPU NVIDIA GeForce RTX 3090 ### CPU Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: 13th Gen Intel(R) Core(TM) i5-13400 CPU family: 6 Model: 191 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 1 Stepping: 2 CPU max MHz: 4600.0000 CPU min MHz: 800.0000 BogoMIPS: 4992.00 ### Ollama version <img width="186" height="18" alt="Image" src="https://github.com/user-attachments/assets/38021f88-f5cf-4d51-9c3d-e59ce4fcde43" /> 0.11.3
GiteaMirror added the question label 2026-04-29 05:44:07 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 7, 2025):

There are data structures that are duplicated per-device when the model is loaded. When the model is spread across multiple devices (CPU+GPU) it uses more memory than when the model is loaded on a single device (GPU or CPU).

<!-- gh-comment-id:3162973797 --> @rick-github commented on GitHub (Aug 7, 2025): There are data structures that are duplicated per-device when the model is loaded. When the model is spread across multiple devices (CPU+GPU) it uses more memory than when the model is loaded on a single device (GPU or CPU).
Author
Owner

@pdevine commented on GitHub (Aug 7, 2025):

As @rick-github mentioned, the context needs memory. As you increase the context size it will require more of your GPU's memory until we can't fit anything onto the GPU and it will instead be loaded onto the CPU.

<!-- gh-comment-id:3165368609 --> @pdevine commented on GitHub (Aug 7, 2025): As @rick-github mentioned, the context needs memory. As you increase the context size it will require more of your GPU's memory until we can't fit anything onto the GPU and it will instead be loaded onto the CPU.
Author
Owner

@fangshengren commented on GitHub (Aug 8, 2025):

thanks guys

Image
<!-- gh-comment-id:3166447204 --> @fangshengren commented on GitHub (Aug 8, 2025): thanks guys <img width="642" height="33" alt="Image" src="https://github.com/user-attachments/assets/019a2424-220e-4bf1-b8a7-baf194f61f7b" />
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54315