[GH-ISSUE #11731] GPT-OSS:20B running almost entirely on CPU #69828

Closed
opened 2026-05-04 19:30:17 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @jhsmith409 on GitHub (Aug 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11731

What is the issue?

On an RTX 5090 + 5070 Ti (total 48 GB VRAM), it is running almost entirely on CPU. Only storing the KV Cache on the GPU but using just 1% of GPU while it is using all 20 cores of the CPU. That gets it about 6 tokens/second.

Diagnosed with Claude and it says "The gpt-oss:20b model runs on CPU because it uses MXFP4 quantization, which likely lacks GPU acceleration support in
Ollama. Your working models (gemma3, qwen3) use standard quantizations (Q4_0, etc.) that have full GPU support.

Solution: Look for a gpt-oss model with standard quantization (Q4_K_M, Q5_K_M, Q8_0) for GPU acceleration, or wait for
MXFP4 GPU support in future Ollama updates."

Running Ollama 0.11.3, CUDA 12.9.1, Ubuntu 24.04.2 LTS

services:
ollama:
image: ollama/ollama:0.11.3
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
environment:
- NVIDIA_VISIBLE_DEVICES=all

- OLLAMA_CONTEXT_LENGTH=32768

- OLLAMA_CONTEXT_LENGTH=49152

- OLLAMA_CONTEXT_LENGTH=65536

  - OLLAMA_CONTEXT_LENGTH=131072
  - OLLAMA_KV_CACHE_TYPE=q8_0
  - OLLAMA_NUM_PARALLEL=1
  - OLLAMA_NUM_THREADS=16
  - OLLAMA_MAX_QUEUE=32
  - OLLAMA_MAX_LOADED_MODELS=1
  - OLLAMA_FLASH_ATTENTION=1
  - OLLAMA_KEEP_ALIVE=5m
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: all
          capabilities: [gpu]
volumes:
  - ollama-data:/root/.ollama
  - /mnt/raid0/medgemma27b/medgemma-27b-text-it-q4_k_m:/models/medgemma-27b-text-it-q4_k_m

volumes:
ollama-data:

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @jhsmith409 on GitHub (Aug 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11731 ### What is the issue? On an RTX 5090 + 5070 Ti (total 48 GB VRAM), it is running almost entirely on CPU. Only storing the KV Cache on the GPU but using just 1% of GPU while it is using all 20 cores of the CPU. That gets it about 6 tokens/second. Diagnosed with Claude and it says "The gpt-oss:20b model runs on CPU because it uses MXFP4 quantization, which likely lacks GPU acceleration support in Ollama. Your working models (gemma3, qwen3) use standard quantizations (Q4_0, etc.) that have full GPU support. Solution: Look for a gpt-oss model with standard quantization (Q4_K_M, Q5_K_M, Q8_0) for GPU acceleration, or wait for MXFP4 GPU support in future Ollama updates." Running Ollama 0.11.3, CUDA 12.9.1, Ubuntu 24.04.2 LTS services: ollama: image: ollama/ollama:0.11.3 container_name: ollama restart: unless-stopped ports: - "11434:11434" environment: - NVIDIA_VISIBLE_DEVICES=all # - OLLAMA_CONTEXT_LENGTH=32768 # - OLLAMA_CONTEXT_LENGTH=49152 # - OLLAMA_CONTEXT_LENGTH=65536 - OLLAMA_CONTEXT_LENGTH=131072 - OLLAMA_KV_CACHE_TYPE=q8_0 - OLLAMA_NUM_PARALLEL=1 - OLLAMA_NUM_THREADS=16 - OLLAMA_MAX_QUEUE=32 - OLLAMA_MAX_LOADED_MODELS=1 - OLLAMA_FLASH_ATTENTION=1 - OLLAMA_KEEP_ALIVE=5m deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] volumes: - ollama-data:/root/.ollama - /mnt/raid0/medgemma27b/medgemma-27b-text-it-q4_k_m:/models/medgemma-27b-text-it-q4_k_m volumes: ollama-data: ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-05-04 19:30:17 -05:00
Author
Owner

@jhsmith409 commented on GitHub (Aug 6, 2025):

On the vLLM repository, one of the users ran the new model against gemma code review and found this:

"vllm/model_executor/models/gpt_oss.py

    for name, weight in weights:
        # FIXME(woosuk): Remove this after testing.
        weight = weight.cuda()

Contributor
@gemini-code-assist gemini-code-assist bot 3 hours ago
critical

The hardcoded weight.cuda() call inside the weight loading loop is problematic. It prevents the model from running on non-CUDA devices (e.g., for testing on CPU) and is inefficient as it moves weights to the GPU one by one. This can lead to performance issues and portability problems. The weight loading pipeline should handle device placement, and this line should be removed. The FIXME comment suggests this might be temporary, but it's a critical issue to address before merging."

Does Ollama have the same issue?

<!-- gh-comment-id:3158810247 --> @jhsmith409 commented on GitHub (Aug 6, 2025): On the vLLM repository, one of the users ran the new model against gemma code review and found this: "[vllm/model_executor/models/gpt_oss.py](https://github.com/vllm-project/vllm/pull/22327/files/f7ea8ccea8996bc451963e33a1908127d33b7101#diff-ce5082bac9eea30a0145ce0a72576a8a4fbceae025fcdc98e714f976072c6704) for name, weight in weights: # FIXME(woosuk): Remove this after testing. weight = weight.cuda() Contributor @[gemini-code-assist](https://github.com/apps/gemini-code-assist) gemini-code-assist bot [3 hours ago](https://github.com/vllm-project/vllm/pull/22327#discussion_r2255968264) critical The hardcoded weight.cuda() call inside the weight loading loop is problematic. It prevents the model from running on non-CUDA devices (e.g., for testing on CPU) and is inefficient as it moves weights to the GPU one by one. This can lead to performance issues and portability problems. The weight loading pipeline should handle device placement, and this line should be removed. The FIXME comment suggests this might be temporary, but it's a critical issue to address before merging." Does Ollama have the same issue?
Author
Owner

@rick-github commented on GitHub (Aug 6, 2025):

Reduce context size to allow more layers to be loaded in VRAM.

<!-- gh-comment-id:3158897052 --> @rick-github commented on GitHub (Aug 6, 2025): Reduce context size to allow more layers to be loaded in VRAM.
Author
Owner

@maglat commented on GitHub (Aug 6, 2025):

Its some kind of faulty memory allocation ollama suffer for long time now. try context size 32k which should which should ollama calculate to a 22GB VRAM. Go above the 32k context size, suddenly the requirement jump up to 50GB Vram usage.

<!-- gh-comment-id:3160174326 --> @maglat commented on GitHub (Aug 6, 2025): Its some kind of faulty memory allocation ollama suffer for long time now. try context size 32k which should which should ollama calculate to a 22GB VRAM. Go above the 32k context size, suddenly the requirement jump up to 50GB Vram usage.
Author
Owner

@jhsmith409 commented on GitHub (Aug 6, 2025):

When I calculate KV Cache Size from info on the model card, I get a relativley small size even at 128k context.

KV cache size = 2 × context_length × num_layers × num_kv_heads × head_dim × bytes_per_element

For gpt-oss:20b (from its model card on HF):

context_length (C) = 128,000
num_layers (L) = 24
num_kv_heads = 8 (due to grouped query attention with group size 8)
head_dim = hidden_size / num_attention_heads = 2,880 / 64 = 45
bytes_per_element = 1 (FP8)

Step-by-step calculation:

Compute per-cache size: C × L × num_kv_heads × head_dim = 128,000 × 24 × 8 × 45 = 1,105,920,000 elements
Double for key + value: 2 × 1,105,920,000 = 2,211,840,000 elements @ 1 byte (FP8) = 2.2 GB

2.2 GB seems small, am I missing something?

<!-- gh-comment-id:3160190153 --> @jhsmith409 commented on GitHub (Aug 6, 2025): When I calculate KV Cache Size from info on the model card, I get a relativley small size even at 128k context. KV cache size = 2 × context_length × num_layers × num_kv_heads × head_dim × bytes_per_element For gpt-oss:20b (from its model card on HF): context_length (C) = 128,000 num_layers (L) = 24 num_kv_heads = 8 (due to grouped query attention with group size 8) head_dim = hidden_size / num_attention_heads = 2,880 / 64 = 45 bytes_per_element = 1 (FP8) Step-by-step calculation: Compute per-cache size: C × L × num_kv_heads × head_dim = 128,000 × 24 × 8 × 45 = 1,105,920,000 elements Double for key + value: 2 × 1,105,920,000 = 2,211,840,000 elements @ 1 byte (FP8) = 2.2 GB 2.2 GB seems small, am I missing something?
Author
Owner

@rick-github commented on GitHub (Aug 6, 2025):

Splitting the model across multiple devices causes increased memory usage because state needs to be duplicated on each device.

<!-- gh-comment-id:3160192038 --> @rick-github commented on GitHub (Aug 6, 2025): Splitting the model across multiple devices causes increased memory usage because state needs to be duplicated on each device.
Author
Owner

@asjad3 commented on GitHub (Aug 9, 2025):

Hi, were you able to resolve this? I have the same issue with my RTX 4080. I've tried turning down context size all the way down to 4k but no luck.

The model fills up 15 gigabytes of vram (out of the 16gb on my gpu) but the actual calculations appear to be stressing the cpu as the gpu is just sitting idle. very low interference rates, too.

<!-- gh-comment-id:3170468048 --> @asjad3 commented on GitHub (Aug 9, 2025): Hi, were you able to resolve this? I have the same issue with my RTX 4080. I've tried turning down context size all the way down to 4k but no luck. The model fills up 15 gigabytes of vram (out of the 16gb on my gpu) but the actual calculations appear to be stressing the cpu as the gpu is just sitting idle. very low interference rates, too.
Author
Owner

@Queracus commented on GitHub (Aug 9, 2025):

droping from 128k context to 32k in ollama now fits into 24GB 3090. just barely. and i must say its a fast model tbf.

i should add that i have OLLAMA_KV_CACHE_TYPE set to f16 and OLLAMA_FLASH_ATTENTION set to 1 in variables.

<!-- gh-comment-id:3170645499 --> @Queracus commented on GitHub (Aug 9, 2025): droping from 128k context to 32k in ollama now fits into 24GB 3090. just barely. and i must say its a fast model tbf. i should add that i have OLLAMA_KV_CACHE_TYPE set to f16 and OLLAMA_FLASH_ATTENTION set to 1 in variables.
Author
Owner

@azomDev commented on GitHub (Aug 10, 2025):

I think this might be related to #11676 , but not completely sure for this one

<!-- gh-comment-id:3172342559 --> @azomDev commented on GitHub (Aug 10, 2025): I think this might be related to #11676 , but not completely sure for this one
Author
Owner

@rick-github commented on GitHub (Aug 10, 2025):

The deciding factor is the graph size, not the KV cache size. For gpt-oss:20b @ 128000, that's 31.2GB. If ollama can't fit the graph on a device, it can't use the device. Until flash attention is supported for gpt-oss, the only solution to getting the model to run on a GPU is to get more VRAM or reduce the context size.

<!-- gh-comment-id:3172771557 --> @rick-github commented on GitHub (Aug 10, 2025): The deciding factor is the graph size, not the KV cache size. For gpt-oss:20b @ 128000, that's 31.2GB. If ollama can't fit the graph on a device, it can't use the device. Until flash attention is supported for gpt-oss, the only solution to getting the model to run on a GPU is to get more VRAM or reduce the context size.
Author
Owner

@jhsmith409 commented on GitHub (Aug 10, 2025):

Rick, Oh! I didn't realize that flash attention wasn't supported. That would require a huge amount of memory for 128k without flash attention, likely TB's of VRAM. My application requires a long context window so I'll stick with Qwen3 which fits in my 48GB of VRAM for 128k until flash attention is working. Thank you for that piece of information.

<!-- gh-comment-id:3172859847 --> @jhsmith409 commented on GitHub (Aug 10, 2025): Rick, Oh! I didn't realize that flash attention wasn't supported. That would require a huge amount of memory for 128k without flash attention, likely TB's of VRAM. My application requires a long context window so I'll stick with Qwen3 which fits in my 48GB of VRAM for 128k until flash attention is working. Thank you for that piece of information.
Author
Owner

@rick-github commented on GitHub (Sep 1, 2025):

Flash attention has been enabled for gpt-oss in recent ollama releases.

<!-- gh-comment-id:3243155477 --> @rick-github commented on GitHub (Sep 1, 2025): Flash attention has been enabled for gpt-oss in recent ollama releases.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69828