[GH-ISSUE #11758] Slowness of ollama running openai oss models #69850

Closed
opened 2026-05-04 19:33:23 -05:00 by GiteaMirror · 12 comments
Owner

Originally created by @VictorWangwz on GitHub (Aug 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11758

What is the issue?

Running the openai oss models via ollama is super slow on the same prompt of 20-30k each.

It only gives 15T/s while with similar dependencies, LMStudio coudl have 60T/s while llama.cpp could have 40T/s.

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @VictorWangwz on GitHub (Aug 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11758 ### What is the issue? Running the openai oss models via ollama is super slow on the same prompt of 20-30k each. It only gives 15T/s while with similar dependencies, LMStudio coudl have 60T/s while llama.cpp could have 40T/s. ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-05-04 19:33:23 -05:00
Author
Owner

@pamelafox commented on GitHub (Aug 6, 2025):

What machine did you try? I ran it on a Mac M1 with 16GB RAM, and got about 6 tokens per second, and noticed that it did not use very much of my MPU at all. Other models use much more of my MPU (100%) so it seems that no optimizations have been made for this model to use MPUs.

<!-- gh-comment-id:3161956495 --> @pamelafox commented on GitHub (Aug 6, 2025): What machine did you try? I ran it on a Mac M1 with 16GB RAM, and got about 6 tokens per second, and noticed that it did not use very much of my MPU at all. Other models use much more of my MPU (100%) so it seems that no optimizations have been made for this model to use MPUs.
Author
Owner

@rick-github commented on GitHub (Aug 7, 2025):

Server logs will help in debugging.

<!-- gh-comment-id:3162492373 --> @rick-github commented on GitHub (Aug 7, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will help in debugging.
Author
Owner

@tmape commented on GitHub (Aug 9, 2025):

same, hardware is 4060ti 16GB, 12700k, after update the latest version of ollama, the speed from 15T/s max to 39T/s, but lmstudio still faster...can over 60T/s

message: how many r are in the word 'strawberry'?

ollama [v0.11.4]
total duration: 4.2997423s
load duration: 66.8125ms
prompt eval count: 114 token(s)
prompt eval duration: 703.725ms
prompt eval rate: 162.00 tokens/s
eval count: 134 token(s)
eval duration: 3.4925531s
eval rate: 38.37 tokens/s

lm studio:
66.23 tok/sec
283 tokens
0.26s to first token
Stop reason: EOS Token Found

<!-- gh-comment-id:3170401073 --> @tmape commented on GitHub (Aug 9, 2025): same, hardware is 4060ti 16GB, 12700k, after update the latest version of ollama, the speed from 15T/s max to 39T/s, but lmstudio still faster...can over 60T/s message: how many r are in the word 'strawberry'? ollama [v0.11.4] total duration: 4.2997423s load duration: 66.8125ms prompt eval count: 114 token(s) prompt eval duration: 703.725ms prompt eval rate: 162.00 tokens/s eval count: 134 token(s) eval duration: 3.4925531s eval rate: 38.37 tokens/s lm studio: 66.23 tok/sec 283 tokens 0.26s to first token Stop reason: EOS Token Found
Author
Owner

@azomDev commented on GitHub (Aug 10, 2025):

This appears to be related to #11676

<!-- gh-comment-id:3172341748 --> @azomDev commented on GitHub (Aug 10, 2025): This appears to be related to #11676
Author
Owner

@sezze commented on GitHub (Aug 11, 2025):

Same happens with MacOS with an M2 Max chip with 32 GB memory, LM Studio is significantly faster

<!-- gh-comment-id:3174461752 --> @sezze commented on GitHub (Aug 11, 2025): Same happens with MacOS with an M2 Max chip with 32 GB memory, LM Studio is significantly faster
Author
Owner

@rick-github commented on GitHub (Aug 11, 2025):

Server logs will help in debugging.

<!-- gh-comment-id:3174521673 --> @rick-github commented on GitHub (Aug 11, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will help in debugging.
Author
Owner

@tmape commented on GitHub (Aug 11, 2025):

how to check logs on Windows?

<!-- gh-comment-id:3174573089 --> @tmape commented on GitHub (Aug 11, 2025): how to check logs on Windows?
Author
Owner
<!-- gh-comment-id:3174587647 --> @rick-github commented on GitHub (Aug 11, 2025): https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#:~:text=When%20you%20run%20Ollama%20on%20Windows
Author
Owner

@tmape commented on GitHub (Aug 12, 2025):

server.log

<!-- gh-comment-id:3177690743 --> @tmape commented on GitHub (Aug 12, 2025): [server.log](https://github.com/user-attachments/files/21726800/server.log)
Author
Owner

@rick-github commented on GitHub (Aug 12, 2025):

time=2025-08-12T12:47:04.287+08:00 level=INFO source=ggml.go:376 msg="offloaded 25/25 layers to GPU"

The model is fully loaded in GPU. It's slower than LMStudio because the MXFP4 implementation in ollama doesn't perform as well as the one in LMStudio (llama.cpp). Fortunately there is an open PR which will merge the upstream llama.cpp MXFP4 implementation and performance will then be on par with LMStudio.

<!-- gh-comment-id:3179682495 --> @rick-github commented on GitHub (Aug 12, 2025): ``` time=2025-08-12T12:47:04.287+08:00 level=INFO source=ggml.go:376 msg="offloaded 25/25 layers to GPU" ``` The model is fully loaded in GPU. It's slower than LMStudio because the MXFP4 implementation in ollama doesn't perform as well as the one in LMStudio (llama.cpp). Fortunately there is an [open PR](https://github.com/ollama/ollama/pull/11823) which will merge the upstream llama.cpp MXFP4 implementation and performance will then be on par with LMStudio.
Author
Owner

@tmape commented on GitHub (Aug 13, 2025):

time=2025-08-12T12:47:04.287+08:00 level=INFO source=ggml.go:376 msg="offloaded 25/25 layers to GPU"

The model is fully loaded in GPU. It's slower than LMStudio because the MXFP4 implementation in ollama doesn't perform as well as the one in LMStudio (llama.cpp). Fortunately there is an open PR which will merge the upstream llama.cpp MXFP4 implementation and performance will then be on par with LMStudio.

Thanks for identifying the issue. Ollama still works great for me, and I’m excited to see it get even better.

<!-- gh-comment-id:3181969418 --> @tmape commented on GitHub (Aug 13, 2025): > ``` > time=2025-08-12T12:47:04.287+08:00 level=INFO source=ggml.go:376 msg="offloaded 25/25 layers to GPU" > ``` > > The model is fully loaded in GPU. It's slower than LMStudio because the MXFP4 implementation in ollama doesn't perform as well as the one in LMStudio (llama.cpp). Fortunately there is an [open PR](https://github.com/ollama/ollama/pull/11823) which will merge the upstream llama.cpp MXFP4 implementation and performance will then be on par with LMStudio. Thanks for identifying the issue. Ollama still works great for me, and I’m excited to see it get even better.
Author
Owner

@rick-github commented on GitHub (Sep 1, 2025):

Recent releases of ollama have improved MXFP4 performance. Upgrade and leave a comment if the problem persists.

<!-- gh-comment-id:3243160822 --> @rick-github commented on GitHub (Sep 1, 2025): Recent releases of ollama have improved MXFP4 performance. Upgrade and leave a comment if the problem persists.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69850