[GH-ISSUE #11809] Low-inference-speed/ Low-level mistakes by you. #7836

Closed
opened 2026-04-12 20:00:17 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @MasihMoafi on GitHub (Aug 8, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11809

What is the issue?

So I tried very much NOT to leave like a hateful note, but your app sucks! It's using more than 80% of my RAM whenever I'm running any model like Qwen3 even! This is a bug in the new version. Isn't your app supposed to be made on top of llama.cpp? Then why is it so slow and it's not even using my GPU? Based on the documentation I should be able to get like six tokens per second using the newest GPT version, but I'm getting like one every second or two at most. So, to sum up, you suck.

Relevant log output

You suck.

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.11.0

Originally created by @MasihMoafi on GitHub (Aug 8, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11809 ### What is the issue? So I tried very much NOT to leave like a hateful note, but your app sucks! It's using more than 80% of my RAM whenever I'm running any model like Qwen3 even! This is a bug in the new version. Isn't your app supposed to be made on top of llama.cpp? Then why is it so slow and it's not even using my GPU? Based on the documentation I should be able to get like six tokens per second using the newest GPT version, but I'm getting like one every second or two at most. So, to sum up, you suck. ### Relevant log output ```shell You suck. ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.11.0
GiteaMirror added the bug label 2026-04-12 20:00:17 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 8, 2025):

Server logs will aid in debugging.

<!-- gh-comment-id:3167692468 --> @rick-github commented on GitHub (Aug 8, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@MasihMoafi commented on GitHub (Aug 8, 2025):

Thank you. It shows that ollama detects my VRAM as low, (8GB), and then it dumps everything on memory! Is there a low-level control anywhere?!

<!-- gh-comment-id:3168923669 --> @MasihMoafi commented on GitHub (Aug 8, 2025): Thank you. It shows that ollama detects my VRAM as low, (8GB), and then it dumps everything on memory! Is there a low-level control anywhere?!
Author
Owner

@rick-github commented on GitHub (Aug 8, 2025):

Post logs.

<!-- gh-comment-id:3168926927 --> @rick-github commented on GitHub (Aug 8, 2025): Post logs.
Author
Owner

@MasihMoafi commented on GitHub (Aug 8, 2025):

server-1.log
server-2.log
server-3.log
server-4.log
server-5.log

<!-- gh-comment-id:3169179006 --> @MasihMoafi commented on GitHub (Aug 8, 2025): [server-1.log](https://github.com/user-attachments/files/21691135/server-1.log) [server-2.log](https://github.com/user-attachments/files/21691138/server-2.log) [server-3.log](https://github.com/user-attachments/files/21691137/server-3.log) [server-4.log](https://github.com/user-attachments/files/21691136/server-4.log) [server-5.log](https://github.com/user-attachments/files/21691139/server-5.log)
Author
Owner

@rick-github commented on GitHub (Aug 8, 2025):

time=2025-08-06T23:02:44.035+03:30 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1
 layers.model=25 layers.offload=0 layers.split="" memory.available="[7.0 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="14.8 GiB" memory.required.partial="0 B" memory.required.kv="3.1 GiB" memory.required.allocations="[0 B]"
 memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB"
 memory.graph.full="32.0 GiB" memory.graph.partial="64.0 GiB"

Context of 131072 is too large. It results in a memory graph of 32G which will not fit in the available 7G of VRAM, so the model is loaded in system RAM. You could reduce OLLAMA_CONTEXT_LENGTH, and then set OLLAMA_FLASH_ATTENTION and OLLAMA_KV_CACHE_TYPE to reduce the memory footprint.

<!-- gh-comment-id:3169322851 --> @rick-github commented on GitHub (Aug 8, 2025): ``` time=2025-08-06T23:02:44.035+03:30 level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=0 layers.split="" memory.available="[7.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.8 GiB" memory.required.partial="0 B" memory.required.kv="3.1 GiB" memory.required.allocations="[0 B]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="32.0 GiB" memory.graph.partial="64.0 GiB" ``` Context of 131072 is too large. It results in a memory graph of 32G which will not fit in the available 7G of VRAM, so the model is loaded in system RAM. You could reduce [`OLLAMA_CONTEXT_LENGTH`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size), and then set [`OLLAMA_FLASH_ATTENTION`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-enable-flash-attention) and [`OLLAMA_KV_CACHE_TYPE`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-set-the-quantization-type-for-the-kv-cache) to reduce the memory footprint.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7836