[GH-ISSUE #11136] Performance drop 0.6.8 -> 0.7.0 .. 0.9.2 #33105

Closed
opened 2026-04-22 15:23:59 -05:00 by GiteaMirror · 16 comments
Owner

Originally created by @Rokazas on GitHub (Jun 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11136

What is the issue?

Hi, I’m using a 5060 GPU with 16GB RAM and running various Gemma3 models. I noticed a significant performance drop in the newer versions. In Task Manager, the GPU shows much higher usage with version 0.6.8, whereas in the updated versions, it seems like only the CPU is being used.

server_068_debug.log
server_070_debug.log
server_092_debug.log

Relevant log output


OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.7.0 and later

Originally created by @Rokazas on GitHub (Jun 19, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11136 ### What is the issue? Hi, I’m using a 5060 GPU with 16GB RAM and running various Gemma3 models. I noticed a significant performance drop in the newer versions. In Task Manager, the GPU shows much higher usage with version 0.6.8, whereas in the updated versions, it seems like only the CPU is being used. [server_068_debug.log](https://github.com/user-attachments/files/20824010/server_068_debug.log) [server_070_debug.log](https://github.com/user-attachments/files/20824011/server_070_debug.log) [server_092_debug.log](https://github.com/user-attachments/files/20824009/server_092_debug.log) ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.7.0 and later
GiteaMirror added the needs more infobug labels 2026-04-22 15:24:00 -05:00
Author
Owner

@indogood1 commented on GitHub (Jun 20, 2025):

can you try set OLLAMA_NUM_PARALLEL=1 ?

<!-- gh-comment-id:2991633340 --> @indogood1 commented on GitHub (Jun 20, 2025): can you try set OLLAMA_NUM_PARALLEL=1 ?
Author
Owner

@Rokazas commented on GitHub (Jun 20, 2025):

Thanks for taking a look.
That is already set, as seen in log files.

<!-- gh-comment-id:2991681310 --> @Rokazas commented on GitHub (Jun 20, 2025): Thanks for taking a look. That is already set, as seen in log files.
Author
Owner

@indogood1 commented on GitHub (Jun 20, 2025):

ok please try set OLLAMA_NEW_ENGINE=false

<!-- gh-comment-id:2991688468 --> @indogood1 commented on GitHub (Jun 20, 2025): ok please try set OLLAMA_NEW_ENGINE=false
Author
Owner

@Rokazas commented on GitHub (Jun 20, 2025):

That is also already set

<!-- gh-comment-id:2991734399 --> @Rokazas commented on GitHub (Jun 20, 2025): That is also already set
Author
Owner

@rick-github commented on GitHub (Jun 20, 2025):

Speed of token generation is affected a lot by how many layers are running on the GPU. Broadly speaking, more layers on the GPU, faster generation.

0.6.8

time=2025-06-19T21:25:32.932+03:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1
 layers.model=63 layers.offload=45 layers.split="" memory.available="[14.7 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="21.0 GiB" memory.required.partial="14.5 GiB" memory.required.kv="944.0 MiB"
 memory.required.allocations="[14.5 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="13.4 GiB"
 memory.weights.nonrepeating="2.6 GiB" memory.graph.full="522.5 MiB" memory.graph.partial="1.6 GiB"
 projector.weights="806.2 MiB" projector.graph="1.0 GiB"
[GIN] 2025/06/19 - 21:26:55 | 200 |   10.1946911s |       127.0.0.1 | POST     "/api/chat"

0.7.0

time=2025-06-19T21:04:39.649+03:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1
 layers.model=63 layers.offload=35 layers.split="" memory.available="[14.8 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="23.4 GiB" memory.required.partial="14.6 GiB" memory.required.kv="944.0 MiB"
 memory.required.allocations="[14.6 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="13.4 GiB"
 memory.weights.nonrepeating="2.6 GiB" memory.graph.full="522.5 MiB" memory.graph.partial="1.6 GiB"
 projector.weights="806.2 MiB" projector.graph="1.0 GiB"
[GIN] 2025/06/19 - 21:12:56 | 200 |         1m23s |       127.0.0.1 | POST     "/api/chat"

0.9.2

time=2025-06-19T19:48:27.406+03:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1
 layers.model=63 layers.offload=53 layers.split="" memory.available="[14.7 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="21.0 GiB" memory.required.partial="14.5 GiB" memory.required.kv="944.0 MiB"
 memory.required.allocations="[14.5 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="13.4 GiB"
 memory.weights.nonrepeating="2.6 GiB" memory.graph.full="522.5 MiB" memory.graph.partial="1.6 GiB"
 projector.weights="806.2 MiB" projector.graph="1.0 GiB"
[GIN] 2025/06/19 - 19:53:48 | 200 |   51.2136842s |       127.0.0.1 | POST     "/api/chat"

0.6.8 has offloaded 45 layers, 0.7.0 35 layers, 0.9.2 53 layers, 0.7.0 has least layers on GPU, and so is slowest. 0.9.2 has most layers, so you would expect it to be quicker. However, as the new engine has evolved, models with a vision component have a different allocation strategy. If you set OLLAMA_DEBUG=2 in the logs you will be able to see which tensors are allocated to the GPU and which are allocated to the CPU. The memory estimation is getting a shake-up in #11090 which should create more optimal tensor allocations, thereby improving overall performance.

<!-- gh-comment-id:2992033719 --> @rick-github commented on GitHub (Jun 20, 2025): Speed of token generation is affected a lot by how many layers are running on the GPU. Broadly speaking, more layers on the GPU, faster generation. 0.6.8 ``` time=2025-06-19T21:25:32.932+03:00 level=INFO source=server.go:139 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=45 layers.split="" memory.available="[14.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.0 GiB" memory.required.partial="14.5 GiB" memory.required.kv="944.0 MiB" memory.required.allocations="[14.5 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="13.4 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="522.5 MiB" memory.graph.partial="1.6 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" [GIN] 2025/06/19 - 21:26:55 | 200 | 10.1946911s | 127.0.0.1 | POST "/api/chat" ``` 0.7.0 ``` time=2025-06-19T21:04:39.649+03:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=35 layers.split="" memory.available="[14.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="23.4 GiB" memory.required.partial="14.6 GiB" memory.required.kv="944.0 MiB" memory.required.allocations="[14.6 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="13.4 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="522.5 MiB" memory.graph.partial="1.6 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" [GIN] 2025/06/19 - 21:12:56 | 200 | 1m23s | 127.0.0.1 | POST "/api/chat" ``` 0.9.2 ``` time=2025-06-19T19:48:27.406+03:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=63 layers.offload=53 layers.split="" memory.available="[14.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.0 GiB" memory.required.partial="14.5 GiB" memory.required.kv="944.0 MiB" memory.required.allocations="[14.5 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="13.4 GiB" memory.weights.nonrepeating="2.6 GiB" memory.graph.full="522.5 MiB" memory.graph.partial="1.6 GiB" projector.weights="806.2 MiB" projector.graph="1.0 GiB" [GIN] 2025/06/19 - 19:53:48 | 200 | 51.2136842s | 127.0.0.1 | POST "/api/chat" ``` 0.6.8 has offloaded 45 layers, 0.7.0 35 layers, 0.9.2 53 layers, 0.7.0 has least layers on GPU, and so is slowest. 0.9.2 has most layers, so you would expect it to be quicker. However, as the new engine has evolved, models with a vision component have a different allocation strategy. If you set `OLLAMA_DEBUG=2` in the logs you will be able to see which tensors are allocated to the GPU and which are allocated to the CPU. The memory estimation is getting a shake-up in #11090 which should create more optimal tensor allocations, thereby improving overall performance.
Author
Owner

@Rokazas commented on GitHub (Jun 20, 2025):

Thanks for pointers, but still not sure what can I do to speed up. I did tests again with #OLLAMA_DEBUG=2 attached
OLLAMA_NEW_ESTIMATES = 1
OLLAMA_NEW_ENGINE = true

server_068_debug2.log
server_092_debug2.log

Image

My only solution- stick to 0.6.8 and from week to week test latest versions.

<!-- gh-comment-id:2992560846 --> @Rokazas commented on GitHub (Jun 20, 2025): Thanks for pointers, but still not sure what can I do to speed up. I did tests again with #OLLAMA_DEBUG=2 attached OLLAMA_NEW_ESTIMATES = 1 OLLAMA_NEW_ENGINE = true [server_068_debug2.log](https://github.com/user-attachments/files/20841786/server_068_debug2.log) [server_092_debug2.log](https://github.com/user-attachments/files/20841785/server_092_debug2.log) ![Image](https://github.com/user-attachments/assets/d34f38a8-ea26-4564-b039-2ac19f9369f0) My only solution- stick to 0.6.8 and from week to week test latest versions.
Author
Owner

@rick-github commented on GitHub (Jun 20, 2025):

OLLAMA_NEW_ESTIMATES is only useful if you've compiled the #11090 branch, and you don't need to set OLLAMA_NEW_ENGINE as that's the default for gemma3.

What you can try doing is overriding the estimate that ollama has come up with by explicitly telling how many layers to offload via num_gpu. See here for details. Note that this can lead to OOMs or performance degradation.

<!-- gh-comment-id:2992607799 --> @rick-github commented on GitHub (Jun 20, 2025): `OLLAMA_NEW_ESTIMATES` is only useful if you've compiled the #11090 branch, and you don't need to set `OLLAMA_NEW_ENGINE` as that's the default for gemma3. What you can try doing is overriding the estimate that ollama has come up with by explicitly telling how many layers to offload via `num_gpu`. See [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650) for details. Note that this can lead to OOMs or [performance degradation](https://github.com/ollama/ollama/issues/7584#issuecomment-2466715900).
Author
Owner

@rick-github commented on GitHub (Jun 20, 2025):

Looking at the logs, 0.9.2 has offloaded 7 extra layers to CPU. In both logs, the vision projector is running purely on CPU. I don't think 7 layers would affect the performance so dramatically, so this might be a different issue. I did a quick comparison with a 16GB 3080 and 0.6.8 offloaded 49 layers to the GPU and did a completion at 5.55 tps, while 0.9.2 offloaded 57 layers and did a completion at 8.1 tps. So it's not clear why 0.9.2 is performing poorly on your system.

<!-- gh-comment-id:2992658097 --> @rick-github commented on GitHub (Jun 20, 2025): Looking at the logs, 0.9.2 has offloaded 7 extra layers to CPU. In both logs, the vision projector is running purely on CPU. I don't think 7 layers would affect the performance so dramatically, so this might be a different issue. I did a quick comparison with a 16GB 3080 and 0.6.8 offloaded 49 layers to the GPU and did a completion at 5.55 tps, while 0.9.2 offloaded 57 layers and did a completion at 8.1 tps. So it's not clear why 0.9.2 is performing poorly on your system.
Author
Owner

@rick-github commented on GitHub (Jun 20, 2025):

I noticed in the logs that you are processing images, so I re-did my quick test with an image: 0.6.8 5.37 tps, 0.9.2 7.65 tps.

<!-- gh-comment-id:2992678903 --> @rick-github commented on GitHub (Jun 20, 2025): I noticed in the logs that you are processing images, so I re-did my quick test with an image: 0.6.8 5.37 tps, 0.9.2 7.65 tps.
Author
Owner

@davidair commented on GitHub (Jun 21, 2025):

I am seeing a similar problem - I have an RTX 5080, running on Windows, and I am analyzing news headlines using the following prompt, with options={"temperature": 0}:

    prompt = (f"Analyze the sentiment of this news item:\n\nTitle: {title}\nDescription: {description}\n\n"
              "Is it positive, neutral, or negative? "
              "You must start your response with -1 for negative, 0 for neutral and 1 for positive, followed by an explanation. "
              "Note that the analysis is done for the purpose of determining if the news article is likely "
              "to cause distress to the reader so it's important to annotate anything possibly causing distress as negative.")

With 0.9.2, it takes 20-25 seconds to generate a response, whereas with 0.6.8, it's about 1 second or so. Like the OP, I am going to stick to 0.6.8 until it's resolved.

<!-- gh-comment-id:2993736041 --> @davidair commented on GitHub (Jun 21, 2025): I am seeing a similar problem - I have an RTX 5080, running on Windows, and I am analyzing news headlines using the following prompt, with options={"temperature": 0}: ``` prompt = (f"Analyze the sentiment of this news item:\n\nTitle: {title}\nDescription: {description}\n\n" "Is it positive, neutral, or negative? " "You must start your response with -1 for negative, 0 for neutral and 1 for positive, followed by an explanation. " "Note that the analysis is done for the purpose of determining if the news article is likely " "to cause distress to the reader so it's important to annotate anything possibly causing distress as negative.") ``` With 0.9.2, it takes 20-25 seconds to generate a response, whereas with 0.6.8, it's about 1 second or so. Like the OP, I am going to stick to 0.6.8 until it's resolved.
Author
Owner

@rick-github commented on GitHub (Jun 21, 2025):

Server logs will aid in debugging.

<!-- gh-comment-id:2993753681 --> @rick-github commented on GitHub (Jun 21, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@davidair commented on GitHub (Jun 28, 2025):

I am no longer able to reproduce the problem with the newest release (0.9.3)

<!-- gh-comment-id:3015842071 --> @davidair commented on GitHub (Jun 28, 2025): I am no longer able to reproduce the problem with the newest release (0.9.3)
Author
Owner

@Rokazas commented on GitHub (Jul 3, 2025):

why it was closed? Issue persits for me.

<!-- gh-comment-id:3031942466 --> @Rokazas commented on GitHub (Jul 3, 2025): why it was closed? Issue persits for me.
Author
Owner

@rick-github commented on GitHub (Jul 3, 2025):

Sorry, over-zealous bug closing. Please post server logs from 0.9.5.

<!-- gh-comment-id:3032042956 --> @rick-github commented on GitHub (Jul 3, 2025): Sorry, over-zealous bug closing. Please post server logs from 0.9.5.
Author
Owner

@Rokazas commented on GitHub (Jul 3, 2025):

sure thing.
same benchmark batch script shows no improvement

Image

server.log

<!-- gh-comment-id:3032542229 --> @Rokazas commented on GitHub (Jul 3, 2025): sure thing. same benchmark batch script shows no improvement ![Image](https://github.com/user-attachments/assets/5953c21b-5a24-4c1f-8ecb-94a898da2e89) [server.log](https://github.com/user-attachments/files/21039838/server.log)
Author
Owner

@jessegross commented on GitHub (Sep 24, 2025):

I'm going to go ahead and close this now that the new memory management logic is on by default. If you continue to see problems, please file a new issue.

<!-- gh-comment-id:3330117629 --> @jessegross commented on GitHub (Sep 24, 2025): I'm going to go ahead and close this now that the new memory management logic is on by default. If you continue to see problems, please file a new issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33105