[GH-ISSUE #4468] Ollama speed dropped with setting OLLAMA_NUM_PARALLEL #2790

Closed
opened 2026-04-12 13:07:13 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @hugefrog on GitHub (May 16, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4468

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

After setting OLLAMA_NUM_PARALLEL in Ollama 0.1.38, the speed of single user access has dropped by half, and the GPU utilization rate is only about 50%."

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.1.38

Originally created by @hugefrog on GitHub (May 16, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4468 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? After setting OLLAMA_NUM_PARALLEL in Ollama 0.1.38, the speed of single user access has dropped by half, and the GPU utilization rate is only about 50%." ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.38
GiteaMirror added the bug label 2026-04-12 13:07:13 -05:00
Author
Owner

@DarkCaster commented on GitHub (May 18, 2024):

Same problem here :(
But I'm using it on Debian 11, CPU - Ryzen 3900X, no GPU.

<!-- gh-comment-id:2118589877 --> @DarkCaster commented on GitHub (May 18, 2024): Same problem here :( But I'm using it on Debian 11, CPU - Ryzen 3900X, no GPU.
Author
Owner

@DarkCaster commented on GitHub (May 19, 2024):

Okay, I did a little research on my case, and it doesn't seem to be relevant to this bug after all. I found similar issue with explanations here. Setting num_thread parameter in model file did help to utilize 100% of CPU in my case, but this did not really help to increase performance. Sorry for misleading you. I have been using Ollama not long ago, and until recently I had a server without HT. Your case probably different.

<!-- gh-comment-id:2119087732 --> @DarkCaster commented on GitHub (May 19, 2024): Okay, I did a little research on my case, and it doesn't seem to be relevant to this bug after all. I found similar issue with explanations [here](https://github.com/ollama/ollama/issues/2496). Setting `num_thread` parameter in model file did help to utilize 100% of CPU in my case, but this did not really help to increase performance. Sorry for misleading you. I have been using Ollama not long ago, and until recently I had a server without HT. Your case probably different.
Author
Owner

@dhiltgen commented on GitHub (May 21, 2024):

@hugefrog can you share some more details? What does ollama ps show with/without the parallel setting, and what did you set it to? We have to multiple the num parallel by the context size when loading into the GPU, so if you're loading a model that just barely fit without parallel, adding parallel might be pushing layers off the GPU and into system memory which could explain the slowdown. Our goal is to auto-select parallelism in the future based on available VRAM so we can avoid overflowing to CPU.

<!-- gh-comment-id:2123518873 --> @dhiltgen commented on GitHub (May 21, 2024): @hugefrog can you share some more details? What does `ollama ps` show with/without the parallel setting, and what did you set it to? We have to multiple the num parallel by the context size when loading into the GPU, so if you're loading a model that just barely fit without parallel, adding parallel might be pushing layers off the GPU and into system memory which could explain the slowdown. Our goal is to auto-select parallelism in the future based on available VRAM so we can avoid overflowing to CPU.
Author
Owner

@dhiltgen commented on GitHub (Jun 21, 2024):

If you're still seeing performance problems, please make sure to upgrade to the latest version and share the ollama ps output so we can evaluate.

<!-- gh-comment-id:2183576795 --> @dhiltgen commented on GitHub (Jun 21, 2024): If you're still seeing performance problems, please make sure to upgrade to the latest version and share the `ollama ps` output so we can evaluate.
Author
Owner

@hugefrog commented on GitHub (Jun 24, 2024):

If you're still seeing performance problems, please make sure to upgrade to the latest version and share the ollama ps output so we can evaluate.

Thank you for your response. I have updated Ollama to version 0.1.141 and conducted some test. I found that after setting OLLAMA_NUM_PARALLEL, the storage consumption of the yi:34b-chat-v1.5-q4_K_M model increased from 22GB to 25GB, which exceeds the memory capacity of my nvidia 3090, resulting in a decrease in speed.

<!-- gh-comment-id:2185791423 --> @hugefrog commented on GitHub (Jun 24, 2024): > If you're still seeing performance problems, please make sure to upgrade to the latest version and share the `ollama ps` output so we can evaluate. Thank you for your response. I have updated Ollama to version 0.1.141 and conducted some test. I found that after setting OLLAMA_NUM_PARALLEL, the storage consumption of the yi:34b-chat-v1.5-q4_K_M model increased from 22GB to 25GB, which exceeds the memory capacity of my nvidia 3090, resulting in a decrease in speed.
Author
Owner

@dhiltgen commented on GitHub (Jun 24, 2024):

@hugefrog that sounds like expected behavior with the current architecture. In an upcoming release if no parallel setting is defined, we'll auto-detect available VRAM and set a reasonable parallel level that keeps the model fitting in VRAM.

<!-- gh-comment-id:2186816723 --> @dhiltgen commented on GitHub (Jun 24, 2024): @hugefrog that sounds like expected behavior with the current architecture. In an upcoming release if no parallel setting is defined, we'll auto-detect available VRAM and set a reasonable parallel level that keeps the model fitting in VRAM.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2790