[GH-ISSUE #5559] ollama 0.2.0 version is slower than 0.1.48 #3475

Closed
opened 2026-04-12 14:09:54 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @ceykmc on GitHub (Jul 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5559

Originally assigned to: @jmorganca on GitHub.

What is the issue?

I upgrade ollama from 0.1.48 to 0.2.0 in ubuntu20.04.
With all configures unchanged, using deepseek-coder-v2:16b, RTX4090D
in 0.1.48, inference speed is 0.4s
but in 0.2.0, inference speed is 4s.
I don't know why.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.2.0

Originally created by @ceykmc on GitHub (Jul 9, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5559 Originally assigned to: @jmorganca on GitHub. ### What is the issue? I upgrade ollama from 0.1.48 to 0.2.0 in ubuntu20.04. With all configures unchanged, using deepseek-coder-v2:16b, RTX4090D in 0.1.48, inference speed is 0.4s but in 0.2.0, inference speed is 4s. I don't know why. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.2.0
GiteaMirror added the bug label 2026-04-12 14:09:54 -05:00
Author
Owner

@jmorganca commented on GitHub (Jul 9, 2024):

Hi @ceykmc do you happen to have the logs available? journalctl -u ollama

<!-- gh-comment-id:2216486564 --> @jmorganca commented on GitHub (Jul 9, 2024): Hi @ceykmc do you happen to have the logs available? `journalctl -u ollama`
Author
Owner

@ceykmc commented on GitHub (Jul 9, 2024):

when use 0.2.0
if set OLLAMA_NUM_PARALLEL=1 which is default value, everything is fine.
when change OLLAMA_NUM_PARALLEL large than 1, e.g. OLLAMA_NUM_PARALLEL=4, every inference is slow, just as first load time.

in my situation, when set OLLAMA_NUM_PARALLEL=1, the first inference time is 4s (which is fine, because the model need to be load), then the inference time is 0.4s.
when set OLLAMA_NUM_PARALLEL=4, the first inference time is 4s , the rest inference time is still 4s.
maybe if OLLAMA_NUM_PARALLEL large than 1, the model will be reloaded every time?

<!-- gh-comment-id:2216757026 --> @ceykmc commented on GitHub (Jul 9, 2024): when use 0.2.0 if set OLLAMA_NUM_PARALLEL=1 which is default value, everything is fine. when change OLLAMA_NUM_PARALLEL large than 1, e.g. OLLAMA_NUM_PARALLEL=4, every inference is slow, just as first load time. in my situation, when set OLLAMA_NUM_PARALLEL=1, the first inference time is 4s (which is fine, because the model need to be load), then the inference time is 0.4s. when set OLLAMA_NUM_PARALLEL=4, the first inference time is 4s , the rest inference time is still 4s. maybe if OLLAMA_NUM_PARALLEL large than 1, the model will be reloaded every time?
Author
Owner

@binarrii commented on GitHub (Jul 9, 2024):

I encountered the same problem, just like @ceykmc mentioned.

<!-- gh-comment-id:2216815972 --> @binarrii commented on GitHub (Jul 9, 2024): I encountered the same problem, just like @ceykmc mentioned.
Author
Owner

@binarrii commented on GitHub (Jul 9, 2024):

I encountered the same problem, just like @ceykmc mentioned.

It seems that the model needs to be reloaded every time a request is made.

<!-- gh-comment-id:2216820170 --> @binarrii commented on GitHub (Jul 9, 2024): > I encountered the same problem, just like @ceykmc mentioned. It seems that the model needs to be reloaded every time a request is made.
Author
Owner

@jmorganca commented on GitHub (Jul 9, 2024):

Hi folks, this should be fixed in https://github.com/ollama/ollama/releases/tag/v0.2.1 – sorry about the issue. Let me know if you're still seeing slowness in 0.2.1

<!-- gh-comment-id:2216845608 --> @jmorganca commented on GitHub (Jul 9, 2024): Hi folks, this should be fixed in https://github.com/ollama/ollama/releases/tag/v0.2.1 – sorry about the issue. Let me know if you're still seeing slowness in 0.2.1
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3475