[GH-ISSUE #3860] Serial generation performance regression from v0.1.32 on main #64427

Closed
opened 2026-05-03 17:37:28 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @brycereitano on GitHub (Apr 24, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3860

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

In effort to test the latest code which includes the recently merged concurrency branch (#3418), I noticed a performance regression when prompting a model already loaded in VRAM. This appears on latest main (2ac3dd6853) branch and I haven't been able to identify the commit that caused the regression as of yet as the docker builds take a long time.

I have confirmed that a docker build of v0.1.32 works as intended and subsequent calls to the same model are snappy. Whereas on the 2ac3dd68 commit, subsequent calls can take up to 20 seconds over the normal 1 second for a complete reply.

The impact of this regression seems to be impacted by model size. The large the model, the longer the delays between each prompt.

I have attached the debug logs for the few prompts I ran. 2ac3dd68.log Although these logs are from an interactive session in open-webui, I can reproduce by calling the following serially multiple times.

curl https://ollama.zete.dev/api/generate -d '{
  "model": "llama3:8b-instruct-q8_0",
  "prompt": "Why is the sky blue?"
}'

OS

Docker

GPU

Nvidia

CPU

AMD

Ollama version

2ac3dd6

Originally created by @brycereitano on GitHub (Apr 24, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3860 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? In effort to test the latest code which includes the recently merged concurrency branch (#3418), I noticed a performance regression when prompting a model already loaded in VRAM. This appears on latest main (2ac3dd6853a45237ac049d0a4982becf91ca8c45) branch and I haven't been able to identify the commit that caused the regression as of yet as the docker builds take a long time. I have confirmed that a docker build of v0.1.32 works as intended and subsequent calls to the same model are snappy. Whereas on the 2ac3dd68 commit, subsequent calls can take up to 20 seconds over the normal 1 second for a complete reply. The impact of this regression seems to be impacted by model size. The large the model, the longer the delays between each prompt. I have attached the debug logs for the few prompts I ran. [2ac3dd68.log](https://github.com/ollama/ollama/files/15087058/2ac3dd68.log) Although these logs are from an interactive session in open-webui, I can reproduce by calling the following serially multiple times. ``` curl https://ollama.zete.dev/api/generate -d '{ "model": "llama3:8b-instruct-q8_0", "prompt": "Why is the sky blue?" }' ``` ### OS Docker ### GPU Nvidia ### CPU AMD ### Ollama version 2ac3dd6
GiteaMirror added the bug label 2026-05-03 17:37:28 -05:00
Author
Owner

@jmorganca commented on GitHub (Apr 24, 2024):

Thanks for testing main! cc @dhiltgen

<!-- gh-comment-id:2073925741 --> @jmorganca commented on GitHub (Apr 24, 2024): Thanks for testing `main`! cc @dhiltgen
Author
Owner

@jmorganca commented on GitHub (Apr 24, 2024):

Hi @brycereitano would you happen to have the container logs handy? This might provide some info as to why the slowdown is happening. Thanks again for flagging this 😊

<!-- gh-comment-id:2073962384 --> @jmorganca commented on GitHub (Apr 24, 2024): Hi @brycereitano would you happen to have the container logs handy? This might provide some info as to why the slowdown is happening. Thanks again for flagging this 😊
Author
Owner

@erasmus74 commented on GitHub (Apr 24, 2024):

Seeing the same. Docker is my preferred deployment method as it covers ROCm and CUDA in a single image without much dependencies, but as soon as I upgraded my image, my 70b calls started taking longer. Some of my Agents went from ~5m runtime to over ~30m. Will do some more testing and try to attach relevant logs

<!-- gh-comment-id:2073974290 --> @erasmus74 commented on GitHub (Apr 24, 2024): Seeing the same. Docker is my preferred deployment method as it covers ROCm and CUDA in a single image without much dependencies, but as soon as I upgraded my image, my 70b calls started taking longer. Some of my Agents went from ~5m runtime to over ~30m. Will do some more testing and try to attach relevant logs
Author
Owner

@brycereitano commented on GitHub (Apr 24, 2024):

@jmorganca Included in my original post, but including here as well https://github.com/ollama/ollama/files/15087058/2ac3dd68.log

Currently trying to bisect down to the commit causing the issue. I'll chime in if I find anything.

<!-- gh-comment-id:2073976271 --> @brycereitano commented on GitHub (Apr 24, 2024): @jmorganca Included in my original post, but including here as well https://github.com/ollama/ollama/files/15087058/2ac3dd68.log Currently trying to bisect down to the commit causing the issue. I'll chime in if I find anything.
Author
Owner

@brycereitano commented on GitHub (Apr 24, 2024):

Running through git bisect, the regression was definitely introduced with from 34b9db5 with the merge of #3418. I did a revert on a local branch based on latest on main and got back to baseline performance.

I haven't set any of the environment variables that were introduced in that change so this would effect anybody running CUDA. I'm unsure of ROCm, but appears to not impact CPU inference in any way.

<!-- gh-comment-id:2074042554 --> @brycereitano commented on GitHub (Apr 24, 2024): Running through git bisect, the regression was definitely introduced with from 34b9db5 with the merge of #3418. I did a revert on a local branch based on latest on `main` and got back to baseline performance. I haven't set any of the environment variables that were introduced in that change so this would effect anybody running CUDA. I'm unsure of ROCm, but appears to not impact CPU inference in any way.
Author
Owner

@brycereitano commented on GitHub (Apr 24, 2024):

Identified the cause, before, Ollama kept a global state if a model was loaded and only loaded if it was necessary. The new logic loads the model into memory from disk every time you make a request. 74d2a9ef9a/server/sched.go (L76-L86)

The simplest solution, if I'm not missing something, would be to defer to loading the model into memory to only when we are trying to fit the model into VRAM here and here.

If that is how we want to handle it, without a larger rework, I can create a patch for this next evening.

<!-- gh-comment-id:2074156627 --> @brycereitano commented on GitHub (Apr 24, 2024): Identified the cause, before, Ollama kept a global state if a model was loaded and only loaded if it was necessary. The new logic loads the model into memory from disk every time you make a request. https://github.com/ollama/ollama/blob/74d2a9ef9aa6a4ee31f027926f3985c9e1610346/server/sched.go#L76-L86 The simplest solution, if I'm not missing something, would be to defer to loading the model into memory to only when we are trying to fit the model into VRAM [here](https://github.com/ollama/ollama/blob/74d2a9ef9aa6a4ee31f027926f3985c9e1610346/server/sched.go#L136-L140) and [here](https://github.com/ollama/ollama/blob/74d2a9ef9aa6a4ee31f027926f3985c9e1610346/server/sched.go#L151-L156). If that is how we want to handle it, without a larger rework, I can create a patch for this next evening.
Author
Owner

@dhiltgen commented on GitHub (Apr 24, 2024):

@brycereitano thanks! If you think you can get a PR up today, go for it!

As to the proposed approach, that sounds reasonable. Perhaps add a func (runner *runnerRef) loadGgml() error and wire that in as a lazy loader.

<!-- gh-comment-id:2075212364 --> @dhiltgen commented on GitHub (Apr 24, 2024): @brycereitano thanks! If you think you can get a PR up today, go for it! As to the proposed approach, that sounds reasonable. Perhaps add a `func (runner *runnerRef) loadGgml() error` and wire that in as a lazy loader.
Author
Owner

@erasmus74 commented on GitHub (Apr 25, 2024):

Is there possibly a way to trigger the docker build? I'm running into issues trying to build it locally. Docker build is still 8 days old.

<!-- gh-comment-id:2077974703 --> @erasmus74 commented on GitHub (Apr 25, 2024): Is there possibly a way to trigger the docker build? I'm running into issues trying to build it locally. Docker build is still 8 days old.
Author
Owner

@dhiltgen commented on GitHub (Apr 28, 2024):

@erasmus74 0.1.33 pre-release is up, and available on Docker hub if you specify the version tag. Once we drop the pre-release status then the "latest" tags will get updated.

https://hub.docker.com/r/ollama/ollama/tags

<!-- gh-comment-id:2081589004 --> @dhiltgen commented on GitHub (Apr 28, 2024): @erasmus74 0.1.33 pre-release is up, and available on Docker hub if you specify the version tag. Once we drop the pre-release status then the "latest" tags will get updated. https://hub.docker.com/r/ollama/ollama/tags
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#64427