[GH-ISSUE #8779] When using open-webui and others ollama seems to struggle stopping/unloading/switching models #31461

Closed
opened 2026-04-22 11:54:50 -05:00 by GiteaMirror · 14 comments
Owner

Originally created by @AncientMystic on GitHub (Feb 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8779

I have noticed over a few installs and few different UIs, with quite a few attempts, when ollama has to switch models, stop models or unload them for whatever reason, it seems to hang a bit, often causing problems with requests.

In open-webui, it seems to cause problems with responses, it will stop to unload after 5 minutes then when you make another request it hangs because its still struggling to unload it from last time and sometimes it causes problems and it just never responds.

It seems ollama will be way better and more efficient if it can unload models more quickly, some processes require rapid switching between models too with using memories and other features that use different models in succession

Originally created by @AncientMystic on GitHub (Feb 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8779 I have noticed over a few installs and few different UIs, with quite a few attempts, when ollama has to switch models, stop models or unload them for whatever reason, it seems to hang a bit, often causing problems with requests. In open-webui, it seems to cause problems with responses, it will stop to unload after 5 minutes then when you make another request it hangs because its still struggling to unload it from last time and sometimes it causes problems and it just never responds. It seems ollama will be way better and more efficient if it can unload models more quickly, some processes require rapid switching between models too with using memories and other features that use different models in succession
GiteaMirror added the needs more info label 2026-04-22 11:54:50 -05:00
Author
Owner

@AncientMystic commented on GitHub (Feb 3, 2025):

Just to add, it probably just needs a verification step or something to ensure the model was unloaded or to force it if it has not been done and is still on stopping after x seconds.

Sometimes it unloads fine and other times it just wont, half the time when this happens i have to remotely connect to the host and force stop ollama to restart it manually (or wait between 5-20min and thats just too long to have everything come grinding to a halt) and it happens often enough its annoying.

It will also tend to not handle requests properly when it is doing all of this, then it just waits out the time until keep alive expires and unloads again, while open-webui hangs like its going to reply but of course ollama already unloaded the model and quit trying.

Edit: it also often has issues with consecutive replies, like the first 1-2 replies are within a few seconds then it will randomly hang and take forever to reply for no apparent reason.

<!-- gh-comment-id:2629748884 --> @AncientMystic commented on GitHub (Feb 3, 2025): Just to add, it probably just needs a verification step or something to ensure the model was unloaded or to force it if it has not been done and is still on stopping after x seconds. Sometimes it unloads fine and other times it just wont, half the time when this happens i have to remotely connect to the host and force stop ollama to restart it manually (or wait between 5-20min and thats just too long to have everything come grinding to a halt) and it happens often enough its annoying. It will also tend to not handle requests properly when it is doing all of this, then it just waits out the time until keep alive expires and unloads again, while open-webui hangs like its going to reply but of course ollama already unloaded the model and quit trying. Edit: it also often has issues with consecutive replies, like the first 1-2 replies are within a few seconds then it will randomly hang and take forever to reply for no apparent reason.
Author
Owner

@rick-github commented on GitHub (Feb 3, 2025):

Server logs will aid in debugging.

<!-- gh-comment-id:2630806543 --> @rick-github commented on GitHub (Feb 3, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md) will aid in debugging.
Author
Owner

@DarkIceXD commented on GitHub (Feb 20, 2025):

I'm running LibreChat and Ollama inside two separate Docker containers. However, I intermittently experience the same issue. Today, I tried limiting OLLAMA_NUM_PARALLEL to 1 as a test, but unfortunately it didn't make a difference.

Here are server logs after it happened today again: logs.txt

Should i enable OLLAMA_DEBUG and report back when the issue happens again?

<!-- gh-comment-id:2670935213 --> @DarkIceXD commented on GitHub (Feb 20, 2025): I'm running LibreChat and Ollama inside two separate Docker containers. However, I intermittently experience the same issue. Today, I tried limiting `OLLAMA_NUM_PARALLEL` to 1 as a test, but unfortunately it didn't make a difference. Here are server logs after it happened today again: [logs.txt](https://github.com/user-attachments/files/18884036/logs.txt) Should i enable `OLLAMA_DEBUG` and report back when the issue happens again?
Author
Owner

@rick-github commented on GitHub (Feb 20, 2025):

From the log, you loaded llama3.3:70b-instruct-q4_K_M, had chats using both the ollama API (/api/chat) and the openAI API (/v1/chat/completions). It looks like your last successful chat was at 08:51:51 using the openAI API, after that the chats take so long to complete that the client times out after 10 minutes. Do you know what the requests after 08:51:51 were about?

I speculate that the model has lost coherence and is "rambling", ie generating a stream of tokens and not hitting an EOS (end of sequence) token. This can happen when a response fills the context buffer and the resulting shift to make room causes the model to lose track of what it was doing. Enabling OLLAMA_DEBUG should show lots of shifting happening. The reason this prevents model unloading is that ollama waits for a completion to finish before unloading the model that is generating it. You can mitigate this by setting num_predict. Since this is happening in the openAI API call, you would need to set this in the Modelfile.

Logs with OLLAMA_DEBUG=1 would verify or refute the above.

<!-- gh-comment-id:2671106963 --> @rick-github commented on GitHub (Feb 20, 2025): From the log, you loaded llama3.3:70b-instruct-q4_K_M, had chats using both the ollama API (`/api/chat`) and the openAI API (`/v1/chat/completions`). It looks like your last successful chat was at 08:51:51 using the openAI API, after that the chats take so long to complete that the client times out after 10 minutes. Do you know what the requests after 08:51:51 were about? I speculate that the model has lost coherence and is "rambling", ie generating a stream of tokens and not hitting an EOS (end of sequence) token. This can happen when a response fills the context buffer and the resulting shift to make room causes the model to lose track of what it was doing. Enabling `OLLAMA_DEBUG` should show lots of `shifting` happening. The reason this prevents model unloading is that ollama waits for a completion to finish before unloading the model that is generating it. You can mitigate this by setting `num_predict`. Since this is happening in the openAI API call, you would need to set this in the Modelfile. Logs with `OLLAMA_DEBUG=1` would verify or refute the above.
Author
Owner

@DarkIceXD commented on GitHub (Feb 20, 2025):

Do you know what the requests after 08:51:51 were about?

Not sure but the last was definatly a new chat just asking "hello" just to see if there is any response. It didn't give me any so i had to restart ollama.

Usually i only use /v1/completions using the vscode extension continue and /v1/chat/completions using librechat the reason why you see /api/chat is because i was testing if i disabled OLLAMA_NUM_PARALLEL correctly. Also i exclusivly use llama3.3:70b-instruct-q4_K_M and try to never unload it for fast responses.

I will add OLLAMA_DEBUG=1

<!-- gh-comment-id:2671297437 --> @DarkIceXD commented on GitHub (Feb 20, 2025): > Do you know what the requests after 08:51:51 were about? Not sure but the last was definatly a new chat just asking "hello" just to see if there is any response. It didn't give me any so i had to restart ollama. Usually i only use `/v1/completions` using the vscode extension continue and `/v1/chat/completions` using librechat the reason why you see `/api/chat` is because i was testing if i disabled OLLAMA_NUM_PARALLEL correctly. Also i exclusivly use llama3.3:70b-instruct-q4_K_M and try to never unload it for fast responses. I will add OLLAMA_DEBUG=1
Author
Owner

@DarkIceXD commented on GitHub (Feb 21, 2025):

Here is another log for the freeze but i had to redact the messages. It looks like the messages still get generated correctly but are not sent to the client?
logs_freeze.txt

Also i encountered the "GGGGG" issue again. Should i open a new ticket for that?
logs_gggg.txt

<!-- gh-comment-id:2673791643 --> @DarkIceXD commented on GitHub (Feb 21, 2025): Here is another log for the freeze but i had to redact the messages. It looks like the messages still get generated correctly but are not sent to the client? [logs_freeze.txt](https://github.com/user-attachments/files/18902724/logs_freeze.txt) Also i encountered the "GGGGG" issue again. Should i open a new ticket for that? [logs_gggg.txt](https://github.com/user-attachments/files/18902726/logs_gggg.txt)
Author
Owner

@AncientMystic commented on GitHub (Feb 24, 2025):

I kept running into this issue in windows, so i switched my vm environment to linux which never does it, so i am no longer experiencing this issue but it is still absolutely a problem on windows, from multiple ollama installs on multiple windows installs both 10 and 11 the problem persisted.

It is also pretty easy to replicate from my experience with it, just run ollama on windows with open-webui and try to get it to respond more than once. Sometimes it works most of the time it has a lot of problems replying and its just completely unreliable.

From my logs reading over them randomly every time it happened it said nothing about any error at most that the connection was closed and ollama unloaded the model or the "gpu vram usage didn't recover within timeout" message i see you (rick-github) have said is just a warning and can be ignored on other issue threads here.

I will see if i can get it to give any specific errors or anything in debug mode logs as the normal log said nothing about a problem, it would just respond fine the first time then 2nd-3rd time stop and hang like it was thinking forever and it would do it more often than not making it mostly unusable unless i wanted to close ollama between every single reply.

(Linux / WSL always seems to work better than running directly on windows for everything that has a linux variant.)

<!-- gh-comment-id:2677324718 --> @AncientMystic commented on GitHub (Feb 24, 2025): I kept running into this issue in windows, so i switched my vm environment to linux which never does it, so i am no longer experiencing this issue but it is still absolutely a problem on windows, from multiple ollama installs on multiple windows installs both 10 and 11 the problem persisted. It is also pretty easy to replicate from my experience with it, just run ollama on windows with open-webui and try to get it to respond more than once. Sometimes it works most of the time it has a lot of problems replying and its just completely unreliable. From my logs reading over them randomly every time it happened it said nothing about any error at most that the connection was closed and ollama unloaded the model or the "gpu vram usage didn't recover within timeout" message i see you (rick-github) have said is just a warning and can be ignored on other issue threads here. I will see if i can get it to give any specific errors or anything in debug mode logs as the normal log said nothing about a problem, it would just respond fine the first time then 2nd-3rd time stop and hang like it was thinking forever and it would do it more often than not making it mostly unusable unless i wanted to close ollama between every single reply. (Linux / WSL always seems to work better than running directly on windows for everything that has a linux variant.)
Author
Owner

@rick-github commented on GitHub (Feb 24, 2025):

It is also pretty easy to replicate from my experience with it,

If you can reproduce it reliably, it would be interesting to see if setting num_predict has any affect - my suggestion of this helping is purely guess work.

<!-- gh-comment-id:2678108089 --> @rick-github commented on GitHub (Feb 24, 2025): > It is also pretty easy to replicate from my experience with it, If you can reproduce it reliably, it would be interesting to see if setting `num_predict` has any affect - my suggestion of this helping is purely guess work.
Author
Owner

@rick-github commented on GitHub (Feb 24, 2025):

logs_freeze.txt

There's a lot of truncation happening. Multiple conversation turnarounds are filling up the context buffer and ollama is discarding earlier parts of the conversation. This may lead to the model lose track of the conversation and go off the rails. I suggest increasing num_ctx and/or adding num_predict to see if it helps.

logs_gggg.txt

time=2025-02-20T15:22:28.984Z level=INFO source=server.go:596 msg="llama runner started in 8.02 seconds"
time=2025-02-20T15:22:28.984Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d
time=2025-02-20T15:22:28.984Z level=DEBUG source=routes.go:1461 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\ntest<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
time=2025-02-20T15:22:28.985Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=11 used=0 remaining=11
time=2025-02-20T15:22:30.877Z level=DEBUG source=server.go:818 msg="prediction aborted, token repeat limit reached"
[GIN] 2025/02/20 - 15:22:30 | 200 | 10.355523674s |      172.20.0.7 | POST     "/v1/chat/completions"

Freshly started server and short prompt. It's not clear why the model starts generating nonsense, but it's probably related to the multi-GPU setup - there are multiple tickets for this issue and they usually involve multiple GPUs. There are reports that setting GGML_CUDA_NO_PEER_COPY helps, but that's unfortunately a compile-time setting, and no strong evidence (yet) that it actually does resolve the issue. What's interesting is that the nonsense generation is not more widespread, it appears affect some users more than others. What's the configuration of the machine that docker is running on? OS, CPU, virtualized, etc.

<!-- gh-comment-id:2678181412 --> @rick-github commented on GitHub (Feb 24, 2025): > logs_freeze.txt There's a lot of truncation happening. Multiple conversation turnarounds are filling up the context buffer and ollama is discarding earlier parts of the conversation. This may lead to the model lose track of the conversation and go off the rails. I suggest increasing `num_ctx` and/or adding `num_predict` to see if it helps. > logs_gggg.txt ``` time=2025-02-20T15:22:28.984Z level=INFO source=server.go:596 msg="llama runner started in 8.02 seconds" time=2025-02-20T15:22:28.984Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/root/.ollama/models/blobs/sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d time=2025-02-20T15:22:28.984Z level=DEBUG source=routes.go:1461 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\ntest<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" time=2025-02-20T15:22:28.985Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=11 used=0 remaining=11 time=2025-02-20T15:22:30.877Z level=DEBUG source=server.go:818 msg="prediction aborted, token repeat limit reached" [GIN] 2025/02/20 - 15:22:30 | 200 | 10.355523674s | 172.20.0.7 | POST "/v1/chat/completions" ``` Freshly started server and short prompt. It's not clear why the model starts generating nonsense, but it's probably related to the multi-GPU setup - there are multiple tickets for this issue and they usually involve multiple GPUs. There are reports that setting `GGML_CUDA_NO_PEER_COPY` helps, but that's unfortunately a compile-time setting, and no strong evidence (yet) that it actually does resolve the issue. What's interesting is that the nonsense generation is not more widespread, it appears affect some users more than others. What's the configuration of the machine that docker is running on? OS, CPU, virtualized, etc.
Author
Owner

@DarkIceXD commented on GitHub (Feb 24, 2025):

There's a lot of truncation happening. Multiple conversation turnarounds are filling up the context buffer and ollama is discarding earlier parts of the conversation. This may lead to the model lose track of the conversation and go off the rails. I suggest increasing num_ctx and/or adding num_predict to see if it helps.

I will need to look on how i can incorparate num_predict to librechat since i would have to send it on every request is that correct?
And num_ctx will need to be set in the modelfile right?
Also what values would you recommend?

What's interesting is that the nonsense generation is not more widespread, it appears affect some users more than others. What's the configuration of the machine that docker is running on? OS, CPU, virtualized, etc.

I am running:

  • Debian 12 (on bare metal no virtualization)
  • 6.1.0-30-amd64 x86_64 GNU/Linux
  • AMD Ryzen Threadripper PRO 5955WX
  • 128 GB RAM
  • Official Docker (not the debian version) + NVIDIA Container Toolkit
  • Official Ollama Docker Image
<!-- gh-comment-id:2678765252 --> @DarkIceXD commented on GitHub (Feb 24, 2025): > There's a lot of truncation happening. Multiple conversation turnarounds are filling up the context buffer and ollama is discarding earlier parts of the conversation. This may lead to the model lose track of the conversation and go off the rails. I suggest increasing `num_ctx` and/or adding `num_predict` to see if it helps. I will need to look on how i can incorparate num_predict to librechat since i would have to send it on every request is that correct? And num_ctx will need to be set in the modelfile right? Also what values would you recommend? > What's interesting is that the nonsense generation is not more widespread, it appears affect some users more than others. What's the configuration of the machine that docker is running on? OS, CPU, virtualized, etc. I am running: - Debian 12 (on bare metal no virtualization) - 6.1.0-30-amd64 x86_64 GNU/Linux - AMD Ryzen Threadripper PRO 5955WX - 128 GB RAM - Official Docker (not the debian version) + NVIDIA Container Toolkit - Official Ollama Docker Image
Author
Owner

@rick-github commented on GitHub (Feb 24, 2025):

I will need to look on how i can incorparate num_predict to librechat since i would have to send it on every request is that correct?

It's probably easier to just create models as required:

$ ollama run llama3.3
>>> /set parameter num_ctx 4096
>>> /set parameter num_predict 200
>>> /save llama3.3:c4k-p200
>>> /set parameter num_ctx 8192
>>> /set parameter num_predict 400
>>> /save llama3.3:c8k-p400
>>> /bye

Then in librechat, choose model llama3.3:c4k-p200 or llama3.3:c8k-p400 or whichever model you create based on parameters.

Also what values would you recommend?

It depends on what your inputs and outputs are. If it's just a conversation, then you probably want a large input buffer (num_ctx) to handle multiple turns, but only need a small amount of tokens (num_predict) for the response. If you expect verbose responses, you would increase num_predict so as to not prematurely cut off the response. Note that if you make num_ctx larger it may push model layers off the GPU in to system RAM, which will make the model run slower. You are borderline at the moment (memory.available="[23.4 GiB 23.4 GiB]" memory.required.allocations="[23.3 GiB 22.5 GiB]"), you can make room for more context by setting OLLAMA_NUM_PARALLEL=1. This will reduce the context memory requirements to a quarter of the current value, so you should be able to get num_ctx up to around 8192.

<!-- gh-comment-id:2678846815 --> @rick-github commented on GitHub (Feb 24, 2025): > I will need to look on how i can incorparate num_predict to librechat since i would have to send it on every request is that correct? It's probably easier to just create models as required: ```console $ ollama run llama3.3 >>> /set parameter num_ctx 4096 >>> /set parameter num_predict 200 >>> /save llama3.3:c4k-p200 >>> /set parameter num_ctx 8192 >>> /set parameter num_predict 400 >>> /save llama3.3:c8k-p400 >>> /bye ``` Then in librechat, choose model llama3.3:c4k-p200 or llama3.3:c8k-p400 or whichever model you create based on parameters. > Also what values would you recommend? It depends on what your inputs and outputs are. If it's just a conversation, then you probably want a large input buffer (`num_ctx`) to handle multiple turns, but only need a small amount of tokens (`num_predict`) for the response. If you expect verbose responses, you would increase `num_predict` so as to not prematurely cut off the response. Note that if you make `num_ctx` larger it may push model layers off the GPU in to system RAM, which will make the model run slower. You are borderline at the moment (`memory.available="[23.4 GiB 23.4 GiB]" memory.required.allocations="[23.3 GiB 22.5 GiB]"`), you can make room for more context by setting `OLLAMA_NUM_PARALLEL=1`. This will reduce the context memory requirements to a quarter of the current value, so you should be able to get `num_ctx` up to around 8192.
Author
Owner

@DarkIceXD commented on GitHub (Feb 25, 2025):

It depends on what your inputs and outputs are. If it's just a conversation, then you probably want a large input buffer (num_ctx) to handle multiple turns, but only need a small amount of tokens (num_predict) for the response. If you expect verbose responses, you would increase num_predict so as to not prematurely cut off the response. Note that if you make num_ctx larger it may push model layers off the GPU in to system RAM, which will make the model run slower. You are borderline at the moment (memory.available="[23.4 GiB 23.4 GiB]" memory.required.allocations="[23.3 GiB 22.5 GiB]"), you can make room for more context by setting OLLAMA_NUM_PARALLEL=1. This will reduce the context memory requirements to a quarter of the current value, so you should be able to get num_ctx up to around 8192.

Actually i don't really want to push it past VRAM and into RAM. Is there no other way than increasing num_ctx and num_predict? I feel like it's a stop gap solution until the context in 1 chat pushes it to the new limit right?

<!-- gh-comment-id:2681965102 --> @DarkIceXD commented on GitHub (Feb 25, 2025): > It depends on what your inputs and outputs are. If it's just a conversation, then you probably want a large input buffer (`num_ctx`) to handle multiple turns, but only need a small amount of tokens (`num_predict`) for the response. If you expect verbose responses, you would increase `num_predict` so as to not prematurely cut off the response. Note that if you make `num_ctx` larger it may push model layers off the GPU in to system RAM, which will make the model run slower. You are borderline at the moment (`memory.available="[23.4 GiB 23.4 GiB]" memory.required.allocations="[23.3 GiB 22.5 GiB]"`), you can make room for more context by setting `OLLAMA_NUM_PARALLEL=1`. This will reduce the context memory requirements to a quarter of the current value, so you should be able to get `num_ctx` up to around 8192. Actually i don't really want to push it past VRAM and into RAM. Is there no other way than increasing num_ctx and num_predict? I feel like it's a stop gap solution until the context in 1 chat pushes it to the new limit right?
Author
Owner

@rick-github commented on GitHub (Feb 25, 2025):

Increasing num_ctx is to allow longer conversations in the context buffer, num_predict is to protect against the model losing coherence, you don't need to do both.

Long conversations aren't normally a problem because as mentioned, ollama will trim down to make it fit in the context buffer.

The model losing coherence seems to be the issue. This is supposed to be uncommon, if it's an ongoing issue for you it might be tied to your hardware/software setup, which may also be the cause of the "GGGG" output. Have you considered running a GPU/VRAM tester to see if anything pops up? If you are dual booting, OCCT or gpumemtest, otherwise (less featureful) open source one here. There's also the possibility of a software issue with multi-gpu cards as raised earlier, you could try building ollama from source and setting GGML_CUDA_NO_PEER_COPY (or other flags) to see if it resolves the problem.

<!-- gh-comment-id:2682284589 --> @rick-github commented on GitHub (Feb 25, 2025): Increasing `num_ctx` is to allow longer conversations in the context buffer, `num_predict` is to protect against the model losing coherence, you don't need to do both. Long conversations aren't normally a problem because as mentioned, ollama will trim down to make it fit in the context buffer. The model losing coherence seems to be the issue. This is supposed to be uncommon, if it's an ongoing issue for you it might be tied to your hardware/software setup, which may also be the cause of the "GGGG" output. Have you considered running a GPU/VRAM tester to see if anything pops up? If you are dual booting, [OCCT](https://www.ocbase.com/download) or [gpumemtest](https://www.programming4beginners.com/gpumemtest), otherwise (less featureful) open source one [here](https://github.com/ihaque/memtestG80). There's also the possibility of a software issue with multi-gpu cards as raised earlier, you could try building ollama from source and setting `GGML_CUDA_NO_PEER_COPY` (or other flags) to see if it resolves the problem.
Author
Owner

@rick-github commented on GitHub (Apr 13, 2025):

Marking this as a dupe of #7606 for the stopping issue.

<!-- gh-comment-id:2799996968 --> @rick-github commented on GitHub (Apr 13, 2025): Marking this as a dupe of #7606 for the stopping issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#31461