[GH-ISSUE #4545] Ollama stops serving requests after 10-15 minutes #2850

New Issue

@iganev commented on GitHub (Sep 13, 2024):

same error, how to resolved it? Added image 'E:\At_Work\2a94e678-6fed-44e7-84c2-b69d92b43f33.jpg' Error: no slots available after 10 retries

You can partially mitigate the issue by enabling parallelism. It will still occur, but it will self-recover and start serving requests again in few seconds.

No permanent solution so far.

@iganev commented on GitHub (Sep 13, 2024): > same error, how to resolved it? Added image 'E:\At_Work\2a94e678-6fed-44e7-84c2-b69d92b43f33.jpg' Error: no slots available after 10 retries You can partially mitigate the issue by enabling parallelism. It will still occur, but it will self-recover and start serving requests again in few seconds. No permanent solution so far.

GiteaMirror commented

@HTK-Tech commented on GitHub (Sep 30, 2024):

My workaround ATM when I get this error is to make a request with only the keep_alive parameter set to 0 to unload the model, wait one second, and then resume making requests normally.

await ollama.chat({
        model: 'llava:34b',
        keep_alive: 0,
    });

@HTK-Tech commented on GitHub (Sep 30, 2024): My workaround ATM when I get this error is to make a request with only the keep_alive parameter set to 0 to unload the model, wait one second, and then resume making requests normally. ``` await ollama.chat({ model: 'llava:34b', keep_alive: 0, }); ```

GiteaMirror commented

https://github.com/ollama/ollama/releases

@dhiltgen commented on GitHub (Oct 23, 2024):

Please give the new 0.4.0 RC a try. We've rewritten the caching (and slot) model for processing requests in the new Go server, so this issue is most likely resolved.

@dhiltgen commented on GitHub (Oct 23, 2024): Please give the new 0.4.0 RC a try. We've rewritten the caching (and slot) model for processing requests in the new Go server, so this issue is most likely resolved. https://github.com/ollama/ollama/releases

GiteaMirror commented

@giladrom commented on GitHub (Oct 27, 2024):

I can confirm 0.4.0-rc5 does not fix the issue for me. It still times out after 1m40s for most workloads, which I am not sure how to resolve. Where is the 1m40s timeout coming from?

source=llama-server.go:821 msg="Failed to acquire semaphore" error="context canceled"
[GIN] 2024/10/27 - 08:40:55 | 500 |         1m40s |  XXX.XXX.XXX.XXX  | POST     "/api/embed"
source=.:0 msg="http: superfluous response.WriteHeader call from main.(*Server).embeddings (runner.go:743)"

@giladrom commented on GitHub (Oct 27, 2024): I can confirm `0.4.0-rc5` does not fix the issue for me. It still times out after 1m40s for most workloads, which I am not sure how to resolve. Where is the 1m40s timeout coming from? ``` source=llama-server.go:821 msg="Failed to acquire semaphore" error="context canceled" [GIN] 2024/10/27 - 08:40:55 | 500 | 1m40s | XXX.XXX.XXX.XXX | POST "/api/embed" source=.:0 msg="http: superfluous response.WriteHeader call from main.(*Server).embeddings (runner.go:743)" ```

GiteaMirror commented

@dhiltgen commented on GitHub (Oct 28, 2024):

@giladrom looking at the code, I believe your client is timing out and closing the connection before the server is able to process the request. It sounds like you may have a 100s timeout in your client code, or maybe client libraries you're using. Do you believe the system is hung when this happens? Can you try to bump up the timeout on your side and see if it is able to make forward progress or if the system is truly hung. The semaphore in question manages the concurrent requests to the model, so depending on what it got wired up with (4 or 1 depending on VRAM, unless you set OLLAMA_NUM_PARALLEL to something else) it will prevent more requests from being processing in parallel until prior requests complete.

@dhiltgen commented on GitHub (Oct 28, 2024): @giladrom looking at the code, I believe your client is timing out and closing the connection before the server is able to process the request. It sounds like you may have a 100s timeout in your client code, or maybe client libraries you're using. Do you believe the system is hung when this happens? Can you try to bump up the timeout on your side and see if it is able to make forward progress or if the system is truly hung. The semaphore in question manages the concurrent requests to the model, so depending on what it got wired up with (4 or 1 depending on VRAM, unless you set OLLAMA_NUM_PARALLEL to something else) it will prevent more requests from being processing in parallel until prior requests complete.

GiteaMirror commented

@giladrom commented on GitHub (Oct 28, 2024):

@dhiltgen That was my conclusion too - I tried looking for 100s timeouts in my client code but couldn't find anything. I'm using LangChain, so that's probably the culprit. For now, I resolved this by using a different Embedding Model vs. Chat model (llama3.2 works fine for chat, and using nomic for embeddings, which is super fast)

@giladrom commented on GitHub (Oct 28, 2024): @dhiltgen That was my conclusion too - I tried looking for 100s timeouts in my client code but couldn't find anything. I'm using LangChain, so that's probably the culprit. For now, I resolved this by using a different Embedding Model vs. Chat model (llama3.2 works fine for chat, and using nomic for embeddings, which is super fast)

GiteaMirror commented

@dhiltgen commented on GitHub (Oct 28, 2024):

@giladrom so just to confirm, avoiding slower models that trigger a LangChain default client-side timeout, you're able to keep running for a long period of time without hangs, is that correct? Has 0.4.0 solved the hangs for you?

@dhiltgen commented on GitHub (Oct 28, 2024): @giladrom so just to confirm, avoiding slower models that trigger a LangChain default client-side timeout, you're able to keep running for a long period of time without hangs, is that correct? Has 0.4.0 solved the hangs for you?

GiteaMirror commented

2026-04-12 13:11:42 -05:00

@giladrom commented on GitHub (Oct 30, 2024):

@dhiltgen Correct. 0.3.6 (the docker image default) works fine with smaller embedding models, taking about 12s to complete a job that times out with llama3.2, as does 0.4.0. Trying to use llama3.2 for embeddings times out for both versions.

@giladrom commented on GitHub (Oct 30, 2024): @dhiltgen Correct. 0.3.6 (the docker image default) works fine with smaller embedding models, taking about 12s to complete a job that times out with llama3.2, as does 0.4.0. Trying to use llama3.2 for embeddings times out for both versions.

GiteaMirror commented

@KIC commented on GitHub (Oct 31, 2024):

Interestingly, I ended up here having the same issue using plain ollama server. However, I realized I only have this problem with one specific model like hhao/openbmb-minicpm-llama3-v-2_5 although it has worked just fine so far. After I delete the model and download it again the problem seems to be gone - at least for the moment.

@KIC commented on GitHub (Oct 31, 2024): Interestingly, I ended up here having the same issue using plain ollama server. However, I realized I only have this problem with one specific model like `hhao/openbmb-minicpm-llama3-v-2_5` although it has worked just fine so far. After I delete the model and download it again the problem seems to be gone - at least for the moment.

GiteaMirror commented

2026-04-12 13:11:42 -05:00

@jashanj0tsingh commented on GitHub (Nov 3, 2024):

@giladrom so just to confirm, avoiding slower models that trigger a LangChain default client-side timeout, you're able to keep running for a long period of time without hangs, is that correct? Has 0.4.0 solved the hangs for you?

Pardon my partial understanding of @giladrom 's use case, but I tried regular old curl request considering client timeout issue with LangChain and tried to load a model different than the one loaded already and not responding (well more on this later)**, but a simple curl request once ollama was in that state took approximately 10 mins to answer why the sky is blue, for me, neither docker, or building from source made any difference (even with go runner), its heartbreaking being unable to use such a great tool.

**With OLLAMA_DEBUG=1 I noticed that the server didn't even manage to log the POST request on a FOREVER loaded model. Once my LangChain application got stuck waiting for ollama req/resp cycle.

Unfortunately, for me at the moment the only resolution is a reboot of the VM, and then 20 mins of sweet glory until it is time to reboot again. I have started considering vLLM at this point because there is no way this would fly in production.

Note: I am on a fully loaded VM on vCenter ( ubuntu 22.04 ), cuda 12, and 2 x 16GB Nvidia A16-16Q, 64 GB RAM

@jashanj0tsingh commented on GitHub (Nov 3, 2024): > @giladrom so just to confirm, avoiding slower models that trigger a LangChain default client-side timeout, you're able to keep running for a long period of time without hangs, is that correct? Has 0.4.0 solved the hangs for you? Pardon my partial understanding of @giladrom 's use case, but I tried regular old curl request considering client timeout issue with LangChain and tried to load a model different than the one loaded already and not responding (well more on this later)**, but a simple curl request once `ollama` was in that state took approximately 10 mins to answer why the sky is blue, for me, neither docker, or building from source made any difference (even with go runner), its heartbreaking being unable to use such a great tool. **With OLLAMA_DEBUG=1 I noticed that the server didn't even manage to log the POST request on a FOREVER loaded model. Once my LangChain application got stuck waiting for ollama req/resp cycle. Unfortunately, for me at the moment the only resolution is a reboot of the VM, and then 20 mins of sweet glory until it is time to reboot again. I have started considering vLLM at this point because there is no way this would fly in production. Note: I am on a fully loaded VM on vCenter ( ubuntu 22.04 ), cuda 12, and 2 x 16GB Nvidia A16-16Q, 64 GB RAM

GiteaMirror commented

@dhiltgen commented on GitHub (Nov 3, 2024):

@jashanj0tsingh can you clarify your scenario a bit? It sounds like you are exercising 2 different models, one with LangChain as the client, one with curl. Do both of these models fit 100% in the GPU(s) or is the Ollama scheduler waiting those 10m for the LangChain client connections to close so that it can unload that model and load the secondary model requested by curl?

Can you explain a little more about what the LangChain client is doing? Is it making multiple requests in parallel, and if so, what limits does it have? Is it calling generate/chat or embedding APIs? Have you changed any scheduler settings in Ollama? Embedding models currently run a single processing thread, and our default queue size is 512 client requests, so if the LangChain client is launch hundreds of parallel requests and those in turn are taking seconds or more to process, perhaps the system isn't actually stuck, just backlogged?

Are you able to observe other metrics on the system when it gets into this state? CPU load, memory load, disk I/O, GPU load, etc.

I might be misunderstanding your scenario, but feels more like a heavy load and working through a backlog and not a "hang after 10-15 minutes" scenario, where Ollama becomes completely wedged and has to be killed to recover.

@dhiltgen commented on GitHub (Nov 3, 2024): @jashanj0tsingh can you clarify your scenario a bit? It sounds like you are exercising 2 different models, one with LangChain as the client, one with curl. Do both of these models fit 100% in the GPU(s) or is the Ollama scheduler waiting those 10m for the LangChain client connections to close so that it can unload that model and load the secondary model requested by curl? Can you explain a little more about what the LangChain client is doing? Is it making multiple requests in parallel, and if so, what limits does it have? Is it calling generate/chat or embedding APIs? Have you changed any scheduler settings in Ollama? Embedding models currently run a single processing thread, and our default queue size is 512 client requests, so if the LangChain client is launch hundreds of parallel requests and those in turn are taking seconds or more to process, perhaps the system isn't actually stuck, just backlogged? Are you able to observe other metrics on the system when it gets into this state? CPU load, memory load, disk I/O, GPU load, etc. I might be misunderstanding your scenario, but feels more like a heavy load and working through a backlog and not a "hang after 10-15 minutes" scenario, where Ollama becomes completely wedged and has to be killed to recover.

GiteaMirror commented

@jashanj0tsingh commented on GitHub (Nov 3, 2024):

@dhiltgen thank you for your prompt reply, really appreciate you working on a weekend, yes sure, I'd be happy to explain the scenario, I will try to answer all of your questions,

Context: I building a rather simple REST API server using FastAPI that wraps around LangGraph and Ollama to provide chat/generate endpoints with additional features such as chat history persistence etc.

My Issue: After a couple of requests via the REST API the model seems unreachable, and if the request manages to reach the model somehow, it responds extremely sluggishly, imagine a streaming response with 1 word per two seconds. Experienced with or without LangGraph, consistent across binary installation, manual build ( with and without go runner ), and docker images 3.14 and 0.4.0-rc5.

My resolution: A reboot of the VM, nothing else works, killing the service, killing the container, restarting the binary, none of these work. If I don't reboot, the model will take forever to accept the request and respond extremely sluggishly, talked a bit more below.

Now, I am using smaller models like Llama 3.1 [7B] and Llama 3.2 [3B], and yes they fit with adequate room in my 2x 16 GB GPUs, also, my use case doesn't necessarily requires two separate models, that was just a test as I actively follow this thread and saw your recent conversation and since Llama 3.1 was already in that state, I tried a curl request instead with Llama 3.2, but the result was the same, firstly it took forever for the request to land at the server and then the stream response was 1 words per 2 seconds.

Embeddings are pre-created, just using a simple retriever to query the embeddings instead for a RAG like use-case. I am not doing multiple requests rather sequential conversation back and forth with one user at a time. I have not changed any scheduler settings, everything default, recently started using keep_alive=-1 to keep the model loaded forever but no luck.

htop shows nothing crazy, nvidia-smi seems fine, with no memory overheads, its something I have been experiencing from quite some time, and I am not sure if its related or not but it does stop working after 10-15 minutes, max 20 minutes, and that's it.

With OLLAMA_DEBUG=1 nothing suspicious jumps out in the logs as well.

Edit: I can verify if vCenter has anything to do with it but I am running a fairly small setup on an adequate hardware with enough room to spare. I am planning to verify this by running the model with vLLM. If vLLM works fine it has something to do with Ollama, if not then I may have a different issue.

Edit2: Its just single, sequential request/response cycles with one model that behave like mentioned above.

@jashanj0tsingh commented on GitHub (Nov 3, 2024): @dhiltgen thank you for your prompt reply, really appreciate you working on a weekend, yes sure, I'd be happy to explain the scenario, I will try to answer all of your questions, Context: I building a rather simple REST API server using FastAPI that wraps around `LangGraph` and `Ollama` to provide chat/generate endpoints with additional features such as chat history persistence etc. My Issue: After a couple of requests via the REST API the model seems unreachable, and if the request manages to reach the model somehow, it responds extremely sluggishly, imagine a streaming response with 1 word per two seconds. Experienced with or without LangGraph, consistent across binary installation, manual build ( with and without go runner ), and docker images 3.14 and 0.4.0-rc5. My resolution: A reboot of the VM, nothing else works, killing the service, killing the container, restarting the binary, none of these work. If I don't reboot, the model will take forever to accept the request and respond extremely sluggishly, talked a bit more below. Now, I am using smaller models like Llama 3.1 [7B] and Llama 3.2 [3B], and yes they fit with adequate room in my 2x 16 GB GPUs, also, my use case doesn't necessarily requires two separate models, that was just a test as I actively follow this thread and saw your recent conversation and since Llama 3.1 was already in that state, I tried a curl request instead with Llama 3.2, but the result was the same, firstly it took forever for the request to land at the server and then the stream response was 1 words per 2 seconds. Embeddings are pre-created, just using a simple retriever to query the embeddings instead for a RAG like use-case. I am not doing multiple requests rather sequential conversation back and forth with one user at a time. I have not changed any scheduler settings, everything default, recently started using `keep_alive=-1` to keep the model loaded forever but no luck. `htop` shows nothing crazy, `nvidia-smi` seems fine, with no memory overheads, its something I have been experiencing from quite some time, and I am not sure if its related or not but it does stop working after 10-15 minutes, max 20 minutes, and that's it. With OLLAMA_DEBUG=1 nothing suspicious jumps out in the logs as well. Edit: I can verify if vCenter has anything to do with it but I am running a fairly small setup on an adequate hardware with enough room to spare. I am planning to verify this by running the model with `vLLM`. If `vLLM` works fine it has something to do with `Ollama`, if not then I may have a different issue. Edit2: Its just single, sequential request/response cycles with one model that behave like mentioned above.

GiteaMirror commented

@giladrom commented on GitHub (Nov 4, 2024):

@dhiltgen It seems there's a huge difference in performance between 0.3.x and 0.4.x after all - I just retested 0.4.0-rc6 this morning and all files I've tried (ranging from a few kilobytes to 60MB PDFs) completed embedding without timing out, whereas 0.3.6 would time out for anything larger than a few megabytes.

@giladrom commented on GitHub (Nov 4, 2024): @dhiltgen It seems there's a huge difference in performance between 0.3.x and 0.4.x after all - I just retested `0.4.0-rc6` this morning and all files I've tried (ranging from a few kilobytes to 60MB PDFs) completed embedding without timing out, whereas 0.3.6 would time out for anything larger than a few megabytes.

GiteaMirror commented