[GH-ISSUE #6380] Hangs after 20-30 mins, a perdiocal restart of the ollama service is required #29767

Closed
opened 2026-04-22 08:58:14 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @itinance on GitHub (Aug 15, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6380

What is the issue?

We tried Ollama on our production GPU server running a RTX-4000 with a Python script that will see 10000s of requests per day from a backend for llama3.1:latest, simultaneously with 2-5 requests in parallel.

Unfortunately, the duration per requests takes longer and longer over time and will end in an infinity hanging of the process, consuming 200% CPU + 100% (two processes "ollama_lama_server", while 94% going on GPU.

In that time, a new instance of ollama run by "ollama run ..." would respond fast, but the old process is just hanging.
We never faced this on a Mac m1 during development, where the same scripts were running over 48h without any issue.

At the moment, we run a cronjob that restarts the ollama process every 20 min as a workaround.

We faced the issue already with 0.3.4 and it still exists with 0.3.6.

Has someone faced the same? Any other workarounds available?

More detailled information can be found here: https://github.com/ollama/ollama/issues/6380#issuecomment-2294829833

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.3.6

Originally created by @itinance on GitHub (Aug 15, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6380 ### What is the issue? We tried Ollama on our production GPU server running a RTX-4000 with a Python script that will see 10000s of requests per day from a backend for `llama3.1:latest`, simultaneously with 2-5 requests in parallel. Unfortunately, the duration per requests takes longer and longer over time and will end in an infinity hanging of the process, consuming 200% CPU + 100% (two processes "ollama_lama_server", while 94% going on GPU. In that time, a new instance of ollama run by "ollama run ..." would respond fast, but the old process is just hanging. We never faced this on a Mac m1 during development, where the same scripts were running over 48h without any issue. At the moment, we run a cronjob that restarts the ollama process every 20 min as a workaround. We faced the issue already with 0.3.4 and it still exists with 0.3.6. Has someone faced the same? Any other workarounds available? More detailled information can be found here: https://github.com/ollama/ollama/issues/6380#issuecomment-2294829833 ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.6
GiteaMirror added the needs more infobug labels 2026-04-22 08:58:14 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 15, 2024):

Server logs may aid in debugging. Does it get in this state if OLLAMA_NUM_PARALLEL=1? Does the size of the CPU processes increase over time?

<!-- gh-comment-id:2292423241 --> @rick-github commented on GitHub (Aug 15, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in debugging. Does it get in this state if `OLLAMA_NUM_PARALLEL=1`? Does the size of the CPU processes increase over time?
Author
Owner

@jmorganca commented on GitHub (Aug 15, 2024):

Thanks @rick-github !

Sorry about this – looking into it now

<!-- gh-comment-id:2292435300 --> @jmorganca commented on GitHub (Aug 15, 2024): Thanks @rick-github ! Sorry about this – looking into it now
Author
Owner

@itinance commented on GitHub (Aug 17, 2024):

OLLAMA_NUM_PARALLEL=1

@rick-github When and where should I apply this setting?

<!-- gh-comment-id:2294811295 --> @itinance commented on GitHub (Aug 17, 2024): > OLLAMA_NUM_PARALLEL=1 @rick-github When and where should I apply this setting?
Author
Owner

@itinance commented on GitHub (Aug 17, 2024):

I've conducted extensive testing today and encountered an issue with the Ollama model while running a FastAPI-based Python API on a GPU machine (RTX 4000 from Hetzner). Here are the detailed observations:

Setup

I've built an API using FastAPI that runs with uvicorn directly on the GPU. The relevant API endpoint is as follows:


@app.post("/chat", response_model=ChatResponse)
async def chat_with_model(request: ChatRequest):
    print (request)
    current_time = datetime.now().strftime("%H:%M:%S")
    print("Current Time:", current_time)

    response = ollama.chat(
        model=request.model,
        keep_alive="15m",
        format=request.format,
        messages=[message.dict() for message in request.messages]
    )

    print (response)
    current_time = datetime.now().strftime("%H:%M:%S")
    print("Current Time:", current_time)
    # Extract and parse the response content
    message_content = response['message']['content']

    print("Duration: " + response.get('total_duration', 0).__str__())

    # Creating a Message instance
    message_instance = Message(
        role=response['message']['role'],
        content=message_content
    )

    # Creating the ChatResponse instance
    chat_response = ChatResponse(
        id=str(uuid.uuid4()),
        model=response['model'],
        messages=[message_instance],
        content=message_content,
        eval_count=response.get('eval_count', 0),
        prompt_eval_count=response.get('prompt_eval_count', 0),
        prompt_eval_duration=response.get('prompt_eval_duration', 0),
        eval_duration=response.get('eval_duration', 0),
        load_duration=response.get('load_duration', 0),
        total_duration=response.get('total_duration', 0)
    )

    return chat_response

Note: This code includes debug statements to print timestamps for further investigation.

Observations

Single Client Performance:

When running a single instance of my client, which sends thousands of sequential requests to the API, the system performs reliably and without issues for hours.

Multiple Clients Performance:

Problems arise when I initiate additional client instances, performing requests in parallel:

  • After launching multiple clients, some requests began timing out client-side after 15 seconds, despite normal log outputs.
    The system resumed normal operations intermittently until I started a fourth client instance.

System Deadlock:

  • After approximately 10 minutes with four clients, all API requests became unresponsive and timed out.

  • New requests to the API endpoint would hang when invoking the chat function of Ollama.

  • At this time, gpustat showed two ollama_lama_server processes with 100% and 200% GPU utilization, respectively.

  • Stopping all clients did not resolve the issue; no new requests were processed, and the Ollama processes remained at high utilization.

Unable to Terminate Python Application:

  • Attempting to terminate the Python application failed as it appeared to be stuck, likely waiting for a response from the Ollama process.
  • SSH-ing into the machine and manually running a new instance via ollama run llama3.1 and issuing a simple "hello" command, the response was delayed but eventually returned with "Hello, nice to meet you". The other processes where still in stuck and "hanging"

Hypothesis

It seems that multiple Ollama invocations are causing a deadlock or some form of resource contention, leading to the processes being locked in a high GPU utilization state. This issue requires restarting the Ollama service to restore normal operation.

<!-- gh-comment-id:2294829833 --> @itinance commented on GitHub (Aug 17, 2024): I've conducted extensive testing today and encountered an issue with the Ollama model while running a FastAPI-based Python API on a GPU machine (RTX 4000 from Hetzner). Here are the detailed observations: # Setup I've built an API using FastAPI that runs with uvicorn directly on the GPU. The relevant API endpoint is as follows: ```python @app.post("/chat", response_model=ChatResponse) async def chat_with_model(request: ChatRequest): print (request) current_time = datetime.now().strftime("%H:%M:%S") print("Current Time:", current_time) response = ollama.chat( model=request.model, keep_alive="15m", format=request.format, messages=[message.dict() for message in request.messages] ) print (response) current_time = datetime.now().strftime("%H:%M:%S") print("Current Time:", current_time) # Extract and parse the response content message_content = response['message']['content'] print("Duration: " + response.get('total_duration', 0).__str__()) # Creating a Message instance message_instance = Message( role=response['message']['role'], content=message_content ) # Creating the ChatResponse instance chat_response = ChatResponse( id=str(uuid.uuid4()), model=response['model'], messages=[message_instance], content=message_content, eval_count=response.get('eval_count', 0), prompt_eval_count=response.get('prompt_eval_count', 0), prompt_eval_duration=response.get('prompt_eval_duration', 0), eval_duration=response.get('eval_duration', 0), load_duration=response.get('load_duration', 0), total_duration=response.get('total_duration', 0) ) return chat_response ``` _Note: This code includes debug statements to print timestamps for further investigation._ # Observations ## Single Client Performance: When running a single instance of my client, which sends thousands of sequential requests to the API, the system performs reliably and without issues for hours. ## Multiple Clients Performance: Problems arise when I initiate additional client instances, performing requests in parallel: - After launching multiple clients, some requests began timing out client-side after 15 seconds, despite normal log outputs. The system resumed normal operations intermittently until I started a fourth client instance. ## System Deadlock: - After approximately 10 minutes with four clients, all API requests became unresponsive and timed out. - New requests to the API endpoint would hang when invoking the chat function of Ollama. - At this time, gpustat showed two ollama_lama_server processes with 100% and 200% GPU utilization, respectively. - Stopping all clients did not resolve the issue; no new requests were processed, and the Ollama processes remained at high utilization. ## Unable to Terminate Python Application: - Attempting to terminate the Python application failed as it appeared to be stuck, likely waiting for a response from the Ollama process. - - SSH-ing into the machine and manually running a new instance via `ollama run llama3.1` and issuing a simple "hello" command, the response was delayed but eventually returned with _"Hello, nice to meet you"_. The other processes where still in stuck and "hanging" # Hypothesis It seems that multiple Ollama invocations are causing a deadlock or some form of resource contention, leading to the processes being locked in a high GPU utilization state. This issue requires restarting the Ollama service to restore normal operation.
Author
Owner

@rick-github commented on GitHub (Aug 17, 2024):

OLLAMA_NUM_PARALLEL=1

@rick-github When and where should I apply this setting?

Depends on how you installed ollama. If you did curl -fsSL https://ollama.com/install.sh | sh, then in the file /etc/systemd/system/ollama.service in the [Service] section, add the line Environment="OLLAMA_NUM_PARALLEL=1", then restart the service:

sudo systemctl stop ollama
sudo systemctl daemon-reload
sudo systemctl start ollama

If you are using docker, pass an enviroment variable to the container:

docker run -d --gpus=all --env OLLAMA_NUM_PARALLEL=1 -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

If you are using docker compose, add it to the environment section and restart (docker compose up -d ollama):

services:
  ollama:
    environment:
      - OLLAMA_NUM_PARALLEL=1
<!-- gh-comment-id:2294846642 --> @rick-github commented on GitHub (Aug 17, 2024): > > OLLAMA_NUM_PARALLEL=1 > > @rick-github When and where should I apply this setting? Depends on how you installed ollama. If you did `curl -fsSL https://ollama.com/install.sh | sh`, then in the file /etc/systemd/system/ollama.service in the [Service] section, add the line `Environment="OLLAMA_NUM_PARALLEL=1"`, then restart the service: ``` sudo systemctl stop ollama sudo systemctl daemon-reload sudo systemctl start ollama ``` If you are using docker, pass an enviroment variable to the container: ``` docker run -d --gpus=all --env OLLAMA_NUM_PARALLEL=1 -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama ``` If you are using docker compose, add it to the environment section and restart (`docker compose up -d ollama`): ```yaml services: ollama: environment: - OLLAMA_NUM_PARALLEL=1 ```
Author
Owner

@itinance commented on GitHub (Aug 17, 2024):

Btw, I found also out, that the invokation of ollama.chat is blocking the full python process. I can not even call another endpoint during that time, not even the "/docs"-endpoint of FastAPI, which renders a swagger documentation.
Can we make it somehow not blocking and more async?

<!-- gh-comment-id:2294861364 --> @itinance commented on GitHub (Aug 17, 2024): Btw, I found also out, that the invokation of `ollama.chat` is blocking the full python process. I can not even call another endpoint during that time, not even the "/docs"-endpoint of FastAPI, which renders a swagger documentation. Can we make it somehow not blocking and more async?
Author
Owner

@dhiltgen commented on GitHub (Nov 6, 2024):

Please give the new 0.4.0 release a try. We've restructured context and cache handling and the old code was a plausible source of these sorts of hangs after long duration scenarios.

<!-- gh-comment-id:2458480082 --> @dhiltgen commented on GitHub (Nov 6, 2024): Please give the new 0.4.0 release a try. We've restructured context and cache handling and the old code was a plausible source of these sorts of hangs after long duration scenarios.
Author
Owner

@pdevine commented on GitHub (Jan 12, 2025):

I'm going to go ahead and close the issue, but we can reopen if it's still causing problems.

<!-- gh-comment-id:2585501824 --> @pdevine commented on GitHub (Jan 12, 2025): I'm going to go ahead and close the issue, but we can reopen if it's still causing problems.
Author
Owner

@SIMSB-99 commented on GitHub (Jul 8, 2025):

Still facing a similar issue: https://github.com/ollama/ollama/issues/11257#issue-3193483999

<!-- gh-comment-id:3050455905 --> @SIMSB-99 commented on GitHub (Jul 8, 2025): Still facing a similar issue: https://github.com/ollama/ollama/issues/11257#issue-3193483999
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#29767