[GH-ISSUE #11889] Ollama does not stop processing a request if the client has terminated. #7892

Open
opened 2026-04-12 20:02:41 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @MarkWard0110 on GitHub (Aug 13, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11889

What is the issue?

With the gpt-oss:120b model, I have observed that if the client terminates its connection before the request is completed, Ollama will continue to process the request.
I don't know yet if this is with other models.

Ollama 0.11.4
gpt-oss:120b is split between GPU and CPU.

A Client sends a chat request and terminates the connection before the response is received. Ollama will continue to process the request. For example, stop the debugger and the program before the server completes the request. Using non-stream request.

What I would expect. When the client terminates its connection, Ollama would stop processing the request.

What has not been checked

  • stream requests
  • other models
  • other models GPU vs CPU

Relevant log output


OS

WIndows 11

GPU

RTX 3090

CPU

Intel Ultra 9 285K

Ollama version

0.11.4

Originally created by @MarkWard0110 on GitHub (Aug 13, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11889 ### What is the issue? With the `gpt-oss:120b` model, I have observed that if the client terminates its connection before the request is completed, Ollama will continue to process the request. I don't know yet if this is with other models. Ollama 0.11.4 `gpt-oss:120b` is split between GPU and CPU. A Client sends a chat request and terminates the connection before the response is received. Ollama will continue to process the request. For example, stop the debugger and the program before the server completes the request. Using non-stream request. What I would expect. When the client terminates its connection, Ollama would stop processing the request. What has not been checked - stream requests - other models - other models GPU vs CPU ### Relevant log output ```shell ``` ### OS WIndows 11 ### GPU RTX 3090 ### CPU Intel Ultra 9 285K ### Ollama version 0.11.4
GiteaMirror added the bug label 2026-04-12 20:02:41 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 13, 2025):

It sounds like the runner is still processing the prompt. The runner doesn't find out that the client has gone away until it tries to send a response.

<!-- gh-comment-id:3184743061 --> @rick-github commented on GitHub (Aug 13, 2025): It sounds like the runner is still processing the prompt. The runner doesn't find out that the client has gone away until it tries to send a response.
Author
Owner

@MarkWard0110 commented on GitHub (Aug 14, 2025):

I think it will shut down the runner if there is a pending request from another client.

<!-- gh-comment-id:3189933972 --> @MarkWard0110 commented on GitHub (Aug 14, 2025): I think it will shut down the runner if there is a pending request from another client.
Author
Owner

@rick-github commented on GitHub (Aug 14, 2025):

The runner processes the prompt and then starts to generate a response. After transmitting the first response token, the runner will be informed that the client has gone away and the runner will stop the generation. If there's a pending request from another client, the runner will receive the new prompt, process it, and start the generation of response tokens. The runner is shutdown if the keep_alive timeout expires or a client requests a new model that can't co-reside with the current model.

<!-- gh-comment-id:3189946020 --> @rick-github commented on GitHub (Aug 14, 2025): The runner processes the prompt and then starts to generate a response. After transmitting the first response token, the runner will be informed that the client has gone away and the runner will stop the generation. If there's a pending request from another client, the runner will receive the new prompt, process it, and start the generation of response tokens. The runner is shutdown if the `keep_alive` timeout expires or a client requests a new model that can't co-reside with the current model.
Author
Owner

@eliciel0513 commented on GitHub (Aug 15, 2025):

If all this means that the model remains loaded/processing without continuing onto the next chat input then yes. Ive been
struggling to get follow up responses. The model just stays processing without follow up response to a follow up chat input.

<!-- gh-comment-id:3190379530 --> @eliciel0513 commented on GitHub (Aug 15, 2025): If all this means that the model remains loaded/processing without continuing onto the next chat input then yes. Ive been struggling to get follow up responses. The model just stays processing without follow up response to a follow up chat input.
Author
Owner

@xxxajk commented on GitHub (Feb 13, 2026):

I have the same issue with Devstral-Small-2-24B-Instruct-2512, so not model related. When I totally exit the python cli i built which uses pydantic_ai and synchronous mode , the request is never canceled even if I totally have exited python.
What would be nice is to have an http "v1/" url that has a cancel option, even if I have to send it manually, if this does not exist already. one thing that could be done is a tcp keepalive option on ollama's frontend, which will properly detect if the other side has slammed the socket shut.

<!-- gh-comment-id:3894683593 --> @xxxajk commented on GitHub (Feb 13, 2026): I have the same issue with Devstral-Small-2-24B-Instruct-2512, so not model related. When I totally exit the python cli i built which uses pydantic_ai and synchronous mode , the request is never canceled even if I totally have exited python. What would be nice is to have an http "v1/" url that has a cancel option, even if I have to send it manually, if this does not exist already. one thing that could be done is a tcp keepalive option on ollama's frontend, which will properly detect if the other side has slammed the socket shut.
Author
Owner

@rick-github commented on GitHub (Feb 13, 2026):

Can you provide a minimal version of your cli that reproduces the problem?

<!-- gh-comment-id:3898794327 --> @rick-github commented on GitHub (Feb 13, 2026): Can you provide a minimal version of your cli that reproduces the problem?
Author
Owner

@xxxajk commented on GitHub (Feb 13, 2026):

Yeah, it is super tiny. and WIP. I also do a few neat tricks to use sync mode to get chat output during think stages.

I just need to convince github to accept the zip/tarball, and it refuses to do that...

<!-- gh-comment-id:3899670706 --> @xxxajk commented on GitHub (Feb 13, 2026): Yeah, it is super tiny. and WIP. I also do a few neat tricks to use sync mode to get chat output during think stages. I just need to convince github to accept the zip/tarball, and it refuses to do that...
Author
Owner

@xxxajk commented on GitHub (Feb 13, 2026):

bloody.zip
Got it to work, here is the tiny source.
This is running ollama in a docker container (two actually, one is CPU only, the other is GPU/CPU split on older ROCm graphics card if that matters...)

<!-- gh-comment-id:3900168686 --> @xxxajk commented on GitHub (Feb 13, 2026): [bloody.zip](https://github.com/user-attachments/files/25308817/bloody.zip) Got it to work, here is the tiny source. This is running ollama in a docker container (two actually, one is CPU only, the other is GPU/CPU split on older ROCm graphics card if that matters...)
Author
Owner

@rick-github commented on GitHub (Feb 15, 2026):

Unable to duplicate. I run bloody and give it a prompt ("write a long story about a panda meeting a penguin"), the agent summarizes the action required, the GPU usage goes up on nvtop, but if I exit the CLI before the story is returned, the GPU usage drops back to idle.

<!-- gh-comment-id:3904625528 --> @rick-github commented on GitHub (Feb 15, 2026): Unable to duplicate. I run bloody and give it a prompt ("write a long story about a panda meeting a penguin"), the agent summarizes the action required, the GPU usage goes up on nvtop, but if I exit the CLI before the story is returned, the GPU usage drops back to idle.
Author
Owner

@xxxajk commented on GitHub (Feb 15, 2026):

Try it running cpu-only in docker, all on linux, using the distibution.

<!-- gh-comment-id:3905231697 --> @xxxajk commented on GitHub (Feb 15, 2026): Try it running cpu-only in docker, all on linux, using the distibution.
Author
Owner

@rick-github commented on GitHub (Feb 15, 2026):

This the docker-compose file:

services:
  ollama-bloody:
    image: ollama/ollama:${DOCKER_OLLAMA_TAG-latest}
    volumes:
      - /tank/ai/models/ollama/models:/root/.ollama/models
  bloody:
    image: bloody
    build:
      context: bloodyRAG
      dockerfile_inline: |
        FROM python:3.11
        RUN pip install typer pydantic-ai python-magic
        COPY . .
        ENTRYPOINT [ "sleep", "inf" ]
    environment:
      - ORCHESTRATOR_ENDPOINT=http://ollama-bloody:11434/v1

Running docker compose exec bloody python3 bloody.py starts the CLI. Monitor with top, give it the prompt, wait for a few seconds and then disconnect shows CPU usage spike up and then fall back to idle.

<!-- gh-comment-id:3905263336 --> @rick-github commented on GitHub (Feb 15, 2026): This the docker-compose file: ```yaml services: ollama-bloody: image: ollama/ollama:${DOCKER_OLLAMA_TAG-latest} volumes: - /tank/ai/models/ollama/models:/root/.ollama/models bloody: image: bloody build: context: bloodyRAG dockerfile_inline: | FROM python:3.11 RUN pip install typer pydantic-ai python-magic COPY . . ENTRYPOINT [ "sleep", "inf" ] environment: - ORCHESTRATOR_ENDPOINT=http://ollama-bloody:11434/v1 ``` Running `docker compose exec bloody python3 bloody.py` starts the CLI. Monitor with top, give it the prompt, wait for a few seconds and then disconnect shows CPU usage spike up and then fall back to idle.
Author
Owner

@xxxajk commented on GitHub (Feb 16, 2026):

I'll do a screenshot, I guess...
I'll have to show you how I am running it.
I asked the question as follows:
show me the contents of the current directory please.
and press control-K, then wait for the cpu to spike up.
Then break out of the program by control-c (sometimes takes more than once, first time should have aborted...)
and...

Image

the CPU will be pegged until it tries to send something, the 500 error was the socket closing. it never stopped the request.
500 errors are normal (I understand that part) and I do a retry on them, because those happen when you have something that it grinds on for over 30 minutes. What we need is a "cleanup" URL to hit that does a post to cancel when it is aborted.

<!-- gh-comment-id:3906168300 --> @xxxajk commented on GitHub (Feb 16, 2026): I'll do a screenshot, I guess... I'll have to show you how I am running it. I asked the question as follows: ```show me the contents of the current directory please.``` and press control-K, then wait for the cpu to spike up. Then break out of the program by control-c (sometimes takes more than once, first time should have aborted...) and... <img width="1600" height="1167" alt="Image" src="https://github.com/user-attachments/assets/5e3ab199-24c8-464e-a42e-d9ecd92ab43c" /> the CPU will be pegged until it tries to send something, the 500 error was the socket closing. it never stopped the request. 500 errors are normal (I understand that part) and I do a retry on them, because those happen when you have something that it grinds on for over 30 minutes. What we need is a "cleanup" URL to hit that does a post to cancel when it is aborted.
Author
Owner

@xxxajk commented on GitHub (Feb 16, 2026):

The short of it is this, and why you can't replicate it... The CLI app runs OUTSIDE of docker, and ollama is inside of docker as a service.

Same thing happens when accessed from somewhere else on the LAN...
The whole misunderstanding is:
We are asking if there is a cancel URL and if not, pointing out that there should be.

<!-- gh-comment-id:3906196992 --> @xxxajk commented on GitHub (Feb 16, 2026): The short of it is this, and why you can't replicate it... The CLI app runs OUTSIDE of docker, and ollama is inside of docker as a service. Same thing happens when accessed from somewhere else on the LAN... The whole misunderstanding is: We are asking if there is a cancel URL and if not, pointing out that there should be.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7892