[GH-ISSUE #3409] API to terminate the running job before the completion #2100

Closed
opened 2026-04-12 12:20:35 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @ansis-m on GitHub (Mar 29, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3409

What are you trying to do?

I am using Ollama via REST api interface. Sometimes when model streams a long response (can be quite slow on my computer) I would like to terminate the process before the completion. I checked the API documentation and I did not find an option for this. Even if I unsubscribe from the stream (Java Spring AI interface) I see that the cpu/gpu usage remains high for extended period (untill the whole response has been generated).

How should we solve this?

Implement REST API endpoint that allows to terminate the running job before the completion.

What is the impact of not solving this?

Better user experience. Many users do not own super fast computers and response generation takes some time (0.5-2 minutes on my computer). Sometimes when you get half way through the answer you know that you are not getting what you want and you would like to reformulate a prompt without waiting for the completion. It would greatly improve the user experience if the running job could be stoped before the completion.

Anything else?

Thank you for the awesome project!

Originally created by @ansis-m on GitHub (Mar 29, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3409 ### What are you trying to do? I am using Ollama via REST api interface. Sometimes when model streams a long response (can be quite slow on my computer) I would like to terminate the process before the completion. I checked the API documentation and I did not find an option for this. Even if I unsubscribe from the stream (Java Spring AI interface) I see that the cpu/gpu usage remains high for extended period (untill the whole response has been generated). ### How should we solve this? Implement REST API endpoint that allows to terminate the running job before the completion. ### What is the impact of not solving this? Better user experience. Many users do not own super fast computers and response generation takes some time (0.5-2 minutes on my computer). Sometimes when you get half way through the answer you know that you are not getting what you want and you would like to reformulate a prompt without waiting for the completion. It would greatly improve the user experience if the running job could be stoped before the completion. ### Anything else? Thank you for the awesome project!
Author
Owner

@pdevine commented on GitHub (Apr 1, 2024):

If you interrupt the stream it should stop generating immediately. The model will stay loaded in the memory of the GPU for 5 minutes, but you can control that with the keep_alive parameter or the OLLAMA_KEEP_ALIVE env variable.

I just re-tested this with curl and it is working as expected (at least on Mac). What platform are you using?

<!-- gh-comment-id:2030627816 --> @pdevine commented on GitHub (Apr 1, 2024): If you interrupt the stream it should stop generating immediately. The model will stay loaded in the memory of the GPU for 5 minutes, but you can control that with the `keep_alive` parameter or the `OLLAMA_KEEP_ALIVE` env variable. I just re-tested this with curl and it is working as expected (at least on Mac). What platform are you using?
Author
Owner

@jmorganca commented on GitHub (Apr 15, 2024):

Hi @ansis-m as mentioned by @pdevine it's possible to have the model unload immediately after generation (including cancelling it)

<!-- gh-comment-id:2057657889 --> @jmorganca commented on GitHub (Apr 15, 2024): Hi @ansis-m as mentioned by @pdevine it's possible to have the model unload immediately after generation (including cancelling it)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2100