[GH-ISSUE #1521] Adjust Time for Ollama Serve to Stop Llama Runner Service #26589

Closed
opened 2026-04-22 02:56:34 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @gzuuus on GitHub (Dec 14, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1521

Originally assigned to: @dhiltgen on GitHub.

Currently, the time it takes for Ollama Serve to stop the Llama Runner service is too short. It would be great to set the time to take longer to send the kill signal and stop the Llama Runner. Maybe its possible to add a configuration option to set the time it takes for Ollama Serve to stop the Llama Runner service.

Originally created by @gzuuus on GitHub (Dec 14, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1521 Originally assigned to: @dhiltgen on GitHub. Currently, the time it takes for Ollama Serve to stop the Llama Runner service is too short. It would be great to set the time to take longer to send the kill signal and stop the Llama Runner. Maybe its possible to add a configuration option to set the time it takes for Ollama Serve to stop the Llama Runner service.
Author
Owner

@rgaidot commented on GitHub (Dec 14, 2023):

Yes, knowing that the server llama.cpp (example in their repo.) in permanent execution and much faster, no re-runs, etc..

I also understand why ollama wants to turn it off. In my opinion, because I haven't gone through all the code, if users want to switch models and ollama is using server from llama.cpp (*) with the old versions, you need to launch another server specifying the model for the /completion endpoint.

FYI, I'm using server from llama.cpp with their new endpoint (/v1/chat/completions).

(*) well, given the logs, there's a good chance

Best.

<!-- gh-comment-id:1856365753 --> @rgaidot commented on GitHub (Dec 14, 2023): Yes, knowing that the server llama.cpp (example in their repo.) in permanent execution and much faster, no re-runs, etc.. I also understand why ollama wants to turn it off. In my opinion, because I haven't gone through all the code, if users want to switch models and ollama is using server from llama.cpp (*) with the old versions, you need to launch another server specifying the model for the `/completion` endpoint. FYI, I'm using server from llama.cpp with their new endpoint (`/v1/chat/completions`). _(*) well, given the logs, there's a good chance_ Best.
Author
Owner

@transcriptionstream commented on GitHub (Dec 19, 2023):

I'm after similar, but opposite. The llama runner is holding gpu memory long after an api generate request is made. I'd love a way to adjust or stop the runners via api.

<!-- gh-comment-id:1863587757 --> @transcriptionstream commented on GitHub (Dec 19, 2023): I'm after similar, but opposite. The llama runner is holding gpu memory long after an api generate request is made. I'd love a way to adjust or stop the runners via api.
Author
Owner

@dhiltgen commented on GitHub (Mar 12, 2024):

In 0.1.25 we added keep_alive which allows you to control how long we keep the model loaded.

We've solved a number of memory ~leak problems, but are still working on some remaining ones which is why it never fully unloads from the GPU. That's tracked in issue #2767

<!-- gh-comment-id:1992138055 --> @dhiltgen commented on GitHub (Mar 12, 2024): In 0.1.25 we added `keep_alive` which allows you to control how long we keep the model loaded. We've solved a number of memory ~leak problems, but are still working on some remaining ones which is why it never fully unloads from the GPU. That's tracked in issue #2767
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#26589