[GH-ISSUE #2235] GPU RAM not released when exiting ollama run #63321

Closed
opened 2026-05-03 12:59:17 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @aseedb on GitHub (Jan 27, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2235

Originally assigned to: @dhiltgen on GitHub.

I'm running ollama version 0.1.22 under Ubuntu and I installed it with the default procedure. After exiting a run using the command /exit the GPU RAM used by ollama is not released immediately. I either need to restart the ollama service, or wait for several minutes for that to occur. Is it an expected behavior?

Originally created by @aseedb on GitHub (Jan 27, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2235 Originally assigned to: @dhiltgen on GitHub. I'm running ollama version 0.1.22 under Ubuntu and I installed it with the default procedure. After exiting a run using the command /exit the GPU RAM used by ollama is not released immediately. I either need to restart the ollama service, or wait for several minutes for that to occur. Is it an expected behavior?
Author
Owner

@easp commented on GitHub (Jan 27, 2024):

Sounds like the expected behavior. Ollama unloads the model after 5m of inactivity.

This will be configurable in an upcoming version.

<!-- gh-comment-id:1913358565 --> @easp commented on GitHub (Jan 27, 2024): Sounds like the expected behavior. Ollama unloads the model after 5m of inactivity. This will be configurable in an upcoming version.
Author
Owner

@jukofyork commented on GitHub (Jan 28, 2024):

Sounds like the expected behavior. Ollama unloads the model after 5m of inactivity.

This will be configurable in an upcoming version.

Unless this has been fixed in the last week or so (currently running 'main' pulled a few days ago), It still seems to hang on to a around 800 to 1500mb of VRAM for me even when the server unloads the model (even many hours after!). It seems to be a leak in the wrapped lllama.cpp server from what I could see.

This can be quite irritating if you push the number of layers offloaded to the limit as it will work fine with the 3mb-using "freshy crashed" Ollama server but then crash again when switching models and take 30s+ to recover.

<!-- gh-comment-id:1913717846 --> @jukofyork commented on GitHub (Jan 28, 2024): > Sounds like the expected behavior. Ollama unloads the model after 5m of inactivity. > > This will be configurable in an upcoming version. Unless this has been fixed in the last week or so (currently running 'main' pulled a few days ago), It still seems to hang on to a around 800 to 1500mb of VRAM for me even when the server unloads the model (even many hours after!). It seems to be a leak in the wrapped lllama.cpp server from what I could see. This can be quite irritating if you push the number of layers offloaded to the limit as it will work fine with the 3mb-using "freshy crashed" Ollama server but then crash again when switching models and take 30s+ to recover.
Author
Owner

@remy415 commented on GitHub (Jan 31, 2024):

Most AI servers have this problem, I ran into it a lot with stable-diffusion. I would implement frequent server restarts into an application you want to leave running indefinitely.

<!-- gh-comment-id:1918178846 --> @remy415 commented on GitHub (Jan 31, 2024): Most AI servers have this problem, I ran into it a lot with stable-diffusion. I would implement frequent server restarts into an application you want to leave running indefinitely.
Author
Owner

@aseedb commented on GitHub (Feb 5, 2024):

It still seems to hang on to a around 800 to 1500mb of VRAM for me even when the server unloads the model (even many hours after!).

Same here and yes it would be nice to have it fixed.

<!-- gh-comment-id:1927347490 --> @aseedb commented on GitHub (Feb 5, 2024): > It still seems to hang on to a around 800 to 1500mb of VRAM for me even when the server unloads the model (even many hours after!). Same here and yes it would be nice to have it fixed.
Author
Owner

@dhiltgen commented on GitHub (Mar 11, 2024):

Let's track this via #1848

We fixed a number of scenarios, but there are still some where GPU memory isn't properly freed when we go idle.

<!-- gh-comment-id:1989254932 --> @dhiltgen commented on GitHub (Mar 11, 2024): Let's track this via #1848 We fixed a number of scenarios, but there are still some where GPU memory isn't properly freed when we go idle.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63321