[GH-ISSUE #3294] System ram won't free up when using cuda. #64065

Closed
opened 2026-05-03 16:02:13 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @stevenhobs on GitHub (Mar 22, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3294

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

When I run starcoder2:15B. The model occupies 9.1GB of GPU memory. And the ollama serve process should occupy about 1GB of OS memory. But I check the real situation,seem the model cached in the system ram and won't be freed up until I end the ollama process. Is this typical?
f46ccf38a0dc283fc8289ca0cca6137
6ed57bf1559d218909f80a378f4eb6e

What did you expect to see?

When loading LLM by Cuda, the model cached in the os ram shoud be released.
Or Windows Task Manager should be normal.

Steps to reproduce

In the past time. The process was normal.

Are there any recent changes that introduced the issue?

I only updated the windows os patch before this happening.
And updated the ollama version to 1.29.

OS

Windows

Architecture

amd64

Platform

No response

Ollama version

No response

GPU

Nvidia

GPU info

GPU Nvidia 3060m. This is ok!

CPU

Intel

Other software

None

Originally created by @stevenhobs on GitHub (Mar 22, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3294 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? When I run starcoder2:15B. The model occupies 9.1GB of GPU memory. And the ollama serve process should occupy about 1GB of OS memory. But I check the real situation,seem the model cached in the system ram and won't be freed up until I end the ollama process. Is this typical? ![f46ccf38a0dc283fc8289ca0cca6137](https://github.com/ollama/ollama/assets/47906512/92bc1a27-49e3-4ffa-8187-200ec9c023fa) ![6ed57bf1559d218909f80a378f4eb6e](https://github.com/ollama/ollama/assets/47906512/3f7920f6-fdd7-4553-ae64-bfe9fd93769c) ### What did you expect to see? When loading LLM by Cuda, the model cached in the os ram shoud be released. Or Windows Task Manager should be normal. ### Steps to reproduce In the past time. The process was normal. ### Are there any recent changes that introduced the issue? I only updated the windows os patch before this happening. And updated the ollama version to 1.29. ### OS Windows ### Architecture amd64 ### Platform _No response_ ### Ollama version _No response_ ### GPU Nvidia ### GPU info GPU Nvidia 3060m. This is ok! ### CPU Intel ### Other software None
GiteaMirror added the bugneeds more info labels 2026-05-03 16:02:13 -05:00
Author
Owner

@BruceMacD commented on GitHub (Mar 25, 2024):

Hi @stevenhobs, is the memory still held after 5 minutes? Ollama keeps the model loaded into memory for 5 minutes after the request.

<!-- gh-comment-id:2018532165 --> @BruceMacD commented on GitHub (Mar 25, 2024): Hi @stevenhobs, is the memory still held after 5 minutes? Ollama keeps the model loaded into memory for 5 minutes after the request.
Author
Owner

@stevenhobs commented on GitHub (Mar 25, 2024):

Yeah, it is the status what you described. How should I do for the debugging?

<!-- gh-comment-id:2018604269 --> @stevenhobs commented on GitHub (Mar 25, 2024): Yeah, it is the status what you described. How should I do for the debugging?
Author
Owner

@stevenhobs commented on GitHub (Mar 26, 2024):

As you can see, I also changed the ENV OLLAMA_KEEPALIVE=3m, but the same result.
The last request is at time 12:53 and the model does not drop at time 12:59 .
屏幕截图 2024-03-26 125909

<!-- gh-comment-id:2019399030 --> @stevenhobs commented on GitHub (Mar 26, 2024): As you can see, I also changed the ENV **OLLAMA_KEEPALIVE**=**3m**, but the same result. The last request is at time 12:53 and the model does not drop at time 12:59 . ![屏幕截图 2024-03-26 125909](https://github.com/ollama/ollama/assets/47906512/7cfd5dad-967a-418c-bafd-ef51fe37c6e3)
Author
Owner

@konczdev commented on GitHub (Mar 27, 2024):

I think my thread on discord related to this: https://discord.com/channels/1128867683291627614/1218874957556224102

<!-- gh-comment-id:2022077414 --> @konczdev commented on GitHub (Mar 27, 2024): I think my thread on discord related to this: https://discord.com/channels/1128867683291627614/1218874957556224102
Author
Owner

@oldgithubman commented on GitHub (Mar 30, 2024):

It'd be nice if people stopped using discord...just sayin'

<!-- gh-comment-id:2027906814 --> @oldgithubman commented on GitHub (Mar 30, 2024): It'd be nice if people stopped using discord...just sayin'
Author
Owner

@dhiltgen commented on GitHub (Jun 1, 2024):

I'm not able to reproduce with the latest version. I started Task Manager, and watched the GPU allocation then ran

> ollama run starcoder2:15b --keepalive 10s write a hello world in python.  be brief
...
> ollama ps
NAME            ID              SIZE    PROCESSOR       UNTIL
starcoder2:15b  20cdb0f709c2    10 GB   100% GPU        7 seconds from now

and after the timer expired, the GPU VRAM dropped back down.

I will note, this model seems to ramble, so it's possible you might have a client that's getting a long response from the model and that's pinning the model in memory. If no clients are present, the model does unload.

If you're still seeing the model stay pinned when no clients are active, please share more information about your scenario and I'll re-open the issue.

<!-- gh-comment-id:2143603759 --> @dhiltgen commented on GitHub (Jun 1, 2024): I'm not able to reproduce with the latest version. I started Task Manager, and watched the GPU allocation then ran ``` > ollama run starcoder2:15b --keepalive 10s write a hello world in python. be brief ... > ollama ps NAME ID SIZE PROCESSOR UNTIL starcoder2:15b 20cdb0f709c2 10 GB 100% GPU 7 seconds from now ``` and after the timer expired, the GPU VRAM dropped back down. I will note, this model seems to ramble, so it's possible you might have a client that's getting a long response from the model and that's pinning the model in memory. If no clients are present, the model does unload. If you're still seeing the model stay pinned when no clients are active, please share more information about your scenario and I'll re-open the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#64065