[GH-ISSUE #3131] Clip model isn't being freed correctly #1927

Closed
opened 2026-04-12 12:03:15 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @RandomGitUser321 on GitHub (Mar 14, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3131

Originally assigned to: @mxyng on GitHub.

I'm on Windows and do a lot of things with models. Mostly a VLM->Get a detailed description of an image->Use a different LLM that's better at writing prompts to inject/mix my ideas in with->Stable diffusion->Image type workflow with ComfyUI. Obviously, I need all the VRAM I can get, but I sometimes run into scenarios where every megabyte of VRAM is precious(ipadapter, controlnets, etc)

I have my own custom nodes that I've made, that incorperate sending the command to unload the model after they are used, so that I don't run into any OOM/OOVM scenarios leading to using shared memory(destroys performance):
For example, it pretty much ultimately runs the command:
client.generate(model=model, prompt=prompt, images=images_b64, system=system, options={'num_predict': num_predict, 'temperature': temperature, 'seed': seed}, keep_alive="0")

Everything works fine and the keep_alive = "0" does indeed unload the model when it's done, but it seems like it leaves the mmproj-model-f16 portion of the associated model still loaded in VRAM. Whatever VRAM load I was at before starting->loading->unloading remains higher until I exit out of the Ollama Windows icon in my task tray; freeing up the chunk that was trapped.

I've tested this with regular LLMs and the command does completely work and return my VRAM load back to what it was before I loaded the model.

EDIT: Oh and I can also replicate this identical behaviour with a regular *.py script doing the same thing with a basic template

Originally created by @RandomGitUser321 on GitHub (Mar 14, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3131 Originally assigned to: @mxyng on GitHub. I'm on Windows and do a lot of things with models. Mostly a VLM->Get a detailed description of an image->Use a different LLM that's better at writing prompts to inject/mix my ideas in with->Stable diffusion->Image type workflow with ComfyUI. Obviously, I need all the VRAM I can get, but I sometimes run into scenarios where every megabyte of VRAM is precious(ipadapter, controlnets, etc) I have my own custom nodes that I've made, that incorperate sending the command to unload the model after they are used, so that I don't run into any OOM/OOVM scenarios leading to using shared memory(destroys performance): For example, it pretty much ultimately runs the command: `client.generate(model=model, prompt=prompt, images=images_b64, system=system, options={'num_predict': num_predict, 'temperature': temperature, 'seed': seed}, keep_alive="0")` Everything works fine and the `keep_alive = "0"` does indeed unload the model when it's done, but it seems like it leaves the `mmproj-model-f16` portion of the associated model still loaded in VRAM. Whatever VRAM load I was at before starting->loading->unloading remains higher until I exit out of the Ollama Windows icon in my task tray; freeing up the chunk that was trapped. I've tested this with regular LLMs and the command does completely work and return my VRAM load back to what it was before I loaded the model. EDIT: Oh and I can also replicate this identical behaviour with a regular *.py script doing the same thing with a basic template
Author
Owner

@ZaneA commented on GitHub (Mar 14, 2024):

Newer versions of Ollama have moved the LLM processing into the server itself, and it looks like there are some memory leaks to be fixed, you might want to follow this issue https://github.com/ollama/ollama/issues/1848

<!-- gh-comment-id:1996406799 --> @ZaneA commented on GitHub (Mar 14, 2024): Newer versions of Ollama have moved the LLM processing into the server itself, and it looks like there are some memory leaks to be fixed, you might want to follow this issue https://github.com/ollama/ollama/issues/1848
Author
Owner

@RandomGitUser321 commented on GitHub (Mar 14, 2024):

Thanks. For some dumb reason, I was under the impression that Ollama wasn't llama.cpp based. I switched to Ollama because llama.cpp is refactoring a lot of stuff right now and last I heard, they were temporarily removing multimodal model usage from the server, or something like that, forcing you to have to run llava-cli for vision models.

<!-- gh-comment-id:1996575783 --> @RandomGitUser321 commented on GitHub (Mar 14, 2024): Thanks. For some dumb reason, I was under the impression that Ollama wasn't llama.cpp based. I switched to Ollama because llama.cpp is refactoring a lot of stuff right now and last I heard, they were temporarily removing multimodal model usage from the server, or something like that, forcing you to have to run llava-cli for vision models.
Author
Owner

@pdevine commented on GitHub (Mar 14, 2024):

@RandomGitUser321 yes, llama.cpp is temporarily removing llava support, but we'll continue to support it. There have been a few issues which have crept up with it lately though that we're working through right now.

<!-- gh-comment-id:1998348929 --> @pdevine commented on GitHub (Mar 14, 2024): @RandomGitUser321 yes, llama.cpp is temporarily removing llava support, but we'll continue to support it. There have been a few issues which have crept up with it lately though that we're working through right now.
Author
Owner

@pdevine commented on GitHub (Mar 14, 2024):

#3149 should address the issue.

<!-- gh-comment-id:1998370098 --> @pdevine commented on GitHub (Mar 14, 2024): #3149 should address the issue.
Author
Owner

@mxyng commented on GitHub (Mar 14, 2024):

This should be fixed now

<!-- gh-comment-id:1998432047 --> @mxyng commented on GitHub (Mar 14, 2024): This should be fixed now
Author
Owner

@RandomGitUser321 commented on GitHub (Mar 15, 2024):

#3149 should address the issue.

You guys rock, that was fast!

<!-- gh-comment-id:1998711724 --> @RandomGitUser321 commented on GitHub (Mar 15, 2024): > #3149 should address the issue. You guys rock, that was fast!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1927