[GH-ISSUE #3264] "CUDA error: out of memory" after random number of API requests #27769

Closed
opened 2026-04-22 05:20:50 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @RandomGitUser321 on GitHub (Mar 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3264

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I run a workflow in ComfyUI that makes calls to Ollama server's API to generate prompts or analyze images. It works fine, normally, but occasionally I get CUDA errors that then make me have to restart the server. It's kind of disruptive to my workflow because I have to check back every 5-10 minutes to make sure a queued list isn't being stalled.

Within the API call, I use keep_alive="0" because otherwise I run into issues with generating an image right after (stable diffusion that needs a lot of VRAM) and sometimes parts of either model get stuck in shared memory. The command works fine and unloads the LLM from the VRAM. I think it will persist in system ram afterwards, which is also fine, since it's faster to reload ram->vram, than it is to have to reopen the whole model off the drive again.

The basic flow of what I'm doing is send a request to Ollama->get response->unload LLM from VRAM->use response for stable diffusion->new seed for Ollama->rinse and repeat. ComfyUI is also set to unload models back to system ram as well.

I added a time.sleep() to my node that sends the requests to Ollama thinking maybe it just needs a little more time for the unloading phase.

What did you expect to see?

I'd expect it not to have this error

Steps to reproduce

I should also note that my call looks like this:

time.sleep(2) #attempting to see if this helps solve the problem
response = client.generate(model=model, prompt=prompt, system=system, options={'num_predict': num_predict, 'temperature': temperature, 'seed': seed}, keep_alive="0")

Are there any recent changes that introduced the issue?

No response

OS

Windows

Architecture

No response

Platform

No response

Ollama version

v0.1.29

GPU

Nvidia

GPU info

RTX 2080 with latest Nvidia drivers.

CPU

Intel

Other software

No response

Originally created by @RandomGitUser321 on GitHub (Mar 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3264 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I run a workflow in ComfyUI that makes calls to Ollama server's API to generate prompts or analyze images. It works fine, normally, but occasionally I get CUDA errors that then make me have to restart the server. It's kind of disruptive to my workflow because I have to check back every 5-10 minutes to make sure a queued list isn't being stalled. Within the API call, I use `keep_alive="0"` because otherwise I run into issues with generating an image right after (stable diffusion that needs a lot of VRAM) and sometimes parts of either model get stuck in shared memory. The command works fine and unloads the LLM from the VRAM. I think it will persist in system ram afterwards, which is also fine, since it's faster to reload ram->vram, than it is to have to reopen the whole model off the drive again. The basic flow of what I'm doing is send a request to Ollama->get response->unload LLM from VRAM->use response for stable diffusion->new seed for Ollama->rinse and repeat. ComfyUI is also set to unload models back to system ram as well. I added a time.sleep() to my node that sends the requests to Ollama thinking maybe it just needs a little more time for the unloading phase. ### What did you expect to see? I'd expect it not to have this error ### Steps to reproduce I should also note that my call looks like this: ``` time.sleep(2) #attempting to see if this helps solve the problem response = client.generate(model=model, prompt=prompt, system=system, options={'num_predict': num_predict, 'temperature': temperature, 'seed': seed}, keep_alive="0") ``` ### Are there any recent changes that introduced the issue? _No response_ ### OS Windows ### Architecture _No response_ ### Platform _No response_ ### Ollama version v0.1.29 ### GPU Nvidia ### GPU info RTX 2080 with latest Nvidia drivers. ### CPU Intel ### Other software _No response_
GiteaMirror added the nvidiabugwindows labels 2026-04-22 05:20:50 -05:00
Author
Owner

@RandomGitUser321 commented on GitHub (Mar 20, 2024):

Example of the VRAM usage during my workflow:
example

EDIT: I've managed to reliably trigger this three times in a row. It happens on exactly the 30th call to Ollama. The task manager graph when it happens shows nothing abnormal. I'm going to test it again with shorter time periods between requests. Right now, there's roughly 30 seconds between requests(waiting on stable diffusion to render)..

EDIT2: Yeah, even with shorter delays between requests, it still breaks on exactly the 30th request(switched to a lightning model for 6 second renders vs 30 second renders before).

<!-- gh-comment-id:2009077330 --> @RandomGitUser321 commented on GitHub (Mar 20, 2024): Example of the VRAM usage during my workflow: ![example](https://github.com/ollama/ollama/assets/27916165/a6d4d17f-1cb7-4f7e-a967-d762451c7d58) **EDIT**: I've managed to reliably trigger this three times in a row. It happens on exactly the 30th call to Ollama. The task manager graph when it happens shows nothing abnormal. I'm going to test it again with shorter time periods between requests. Right now, there's roughly 30 seconds between requests(waiting on stable diffusion to render).. **EDIT2**: Yeah, even with shorter delays between requests, it still breaks on exactly the 30th request(switched to a lightning model for 6 second renders vs 30 second renders before).
Author
Owner

@shanchuantian commented on GitHub (Mar 21, 2024):

I also met the error "CUDA error: out of memory" in windows, i was wondering if this problem exists on Linux, or is it only on Windows?Does any one know? thanks.

<!-- gh-comment-id:2011069302 --> @shanchuantian commented on GitHub (Mar 21, 2024): I also met the error "CUDA error: out of memory" in windows, i was wondering if this problem exists on Linux, or is it only on Windows?Does any one know? thanks.
Author
Owner

@dhiltgen commented on GitHub (Jun 1, 2024):

We've been making steady improvements in our memory prediction logic, so I'd give the latest release a try. If things are still ~unstable with the latest release, you might try adding some delay between toggling between VRAM consumers to give the system some time to stabilize.

That said, I think you'll see the most improvement once I get #4599 resolved.

<!-- gh-comment-id:2143605266 --> @dhiltgen commented on GitHub (Jun 1, 2024): We've been making steady improvements in our memory prediction logic, so I'd give the latest release a try. If things are still ~unstable with the latest release, you might try adding some delay between toggling between VRAM consumers to give the system some time to stabilize. That said, I think you'll see the most improvement once I get #4599 resolved.
Author
Owner

@dhiltgen commented on GitHub (Jun 22, 2024):

The latest release should perform much better on understanding the available VRAM on the system. If you're still having problems after updating to 0.1.45, please share an updated log and repro scenario and I'll reopen the issue.

<!-- gh-comment-id:2183594153 --> @dhiltgen commented on GitHub (Jun 22, 2024): The latest release should perform much better on understanding the available VRAM on the system. If you're still having problems after updating to 0.1.45, please share an updated log and repro scenario and I'll reopen the issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#27769