[GH-ISSUE #10484] Load the model in increments #53408

Closed
opened 2026-04-29 03:00:52 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @ferdinandkeller on GitHub (Apr 29, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10484

We use small servers with GPU for running our models. Maybe something with 8Gb of RAM and 8Gb of VRAM for a 4Gb model.

When loading the model to the GPU, it (apparently) first gets loaded into RAM as one big chunk, and then send to GPU into VRAM, clearing main memory.

But after a few hours of use, the server probably tried to utilize available memory, and now only 2Gb remain available.

That means if the model unloaded from VRAM, it can’t be loaded as one big chunk anymore, and ollama ends up waiting indefinitely.

We fixed it temporarily by not allowing the model to unload from VRAM, but it would be great if there was the option to load the model in chunks (split the big file into smaller chunks that get progressively send to GPU).

This also probably happens to people using really big models. They need the VRAM to run it efficiently, but also requiring the same amount of RAM is a waste of resources.

Originally created by @ferdinandkeller on GitHub (Apr 29, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10484 We use small servers with GPU for running our models. Maybe something with 8Gb of RAM and 8Gb of VRAM for a 4Gb model. When loading the model to the GPU, it (apparently) first gets loaded into RAM as one big chunk, and then send to GPU into VRAM, clearing main memory. But after a few hours of use, the server probably tried to utilize available memory, and now only 2Gb remain available. That means if the model unloaded from VRAM, it can’t be loaded as one big chunk anymore, and ollama ends up waiting indefinitely. We fixed it temporarily by not allowing the model to unload from VRAM, but it would be great if there was the option to load the model in chunks (split the big file into smaller chunks that get progressively send to GPU). This also probably happens to people using really big models. They need the VRAM to run it efficiently, but also requiring the same amount of RAM is a waste of resources.
GiteaMirror added the feature request label 2026-04-29 03:00:52 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 29, 2025):

ollama doesn't stage the entire model in RAM before writing it to VRAM. What does happen is that the operating system page cache stores the disk blocks of the model in RAM as they are read from disk, reducing the amount of free RAM seen in free. This is just cache, though - if the OS needs some RAM to load a new program, it will discard the cached model blocks and re-use the RAM.

and ollama ends up waiting indefinitely.

How are you observing this?

<!-- gh-comment-id:2840068992 --> @rick-github commented on GitHub (Apr 29, 2025): ollama doesn't stage the entire model in RAM before writing it to VRAM. What does happen is that the operating system page cache stores the disk blocks of the model in RAM as they are read from disk, reducing the amount of free RAM seen in `free`. This is just cache, though - if the OS needs some RAM to load a new program, it will discard the cached model blocks and re-use the RAM. > and ollama ends up waiting indefinitely. How are you observing this?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53408