[GH-ISSUE #14965] Start inferencing on CPU while the model weights are being uploaded to GPU VRAM #71683

Closed
opened 2026-05-05 02:20:32 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @AIWintermuteAI on GitHub (Mar 19, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14965

As per title - I'm wondering if this is possible.
For more context: my situation might be similar to fair share of people using ollama. I run open webui + ollama locally and use it myself + share with family members. There are multiple models downloaded to choose from, so always keeping one model in VRAM is not desirable.
There is quite a bit of delay where the model being loaded to VRAM. Perhaps, it is possible to utilize CPU for slower inference during that time, so at least the users would see the response being generated?

Originally created by @AIWintermuteAI on GitHub (Mar 19, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14965 As per title - I'm wondering if this is possible. For more context: my situation might be similar to fair share of people using ollama. I run open webui + ollama locally and use it myself + share with family members. There are multiple models downloaded to choose from, so always keeping one model in VRAM is not desirable. There is quite a bit of delay where the model being loaded to VRAM. Perhaps, it is possible to utilize CPU for slower inference during that time, so at least the users would see the response being generated?
GiteaMirror added the feature request label 2026-05-05 02:20:32 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 19, 2026):

Generating a single token requires a full forward pass through the model weights, so if inference starts as soon as the first layer is loaded, the user still won't see a response until the model is fully loaded. If you are OK with slower response, you could configure some of the less frequently used models to run only in RAM by setting num_gpu, that way they won't evict the main model from the GPU.

<!-- gh-comment-id:4093480072 --> @rick-github commented on GitHub (Mar 19, 2026): Generating a single token requires a full forward pass through the model weights, so if inference starts as soon as the first layer is loaded, the user still won't see a response until the model is fully loaded. If you are OK with slower response, you could configure some of the less frequently used models to run only in RAM by setting [`num_gpu`](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650), that way they won't evict the main model from the GPU.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71683