[GH-ISSUE #10631] Proposal for Manual VRAM Allocation in Modelfile #6994

Open
opened 2026-04-12 18:53:16 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @sunhy0316 on GitHub (May 9, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10631

PARAMETER num_gpu xxx

Originally posted by @rick-github in #10359


Hi Team,

I'd like to suggest adding a feature to manually specify the VRAM allocation size for models directly in the Modelfile. Currently, when loading a second model, the first model is forcibly unloaded even if there's sufficient remaining VRAM. While num_gpu ensures a single model stays entirely in VRAM, it doesn't guarantee coexistence for multiple models.

By allowing explicit VRAM size declaration in the Modelfile (e.g., vram_size 24GB), Ollama could:

  1. Compare the declared size against actual free VRAM. If adequate, load the new model without unloading the first.
  2. Maximize GPU utilization by enabling parallel model loading when possible.
  3. Unload other model if the declared size exceeds available VRAM (clear error messaging).

This approach would address:

  • The current conservative estimation issue forcing unnecessary unloads.
  • User control over resource allocation for multi-model workflows.
  • Efficient VRAM usage without compromising stability.

Would this align with Ollama's memory management roadmap? Looking forward to your thoughts.


Originally created by @sunhy0316 on GitHub (May 9, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10631 > ``` > PARAMETER num_gpu xxx > ``` _Originally posted by @rick-github in [#10359](https://github.com/ollama/ollama/issues/10359#issuecomment-2864661617)_ --- Hi Team, I'd like to suggest adding a feature to manually specify the VRAM allocation size for models directly in the Modelfile. Currently, when loading a second model, the first model is forcibly unloaded even if there's sufficient remaining VRAM. While `num_gpu` ensures a single model stays entirely in VRAM, it doesn't guarantee coexistence for multiple models. By allowing explicit VRAM size declaration in the Modelfile (e.g., `vram_size 24GB`), Ollama could: 1. Compare the declared size against actual free VRAM. If adequate, load the new model *without* unloading the first. 2. Maximize GPU utilization by enabling parallel model loading when possible. 3. Unload other model if the declared size exceeds available VRAM (clear error messaging). This approach would address: - The current conservative estimation issue forcing unnecessary unloads. - User control over resource allocation for multi-model workflows. - Efficient VRAM usage without compromising stability. Would this align with Ollama's memory management roadmap? Looking forward to your thoughts. ---
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6994