[GH-ISSUE #9319] Ollama parallel configuration tweaks for more workloads on the same server #6082

Open
opened 2026-04-12 17:24:42 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @Fade78 on GitHub (Feb 24, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9319

Ollama parallel configuration tweaks for more workloads on the same server

Increase responsiveness when you have light (quick) and heavy (slow, you are ok to wait) workloads.

Use case

  • I use a front end with multiple users
  • Most of the time the model I use take the all VRAM so I put parallel and simultaneously loaded to 1
  • But sometimes I run a very big model that doesn fit in VRAM, therefore ollama schedule it on the CPU and it runs for dozen of minutes
  • When a model scheduled on 100% CPU is running, according to the settings, the GPU is not used!

Feature request

  • Current limit variables (OLLAMA_MAX_LOADED_MODELS, OLLAMA_NUM_PARALLEL, OLLAMA_MAX_QUEUE) stays there and become overall limits, like they are now.
  • New OLLAMA_MAX_LOADED_MODELS_CPU, OLLAMA_NUM_PARALLEL_CPU, OLLAMA_MAX_QUEUE_CPU that will only applies on the 100% CPU loaded model
  • New OLLAMA_MAX_LOADED_MODELS_GPU, OLLAMA_NUM_PARALLEL_GPU, OLLAMA_MAX_QUEUE_GPU that will only applies on the 100% GPU loaded model
  • New OLLAMA_PREVENT_GPU_HYBRID=[percentage] will load a model 100% CPU instead of CPU/GPU if the GPU portion is superior to percentage.
    • When a model is loaded in hybrid mode, it will take 100% GPU VRAM but, depending of the part loaded on the CPU, the GPU computation will have to wait for the CPU computation so the GPU will be loaded at a low power so it will be waste of power for the GPU.

How you can use it

OLLAMA_MAX_LOADED_MODELS_GPU=1
OLLAMA_NUM_PARALLEL_GPU=1
OLLAMA_MAX_LOADED_MODELS_CPU=2
OLLAMA_NUM_PARALLEL_CPU=2
OLLAMA_PREVENT_GPU_HYBRID=30

Some consequences:

  • If a model takes 29% GPU or less, it will run in hybrid mode. If another model is requested, it will be automatically loaded on the CPU because 100% of the GPU VRAM is used in this scenario.
  • If a model would take 30% GPU or more, because of OLLAMA_PREVENT_GPU_HYBRID=30, it will run in full CPU mode instead of the hybrid mode. Leaving 100% of the GPU free.
  • In this configuration, two long queries can run on the CPU (and remember some CPU are very powerful nowadays) and short ones, which are using model requested because of their size that fits in VRAM, are quickly done by the GPU.
Originally created by @Fade78 on GitHub (Feb 24, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9319 # Ollama parallel configuration tweaks for more workloads on the same server *Increase responsiveness when you have light (quick) and heavy (slow, you are ok to wait) workloads.* ## Use case - I use a front end with multiple users - Most of the time the model I use take the all VRAM so I put parallel and simultaneously loaded to 1 - But sometimes I run a very big model that doesn fit in VRAM, therefore ollama schedule it on the CPU and it runs for dozen of minutes - When a model scheduled on 100% CPU is running, according to the settings, the GPU is not used! ## Feature request - Current limit variables (OLLAMA_MAX_LOADED_MODELS, OLLAMA_NUM_PARALLEL, OLLAMA_MAX_QUEUE) stays there and become overall limits, like they are now. - New OLLAMA_MAX_LOADED_MODELS_CPU, OLLAMA_NUM_PARALLEL_CPU, OLLAMA_MAX_QUEUE_CPU that will only applies on the 100% CPU loaded model - New OLLAMA_MAX_LOADED_MODELS_GPU, OLLAMA_NUM_PARALLEL_GPU, OLLAMA_MAX_QUEUE_GPU that will only applies on the 100% GPU loaded model - New OLLAMA_PREVENT_GPU_HYBRID=[percentage] will load a model 100% CPU instead of CPU/GPU if the GPU portion is superior to percentage. - When a model is loaded in hybrid mode, it will take 100% GPU VRAM but, depending of the part loaded on the CPU, the GPU computation will have to wait for the CPU computation so the GPU will be loaded at a low power so it will be waste of power for the GPU. ## How you can use it ``` OLLAMA_MAX_LOADED_MODELS_GPU=1 OLLAMA_NUM_PARALLEL_GPU=1 OLLAMA_MAX_LOADED_MODELS_CPU=2 OLLAMA_NUM_PARALLEL_CPU=2 OLLAMA_PREVENT_GPU_HYBRID=30 ``` Some consequences: - If a model takes 29% GPU or less, it will run in hybrid mode. If another model is requested, it will be automatically loaded on the CPU because 100% of the GPU VRAM is used in this scenario. - If a model would take 30% GPU or more, because of OLLAMA_PREVENT_GPU_HYBRID=30, it will run in full CPU mode instead of the hybrid mode. Leaving 100% of the GPU free. - In this configuration, two long queries can run on the CPU (and remember some CPU are very powerful nowadays) and short ones, which are using model requested because of their size that fits in VRAM, are quickly done by the GPU.
GiteaMirror added the feature request label 2026-04-12 17:24:42 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6082