[GH-ISSUE #9462] Ollama logic for GPU choice is suboptimal. #68224

Closed
opened 2026-05-04 12:56:21 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @CoolShades on GitHub (Mar 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9462

What is the issue?

Suboptimal GPU Selection Logic in Ollama

Issue Description

Ollama's current GPU selection logic doesn't optimally allocate models based on GPU performance characteristics. When multiple GPUs with different performance profiles are present, Ollama should prioritize the fastest GPU for small models and utilize multiple GPUs for larger models that exceed the memory of a single GPU.

Current Behavior

  • With multiple GPUs present, Ollama appears to make GPU selection decisions based primarily on VRAM capacity rather than performance characteristics
  • In a mixed GPU environment (e.g., RTX 4090 + RTX 3090), Ollama may use the slower GPU with slightly more VRAM instead of the faster GPU
  • When a model exceeds the memory of the faster GPU, Ollama may fall back to system RAM rather than utilizing both GPUs

Expected Behavior

Ollama should:

  1. Default to using the fastest GPU for models that fit within its VRAM
  2. Utilize multiple GPUs for models that exceed the memory of the fastest GPU
  3. Never use only the slower GPU when a faster GPU is available
  4. Provide configurable options to specify preferred GPU selection logic

System Configuration

Environment:
- RTX 4090 (24564 MiB VRAM, Device #0) - Faster but slightly less VRAM
- RTX 3090 (24576 MiB VRAM, Device #1) - Slower but marginally more VRAM
- Docker configuration with nvidia-runtime

Docker Compose Configuration

ollama:
  image: ollama/ollama
  environment:
    - NVIDIA_VISIBLE_DEVICES=0,1
    - CUDA_DEVICE_ORDER=PCI_BUS_ID
    - OLLAMA_NUM_GPU=2
    - OLLAMA_HOST=0.0.0.0
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]

Proposed Solutions

  1. Add configuration options to specify GPU selection preferences:

    • OLLAMA_PREFERRED_GPU: Specify primary GPU device index
    • OLLAMA_GPU_STRATEGY: Options like "fastest-first", "memory-first", or "performance-model"
  2. Implement smarter default behavior that considers both VRAM capacity and GPU performance metrics when making allocation decisions

  3. Add a way to specify model-specific GPU configurations without needing custom entrypoint scripts

Impact

This issue affects users with heterogeneous GPU setups who want to maximize performance. The current behavior leads to suboptimal performance by not utilizing the fastest available GPU or by falling back to system RAM instead of using multiple GPUs.

Workarounds

Currently using custom entrypoint scripts or manually setting CUDA_VISIBLE_DEVICES for different models, but this is cumbersome and requires manual intervention.

Relevant log output


OS

Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.5.12

Originally created by @CoolShades on GitHub (Mar 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9462 ### What is the issue? # Suboptimal GPU Selection Logic in Ollama ## Issue Description Ollama's current GPU selection logic doesn't optimally allocate models based on GPU performance characteristics. When multiple GPUs with different performance profiles are present, Ollama should prioritize the fastest GPU for small models and utilize multiple GPUs for larger models that exceed the memory of a single GPU. ## Current Behavior - With multiple GPUs present, Ollama appears to make GPU selection decisions based primarily on VRAM capacity rather than performance characteristics - In a mixed GPU environment (e.g., RTX 4090 + RTX 3090), Ollama may use the slower GPU with slightly more VRAM instead of the faster GPU - When a model exceeds the memory of the faster GPU, Ollama may fall back to system RAM rather than utilizing both GPUs ## Expected Behavior Ollama should: 1. Default to using the fastest GPU for models that fit within its VRAM 2. Utilize multiple GPUs for models that exceed the memory of the fastest GPU 3. Never use only the slower GPU when a faster GPU is available 4. Provide configurable options to specify preferred GPU selection logic ## System Configuration ``` Environment: - RTX 4090 (24564 MiB VRAM, Device #0) - Faster but slightly less VRAM - RTX 3090 (24576 MiB VRAM, Device #1) - Slower but marginally more VRAM - Docker configuration with nvidia-runtime ``` ## Docker Compose Configuration ```yaml ollama: image: ollama/ollama environment: - NVIDIA_VISIBLE_DEVICES=0,1 - CUDA_DEVICE_ORDER=PCI_BUS_ID - OLLAMA_NUM_GPU=2 - OLLAMA_HOST=0.0.0.0 deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] ``` ## Proposed Solutions 1. Add configuration options to specify GPU selection preferences: - `OLLAMA_PREFERRED_GPU`: Specify primary GPU device index - `OLLAMA_GPU_STRATEGY`: Options like "fastest-first", "memory-first", or "performance-model" 2. Implement smarter default behavior that considers both VRAM capacity and GPU performance metrics when making allocation decisions 3. Add a way to specify model-specific GPU configurations without needing custom entrypoint scripts ## Impact This issue affects users with heterogeneous GPU setups who want to maximize performance. The current behavior leads to suboptimal performance by not utilizing the fastest available GPU or by falling back to system RAM instead of using multiple GPUs. ## Workarounds Currently using custom entrypoint scripts or manually setting `CUDA_VISIBLE_DEVICES` for different models, but this is cumbersome and requires manual intervention. ### Relevant log output ```shell ``` ### OS Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.12
GiteaMirror added the bug label 2026-05-04 12:56:21 -05:00
Author
Owner

@CoolShades commented on GitHub (Mar 3, 2025):

Image

TLDR: Model is 22GB. I have 4090 and 3090. Ollama chooses to use 3090 because it has more VRAM ( by 12MB)
It should instead be using the 4090 as it is more capable.

<!-- gh-comment-id:2693485275 --> @CoolShades commented on GitHub (Mar 3, 2025): <img width="1149" alt="Image" src="https://github.com/user-attachments/assets/70db5efd-ee2c-4234-af54-bdc9de001107" /> TLDR: Model is 22GB. I have 4090 and 3090. Ollama chooses to use 3090 because it has more VRAM ( by 12MB) It should instead be using the 4090 as it is more capable.
Author
Owner

@CoolShades commented on GitHub (Mar 4, 2025):

anyone know where abouts this logic is in the code base that i can find and possibly fix with a pull request?

<!-- gh-comment-id:2696509130 --> @CoolShades commented on GitHub (Mar 4, 2025): anyone know where abouts this logic is in the code base that i can find and possibly fix with a pull request?
Author
Owner

@rick-github commented on GitHub (Mar 4, 2025):

#8430

<!-- gh-comment-id:2698070241 --> @rick-github commented on GitHub (Mar 4, 2025): #8430
Author
Owner

@apunkt commented on GitHub (Mar 15, 2025):

This behavior seems to be introduced somewhere v 0.5 > x > 0.6
as my T4 (16GB) have always been selected before my M40s (24GB), but with my recent upgrade to 0.6 it is avoiding the T4 by all cost.

Which is bad for me, as it is more powerful than the M40s...

<!-- gh-comment-id:2726904506 --> @apunkt commented on GitHub (Mar 15, 2025): This behavior seems to be introduced somewhere v 0.5 > x > 0.6 as my T4 (16GB) have always been selected before my M40s (24GB), but with my recent upgrade to 0.6 it is avoiding the T4 by all cost. Which is bad for me, as it is more powerful than the M40s...
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68224