[GH-ISSUE #4747] Running multiple models simultaneously, always using one card #2990

Closed
opened 2026-04-12 13:23:04 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @leoHostProject on GitHub (May 31, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4747

What is the issue?

Running multiple models simultaneously, always using one card,but i have 4 cards and download 4model
When multiple users are using it at the same time, always clear the first card and then load other models instead of using my other idle cards

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.33

Originally created by @leoHostProject on GitHub (May 31, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4747 ### What is the issue? Running multiple models simultaneously, always using one card,but i have 4 cards and download 4model When multiple users are using it at the same time, always clear the first card and then load other models instead of using my other idle cards ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.33
GiteaMirror added the bug label 2026-04-12 13:23:04 -05:00
Author
Owner

@leoHostProject commented on GitHub (May 31, 2024):

this is my ollama settings
Execstart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="OLLAMA HOST=0.0.0.0"
Environment="CUDA VISIBLE DEVICES=0,1,2,3"
Environment="OLLAMA NUM PARALLEL=4"

<!-- gh-comment-id:2141433906 --> @leoHostProject commented on GitHub (May 31, 2024): this is my ollama settings Execstart=/usr/local/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" Environment="OLLAMA HOST=0.0.0.0" Environment="CUDA VISIBLE DEVICES=0,1,2,3" Environment="OLLAMA NUM PARALLEL=4"
Author
Owner

@leoHostProject commented on GitHub (May 31, 2024):

gpu resources is NVIDIA A100 80GB PCIe *4
NVIDIA-SMI 535.154.05
CUDA Version:12.2
GPU Fan is 0,1,2,3
alwalys use 0 free 1,2,3

<!-- gh-comment-id:2141442707 --> @leoHostProject commented on GitHub (May 31, 2024): gpu resources is NVIDIA A100 80GB PCIe *4 NVIDIA-SMI 535.154.05 CUDA Version:12.2 GPU Fan is 0,1,2,3 alwalys use 0 free 1,2,3
Author
Owner

@DuckyBlender commented on GitHub (May 31, 2024):

please update your ollama to see if it's been fixed in the newer versions

<!-- gh-comment-id:2141572160 --> @DuckyBlender commented on GitHub (May 31, 2024): please update your ollama to see if it's been fixed in the newer versions
Author
Owner

@dhiltgen commented on GitHub (May 31, 2024):

@leoHostProject to load 4 models on 4 GPUs concurrently, you need to also set OLLAMA_MAX_LOADED_MODELS - the current default is 1. In a future version this will be autodetected, but for now while it's still experimental, you have to set this variable to allow concurrent model loading.

<!-- gh-comment-id:2142821584 --> @dhiltgen commented on GitHub (May 31, 2024): @leoHostProject to load 4 models on 4 GPUs concurrently, you need to also set `OLLAMA_MAX_LOADED_MODELS` - the current default is 1. In a future version this will be autodetected, but for now while it's still experimental, you have to set this variable to allow concurrent model loading.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2990