[GH-ISSUE #10952] Cannot run multiple models concurrently #69272

Closed
opened 2026-05-04 17:37:08 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @trdischat on GitHub (Jun 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10952

What is the issue?

I am running Ollama on a separate Linux server on my LAN. The server has 96G of RAM and an RTX 3060 with 12G of VRAM. I am trying to run more than one model concurrently so that I don't have to wait for models to reload constantly. I set the following environment variables in ollama.service:

Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OLLAMA_MAX_LOADED_MODELS=4"
Environment="OLLAMA_DEBUG=1"

It doesn't matter what model I use, nothing runs concurrently, and querying one model always unloads the other model from memory, regardless of model size. I tried this with many different models, including fairly small models (mistral:7b and llama2:7b) that should have fit into VRAM with no problem. FWIW, I set up the exact some software and models on my Windows desktop computer and it can run the models concurrently without a problem. Is there an issue with the Linux version of Ollama or is it something in my Linux setup?

ollama.log

Relevant log output

Log file attached.

Not sure if it is relevant, but the Linux logs contain these lines that are not present in the Windows logs:

msg="runner with non-zero duration has gone idle, adding timer"
msg="found an idle runner to unload"
msg="resetting model to expire immediately to make room"
msg="waiting for pending requests to complete and unload to occur"
msg="runner expired event received"
msg="got lock to unload expired event"

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.9.0

Originally created by @trdischat on GitHub (Jun 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10952 ### What is the issue? I am running Ollama on a separate Linux server on my LAN. The server has 96G of RAM and an RTX 3060 with 12G of VRAM. I am trying to run more than one model concurrently so that I don't have to wait for models to reload constantly. I set the following environment variables in ollama.service: ``` Environment="OLLAMA_HOST=0.0.0.0" Environment="OLLAMA_ORIGINS=*" Environment="OLLAMA_KEEP_ALIVE=-1" Environment="OLLAMA_NUM_PARALLEL=4" Environment="OLLAMA_MAX_LOADED_MODELS=4" Environment="OLLAMA_DEBUG=1" ``` It doesn't matter what model I use, nothing runs concurrently, and querying one model always unloads the other model from memory, regardless of model size. I tried this with many different models, including fairly small models (mistral:7b and llama2:7b) that should have fit into VRAM with no problem. FWIW, I set up the exact some software and models on my Windows desktop computer and it can run the models concurrently without a problem. Is there an issue with the Linux version of Ollama or is it something in my Linux setup? [ollama.log](https://github.com/user-attachments/files/20559277/ollama.log) ### Relevant log output Log file attached. Not sure if it is relevant, but the Linux logs contain these lines that are not present in the Windows logs: ``` msg="runner with non-zero duration has gone idle, adding timer" msg="found an idle runner to unload" msg="resetting model to expire immediately to make room" msg="waiting for pending requests to complete and unload to occur" msg="runner expired event received" msg="got lock to unload expired event" ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.9.0
GiteaMirror added the bugneeds more info labels 2026-05-04 17:37:09 -05:00
Author
Owner

@Mugl3 commented on GitHub (Jun 3, 2025):

Seeing the same here. Windows with v0.90.0

<!-- gh-comment-id:2933554992 --> @Mugl3 commented on GitHub (Jun 3, 2025): Seeing the same here. Windows with v0.90.0
Author
Owner

@trdischat commented on GitHub (Jun 3, 2025):

I am really confused now. I added a second RTX 3060 to my server, and now I can load and run multiple models concurrently. The two models I am testing with, gemma3:1b and qwen2.5-coder:1.5b are quite small. Each RTX 3060 has 12GB of VRAM. So I went from 12GB VRAM to 24GB VRAM. Concurrent loading should have worked with either one card or two. But adding the second card seems to have made a difference.

Reviewing the log files from today and yesterday was interesting:

Memory Yesterday Today
gemma3:1b 3.0 2.1
qwen2.5-coder:1.5b 8.1 4.7
Available VRAM 12 24

The amount of memory used by these models changed pretty significantly. Overall, my impression is that you either need to have a massive amount of VRAM or use really small models if you want to run multiple models concurrently.

<!-- gh-comment-id:2936074527 --> @trdischat commented on GitHub (Jun 3, 2025): I am really confused now. I added a second RTX 3060 to my server, and now I can load and run multiple models concurrently. The two models I am testing with, gemma3:1b and qwen2.5-coder:1.5b are quite small. Each RTX 3060 has 12GB of VRAM. So I went from 12GB VRAM to 24GB VRAM. Concurrent loading should have worked with either one card or two. But adding the second card seems to have made a difference. Reviewing the log files from today and yesterday was interesting: | Memory | Yesterday | Today | | ------------------ | :-------: | :---: | | gemma3:1b | 3.0 | 2.1 | | qwen2.5-coder:1.5b | 8.1 | 4.7 | | Available VRAM | 12 | 24 | The amount of memory used by these models changed pretty significantly. Overall, my impression is that you either need to have a massive amount of VRAM or use really small models if you want to run multiple models concurrently.
Author
Owner

@NGC13009 commented on GitHub (Jun 5, 2025):

@trdischat

Environment="OLLAMA_NUM_PARALLEL=1"   # fix to 1
Environment="OLLAMA_MAX_LOADED_MODELS=4"

模型并行数会导致实际kv cache需要的显存达到单个模型的多倍(因为每个并发都需要独立的cache)。比如,假设您默认设置的是2048 num_ctx,并行数为4,则实际上会请求等价于 8192 num_ctx 的显存。对于7b规模的模型(比如qwen之类的),可能需要大概8G显存左右(假设是q4量化)。同时启动两个会超出显卡最大显存容量,导致较旧的模型关闭。

如果并行数为1,则请求是被一个个处理的,就不存在这个问题。

OLLAMA_MAX_LOADED_MODELS 不需要特殊设置,这个代表的是最大同时载入模型数,不是一定要按照这个去分配显存之类的。

<!-- gh-comment-id:2942818179 --> @NGC13009 commented on GitHub (Jun 5, 2025): @trdischat ```text Environment="OLLAMA_NUM_PARALLEL=1" # fix to 1 Environment="OLLAMA_MAX_LOADED_MODELS=4" ``` 模型并行数会导致实际kv cache需要的显存达到单个模型的多倍(因为每个并发都需要独立的cache)。比如,假设您默认设置的是2048 num_ctx,并行数为4,则实际上会请求等价于 8192 num_ctx 的显存。对于7b规模的模型(比如qwen之类的),可能需要大概8G显存左右(假设是q4量化)。同时启动两个会超出显卡最大显存容量,导致较旧的模型关闭。 如果并行数为1,则请求是被一个个处理的,就不存在这个问题。 OLLAMA_MAX_LOADED_MODELS 不需要特殊设置,这个代表的是最大同时载入模型数,不是一定要按照这个去分配显存之类的。
Author
Owner

@sunhy0316 commented on GitHub (Jun 5, 2025):

两个模型能够单独运行,且通过nvidia-smi查看到两个占用的vram分别为a和b,且a+b<可用显存,有时候这样也不能同时加载

Two models can run separately, and viewing via nvidia-smi shows their VRAM usage as a and b respectively, with a + b < available VRAM. However, they sometimes still cannot be loaded simultaneously.

<!-- gh-comment-id:2943783290 --> @sunhy0316 commented on GitHub (Jun 5, 2025): 两个模型能够单独运行,且通过nvidia-smi查看到两个占用的vram分别为a和b,且a+b<可用显存,有时候这样也不能同时加载 Two models can run separately, and viewing via `nvidia-smi` shows their VRAM usage as **a** and **b** respectively, with **a + b < available VRAM**. However, they sometimes still cannot be loaded simultaneously.
Author
Owner

@duck-5 commented on GitHub (Jun 6, 2025):

Can you include the nvidia-smi output and ollama ps output?

<!-- gh-comment-id:2951099890 --> @duck-5 commented on GitHub (Jun 6, 2025): Can you include the `nvidia-smi` output and `ollama ps` output?
Author
Owner

@trdischat commented on GitHub (Jun 12, 2025):

After quite a bit of testing on a system with 96GB of system RAM and two RTX 3060 12GB cards (total of 24GB of VRAM), the key environment settings for me were:

Variable Impact
OLLAMA_CONTEXT_LENGTH=16000 Reducing the size of the context window left a lot more room for other models to load
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 Enables unified memory, which avoids crashes when loading larger models

Other variables that I experimented with, but which I found I could leave set to the default for my setup include: OLLAMA_NUM_PARALLEL, OLLAMA_MAX_LOADED_MODELS and CUDA_VISIBLE_DEVICES.

With the right settings and appropriate model selection, I can run multiple models concurrently without a problem. At this point, the issue that I originally reported appears to have been operator error on my part.

<!-- gh-comment-id:2964996442 --> @trdischat commented on GitHub (Jun 12, 2025): After quite a bit of testing on a system with 96GB of system RAM and two RTX 3060 12GB cards (total of 24GB of VRAM), the key environment settings for me were: | Variable | Impact | | ----- | ----- | | `OLLAMA_CONTEXT_LENGTH=16000` | Reducing the size of the context window left a lot more room for other models to load | | `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` | Enables unified memory, which avoids crashes when loading larger models | Other variables that I experimented with, but which I found I could leave set to the default for my setup include: `OLLAMA_NUM_PARALLEL`, `OLLAMA_MAX_LOADED_MODELS` and `CUDA_VISIBLE_DEVICES`. With the right settings and appropriate model selection, I can run multiple models concurrently without a problem. At this point, the issue that I originally reported appears to have been operator error on my part.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69272