[GH-ISSUE #5114] Ollama not loading in gpu with docker on latest version but works on 0.1.31 which doesn't have multi-user concurrency #3227

Closed
opened 2026-04-12 13:44:03 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @bluenevus on GitHub (Jun 18, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5114

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Absolutely excited to see you have multi-user concurrency. I setup ollama on docker with 8 gpus. I could get 2 models to run in gpu each with their own container, Llava and Llamaguard2. No other models would load into gpu even if there are no other gpus using it. I tried --gpus= 2, I tried --gpus '"device=0,1"' and I tried device all, no luck only those 2 specific models loaded and I could only get it to load if I assigned only 1 device like so --gpus device=1. I read through the issues and there was one comment to go back to 0.1.31 but it seems no multi-user concurrency on that with this -e OLLAMA_NUM_PARALLEL=10. I tried this on 4090, rtx8ks, A6000s and they all have the same issue with v0.1.44

OS

Linux, Docker

GPU

Nvidia

CPU

Intel

Ollama version

v0.1.44

Originally created by @bluenevus on GitHub (Jun 18, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5114 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Absolutely excited to see you have multi-user concurrency. I setup ollama on docker with 8 gpus. I could get 2 models to run in gpu each with their own container, Llava and Llamaguard2. No other models would load into gpu even if there are no other gpus using it. I tried --gpus= 2, I tried --gpus '"device=0,1"' and I tried device all, no luck only those 2 specific models loaded and I could only get it to load if I assigned only 1 device like so --gpus device=1. I read through the issues and there was one comment to go back to 0.1.31 but it seems no multi-user concurrency on that with this -e OLLAMA_NUM_PARALLEL=10. I tried this on 4090, rtx8ks, A6000s and they all have the same issue with v0.1.44 ### OS Linux, Docker ### GPU Nvidia ### CPU Intel ### Ollama version v0.1.44
GiteaMirror added the bug label 2026-04-12 13:44:03 -05:00
Author
Owner

@dhiltgen commented on GitHub (Jun 18, 2024):

To clarify, OLLAMA_NUM_PARALLEL only allows multiple concurrent requests to a given model. To load multiple models, you need to adjust OLLAMA_MAX_LOADED_MODELS, which is what it sounds like you're trying to do. Be aware high parallel settings will lead to a lot of VRAM usage as it has to multiply the context size to handle the concurrent requests. You'll see the results of this via ollama ps and the VRAM usage.

(hint: you can now run ollama serve --help to get a quick synopsis on the settings that are available)

<!-- gh-comment-id:2176603398 --> @dhiltgen commented on GitHub (Jun 18, 2024): To clarify, OLLAMA_NUM_PARALLEL only allows multiple concurrent requests to a given model. To load multiple models, you need to adjust OLLAMA_MAX_LOADED_MODELS, which is what it sounds like you're trying to do. Be aware high parallel settings will lead to a lot of VRAM usage as it has to multiply the context size to handle the concurrent requests. You'll see the results of this via `ollama ps` and the VRAM usage. (hint: you can now run `ollama serve --help` to get a quick synopsis on the settings that are available)
Author
Owner

@bluenevus commented on GitHub (Jun 18, 2024):

Thanks. Yes, I'm only trying to do multiple concurrent connection and that doesn't work. I'm not trying to load multiple models from the same container. I have one container per model per gpu or gpus attached.

<!-- gh-comment-id:2176631611 --> @bluenevus commented on GitHub (Jun 18, 2024): Thanks. Yes, I'm only trying to do multiple concurrent connection and that doesn't work. I'm not trying to load multiple models from the same container. I have one container per model per gpu or gpus attached.
Author
Owner

@dhiltgen commented on GitHub (Jun 19, 2024):

Can you share the server logs, so I can try to understand why it isn't setting up for parallel requests?

<!-- gh-comment-id:2177306225 --> @dhiltgen commented on GitHub (Jun 19, 2024): Can you share the server logs, so I can try to understand why it isn't setting up for parallel requests?
Author
Owner

@bluenevus commented on GitHub (Jun 19, 2024):

Thanks, I just removed ollama everywhere and went with vllm. I appreciate your fast response and help though and wish ollama the best. Its an amazing product with an amazing team doing great work.

<!-- gh-comment-id:2177312587 --> @bluenevus commented on GitHub (Jun 19, 2024): Thanks, I just removed ollama everywhere and went with vllm. I appreciate your fast response and help though and wish ollama the best. Its an amazing product with an amazing team doing great work.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3227