[GH-ISSUE #6251] Ollama multiuser scale #65948

Open
opened 2026-05-03 23:17:52 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @jamiabailey on GitHub (Aug 8, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6251

Originally assigned to: @dhiltgen on GitHub.

I'm looking for some scale numbers on what ollama supports as far as multi-user environments go. I see the OLLAMA_NUM_PARALLEL for adjusting how many simultaneous requests can be served as well as OLLAMA_MAX_QUEUE for how many requests can be queued before being rejected but nothing that will help me understand how that directly relates to how to design a system that will serve a large number of users and how much GPU resources will be required to do so. Is Ollama a fit for large scale environments where there might be a very large number of users interacting with it without having to front end an endless number of Ollama instances in front of a load balancer VIP? Has anyone done some scale testing to help design larger scale designs using Ollama or is Ollama still mostly fitting solely into the desktop use case? Will containers help here or is it strictly an underlying GPU/memory issue? The cost for the servers underneath are not an issue for us. Just need scale. Please advise.

Originally created by @jamiabailey on GitHub (Aug 8, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6251 Originally assigned to: @dhiltgen on GitHub. I'm looking for some scale numbers on what ollama supports as far as multi-user environments go. I see the OLLAMA_NUM_PARALLEL for adjusting how many simultaneous requests can be served as well as OLLAMA_MAX_QUEUE for how many requests can be queued before being rejected but nothing that will help me understand how that directly relates to how to design a system that will serve a large number of users and how much GPU resources will be required to do so. Is Ollama a fit for large scale environments where there might be a very large number of users interacting with it without having to front end an endless number of Ollama instances in front of a load balancer VIP? Has anyone done some scale testing to help design larger scale designs using Ollama or is Ollama still mostly fitting solely into the desktop use case? Will containers help here or is it strictly an underlying GPU/memory issue? The cost for the servers underneath are not an issue for us. Just need scale. Please advise.
GiteaMirror added the questionfeature request labels 2026-05-03 23:17:52 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65948