[GH-ISSUE #3170] Concurrency #27712

Closed
opened 2026-04-22 05:15:15 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @samyfodil on GitHub (Mar 15, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3170

What are you trying to do?

Right now Ollama is limited to one request and one model. Even desktop GPUs can handle easily load more than one model. While you can still run multiple instances of ollama, fixing the issue at the core is better.

How should we solve this?

The Ext Server was built on the server example on llama.cpp, which can only load one model. While it is the case for the example server, it's not the case for llama.cpp. Instead of building on top of the example server, using the same code but adding to it the ability to load multiple models. Maybe while doing that getting rid of the json encoding/decoding between the ext server and the example server used on the backend.

What is the impact of not solving this?

Huge impact on performance as requests are lined up especially if they use different models.
It's also hard to use Ollama in a cloud setup.

Anything else?

I discovered the issue while building a tau plugin for exposing ollama through wasm here https://github.com/samyfodil/ollama (see tau folder)

Originally created by @samyfodil on GitHub (Mar 15, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3170 ### What are you trying to do? Right now Ollama is limited to one request and one model. Even desktop GPUs can handle easily load more than one model. While you can still run multiple instances of ollama, fixing the issue at the core is better. ### How should we solve this? The Ext Server was built on the server example on llama.cpp, which can only load one model. While it is the case for the example server, it's not the case for llama.cpp. Instead of building on top of the example server, using the same code but adding to it the ability to load multiple models. Maybe while doing that getting rid of the json encoding/decoding between the ext server and the example server used on the backend. ### What is the impact of not solving this? Huge impact on performance as requests are lined up especially if they use different models. It's also hard to use Ollama in a cloud setup. ### Anything else? I discovered the issue while building a [tau](https://github.com/taubyte/tau) plugin for exposing ollama through wasm here https://github.com/samyfodil/ollama (see `tau` folder) ![](https://github.com/ollama/ollama/assets/76626119/be0fee96-e9b6-4996-aba5-0c7b76ea0589)
Author
Owner

@mxyng commented on GitHub (Mar 18, 2024):

Duplicate of #358

<!-- gh-comment-id:2003171020 --> @mxyng commented on GitHub (Mar 18, 2024): Duplicate of #358
Author
Owner

@pdevine commented on GitHub (Mar 18, 2024):

Thanks for the issue @samyfodil , this is definitely on the radar.

<!-- gh-comment-id:2003171937 --> @pdevine commented on GitHub (Mar 18, 2024): Thanks for the issue @samyfodil , this is definitely on the radar.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#27712