[GH-ISSUE #358] Parallel requests #25920

New Issue

@ehartford commented on GitHub (Jan 30, 2024):

I would really like to have an m2 ultra 192gb on my company's intranet that can service the whole R&D department (a dozen people)

as long as I have enough ram, I wish to be able to run multiple inference request at the same time. Thank you for considering my wish!

@ehartford commented on GitHub (Jan 30, 2024): I would really like to have an m2 ultra 192gb on my company's intranet that can service the whole R&D department (a dozen people) as long as I have enough ram, I wish to be able to run multiple inference request at the same time. Thank you for considering my wish!

GiteaMirror commented

@ivanfioravanti commented on GitHub (Jan 30, 2024):

same here, multiple 7B models served by an M2 Ultra. My dream! 🙏

@ivanfioravanti commented on GitHub (Jan 30, 2024): same here, multiple 7B models served by an M2 Ultra. My dream! 🙏

GiteaMirror commented

@Adphi commented on GitHub (Jan 31, 2024):

At first glance, when I started examining the source code, I thought that the problem of concurrent requests was due to the fact that the current implementation referred to the use of global variables in its llama.cpp binding. After paying a little more attention to the original llapma.cpp source code, I realized that the original implementation wasn't exactly geared towards multithreaded or server-side use, but rather towards to a local development/experimentation use case.
So apart from setting up an API using queued workers (managed using gRPC for example) based on a fork/exec model to get around the lack of batch processing on the llama.cpp side, which doesn't seem to procure too much attention from the developpers (reference missing, but found in the github project's conversations), which Local AI does in a way, I don't see exactly how it could be implemented here.

@Adphi commented on GitHub (Jan 31, 2024): At first glance, when I started examining the source code, I thought that the problem of concurrent requests was due to the fact that the current implementation referred to the use of global variables in its llama.cpp binding. After paying a little more attention to the original llapma.cpp source code, I realized that the original implementation wasn't exactly geared towards multithreaded or server-side use, but rather towards to a local development/experimentation use case. So apart from setting up an API using queued workers (managed using gRPC for example) based on a fork/exec model to get around the lack of batch processing on the llama.cpp side, which doesn't seem to procure too much attention from the developpers (reference missing, but found in the github project's conversations), which Local AI does in a way, I don't see exactly how it could be implemented here.

GiteaMirror commented

@ParisNeo commented on GitHub (Jan 31, 2024):

That's why I had to build a proxy. You can install multiple servers on a single or multiple machines then use my proxy to service multiple users with a multi queue. For example if you have 2 servers then you can servicec at most 2 clients simultaniously and when the two are buzy, the current message is queued on the server that has the least full queue. In practice, since the generation is very fast, you don't get too much collisions. We rarely get more than 1 person in the queue.

I also added security features as well as logging features. It works fine and serves all of us perfectly.

Also, some dude built a docker docker version and I accepted his PR:
https://github.com/ParisNeo/ollama_proxy_server

If you are interested, you can try it, it is open source, you can also read the code, get inspired, help enhance it etc. It is Apache 2.0 so you can do whatever you want with it 100% free.

We get around 300 tokens/s for each user so it is not causing any significant delay.

@ParisNeo commented on GitHub (Jan 31, 2024): That's why I had to build a proxy. You can install multiple servers on a single or multiple machines then use my proxy to service multiple users with a multi queue. For example if you have 2 servers then you can servicec at most 2 clients simultaniously and when the two are buzy, the current message is queued on the server that has the least full queue. In practice, since the generation is very fast, you don't get too much collisions. We rarely get more than 1 person in the queue. I also added security features as well as logging features. It works fine and serves all of us perfectly. Also, some dude built a docker docker version and I accepted his PR: https://github.com/ParisNeo/ollama_proxy_server If you are interested, you can try it, it is open source, you can also read the code, get inspired, help enhance it etc. It is Apache 2.0 so you can do whatever you want with it 100% free. We get around 300 tokens/s for each user so it is not causing any significant delay.

GiteaMirror commented

@ehartford commented on GitHub (Jan 31, 2024):

llama.cpp can be run as multiple process, multiple thread, or as server.
but it's totally fair for the feature request to be deprioritized, of course.

@ehartford commented on GitHub (Jan 31, 2024): llama.cpp can be run as multiple process, multiple thread, or as server. but it's totally fair for the feature request to be deprioritized, of course.

GiteaMirror commented

@farhanhubble commented on GitHub (Feb 1, 2024):

The easiest way to multiplex Ollama, at least on linux system should be with a reverse proxy load balancer like HAProxy. Launch multiple instances of ollama serve on different ports and map them to a single port using HAproxy.

Note that the approach can sometimes deteriorate performance due to CPU contention. I have a decent system with 64 cores and 24GB of GPU RAM. When I run 3 instances of Ollama with HAproxy to generate embeddings it does speed up the process however if I try to generate text, the processing time is much worse than with a single instance.

@farhanhubble commented on GitHub (Feb 1, 2024): The easiest way to multiplex Ollama, at least on linux system should be with a reverse proxy load balancer like HAProxy. Launch multiple instances of `ollama serve` on different ports and map them to a single port using HAproxy. Note that the approach can sometimes deteriorate performance due to CPU contention. I have a decent system with 64 cores and 24GB of GPU RAM. When I run 3 instances of Ollama with HAproxy to generate embeddings it does speed up the process however if I try to generate text, the processing time is much worse than with a single instance.

GiteaMirror commented

@trymeouteh commented on GitHub (Mar 1, 2024):

Would like to see the ability for using the same LLM in two or more apps at the same time and would like to see the ability to use multiple LLMs in two or more apps at the same time

@trymeouteh commented on GitHub (Mar 1, 2024): Would like to see the ability for using the same LLM in two or more apps at the same time and would like to see the ability to use multiple LLMs in two or more apps at the same time

GiteaMirror commented

2026-04-22 01:46:24 -05:00

@ParisNeo commented on GitHub (Mar 1, 2024):

well technically, you can run multiple instances of the same model by runnin multiple instances of ollama with different port numbers, configure them in the proxy config file then they can be accessed from multiple clients at once.

@ParisNeo commented on GitHub (Mar 1, 2024): well technically, you can run multiple instances of the same model by runnin multiple instances of ollama with different port numbers, configure them in the proxy config file then they can be accessed from multiple clients at once.

GiteaMirror commented

@ehartford commented on GitHub (Mar 1, 2024):

That's just wrong 😂

@ehartford commented on GitHub (Mar 1, 2024): That's just wrong 😂

GiteaMirror commented

2026-04-22 01:46:24 -05:00

@ParisNeo commented on GitHub (Mar 1, 2024):

ollama does not support batching and unless they build something like vllm, I don't see how you can do it.

@ParisNeo commented on GitHub (Mar 1, 2024): ollama does not support batching and unless they build something like vllm, I don't see how you can do it.

GiteaMirror commented

2026-04-22 01:46:24 -05:00

@ehartford commented on GitHub (Mar 1, 2024):

manage a process/thread pool. Load balance. It's not rocket science.

@ehartford commented on GitHub (Mar 1, 2024): manage a process/thread pool. Load balance. It's not rocket science.

GiteaMirror commented

2026-04-22 01:46:25 -05:00

@youssef02 commented on GitHub (Mar 1, 2024):

This is somehow a problem with scaling, I think. While ollama imodel can only handle one request at a time. It can't be concurrent to multiple request as it will need to run new models, which take more resources. I really want for someone to share their finding when trying to handle multiple request like in a uni environment with many student.

How to approach this problem

@youssef02 commented on GitHub (Mar 1, 2024): This is somehow a problem with scaling, I think. While ollama imodel can only handle one request at a time. It can't be concurrent to multiple request as it will need to run new models, which take more resources. I really want for someone to share their finding when trying to handle multiple request like in a uni environment with many student. How to approach this problem

GiteaMirror commented

2026-04-22 01:46:26 -05:00

@ehartford commented on GitHub (Mar 2, 2024):

The idea is the system (a MacBook or Mac studio with 128gb+ of unified RAM) can handle several dozen 4-bit 7b models at once.

So taking more resources is ok, no problem.

@ehartford commented on GitHub (Mar 2, 2024): The idea is the system (a MacBook or Mac studio with 128gb+ of unified RAM) can handle several dozen 4-bit 7b models at once. So taking more resources is ok, no problem.

GiteaMirror commented

2026-04-22 01:46:26 -05:00

@ParisNeo commented on GitHub (Mar 2, 2024):

The idea is the system (a MacBook or Mac studio with 128gb+ of unified RAM) can handle several dozen 4-bit 7b models at once.

So taking more resources is ok, no problem.

If you want that way, then you can use my ollama_proxy project, it does exactly what you asked for.
If multiplying the number of resources doesn't bother you then this is a simple solution. Each ollama process runs on its own and the proxy manages the routing and queues and everything:
https://github.com/ParisNeo/ollama_proxy_server

I was really wanting a better more optimized way. Neural nets can do inferance on a batch of inputs. You don't need to duplicate the model weights. That's what we talk about when we say real multigeneration system. You can perform fast inference with that, just as in vllm.

@ParisNeo commented on GitHub (Mar 2, 2024): > The idea is the system (a MacBook or Mac studio with 128gb+ of unified RAM) can handle several dozen 4-bit 7b models at once. > > So taking more resources is ok, no problem. If you want that way, then you can use my ollama_proxy project, it does exactly what you asked for. If multiplying the number of resources doesn't bother you then this is a simple solution. Each ollama process runs on its own and the proxy manages the routing and queues and everything: [https://github.com/ParisNeo/ollama_proxy_server](https://github.com/ParisNeo/ollama_proxy_server) I was really wanting a better more optimized way. Neural nets can do inferance on a batch of inputs. You don't need to duplicate the model weights. That's what we talk about when we say real multigeneration system. You can perform fast inference with that, just as in vllm.

GiteaMirror commented

2026-04-22 01:46:26 -05:00

@ehartford commented on GitHub (Mar 2, 2024):

I appreciate that you have provided a workaround.

Basically that's what ollama needs to implement internally.

I want ollama to be smart, and internally spawn and destroy llama.cpp threads/processes as needed to handle parallel requests without an external proxy.

That's what I'm asking for. it should "just work" ollama-style.

@ehartford commented on GitHub (Mar 2, 2024): I appreciate that you have provided a workaround. Basically that's what ollama needs to implement internally. I want ollama to be smart, and internally spawn and destroy llama.cpp threads/processes as needed to handle parallel requests without an external proxy. That's what I'm asking for. it should "just work" ollama-style.

GiteaMirror commented

2026-04-22 01:46:27 -05:00

@ffaubert commented on GitHub (Mar 18, 2024):

+1 for this. It would be a game changer. @bmizerany is this something in progress?

@ffaubert commented on GitHub (Mar 18, 2024): +1 for this. It would be a game changer. @bmizerany is this something in progress?

GiteaMirror commented

2026-04-22 01:46:28 -05:00

@triet0612 commented on GitHub (Mar 19, 2024):

I notice that new requests will wait for old requests to complete despite that we interrupt and stop old request when they're running or waiting to be executed. Have anyone have this issue? Thank you and sorry for taking your time.

@triet0612 commented on GitHub (Mar 19, 2024): I notice that new requests will wait for old requests to complete despite that we interrupt and stop old request when they're running or waiting to be executed. Have anyone have this issue? Thank you and sorry for taking your time.

GiteaMirror commented

2026-04-22 01:46:28 -05:00

@ehartford commented on GitHub (Mar 19, 2024):

I notice similar sometimes I have to kill and restart ollama because of it but I think that deserves it's own issue.

@ehartford commented on GitHub (Mar 19, 2024): I notice similar sometimes I have to kill and restart ollama because of it but I think that deserves it's own issue.

GiteaMirror commented

2026-04-22 01:46:28 -05:00

@jammsen commented on GitHub (Mar 20, 2024):

Not a mistake – Ollama will serve one generation at a time currently, but supporting 2+ concurrent requests is definitely on the roadmap

@jmorganca you wrote its on the roadmap, could you explain for when? Do you have a date?

Something beeing on a roadmap doesnt in reality mean it gets solved/done/programmed.

Could you help us/the community out please? The request is from 6months++ ago.

@jammsen commented on GitHub (Mar 20, 2024): > Not a mistake – Ollama will serve one generation at a time currently, but supporting 2+ concurrent requests is definitely on the roadmap @jmorganca you wrote its on the roadmap, could you explain for when? Do you have a date? Something beeing on a roadmap doesnt in reality mean it gets solved/done/programmed. Could you help us/the community out please? The request is from 6months++ ago.

GiteaMirror commented

2026-04-22 01:46:29 -05:00

@ehartford commented on GitHub (Mar 20, 2024):

this feature would allow ollama to be a production backend. Low traffic for sure but still, this would get a lot of startup running / hosting MVP with minimal effort.

@ehartford commented on GitHub (Mar 20, 2024): this feature would allow ollama to be a production backend. Low traffic for sure but still, this would get a lot of startup running / hosting MVP with minimal effort.

GiteaMirror commented

2026-04-22 01:46:30 -05:00

@ParisNeo commented on GitHub (Mar 24, 2024):

In the meantime you can try my ollama proxy :)
It can be configured to fire up multiple ollama services, then you can also define users with keys to access the service and it balances the loads between the instances of ollama with multiple queues.

@ParisNeo commented on GitHub (Mar 24, 2024): In the meantime you can try my ollama proxy :) It can be configured to fire up multiple ollama services, then you can also define users with keys to access the service and it balances the loads between the instances of ollama with multiple queues.

GiteaMirror commented

2026-04-22 01:46:30 -05:00

@jammsen commented on GitHub (Mar 24, 2024):

In the meantime you can try my ollama proxy :) It can be configured to fire up multiple ollama services, then you can also define users with keys to access the service and it balances the loads between the instances of ollama with multiple queues.

@ParisNeo
Do you have a link for me please? Also how does that works from an architectur point of view? I guessing sticky-session, so people usng this wont get mixed up and have a english cooking recipe with c++ snippets and german dictionaries in there right?

@jammsen commented on GitHub (Mar 24, 2024): > In the meantime you can try my ollama proxy :) It can be configured to fire up multiple ollama services, then you can also define users with keys to access the service and it balances the loads between the instances of ollama with multiple queues. @ParisNeo Do you have a link for me please? Also how does that works from an architectur point of view? I guessing sticky-session, so people usng this wont get mixed up and have a english cooking recipe with c++ snippets and german dictionaries in there right?

GiteaMirror commented

2026-04-22 01:46:30 -05:00

@m4r1k commented on GitHub (Mar 26, 2024):

While the ollama_proxy_server is a valuable starting point, Ollama's lack of native parallel request handling hinders its scalability. This is particularly evident in cloud environments where Ollama pods/instances could dynamically scale based on load.

The current approach in ollama_proxy_server requires hardcoded backend configuration, complicating consistent and dynamic scaling.

@m4r1k commented on GitHub (Mar 26, 2024): While the [`ollama_proxy_server`](https://github.com/ParisNeo/ollama_proxy_server) is a valuable starting point, Ollama's lack of native parallel request handling hinders its scalability. This is particularly evident in cloud environments where Ollama pods/instances could dynamically scale based on load. The current approach in `ollama_proxy_server` requires hardcoded backend configuration, complicating consistent and dynamic scaling.

GiteaMirror commented

2026-04-22 01:46:31 -05:00

@0x77dev commented on GitHub (Mar 27, 2024):

This issue seems to be stale for a while. Ollama is a promising project with the potential to expand beyond just "running on laptops." I believe there are many projects that wish to use it in production despite facing challenges (pre-pulling a model on start, custom model registry, etc), with parallel requests being a significant obstacle in my opinion.

Merged PR #3348 has newer version of llama.cpp, where parallel decoding is supported for sure.

I will try to delve into the Ollama codebase to address this issue. It would be helpful if a maintainer or someone familiar with the /llm section of the codebase could identify any blockers, maybe potential blocking of parallel processing of requests on the Golang side.

Discord thread
https://github.com/ollama/ollama/issues/380#issuecomment-1762925095
The llama.cpp server example demonstrates support for parallel decoding.

@0x77dev commented on GitHub (Mar 27, 2024): [This issue seems to be stale for a while.](https://github.com/ollama/ollama/issues/358#issuecomment-2010280146) Ollama is a promising project with the potential to expand beyond just "running on laptops." I believe there are many projects that wish [to use it in production despite facing challenges](https://github.com/ollama/ollama/issues/358#issuecomment-2010706535) ([pre-pulling a model on start](https://github.com/ollama/ollama/issues/3369#issue-2210418734), custom model registry, etc), with parallel requests being a significant obstacle in my opinion. Merged PR #3348 has newer version of llama.cpp, where parallel decoding is supported for sure. I will try to delve into the Ollama codebase to address this issue. It would be helpful if a maintainer or someone familiar with the `/llm` section of the codebase could identify any blockers, maybe potential blocking of parallel processing of requests on the Golang side. Related: - [Discord thread](https://discordapp.com/channels/1128867683291627614/1128867684130508875/1222495299113713758) - https://github.com/ollama/ollama/issues/380#issuecomment-1762925095 - The [llama.cpp server example](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#llamacpp-http-server) demonstrates support for parallel decoding.

GiteaMirror commented

2026-04-22 01:46:32 -05:00

@0x77dev commented on GitHub (Mar 28, 2024):

There has been some progress, but there are still issues to address. For instance, support for calling different models simultaneously (maybe calling different models simultaneously can be skipped, but it may cause DX confusion), and loaded.mu.{Lock,Unlock}() is not implemented correctly in my changes below.

This will require more testing and work on a better design for model loading

https://github.com/ollama/ollama/assets/46429701/00d84851-ca18-4e0a-a939-be367d49226d

git diff

diff --git a/llm/dyn_ext_server.go b/llm/dyn_ext_server.go
index 6e43333..5b0e435 100644
--- a/llm/dyn_ext_server.go
+++ b/llm/dyn_ext_server.go
@@ -95,7 +95,7 @@ func newDynExtServer(library, model string, adapters, projectors []string, opts
 	sparams.n_batch = C.uint(opts.NumBatch)
 	sparams.n_gpu_layers = C.int(opts.NumGPU)
 	sparams.main_gpu = C.int(opts.MainGPU)
-	sparams.n_parallel = 1 // TODO - wire up concurrency
+	sparams.n_parallel = 4 // TODO - wire up concurrency
 
 	// Always use the value encoded in the model
 	sparams.rope_freq_base = 0.0
diff --git a/server/routes.go b/server/routes.go
index 5582f3a..1ce55e6 100644
--- a/server/routes.go
+++ b/server/routes.go
@@ -67,8 +67,10 @@ var loaded struct {
 
 var defaultSessionDuration = 5 * time.Minute
 
-// load a model into memory if it is not already loaded, it is up to the caller to lock loaded.mu before calling this function
 func load(c *gin.Context, model *Model, opts api.Options, sessionDuration time.Duration) error {
+	loaded.mu.Lock()
+	defer loaded.mu.Unlock()
+
 	needLoad := loaded.runner == nil || // is there a model loaded?
 		loaded.ModelPath != model.ModelPath || // has the base model changed?
 		!reflect.DeepEqual(loaded.AdapterPaths, model.AdapterPaths) || // have the adapters changed?
@@ -145,9 +147,6 @@ func isSupportedImageType(image []byte) bool {
 }
 
 func GenerateHandler(c *gin.Context) {
-	loaded.mu.Lock()
-	defer loaded.mu.Unlock()
-
 	checkpointStart := time.Now()
 	var req api.GenerateRequest
 	err := c.ShouldBindJSON(&req)

@0x77dev commented on GitHub (Mar 28, 2024): There has been some progress, but there are still issues to address. For instance, support for calling different models simultaneously (maybe calling different models simultaneously can be skipped, but it may cause _DX confusion_), and `loaded.mu.{Lock,Unlock}()` is not implemented correctly in my changes below. This will require more testing and work on a better design for model loading https://github.com/ollama/ollama/assets/46429701/00d84851-ca18-4e0a-a939-be367d49226d <details> <summary> git diff </summary> ```diff diff --git a/llm/dyn_ext_server.go b/llm/dyn_ext_server.go index 6e43333..5b0e435 100644 --- a/llm/dyn_ext_server.go +++ b/llm/dyn_ext_server.go @@ -95,7 +95,7 @@ func newDynExtServer(library, model string, adapters, projectors []string, opts sparams.n_batch = C.uint(opts.NumBatch) sparams.n_gpu_layers = C.int(opts.NumGPU) sparams.main_gpu = C.int(opts.MainGPU) - sparams.n_parallel = 1 // TODO - wire up concurrency + sparams.n_parallel = 4 // TODO - wire up concurrency // Always use the value encoded in the model sparams.rope_freq_base = 0.0 diff --git a/server/routes.go b/server/routes.go index 5582f3a..1ce55e6 100644 --- a/server/routes.go +++ b/server/routes.go @@ -67,8 +67,10 @@ var loaded struct { var defaultSessionDuration = 5 * time.Minute -// load a model into memory if it is not already loaded, it is up to the caller to lock loaded.mu before calling this function func load(c *gin.Context, model *Model, opts api.Options, sessionDuration time.Duration) error { + loaded.mu.Lock() + defer loaded.mu.Unlock() + needLoad := loaded.runner == nil || // is there a model loaded? loaded.ModelPath != model.ModelPath || // has the base model changed? !reflect.DeepEqual(loaded.AdapterPaths, model.AdapterPaths) || // have the adapters changed? @@ -145,9 +147,6 @@ func isSupportedImageType(image []byte) bool { } func GenerateHandler(c *gin.Context) { - loaded.mu.Lock() - defer loaded.mu.Unlock() - checkpointStart := time.Now() var req api.GenerateRequest err := c.ShouldBindJSON(&req) ``` </details>

GiteaMirror commented

2026-04-22 01:46:32 -05:00

@easp commented on GitHub (Mar 30, 2024):

Looks relevant: https://github.com/ollama/ollama/pull/3418

@easp commented on GitHub (Mar 30, 2024): Looks relevant: https://github.com/ollama/ollama/pull/3418

GiteaMirror commented

2026-04-22 01:46:32 -05:00

@darwinvelez58 commented on GitHub (Apr 9, 2024):

any update?

@darwinvelez58 commented on GitHub (Apr 9, 2024): any update?

GiteaMirror commented

2026-04-22 01:46:33 -05:00

@0x77dev commented on GitHub (Apr 9, 2024):

@darwinvelez58 there is some work being done in https://github.com/ollama/ollama/pull/3418 ~~and a9195b3efa~~

was testing both options, at the moment seems very unstable

@0x77dev commented on GitHub (Apr 9, 2024): @darwinvelez58 there is some work being done in https://github.com/ollama/ollama/pull/3418 ~~and https://github.com/Adphi/ollama/commit/a9195b3efa3a63ea8dee2907ecc7c06dccc59e6f~~ was testing both options, at the moment seems very unstable

GiteaMirror commented

@ParisNeo commented on GitHub (Apr 9, 2024):

In the meantime you can try my ollama proxy :) It can be configured to fire up multiple ollama services, then you can also define users with keys to access the service and it balances the loads between the instances of ollama with multiple queues.

@ParisNeo Do you have a link for me please? Also how does that works from an architectur point of view? I guessing sticky-session, so people usng this wont get mixed up and have a english cooking recipe with c++ snippets and german dictionaries in there right?

Hi, sorry I was very very buzy lately and had no time to look at the github messages.
The proxy server can be found here:
https://github.com/ParisNeo/ollama_proxy_server

It has also an optional security to allow you to have a users base with a personal key for each user.
The architecture is simple:
You specify multiple ollama instances viewable only on the server side each one has a different port number, then the proxy manages both the authentication of the users, the logging of their access, and distribution of the users over multiple queues.
Each queue is behind one instance of ollama and the proxy will always put you in the least filled queue.
So if you run 4 ollama instances, you can simultaniously serve 4 users and the others will be queued.

I agree that you can do more if ollama manages this all inside to mutualize the weights and stuff. I have built my proxy because I needed the security and the management and back then ollama had no plans to do this so got to move, but I'll be happy if they integrate multiusers directly inside.

@ParisNeo commented on GitHub (Apr 9, 2024): > > In the meantime you can try my ollama proxy :) It can be configured to fire up multiple ollama services, then you can also define users with keys to access the service and it balances the loads between the instances of ollama with multiple queues. > > @ParisNeo Do you have a link for me please? Also how does that works from an architectur point of view? I guessing sticky-session, so people usng this wont get mixed up and have a english cooking recipe with c++ snippets and german dictionaries in there right? Hi, sorry I was very very buzy lately and had no time to look at the github messages. The proxy server can be found here: [https://github.com/ParisNeo/ollama_proxy_server](https://github.com/ParisNeo/ollama_proxy_server) It has also an optional security to allow you to have a users base with a personal key for each user. The architecture is simple: You specify multiple ollama instances viewable only on the server side each one has a different port number, then the proxy manages both the authentication of the users, the logging of their access, and distribution of the users over multiple queues. Each queue is behind one instance of ollama and the proxy will always put you in the least filled queue. So if you run 4 ollama instances, you can simultaniously serve 4 users and the others will be queued. I agree that you can do more if ollama manages this all inside to mutualize the weights and stuff. I have built my proxy because I needed the security and the management and back then ollama had no plans to do this so got to move, but I'll be happy if they integrate multiusers directly inside.

GiteaMirror commented

@TheMasterFX commented on GitHub (Apr 10, 2024):

vLLM uses a PagedAttention. Is this a part which must be integrated in ollama or in the llama.cpp part?

@TheMasterFX commented on GitHub (Apr 10, 2024): vLLM uses a [PagedAttention](https://blog.vllm.ai/2023/06/20/vllm.html). Is this a part which must be integrated in ollama or in the llama.cpp part?

GiteaMirror commented

@0x77dev commented on GitHub (Apr 10, 2024):

@TheMasterFX this is llama.cpp side of things

@0x77dev commented on GitHub (Apr 10, 2024): @TheMasterFX this is llama.cpp side of things

GiteaMirror commented