[GH-ISSUE #761] Processing inference in parallel #360

Closed
opened 2026-04-12 10:00:09 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @SabareeshGC on GitHub (Oct 11, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/761

I was using http endpoint but it appears it is limited to 1 request for processing , is it possible to process multiple inference request at the same time.
ref https://github.com/ggerganov/llama.cpp/pull/3228

Originally created by @SabareeshGC on GitHub (Oct 11, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/761 I was using http endpoint but it appears it is limited to 1 request for processing , is it possible to process multiple inference request at the same time. ref https://github.com/ggerganov/llama.cpp/pull/3228
GiteaMirror added the feature request label 2026-04-12 10:00:09 -05:00
Author
Owner

@morandalex commented on GitHub (Oct 12, 2023):

good point! how to process multiple request in parallel ?

<!-- gh-comment-id:1759119763 --> @morandalex commented on GitHub (Oct 12, 2023): good point! how to process multiple request in parallel ?
Author
Owner

@UmutAlihan commented on GitHub (Oct 25, 2023):

I am also looking forward for this feature. It would be perfect if "ollama server" could deploy the model to let multiple users infere in parallel.

As of now at each request from different clients, ollama loads the model again even though it is the same model name.

<!-- gh-comment-id:1778868158 --> @UmutAlihan commented on GitHub (Oct 25, 2023): I am also looking forward for this feature. It would be perfect if "ollama server" could deploy the model to let multiple users infere in parallel. As of now at each request from different clients, ollama loads the model again even though it is the same model name.
Author
Owner

@morandalex commented on GitHub (Nov 7, 2023):

news on that? this feature is game changing

<!-- gh-comment-id:1799287953 --> @morandalex commented on GitHub (Nov 7, 2023): news on that? this feature is game changing
Author
Owner

@UmutAlihan commented on GitHub (Nov 7, 2023):

It is solved in new ollama versions. Currently our Ollams deployment gracefully balances the inference load as requests from different sources. You can check the discourse for up to date news.

<!-- gh-comment-id:1799296140 --> @UmutAlihan commented on GitHub (Nov 7, 2023): It is solved in new ollama versions. Currently our Ollams deployment gracefully balances the inference load as requests from different sources. You can check the discourse for up to date news.
Author
Owner

@UmutAlihan commented on GitHub (Nov 7, 2023):

It is solved in new ollama versions. Currently our Ollams deployment gracefully balances the inference load as requests from different sources. You can check the discourse for up to date news.

Even though it is not parallel, there is a kind of queue mechanism and it saves the day :)

<!-- gh-comment-id:1799297380 --> @UmutAlihan commented on GitHub (Nov 7, 2023): > It is solved in new ollama versions. Currently our Ollams deployment gracefully balances the inference load as requests from different sources. You can check the discourse for up to date news. > Even though it is not parallel, there is a kind of queue mechanism and it saves the day :)
Author
Owner

@gningue commented on GitHub (Dec 19, 2023):

Hi
Any news ?

<!-- gh-comment-id:1863204009 --> @gningue commented on GitHub (Dec 19, 2023): Hi Any news ?
Author
Owner

@jmorganca commented on GitHub (Dec 22, 2023):

Hi all! Thanks for the issues and comments. Going to merge this with https://github.com/jmorganca/ollama/issues/358

<!-- gh-comment-id:1867191141 --> @jmorganca commented on GitHub (Dec 22, 2023): Hi all! Thanks for the issues and comments. Going to merge this with https://github.com/jmorganca/ollama/issues/358
Author
Owner

@ITHealer commented on GitHub (May 22, 2024):

How to set env to run parallel in linux?
I had set up env by:

sudo nano ~/.bashrc
export OLLAMA_NUM_PARALLEL=4
source ~/.bashrc

But it doesn't work
Please help me!!!

<!-- gh-comment-id:2123998209 --> @ITHealer commented on GitHub (May 22, 2024): How to set env to run parallel in linux? I had set up env by: sudo nano ~/.bashrc export OLLAMA_NUM_PARALLEL=4 source ~/.bashrc But it doesn't work Please help me!!!
Author
Owner

@swlee9087 commented on GitHub (Jun 7, 2024):

How to set env to run parallel in linux? I had set up env by:

sudo nano ~/.bashrc export OLLAMA_NUM_PARALLEL=4 source ~/.bashrc

But it doesn't work Please help me!!!

I had the exact same problem. I solved it this way, hope it helps:

Open service file

sudo nano /etc/systemd/system/ollama.service

Paste service content and save the file

DO NOT JUST ADD THE REVISED CONTENT TO ORIGINAL SERVICE FILE

start off with 2, then gradually increase if your GPU can take it

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve (or wherever your linux ollama is installed )
User=ollama
Group=ollama
Restart=always (i find that this restart option doesn't always work)
RestartSec=3
Environment= (check your paths in linux)
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_LOADED_MODELS=2" (i think this is optional, but i'm always juggling 2 models at the same time so it works for me)

[Install]
WantedBy=default.target

Reload systemd configuration

sudo systemctl daemon-reload

Enable the Ollama service

sudo systemctl enable ollama (again, i'm unsure about "enable" but it's definitely not "start"/"restart" if you want to monitor ollama directly)

<!-- gh-comment-id:2154375146 --> @swlee9087 commented on GitHub (Jun 7, 2024): > How to set env to run parallel in linux? I had set up env by: > > sudo nano ~/.bashrc export OLLAMA_NUM_PARALLEL=4 source ~/.bashrc > > But it doesn't work Please help me!!! I had the exact same problem. I solved it this way, hope it helps: ### Open service file sudo nano /etc/systemd/system/ollama.service ### Paste service content and save the file #### DO NOT JUST ADD THE REVISED CONTENT TO ORIGINAL SERVICE FILE #### start off with 2, then gradually increase if your GPU can take it > [Unit] > Description=Ollama Service > After=network-online.target > > [Service] > ExecStart=/usr/local/bin/ollama serve (or wherever your linux ollama is installed ) > User=ollama > Group=ollama > Restart=always (i find that this restart option doesn't always work) > RestartSec=3 > Environment= (check your paths in linux) > Environment="OLLAMA_HOST=0.0.0.0:11434" > Environment="OLLAMA_NUM_PARALLEL=2" > Environment="OLLAMA_MAX_LOADED_MODELS=2" (i think this is optional, but i'm always juggling 2 models at the same time so it works for me) > > > [Install] > WantedBy=default.target ### Reload systemd configuration sudo systemctl daemon-reload ### Enable the Ollama service sudo systemctl enable ollama (again, i'm unsure about "enable" but it's definitely not "start"/"restart" if you want to monitor ollama directly)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#360