[GH-ISSUE #9054] Ollama Does Not Utilize Multiple Instances of the Same Model for Parallel Processing #67949

Closed
opened 2026-05-04 12:06:24 -05:00 by GiteaMirror · 20 comments
Owner

Originally created by @BennisonDevadoss on GitHub (Feb 12, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9054

I have a server with two Nvidia L4 GPUs, and I’m running the LLaMA 3.1 8B model using Ollama. Here’s the current behavior:

  1. What Works:

    • If a model is large enough to require both GPUs, Ollama successfully splits the workload and utilizes both GPUs for the same instance.
    • If different models are loaded, Ollama utilizes the available VRAM efficiently and runs them concurrently across the GPUs.
  2. What Does Not Work:

    • When multiple users send concurrent requests, Ollama doesn’t load multiple instances of the same model on available VRAM to handle parallel requests. I’ve set the OLLAMA_NUM_PARALLEL parameter to 3, but it doesn’t seem to have any effect.

What I Want to Achieve:

  • I’d like Ollama to load multiple instances of the same model on different GPUs or available VRAM to handle parallel user requests efficiently.

Questions:

  • Why doesn’t Ollama load multiple instances of the same model for parallel processing?
  • Is there any configuration or workaround to achieve this behavior?

Any help or insights would be greatly appreciated!

Originally created by @BennisonDevadoss on GitHub (Feb 12, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9054 I have a server with two Nvidia L4 GPUs, and I’m running the LLaMA 3.1 8B model using Ollama. Here’s the current behavior: 1. **What Works:** - If a model is large enough to require both GPUs, Ollama successfully splits the workload and utilizes both GPUs for the same instance. - If different models are loaded, Ollama utilizes the available VRAM efficiently and runs them concurrently across the GPUs. 2. **What Does Not Work:** - When multiple users send concurrent requests, Ollama doesn’t load multiple instances of the same model on available VRAM to handle parallel requests. I’ve set the `OLLAMA_NUM_PARALLEL` parameter to 3, but it doesn’t seem to have any effect. **What I Want to Achieve:** - I’d like Ollama to load multiple instances of the same model on different GPUs or available VRAM to handle parallel user requests efficiently. **Questions:** - Why doesn’t Ollama load multiple instances of the same model for parallel processing? - Is there any configuration or workaround to achieve this behavior? Any help or insights would be greatly appreciated!
Author
Owner

@rick-github commented on GitHub (Feb 12, 2025):

ollama doesn't load a model multiple times, what OLLAMA_NUM_PARALLEL does is create a context buffer for each parallel request. The model weights are the same for each context buffer. The clients send 3 simultaneous requests and they get processed concurrently. If more simultaneous requests than completion slots are sent, they are queued up until one of the on-going completions is finished. See here for more info on concurrent request processing.

<!-- gh-comment-id:2654710215 --> @rick-github commented on GitHub (Feb 12, 2025): ollama doesn't load a model multiple times, what `OLLAMA_NUM_PARALLEL` does is create a context buffer for each parallel request. The model weights are the same for each context buffer. The clients send 3 simultaneous requests and they get processed concurrently. If more simultaneous requests than completion slots are sent, they are queued up until one of the on-going completions is finished. See [here](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests) for more info on concurrent request processing.
Author
Owner

@pdevine commented on GitHub (Feb 12, 2025):

@BennisonDevadoss as @rick-github mentioned, you don't need to waste the VRAM loading the same model into memory multiple times. I'm going to go ahead and close the issue as answered, but feel free to keep commenting.

<!-- gh-comment-id:2654929947 --> @pdevine commented on GitHub (Feb 12, 2025): @BennisonDevadoss as @rick-github mentioned, you don't need to waste the VRAM loading the same model into memory multiple times. I'm going to go ahead and close the issue as answered, but feel free to keep commenting.
Author
Owner

@BilibalaX commented on GitHub (Apr 29, 2025):

I have a similar question. With four GPUs (40GB VRAM), the model size (20GB) is ok to load on each GPU. My task is to input 100,000 messages for information extraction. If I set OLLAMA_NUM_PARALLEL as 4, does it mean I have tasks running on each of the GPUs separately and I can have 4 times the speed?

Then the other parameter OLLAMA_MAX_LOADED_MODELS is the maximum number of models that can be loaded. If I set it as 4, will Ollama regard them as 4 instances?

<!-- gh-comment-id:2839688531 --> @BilibalaX commented on GitHub (Apr 29, 2025): I have a similar question. With four GPUs (40GB VRAM), the model size (20GB) is ok to load on each GPU. My task is to input 100,000 messages for information extraction. If I set OLLAMA_NUM_PARALLEL as 4, does it mean I have tasks running on each of the GPUs separately and I can have 4 times the speed? Then the other parameter OLLAMA_MAX_LOADED_MODELS is the maximum number of models that can be loaded. If I set it as 4, will Ollama regard them as 4 instances?
Author
Owner

@rick-github commented on GitHub (Apr 29, 2025):

No, if you set OLLAMA_NUM_PARALLEL=4, ollama will process up to 4 requests at a time but which GPU they run on is indeterministic. If you really must have each GPU processing a different request simultaneously, you need to run 4 ollama servers, assign each a different GPU with CUDA_VISIBLE_DEVICES, and use a load balancing proxy in front the distribute the queries.

Note ollama gets better utilization when a single GPU does concurrent requests:

Image

So you could also set OLLAMA_NUM_PARALLEL for each of these servers and increase throughput. For example:

x-ollama: &ollama
  image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest}
  volumes:
    - ${OLLAMA_MODELS-./ollama}:/root/.ollama
  environment: &env
    OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1}
    OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m}
    OLLAMA_MAX_LOADED_MODELS: 1
    OLLAMA_NUM_PARALLEL: 2
    OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0}
    OLLAMA_DEBUG: ${OLLAMA_DEBUG-1}
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

services:
  ollama-1:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 0

  ollama-2:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 1

  ollama-3:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 2

  ollama-4:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 3

  ollama:
    image: nginx-lb
    build:
      dockerfile_inline: |
        FROM nginx:latest
        RUN cat > /etc/nginx/conf.d/default.conf <<EOF
        upstream ollama_group {
          least_conn;
          server ollama-1:11434 max_conns=2;
          server ollama-2:11434 max_conns=2;
          server ollama-3:11434 max_conns=2;
          server ollama-4:11434 max_conns=2;
        }
        server {
          listen 11434;
          server_name localhost;
          location / {
            proxy_pass http://ollama_group;
          }
        }
        EOF
    ports:
      - 11434:11434
<!-- gh-comment-id:2839756297 --> @rick-github commented on GitHub (Apr 29, 2025): No, if you set `OLLAMA_NUM_PARALLEL=4`, ollama will process up to 4 requests at a time but which GPU they run on is indeterministic. If you really must have each GPU processing a different request simultaneously, you need to run 4 ollama servers, assign each a different GPU with `CUDA_VISIBLE_DEVICES`, and use a load balancing proxy in front the distribute the queries. Note ollama gets better utilization when a single GPU does concurrent requests: ![Image](https://github.com/user-attachments/assets/1c6c3b30-93ea-46c7-95fc-681d83776267) So you could also set `OLLAMA_NUM_PARALLEL` for each of these servers and increase throughput. For example: ```yaml x-ollama: &ollama image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest} volumes: - ${OLLAMA_MODELS-./ollama}:/root/.ollama environment: &env OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1} OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m} OLLAMA_MAX_LOADED_MODELS: 1 OLLAMA_NUM_PARALLEL: 2 OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0} OLLAMA_DEBUG: ${OLLAMA_DEBUG-1} deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] services: ollama-1: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 0 ollama-2: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 1 ollama-3: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 2 ollama-4: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 3 ollama: image: nginx-lb build: dockerfile_inline: | FROM nginx:latest RUN cat > /etc/nginx/conf.d/default.conf <<EOF upstream ollama_group { least_conn; server ollama-1:11434 max_conns=2; server ollama-2:11434 max_conns=2; server ollama-3:11434 max_conns=2; server ollama-4:11434 max_conns=2; } server { listen 11434; server_name localhost; location / { proxy_pass http://ollama_group; } } EOF ports: - 11434:11434 ```
Author
Owner

@BilibalaX commented on GitHub (Apr 30, 2025):

@rick-github Thanks for your detailed interpretation. I finally managed to run Ollama simultaneously on our HPC system, where I have to use different containers and assign different ports. I’m not a software engineer; I’m using Ollama and LLMs as tools for social-science research, so please forgive any naïve questions.

Regarding the graph you posted, the speed improvement when setting OLLAMA_NUM_PARALLEL is remarkable. My workflow involves feeding many independent texts into the model for information extraction. If I increase OLLAMA_NUM_PARALLEL, can I expect a faster overall throughput?

Here mentioned that “the default will auto-select either 4 or 1 based on available memory.” Under what circumstances will it choose 4 versus 1? And is there a recommended value for OLLAMA_NUM_PARALLEL given typical hardware capability.

Many thanks for your help.

<!-- gh-comment-id:2842173658 --> @BilibalaX commented on GitHub (Apr 30, 2025): @rick-github Thanks for your detailed interpretation. I finally managed to run Ollama simultaneously on our HPC system, where I have to use different containers and assign different ports. I’m not a software engineer; I’m using Ollama and LLMs as tools for social-science research, so please forgive any naïve questions. Regarding the graph you posted, the speed improvement when setting OLLAMA_NUM_PARALLEL is remarkable. My workflow involves feeding many independent texts into the model for information extraction. If I increase OLLAMA_NUM_PARALLEL, can I expect a faster overall throughput? [Here](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests) mentioned that “the default will auto-select either 4 or 1 based on available memory.” Under what circumstances will it choose 4 versus 1? And is there a recommended value for OLLAMA_NUM_PARALLEL given typical hardware capability. Many thanks for your help.
Author
Owner

@rick-github commented on GitHub (Apr 30, 2025):

Yes, the overall throughput will increase, up to a level. You can see how the graph starts to flatten, as more and more of the processing units in the GPU are engaged, there will eventually be a bottleneck after which throughput will not increase (and may go down). Note that while the aggregate TPS goes up, the TPS per request goes down, so this is a negative effect if the clients are expecting real-time interaction. In your case of information extraction I assume that's not an issue.

The selection algorithm is: will 4x context buffer fit in VRAM without spilling layers to system RAM? If yes, OLLAMA_NUM_PARALLEL=4, otherwise 1. In your case where you have 40G VRAM and a 20G model, there's no downside to increasing OLLAMA_NUM_PARALLEL until all of the VRAM on a GPU is allocated. You don't indicate what context size you are using, so what you are aiming for is (40 - 20) > (context_size * parallel * k) where k is a constant for how much VRAM a single token takes. You can adjust either context size (OLLAMA_CONTEXT_LENGTH) or parallel (OLLAMA_NUM_PARALLEL) to maximize the size of the allocated context buffer to the point where most of the VRAM on a GPU has been allocated (you can check with nvidia-smi).

Also note that because of the way that throughput scales with OLLAMA_NUM_PARALLEL, it might be simpler to just use a single ollama server, give it access to all GPU devices, and set OLLAMA_NUM_PARALLEL to 4x what you would set it for the case where you are running 4 servers. Since the model is loaded only once, you get an extra 60G of space for context buffer. The downside is that there is an inherent bottleneck in this arrangment from the PCI bus, see here, so performance will not be as good as 4 individual servers. But the simpler configuration (no nginx) might be worth it.

And a caveat - the scaling of throughput depends on model, configuration, hardware and workload. So some experimentation may be required to the optimal results.

<!-- gh-comment-id:2842292968 --> @rick-github commented on GitHub (Apr 30, 2025): Yes, the overall throughput will increase, up to a level. You can see how the graph starts to flatten, as more and more of the processing units in the GPU are engaged, there will eventually be a bottleneck after which throughput will not increase (and may go down). Note that while the aggregate TPS goes up, the TPS per request goes down, so this is a negative effect if the clients are expecting real-time interaction. In your case of information extraction I assume that's not an issue. The selection algorithm is: will 4x context buffer fit in VRAM without spilling layers to system RAM? If yes, OLLAMA_NUM_PARALLEL=4, otherwise 1. In your case where you have 40G VRAM and a 20G model, there's no downside to increasing OLLAMA_NUM_PARALLEL until all of the VRAM on a GPU is allocated. You don't indicate what context size you are using, so what you are aiming for is (40 - 20) > (context_size * parallel * k) where k is a constant for how much VRAM a single token takes. You can adjust either context size (OLLAMA_CONTEXT_LENGTH) or parallel (OLLAMA_NUM_PARALLEL) to maximize the size of the allocated context buffer to the point where most of the VRAM on a GPU has been allocated (you can check with `nvidia-smi`). Also note that because of the way that throughput scales with OLLAMA_NUM_PARALLEL, it might be simpler to just use a single ollama server, give it access to all GPU devices, and set OLLAMA_NUM_PARALLEL to 4x what you would set it for the case where you are running 4 servers. Since the model is loaded only once, you get an extra 60G of space for context buffer. The downside is that there is an inherent bottleneck in this arrangment from the PCI bus, see [here](https://github.com/ollama/ollama/issues/7648#issuecomment-2473561990), so performance will not be as good as 4 individual servers. But the simpler configuration (no nginx) might be worth it. And a caveat - the scaling of throughput depends on model, configuration, hardware and workload. So some experimentation may be required to the optimal results.
Author
Owner

@aubourg commented on GitHub (May 2, 2025):

I have a similar issue, here is my use case. I have a machine with 8 GPUs. I want to run embarassingly parallel tasks but with only two classes of models. I see only 2 GPUs are used to process 8 parallel and independent tasks, the others are idle. It looks like since I am using only two different models, ollama uses 2 GPUs. It would clearly be more efficient to load 4 instances of each model, and run stuff on 8 GPUs. If I use artificially different models, then it indeed runs faster, using more of my GPUs...

<!-- gh-comment-id:2848270519 --> @aubourg commented on GitHub (May 2, 2025): I have a similar issue, here is my use case. I have a machine with 8 GPUs. I want to run embarassingly parallel tasks but with only two classes of models. I see only 2 GPUs are used to process 8 parallel and independent tasks, the others are idle. It looks like since I am using only two different models, ollama uses 2 GPUs. It would clearly be more efficient to load 4 instances of each model, and run stuff on 8 GPUs. If I use artificially different models, then it indeed runs faster, using more of my GPUs...
Author
Owner

@rick-github commented on GitHub (May 2, 2025):

As described above, increase OLLAMA_NUM_PARALLEL, run multiple servers, or both.

<!-- gh-comment-id:2848276776 --> @rick-github commented on GitHub (May 2, 2025): As described above, increase `OLLAMA_NUM_PARALLEL`, run multiple servers, or both.
Author
Owner

@aubourg commented on GitHub (May 2, 2025):

From what I understand, OLLAMA_NUM_PARALLEL will not increase the numbers of GPUs that are used, I should run multiple servers. Or perhaps clone the models under different names?

<!-- gh-comment-id:2848282307 --> @aubourg commented on GitHub (May 2, 2025): From what I understand, OLLAMA_NUM_PARALLEL will not increase the numbers of GPUs that are used, I should run multiple servers. Or perhaps clone the models under different names?
Author
Owner

@rick-github commented on GitHub (May 2, 2025):

The models are identified by the sha256 of the model weights, cloning a model will not allow the same model to be run more than once in a server. If you increase OLLAMA_NUM_PARALLEL to the point where the required KV cache no longer fits on a single GPU, ollama will distribute the model and cache across all available GPUs. You can achieve the same effect by setting OLLAMA_SCHED_SPREAD=1 in the server environment.

<!-- gh-comment-id:2848285465 --> @rick-github commented on GitHub (May 2, 2025): The models are identified by the sha256 of the model weights, cloning a model will not allow the same model to be run more than once in a server. If you increase OLLAMA_NUM_PARALLEL to the point where the required KV cache no longer fits on a single GPU, ollama will distribute the model and cache across all available GPUs. You can achieve the same effect by setting [`OLLAMA_SCHED_SPREAD=1`](https://github.com/ollama/ollama/blob/a6ef73f4f26a22cc605516113625a404bd064250/envconfig/config.go#L256) in the server environment.
Author
Owner

@LukasBreit commented on GitHub (May 9, 2025):

I am stucking on this topic. I am quite new to Ollama so please excuse any misunderstandings.
I am working on a M3 Pro and have a Langchain app where I am using ChatOllama with model "granite3.2:8b". The same model is invoked several times, what I want to be run in parallel.

I have tried to serve multiple servers on different ports with commands like OLLAMA_HOST=localhost:11436 OLLAMA_NUM_PARALLEL=2 ollama serve , which lead to a little performance improvement.

When I serve just a single ollama instance, but with different values for OLLAMA_NUM_PARALLEL, I can't detect any difference in performance. I have attached the output, which is generated when starting the server with OLLAMA_HOST=localhost OLLAMA_PORT=11434 OLLAMA_NUM_PARALLEL=1 OLLAMA_CONTEXT_LENGTH=32000 ollama serve - honestly I don't understand the output.

ollama serve_output.txt

Attached is also the code, which I am using for calling the model. As the context exceeds the default context length I have set the num_ctx parameter in ChatOllama.

code.txt

It would be great getting any feedback. Is it possible to archieve parallel processing on my M3 with this large context size? By setting the OLLAMA_NUM_PARALLEL variable or by running multiple ollama servers? If more information is needed just let me know!

<!-- gh-comment-id:2866591425 --> @LukasBreit commented on GitHub (May 9, 2025): I am stucking on this topic. I am quite new to Ollama so please excuse any misunderstandings. I am working on a M3 Pro and have a Langchain app where I am using ChatOllama with model "granite3.2:8b". The same model is invoked several times, what I want to be run in parallel. I have tried to serve multiple servers on different ports with commands like `OLLAMA_HOST=localhost:11436 OLLAMA_NUM_PARALLEL=2 ollama serve` , which lead to a little performance improvement. When I serve just a single ollama instance, but with different values for OLLAMA_NUM_PARALLEL, I can't detect any difference in performance. I have attached the output, which is generated when starting the server with `OLLAMA_HOST=localhost OLLAMA_PORT=11434 OLLAMA_NUM_PARALLEL=1 OLLAMA_CONTEXT_LENGTH=32000 ollama serve` - honestly I don't understand the output. [ollama serve_output.txt](https://github.com/user-attachments/files/20122769/ollama.serve_output.txt) Attached is also the code, which I am using for calling the model. As the context exceeds the default context length I have set the num_ctx parameter in ChatOllama. [code.txt](https://github.com/user-attachments/files/20122868/code.txt) It would be great getting any feedback. Is it possible to archieve parallel processing on my M3 with this large context size? By setting the OLLAMA_NUM_PARALLEL variable or by running multiple ollama servers? If more information is needed just let me know!
Author
Owner

@rick-github commented on GitHub (May 12, 2025):

[GIN] 2025/05/09 - 15:14:30 | 200 |  6.674368042s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/05/09 - 15:14:37 | 200 |  6.598018125s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/05/09 - 15:14:41 | 200 |     4.321513s |       127.0.0.1 | POST     "/api/chat"

From the log we see that queries are processed sequentially, because OLLAMA_NUM_PARALLEL=1. If you increase this, ollama will do concurrent processing, up to the number you specify. I'm not familiar with the GPU on an M3 Pro (the graph above is for Nvidia hardware), Apple describes a '16-core Neural Engine' so I imagine you should see some degree of parallelism.

You can use the script below to see if parallelism improves generation speed. I'm not sure how close MacOS is to a linux environment, so you may have to install some command line utils. Note that the overall TPS from the script includes prompt processing and curl overhead so is a little inaccurate, but you should see the rate go up as the amount of parallelism is increased.

#!/bin/bash

export OLLAMA_HOST=${OLLAMA_HOST-localhost:11434}
export MODEL=${MODEL-granite3.2:8b}

need() {
  _=$(command -v $1) || { echo "Need $1" ; exit 1 ; }
}

need curl
need jq
need dc
need parallel
need date
need seq

parallel=$1
[ -z "$parallel" ] && parallel=2

# load the model
curl -s $OLLAMA_HOST/api/generate -d '{"model":"'$MODEL'"}' >&-

t0=$(date +%s.%N)
res=$(
  for i in $(seq $[parallel * 4]) ; do
    echo '{"model":"'$MODEL'","prompt":"'x=$RANDOM'\ncount to ten","stream":false}'
  done | parallel --jobs $parallel curl -s $OLLAMA_HOST/api/generate -d "{}"
)
t1=$(date +%s.%N)
total_tokens=0
total_elapsed=0
evals=()
while read count duration ; do
  total_tokens=$[total_tokens + count]
  total_elapsed=$[total_elapsed + duration]
  evals+=( $(dc <<< "9k $count $duration 1000000000//pq") )
done <<< "$(echo "$res" | jq -r '"\(.eval_count) \(.eval_duration)"')"

echo "Overall TPS: " $(dc <<< "2k $total_tokens $t1 $t0 - /pq")
echo "Average per-completion TPS $(dc <<< "2k 0 ${evals[*]/%/+} ${#evals[*]} /p")"
$ ./9054.sh 2
Overall TPS:  110.86
Average per-completion TPS 68.12
$ ./9054.sh 3
Overall TPS:  132.55
Average per-completion TPS 65.71
$ ./9054.sh 4
Overall TPS:  155.41
Average per-completion TPS 60.99
<!-- gh-comment-id:2872479337 --> @rick-github commented on GitHub (May 12, 2025): ``` [GIN] 2025/05/09 - 15:14:30 | 200 | 6.674368042s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/05/09 - 15:14:37 | 200 | 6.598018125s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/05/09 - 15:14:41 | 200 | 4.321513s | 127.0.0.1 | POST "/api/chat" ``` From the log we see that queries are processed sequentially, because `OLLAMA_NUM_PARALLEL=1`. If you increase this, ollama will do concurrent processing, up to the number you specify. I'm not familiar with the GPU on an M3 Pro (the graph above is for Nvidia hardware), Apple describes a '16-core Neural Engine' so I imagine you should see some degree of parallelism. You can use the script below to see if parallelism improves generation speed. I'm not sure how close MacOS is to a linux environment, so you may have to install some command line utils. Note that the overall TPS from the script includes prompt processing and `curl` overhead so is a little inaccurate, but you should see the rate go up as the amount of parallelism is increased. ```sh #!/bin/bash export OLLAMA_HOST=${OLLAMA_HOST-localhost:11434} export MODEL=${MODEL-granite3.2:8b} need() { _=$(command -v $1) || { echo "Need $1" ; exit 1 ; } } need curl need jq need dc need parallel need date need seq parallel=$1 [ -z "$parallel" ] && parallel=2 # load the model curl -s $OLLAMA_HOST/api/generate -d '{"model":"'$MODEL'"}' >&- t0=$(date +%s.%N) res=$( for i in $(seq $[parallel * 4]) ; do echo '{"model":"'$MODEL'","prompt":"'x=$RANDOM'\ncount to ten","stream":false}' done | parallel --jobs $parallel curl -s $OLLAMA_HOST/api/generate -d "{}" ) t1=$(date +%s.%N) total_tokens=0 total_elapsed=0 evals=() while read count duration ; do total_tokens=$[total_tokens + count] total_elapsed=$[total_elapsed + duration] evals+=( $(dc <<< "9k $count $duration 1000000000//pq") ) done <<< "$(echo "$res" | jq -r '"\(.eval_count) \(.eval_duration)"')" echo "Overall TPS: " $(dc <<< "2k $total_tokens $t1 $t0 - /pq") echo "Average per-completion TPS $(dc <<< "2k 0 ${evals[*]/%/+} ${#evals[*]} /p")" ``` ```console $ ./9054.sh 2 Overall TPS: 110.86 Average per-completion TPS 68.12 $ ./9054.sh 3 Overall TPS: 132.55 Average per-completion TPS 65.71 $ ./9054.sh 4 Overall TPS: 155.41 Average per-completion TPS 60.99 ```
Author
Owner

@LukasBreit commented on GitHub (May 12, 2025):

Thanks for your feedback!

WIth some little adjustments I was able to run the script and I can see the changing values in the output.
bash ollama_parallel_test.sh 2
Overall TPS: 56.13
Average per-completion TPS: 35.77
bash ollama_parallel_test.sh 3
Overall TPS: 62.42
Average per-completion TPS: 27.78
bash ollama_parallel_test.sh 4
Overall TPS: 66.41
Average per-completion TPS: 20.40
bash ollama_parallel_test.sh 5
Overall TPS: 63.77
Average per-completion TPS: 20.10
bash ollama_parallel_test.sh 8
Overall TPS: 66.42
Average per-completion TPS: 19.12

I have tested around a bit today and there are some results that I don't quite understand.

In my setup I am working with ChatOllama in Langchain and have a list of prompts that shall be processed in parallel. Therefore I have used the abatch method. These prompts need a big context size of up to 32K.
So I have instantiated the ChatOllama with is_relevant_llm = ChatOllama(model="granite3.2:8b", temperature=0, format="json", max_tokens=100, **num_ctx**=32000).

For processing I have tried different setups:

  1. Starting Ollama server on command line with different values for parameters OLLAMA_NUM_PARALLEL and OLLAMA_CONTEXT_LENGTH:
    OLLAMA_HOST=localhost OLLAMA_PORT=11434 OLLAMA_NUM_PARALLEL=2 OLLAMA_CONTEXT_LENGTH=32000 ollama serve
    ..and using this instance by the ChatOllama object. According to the logs it seems to me that indeed two requests are handled in parallel.

ollama_parallel_test.txt

  1. Starting multiple Ollama servers on different ports via command line. In the code I have distributed each llm call to the next server/port. In the activity monitor of macOS I have also seen that GPU usage was distributed to the number of ollama servers.
  2. Just using the plain ChatOllama object and instead of giving all prompts at once with abatch method I have iterated over the list of prompts and called invoke method for each sequentially. The ollama instance was started automatically and not by me via command line. This is the default way how I have used ChatOllama so far before trying to improve performance.

I have tested all these settings and played around with them, but it seems that the plain ChatOllama handling (3. case) was already the quickest one. Can you explain how this is handled behind the scenes? Is my hardware not able to process the prompts faster? It seems that when I increase the number of parallelizations each of them just gets slower so that there is no performance increase in the end.

<!-- gh-comment-id:2872769389 --> @LukasBreit commented on GitHub (May 12, 2025): Thanks for your feedback! WIth some little adjustments I was able to run the script and I can see the changing values in the output. bash ollama_parallel_test.sh 2 Overall TPS: 56.13 Average per-completion TPS: 35.77 bash ollama_parallel_test.sh 3 Overall TPS: 62.42 Average per-completion TPS: 27.78 bash ollama_parallel_test.sh 4 Overall TPS: 66.41 Average per-completion TPS: 20.40 bash ollama_parallel_test.sh 5 Overall TPS: 63.77 Average per-completion TPS: 20.10 bash ollama_parallel_test.sh 8 Overall TPS: 66.42 Average per-completion TPS: 19.12 I have tested around a bit today and there are some results that I don't quite understand. In my setup I am working with ChatOllama in Langchain and have a list of prompts that shall be processed in parallel. Therefore I have used the abatch method. These prompts need a big context size of up to 32K. So I have instantiated the ChatOllama with `is_relevant_llm = ChatOllama(model="granite3.2:8b", temperature=0, format="json", max_tokens=100, **num_ctx**=32000)`. For processing I have tried different setups: 1. Starting Ollama server on command line with different values for parameters OLLAMA_NUM_PARALLEL and OLLAMA_CONTEXT_LENGTH: `OLLAMA_HOST=localhost OLLAMA_PORT=11434 OLLAMA_NUM_PARALLEL=2 OLLAMA_CONTEXT_LENGTH=32000 ollama serve` ..and using this instance by the ChatOllama object. According to the logs it seems to me that indeed two requests are handled in parallel. [ollama_parallel_test.txt](https://github.com/user-attachments/files/20162701/ollama_parallel_test.txt) 2. Starting multiple Ollama servers on different ports via command line. In the code I have distributed each llm call to the next server/port. In the activity monitor of macOS I have also seen that GPU usage was distributed to the number of ollama servers. 3. Just using the plain ChatOllama object and instead of giving all prompts at once with abatch method I have iterated over the list of prompts and called invoke method for each sequentially. The ollama instance was started automatically and not by me via command line. This is the default way how I have used ChatOllama so far before trying to improve performance. I have tested all these settings and played around with them, but it seems that the plain ChatOllama handling (3. case) was already the quickest one. Can you explain how this is handled behind the scenes? Is my hardware not able to process the prompts faster? It seems that when I increase the number of parallelizations each of them just gets slower so that there is no performance increase in the end.
Author
Owner

@rick-github commented on GitHub (May 12, 2025):

Note that for ollama_parallel_test.sh to show the effect of parallelism, you need to set OLLAMA_NUM_PARALLEL to the maximum number of parallel requests you want to handle. You may have done that but your post and the log only show OLLAMA_NUM_PARALLEL=2. You also started the server with OLLAMA_CONTEXT_LENGTH=32000 but the log shows the context length is 10000. Also, OLLAMA_PORT is not an ollama configuration variable. The form is OLLAMA_HOST=localhost:11434, where the port number uses the default 11434 if not supplied (which is why it works anyway in your invocation).

I'm not familiar enough with langchain or Apple hardware to offer any explanation about the different scenarios. If scenario 3 is using the default configuration, then it may be starting with OLLAMA_NUM_PARALLEL=4. In general, individual completions will get slower as parallelism increases, while the overall completion rate will increase up to a limit determined by the hardware.

<!-- gh-comment-id:2872973202 --> @rick-github commented on GitHub (May 12, 2025): Note that for `ollama_parallel_test.sh` to show the effect of parallelism, you need to set `OLLAMA_NUM_PARALLEL` to the maximum number of parallel requests you want to handle. You may have done that but your post and the log only show `OLLAMA_NUM_PARALLEL=2`. You also started the server with `OLLAMA_CONTEXT_LENGTH=32000` but the log shows the context length is 10000. Also, `OLLAMA_PORT` is not an ollama configuration variable. The form is `OLLAMA_HOST=localhost:11434`, where the port number uses the default 11434 if not supplied (which is why it works anyway in your invocation). I'm not familiar enough with langchain or Apple hardware to offer any explanation about the different scenarios. If scenario 3 is using the default configuration, then it may be starting with `OLLAMA_NUM_PARALLEL=4`. In general, individual completions will get slower as parallelism increases, while the overall completion rate will increase up to a limit determined by the hardware.
Author
Owner

@Teeeto commented on GitHub (May 22, 2025):

I have the same question about scaling Ollama effectively. I have 4 GPUs (NVidia) only one is used (model fits it completely 20 GB / 24GB). I want to avoid using load balancer and multiple instances since it introduces another point of config/maintenance/failure.
Ollama docs say there is an inbuilt balancer but I can not get it to work.
https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests
OLLAMA_MAX_LOADED_MODELS is set to 3, OLLAMA_NUM_PARALLEL set to 4. But running 10 complex parallel requests I cannot see ollama spawning another model.

<!-- gh-comment-id:2900202546 --> @Teeeto commented on GitHub (May 22, 2025): I have the same question about scaling Ollama effectively. I have 4 GPUs (NVidia) only one is used (model fits it completely 20 GB / 24GB). I want to avoid using load balancer and multiple instances since it introduces another point of config/maintenance/failure. Ollama docs say there is an inbuilt balancer but I can not get it to work. https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests OLLAMA_MAX_LOADED_MODELS is set to 3, OLLAMA_NUM_PARALLEL set to 4. But running 10 complex parallel requests I cannot see ollama spawning another model.
Author
Owner

@rick-github commented on GitHub (May 22, 2025):

It doesn't spawn another model, it just uses more capacity when parallel processing. See the graph here. If you want to maximize overall throughput when running 10 complex parallel requests, set OLLAMA_NUM_PARALLEL=10. As discussed above, token generation rate for individual completions will be slower, but the aggregate token generation rate will be higher.

<!-- gh-comment-id:2900980265 --> @rick-github commented on GitHub (May 22, 2025): It doesn't spawn another model, it just uses more capacity when parallel processing. See the graph [here](https://github.com/ollama/ollama/issues/9054#issuecomment-2839756297). If you want to maximize overall throughput when running 10 complex parallel requests, set `OLLAMA_NUM_PARALLEL=10`. As discussed above, token generation rate for individual completions will be slower, but the aggregate token generation rate will be higher.
Author
Owner

@simonabisiani commented on GitHub (Jun 12, 2025):

@BilibalaX I come from a similar background and am facing a similar problem. Could I reach out to you with some questions?

<!-- gh-comment-id:2965937704 --> @simonabisiani commented on GitHub (Jun 12, 2025): @BilibalaX I come from a similar background and am facing a similar problem. Could I reach out to you with some questions?
Author
Owner

@falmanna commented on GitHub (Aug 21, 2025):

No, if you set OLLAMA_NUM_PARALLEL=4, ollama will process up to 4 requests at a time but which GPU they run on is indeterministic. If you really must have each GPU processing a different request simultaneously, you need to run 4 ollama servers, assign each a different GPU with CUDA_VISIBLE_DEVICES, and use a load balancing proxy in front the distribute the queries.

Note ollama gets better utilization when a single GPU does concurrent requests:

Image

So you could also set OLLAMA_NUM_PARALLEL for each of these servers and increase throughput. For example:

x-ollama: &ollama
image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest}
volumes:
- ${OLLAMA_MODELS-./ollama}:/root/.ollama
environment: &env
OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1}
OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m}
OLLAMA_MAX_LOADED_MODELS: 1
OLLAMA_NUM_PARALLEL: 2
OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0}
OLLAMA_DEBUG: ${OLLAMA_DEBUG-1}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]

services:
ollama-1:
<< : *ollama
environment:
<< : *env
CUDA_VISIBLE_DEVICES: 0

ollama-2:
<< : *ollama
environment:
<< : *env
CUDA_VISIBLE_DEVICES: 1

ollama-3:
<< : *ollama
environment:
<< : *env
CUDA_VISIBLE_DEVICES: 2

ollama-4:
<< : *ollama
environment:
<< : *env
CUDA_VISIBLE_DEVICES: 3

ollama:
image: nginx-lb
build:
dockerfile_inline: |
FROM nginx:latest
RUN cat > /etc/nginx/conf.d/default.conf <<EOF
upstream ollama_group {
least_conn;
server ollama-1:11434 max_conns=2;
server ollama-2:11434 max_conns=2;
server ollama-3:11434 max_conns=2;
server ollama-4:11434 max_conns=2;
}
server {
listen 11434;
server_name localhost;
location / {
proxy_pass http://ollama_group;
}
}
EOF
ports:
- 11434:11434

This should be somewhere in the faqs. It took me a lot of reading to understand how parallel works in ollama

<!-- gh-comment-id:3211387744 --> @falmanna commented on GitHub (Aug 21, 2025): > No, if you set `OLLAMA_NUM_PARALLEL=4`, ollama will process up to 4 requests at a time but which GPU they run on is indeterministic. If you really must have each GPU processing a different request simultaneously, you need to run 4 ollama servers, assign each a different GPU with `CUDA_VISIBLE_DEVICES`, and use a load balancing proxy in front the distribute the queries. > > Note ollama gets better utilization when a single GPU does concurrent requests: > > ![Image](https://github.com/user-attachments/assets/1c6c3b30-93ea-46c7-95fc-681d83776267) > > So you could also set `OLLAMA_NUM_PARALLEL` for each of these servers and increase throughput. For example: > > x-ollama: &ollama > image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest} > volumes: > - ${OLLAMA_MODELS-./ollama}:/root/.ollama > environment: &env > OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1} > OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m} > OLLAMA_MAX_LOADED_MODELS: 1 > OLLAMA_NUM_PARALLEL: 2 > OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0} > OLLAMA_DEBUG: ${OLLAMA_DEBUG-1} > deploy: > resources: > reservations: > devices: > - driver: nvidia > count: 1 > capabilities: [gpu] > > services: > ollama-1: > << : *ollama > environment: > << : *env > CUDA_VISIBLE_DEVICES: 0 > > ollama-2: > << : *ollama > environment: > << : *env > CUDA_VISIBLE_DEVICES: 1 > > ollama-3: > << : *ollama > environment: > << : *env > CUDA_VISIBLE_DEVICES: 2 > > ollama-4: > << : *ollama > environment: > << : *env > CUDA_VISIBLE_DEVICES: 3 > > ollama: > image: nginx-lb > build: > dockerfile_inline: | > FROM nginx:latest > RUN cat > /etc/nginx/conf.d/default.conf <<EOF > upstream ollama_group { > least_conn; > server ollama-1:11434 max_conns=2; > server ollama-2:11434 max_conns=2; > server ollama-3:11434 max_conns=2; > server ollama-4:11434 max_conns=2; > } > server { > listen 11434; > server_name localhost; > location / { > proxy_pass http://ollama_group; > } > } > EOF > ports: > - 11434:11434 This should be somewhere in the faqs. It took me a lot of reading to understand how parallel works in ollama
Author
Owner
<!-- gh-comment-id:3211464536 --> @rick-github commented on GitHub (Aug 21, 2025): https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests
Author
Owner

@falmanna commented on GitHub (Aug 21, 2025):

@rick-github I read that multiple times, but the comment + the plot made it click for me

<!-- gh-comment-id:3211703750 --> @falmanna commented on GitHub (Aug 21, 2025): @rick-github I read that multiple times, but the comment + the plot made it click for me
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#67949