[GH-ISSUE #9054] Ollama Does Not Utilize Multiple Instances of the Same Model for Parallel Processing #67949

New Issue

GiteaMirror · 2026-05-04T12:06:24-05:00

GiteaMirror commented

2026-05-04 12:06:24 -05:00

Originally created by @BennisonDevadoss on GitHub (Feb 12, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9054

I have a server with two Nvidia L4 GPUs, and I’m running the LLaMA 3.1 8B model using Ollama. Here’s the current behavior:

What Works:
- If a model is large enough to require both GPUs, Ollama successfully splits the workload and utilizes both GPUs for the same instance.
- If different models are loaded, Ollama utilizes the available VRAM efficiently and runs them concurrently across the GPUs.
What Does Not Work:
- When multiple users send concurrent requests, Ollama doesn’t load multiple instances of the same model on available VRAM to handle parallel requests. I’ve set the OLLAMA_NUM_PARALLEL parameter to 3, but it doesn’t seem to have any effect.

What I Want to Achieve:

I’d like Ollama to load multiple instances of the same model on different GPUs or available VRAM to handle parallel user requests efficiently.

Questions:

Why doesn’t Ollama load multiple instances of the same model for parallel processing?
Is there any configuration or workaround to achieve this behavior?

Any help or insights would be greatly appreciated!

Originally created by @BennisonDevadoss on GitHub (Feb 12, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9054 I have a server with two Nvidia L4 GPUs, and I’m running the LLaMA 3.1 8B model using Ollama. Here’s the current behavior: 1. **What Works:** - If a model is large enough to require both GPUs, Ollama successfully splits the workload and utilizes both GPUs for the same instance. - If different models are loaded, Ollama utilizes the available VRAM efficiently and runs them concurrently across the GPUs. 2. **What Does Not Work:** - When multiple users send concurrent requests, Ollama doesn’t load multiple instances of the same model on available VRAM to handle parallel requests. I’ve set the `OLLAMA_NUM_PARALLEL` parameter to 3, but it doesn’t seem to have any effect. **What I Want to Achieve:** - I’d like Ollama to load multiple instances of the same model on different GPUs or available VRAM to handle parallel user requests efficiently. **Questions:** - Why doesn’t Ollama load multiple instances of the same model for parallel processing? - Is there any configuration or workaround to achieve this behavior? Any help or insights would be greatly appreciated!

GiteaMirror closed this issue

2026-05-04 12:06:24 -05:00

GiteaMirror commented

2026-05-04 12:06:25 -05:00

@rick-github commented on GitHub (Feb 12, 2025):

ollama doesn't load a model multiple times, what OLLAMA_NUM_PARALLEL does is create a context buffer for each parallel request. The model weights are the same for each context buffer. The clients send 3 simultaneous requests and they get processed concurrently. If more simultaneous requests than completion slots are sent, they are queued up until one of the on-going completions is finished. See here for more info on concurrent request processing.

@rick-github commented on GitHub (Feb 12, 2025): ollama doesn't load a model multiple times, what `OLLAMA_NUM_PARALLEL` does is create a context buffer for each parallel request. The model weights are the same for each context buffer. The clients send 3 simultaneous requests and they get processed concurrently. If more simultaneous requests than completion slots are sent, they are queued up until one of the on-going completions is finished. See [here](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests) for more info on concurrent request processing.

GiteaMirror commented

2026-05-04 12:06:25 -05:00

@pdevine commented on GitHub (Feb 12, 2025):

@BennisonDevadoss as @rick-github mentioned, you don't need to waste the VRAM loading the same model into memory multiple times. I'm going to go ahead and close the issue as answered, but feel free to keep commenting.

@pdevine commented on GitHub (Feb 12, 2025): @BennisonDevadoss as @rick-github mentioned, you don't need to waste the VRAM loading the same model into memory multiple times. I'm going to go ahead and close the issue as answered, but feel free to keep commenting.

GiteaMirror commented

2026-05-04 12:06:26 -05:00

@BilibalaX commented on GitHub (Apr 29, 2025):

I have a similar question. With four GPUs (40GB VRAM), the model size (20GB) is ok to load on each GPU. My task is to input 100,000 messages for information extraction. If I set OLLAMA_NUM_PARALLEL as 4, does it mean I have tasks running on each of the GPUs separately and I can have 4 times the speed?

Then the other parameter OLLAMA_MAX_LOADED_MODELS is the maximum number of models that can be loaded. If I set it as 4, will Ollama regard them as 4 instances?

@BilibalaX commented on GitHub (Apr 29, 2025): I have a similar question. With four GPUs (40GB VRAM), the model size (20GB) is ok to load on each GPU. My task is to input 100,000 messages for information extraction. If I set OLLAMA_NUM_PARALLEL as 4, does it mean I have tasks running on each of the GPUs separately and I can have 4 times the speed? Then the other parameter OLLAMA_MAX_LOADED_MODELS is the maximum number of models that can be loaded. If I set it as 4, will Ollama regard them as 4 instances?

GiteaMirror commented

2026-05-04 12:06:27 -05:00

@rick-github commented on GitHub (Apr 29, 2025):

No, if you set OLLAMA_NUM_PARALLEL=4, ollama will process up to 4 requests at a time but which GPU they run on is indeterministic. If you really must have each GPU processing a different request simultaneously, you need to run 4 ollama servers, assign each a different GPU with CUDA_VISIBLE_DEVICES, and use a load balancing proxy in front the distribute the queries.

Note ollama gets better utilization when a single GPU does concurrent requests:

So you could also set OLLAMA_NUM_PARALLEL for each of these servers and increase throughput. For example:

x-ollama: &ollama
  image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest}
  volumes:
    - ${OLLAMA_MODELS-./ollama}:/root/.ollama
  environment: &env
    OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1}
    OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m}
    OLLAMA_MAX_LOADED_MODELS: 1
    OLLAMA_NUM_PARALLEL: 2
    OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0}
    OLLAMA_DEBUG: ${OLLAMA_DEBUG-1}
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

services:
  ollama-1:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 0

  ollama-2:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 1

  ollama-3:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 2

  ollama-4:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 3

  ollama:
    image: nginx-lb
    build:
      dockerfile_inline: |
        FROM nginx:latest
        RUN cat > /etc/nginx/conf.d/default.conf <<EOF
        upstream ollama_group {
          least_conn;
          server ollama-1:11434 max_conns=2;
          server ollama-2:11434 max_conns=2;
          server ollama-3:11434 max_conns=2;
          server ollama-4:11434 max_conns=2;
        }
        server {
          listen 11434;
          server_name localhost;
          location / {
            proxy_pass http://ollama_group;
          }
        }
        EOF
    ports:
      - 11434:11434

@rick-github commented on GitHub (Apr 29, 2025): No, if you set `OLLAMA_NUM_PARALLEL=4`, ollama will process up to 4 requests at a time but which GPU they run on is indeterministic. If you really must have each GPU processing a different request simultaneously, you need to run 4 ollama servers, assign each a different GPU with `CUDA_VISIBLE_DEVICES`, and use a load balancing proxy in front the distribute the queries. Note ollama gets better utilization when a single GPU does concurrent requests: ![Image](https://github.com/user-attachments/assets/1c6c3b30-93ea-46c7-95fc-681d83776267) So you could also set `OLLAMA_NUM_PARALLEL` for each of these servers and increase throughput. For example: ```yaml x-ollama: &ollama image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest} volumes: - ${OLLAMA_MODELS-./ollama}:/root/.ollama environment: &env OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1} OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m} OLLAMA_MAX_LOADED_MODELS: 1 OLLAMA_NUM_PARALLEL: 2 OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0} OLLAMA_DEBUG: ${OLLAMA_DEBUG-1} deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] services: ollama-1: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 0 ollama-2: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 1 ollama-3: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 2 ollama-4: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 3 ollama: image: nginx-lb build: dockerfile_inline: | FROM nginx:latest RUN cat > /etc/nginx/conf.d/default.conf <<EOF upstream ollama_group { least_conn; server ollama-1:11434 max_conns=2; server ollama-2:11434 max_conns=2; server ollama-3:11434 max_conns=2; server ollama-4:11434 max_conns=2; } server { listen 11434; server_name localhost; location / { proxy_pass http://ollama_group; } } EOF ports: - 11434:11434 ```

GiteaMirror commented

2026-05-04 12:06:29 -05:00

@BilibalaX commented on GitHub (Apr 30, 2025):

@rick-github Thanks for your detailed interpretation. I finally managed to run Ollama simultaneously on our HPC system, where I have to use different containers and assign different ports. I’m not a software engineer; I’m using Ollama and LLMs as tools for social-science research, so please forgive any naïve questions.

Regarding the graph you posted, the speed improvement when setting OLLAMA_NUM_PARALLEL is remarkable. My workflow involves feeding many independent texts into the model for information extraction. If I increase OLLAMA_NUM_PARALLEL, can I expect a faster overall throughput?

Here mentioned that “the default will auto-select either 4 or 1 based on available memory.” Under what circumstances will it choose 4 versus 1? And is there a recommended value for OLLAMA_NUM_PARALLEL given typical hardware capability.

Many thanks for your help.

@BilibalaX commented on GitHub (Apr 30, 2025): @rick-github Thanks for your detailed interpretation. I finally managed to run Ollama simultaneously on our HPC system, where I have to use different containers and assign different ports. I’m not a software engineer; I’m using Ollama and LLMs as tools for social-science research, so please forgive any naïve questions. Regarding the graph you posted, the speed improvement when setting OLLAMA_NUM_PARALLEL is remarkable. My workflow involves feeding many independent texts into the model for information extraction. If I increase OLLAMA_NUM_PARALLEL, can I expect a faster overall throughput? [Here](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests) mentioned that “the default will auto-select either 4 or 1 based on available memory.” Under what circumstances will it choose 4 versus 1? And is there a recommended value for OLLAMA_NUM_PARALLEL given typical hardware capability. Many thanks for your help.

GiteaMirror commented

2026-05-04 12:06:32 -05:00

@rick-github commented on GitHub (Apr 30, 2025):

Yes, the overall throughput will increase, up to a level. You can see how the graph starts to flatten, as more and more of the processing units in the GPU are engaged, there will eventually be a bottleneck after which throughput will not increase (and may go down). Note that while the aggregate TPS goes up, the TPS per request goes down, so this is a negative effect if the clients are expecting real-time interaction. In your case of information extraction I assume that's not an issue.

The selection algorithm is: will 4x context buffer fit in VRAM without spilling layers to system RAM? If yes, OLLAMA_NUM_PARALLEL=4, otherwise 1. In your case where you have 40G VRAM and a 20G model, there's no downside to increasing OLLAMA_NUM_PARALLEL until all of the VRAM on a GPU is allocated. You don't indicate what context size you are using, so what you are aiming for is (40 - 20) > (context_size * parallel * k) where k is a constant for how much VRAM a single token takes. You can adjust either context size (OLLAMA_CONTEXT_LENGTH) or parallel (OLLAMA_NUM_PARALLEL) to maximize the size of the allocated context buffer to the point where most of the VRAM on a GPU has been allocated (you can check with nvidia-smi).

Also note that because of the way that throughput scales with OLLAMA_NUM_PARALLEL, it might be simpler to just use a single ollama server, give it access to all GPU devices, and set OLLAMA_NUM_PARALLEL to 4x what you would set it for the case where you are running 4 servers. Since the model is loaded only once, you get an extra 60G of space for context buffer. The downside is that there is an inherent bottleneck in this arrangment from the PCI bus, see here, so performance will not be as good as 4 individual servers. But the simpler configuration (no nginx) might be worth it.

And a caveat - the scaling of throughput depends on model, configuration, hardware and workload. So some experimentation may be required to the optimal results.

@rick-github commented on GitHub (Apr 30, 2025): Yes, the overall throughput will increase, up to a level. You can see how the graph starts to flatten, as more and more of the processing units in the GPU are engaged, there will eventually be a bottleneck after which throughput will not increase (and may go down). Note that while the aggregate TPS goes up, the TPS per request goes down, so this is a negative effect if the clients are expecting real-time interaction. In your case of information extraction I assume that's not an issue. The selection algorithm is: will 4x context buffer fit in VRAM without spilling layers to system RAM? If yes, OLLAMA_NUM_PARALLEL=4, otherwise 1. In your case where you have 40G VRAM and a 20G model, there's no downside to increasing OLLAMA_NUM_PARALLEL until all of the VRAM on a GPU is allocated. You don't indicate what context size you are using, so what you are aiming for is (40 - 20) > (context_size * parallel * k) where k is a constant for how much VRAM a single token takes. You can adjust either context size (OLLAMA_CONTEXT_LENGTH) or parallel (OLLAMA_NUM_PARALLEL) to maximize the size of the allocated context buffer to the point where most of the VRAM on a GPU has been allocated (you can check with `nvidia-smi`). Also note that because of the way that throughput scales with OLLAMA_NUM_PARALLEL, it might be simpler to just use a single ollama server, give it access to all GPU devices, and set OLLAMA_NUM_PARALLEL to 4x what you would set it for the case where you are running 4 servers. Since the model is loaded only once, you get an extra 60G of space for context buffer. The downside is that there is an inherent bottleneck in this arrangment from the PCI bus, see [here](https://github.com/ollama/ollama/issues/7648#issuecomment-2473561990), so performance will not be as good as 4 individual servers. But the simpler configuration (no nginx) might be worth it. And a caveat - the scaling of throughput depends on model, configuration, hardware and workload. So some experimentation may be required to the optimal results.

GiteaMirror commented

2026-05-04 12:06:32 -05:00

@aubourg commented on GitHub (May 2, 2025):

I have a similar issue, here is my use case. I have a machine with 8 GPUs. I want to run embarassingly parallel tasks but with only two classes of models. I see only 2 GPUs are used to process 8 parallel and independent tasks, the others are idle. It looks like since I am using only two different models, ollama uses 2 GPUs. It would clearly be more efficient to load 4 instances of each model, and run stuff on 8 GPUs. If I use artificially different models, then it indeed runs faster, using more of my GPUs...

@aubourg commented on GitHub (May 2, 2025): I have a similar issue, here is my use case. I have a machine with 8 GPUs. I want to run embarassingly parallel tasks but with only two classes of models. I see only 2 GPUs are used to process 8 parallel and independent tasks, the others are idle. It looks like since I am using only two different models, ollama uses 2 GPUs. It would clearly be more efficient to load 4 instances of each model, and run stuff on 8 GPUs. If I use artificially different models, then it indeed runs faster, using more of my GPUs...

GiteaMirror commented

2026-05-04 12:06:33 -05:00

@rick-github commented on GitHub (May 2, 2025):

As described above, increase OLLAMA_NUM_PARALLEL, run multiple servers, or both.

@rick-github commented on GitHub (May 2, 2025): As described above, increase `OLLAMA_NUM_PARALLEL`, run multiple servers, or both.

GiteaMirror commented

2026-05-04 12:06:33 -05:00

@aubourg commented on GitHub (May 2, 2025):

From what I understand, OLLAMA_NUM_PARALLEL will not increase the numbers of GPUs that are used, I should run multiple servers. Or perhaps clone the models under different names?

@aubourg commented on GitHub (May 2, 2025): From what I understand, OLLAMA_NUM_PARALLEL will not increase the numbers of GPUs that are used, I should run multiple servers. Or perhaps clone the models under different names?

GiteaMirror commented

2026-05-04 12:06:33 -05:00

@rick-github commented on GitHub (May 2, 2025):

The models are identified by the sha256 of the model weights, cloning a model will not allow the same model to be run more than once in a server. If you increase OLLAMA_NUM_PARALLEL to the point where the required KV cache no longer fits on a single GPU, ollama will distribute the model and cache across all available GPUs. You can achieve the same effect by setting OLLAMA_SCHED_SPREAD=1 in the server environment.

@rick-github commented on GitHub (May 2, 2025): The models are identified by the sha256 of the model weights, cloning a model will not allow the same model to be run more than once in a server. If you increase OLLAMA_NUM_PARALLEL to the point where the required KV cache no longer fits on a single GPU, ollama will distribute the model and cache across all available GPUs. You can achieve the same effect by setting [`OLLAMA_SCHED_SPREAD=1`](https://github.com/ollama/ollama/blob/a6ef73f4f26a22cc605516113625a404bd064250/envconfig/config.go#L256) in the server environment.

GiteaMirror commented

2026-05-04 12:06:34 -05:00

@LukasBreit commented on GitHub (May 9, 2025):

I am stucking on this topic. I am quite new to Ollama so please excuse any misunderstandings.
I am working on a M3 Pro and have a Langchain app where I am using ChatOllama with model "granite3.2:8b". The same model is invoked several times, what I want to be run in parallel.

I have tried to serve multiple servers on different ports with commands like OLLAMA_HOST=localhost:11436 OLLAMA_NUM_PARALLEL=2 ollama serve , which lead to a little performance improvement.

When I serve just a single ollama instance, but with different values for OLLAMA_NUM_PARALLEL, I can't detect any difference in performance. I have attached the output, which is generated when starting the server with OLLAMA_HOST=localhost OLLAMA_PORT=11434 OLLAMA_NUM_PARALLEL=1 OLLAMA_CONTEXT_LENGTH=32000 ollama serve - honestly I don't understand the output.

ollama serve_output.txt

Attached is also the code, which I am using for calling the model. As the context exceeds the default context length I have set the num_ctx parameter in ChatOllama.

code.txt

It would be great getting any feedback. Is it possible to archieve parallel processing on my M3 with this large context size? By setting the OLLAMA_NUM_PARALLEL variable or by running multiple ollama servers? If more information is needed just let me know!

@LukasBreit commented on GitHub (May 9, 2025): I am stucking on this topic. I am quite new to Ollama so please excuse any misunderstandings. I am working on a M3 Pro and have a Langchain app where I am using ChatOllama with model "granite3.2:8b". The same model is invoked several times, what I want to be run in parallel. I have tried to serve multiple servers on different ports with commands like `OLLAMA_HOST=localhost:11436 OLLAMA_NUM_PARALLEL=2 ollama serve` , which lead to a little performance improvement. When I serve just a single ollama instance, but with different values for OLLAMA_NUM_PARALLEL, I can't detect any difference in performance. I have attached the output, which is generated when starting the server with `OLLAMA_HOST=localhost OLLAMA_PORT=11434 OLLAMA_NUM_PARALLEL=1 OLLAMA_CONTEXT_LENGTH=32000 ollama serve` - honestly I don't understand the output. [ollama serve_output.txt](https://github.com/user-attachments/files/20122769/ollama.serve_output.txt) Attached is also the code, which I am using for calling the model. As the context exceeds the default context length I have set the num_ctx parameter in ChatOllama. [code.txt](https://github.com/user-attachments/files/20122868/code.txt) It would be great getting any feedback. Is it possible to archieve parallel processing on my M3 with this large context size? By setting the OLLAMA_NUM_PARALLEL variable or by running multiple ollama servers? If more information is needed just let me know!

GiteaMirror commented

2026-05-04 12:06:34 -05:00

@rick-github commented on GitHub (May 12, 2025):

[GIN] 2025/05/09 - 15:14:30 | 200 |  6.674368042s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/05/09 - 15:14:37 | 200 |  6.598018125s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/05/09 - 15:14:41 | 200 |     4.321513s |       127.0.0.1 | POST     "/api/chat"

From the log we see that queries are processed sequentially, because OLLAMA_NUM_PARALLEL=1. If you increase this, ollama will do concurrent processing, up to the number you specify. I'm not familiar with the GPU on an M3 Pro (the graph above is for Nvidia hardware), Apple describes a '16-core Neural Engine' so I imagine you should see some degree of parallelism.

You can use the script below to see if parallelism improves generation speed. I'm not sure how close MacOS is to a linux environment, so you may have to install some command line utils. Note that the overall TPS from the script includes prompt processing and curl overhead so is a little inaccurate, but you should see the rate go up as the amount of parallelism is increased.

#!/bin/bash

export OLLAMA_HOST=${OLLAMA_HOST-localhost:11434}
export MODEL=${MODEL-granite3.2:8b}

need() {
  _=$(command -v $1) || { echo "Need $1" ; exit 1 ; }
}

need curl
need jq
need dc
need parallel
need date
need seq

parallel=$1
[ -z "$parallel" ] && parallel=2

# load the model
curl -s $OLLAMA_HOST/api/generate -d '{"model":"'$MODEL'"}' >&-

t0=$(date +%s.%N)
res=$(
  for i in $(seq $[parallel * 4]) ; do
    echo '{"model":"'$MODEL'","prompt":"'x=$RANDOM'\ncount to ten","stream":false}'
  done | parallel --jobs $parallel curl -s $OLLAMA_HOST/api/generate -d "{}"
)
t1=$(date +%s.%N)
total_tokens=0
total_elapsed=0
evals=()
while read count duration ; do
  total_tokens=$[total_tokens + count]
  total_elapsed=$[total_elapsed + duration]
  evals+=( $(dc <<< "9k $count $duration 1000000000//pq") )
done <<< "$(echo "$res" | jq -r '"\(.eval_count) \(.eval_duration)"')"

echo "Overall TPS: " $(dc <<< "2k $total_tokens $t1 $t0 - /pq")
echo "Average per-completion TPS $(dc <<< "2k 0 ${evals[*]/%/+} ${#evals[*]} /p")"

$ ./9054.sh 2
Overall TPS:  110.86
Average per-completion TPS 68.12
$ ./9054.sh 3
Overall TPS:  132.55
Average per-completion TPS 65.71
$ ./9054.sh 4
Overall TPS:  155.41
Average per-completion TPS 60.99

@rick-github commented on GitHub (May 12, 2025): ``` [GIN] 2025/05/09 - 15:14:30 | 200 | 6.674368042s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/05/09 - 15:14:37 | 200 | 6.598018125s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/05/09 - 15:14:41 | 200 | 4.321513s | 127.0.0.1 | POST "/api/chat" ``` From the log we see that queries are processed sequentially, because `OLLAMA_NUM_PARALLEL=1`. If you increase this, ollama will do concurrent processing, up to the number you specify. I'm not familiar with the GPU on an M3 Pro (the graph above is for Nvidia hardware), Apple describes a '16-core Neural Engine' so I imagine you should see some degree of parallelism. You can use the script below to see if parallelism improves generation speed. I'm not sure how close MacOS is to a linux environment, so you may have to install some command line utils. Note that the overall TPS from the script includes prompt processing and `curl` overhead so is a little inaccurate, but you should see the rate go up as the amount of parallelism is increased. ```sh #!/bin/bash export OLLAMA_HOST=${OLLAMA_HOST-localhost:11434} export MODEL=${MODEL-granite3.2:8b} need() { _=$(command -v $1) || { echo "Need $1" ; exit 1 ; } } need curl need jq need dc need parallel need date need seq parallel=$1 [ -z "$parallel" ] && parallel=2 # load the model curl -s $OLLAMA_HOST/api/generate -d '{"model":"'$MODEL'"}' >&- t0=$(date +%s.%N) res=$( for i in $(seq $[parallel * 4]) ; do echo '{"model":"'$MODEL'","prompt":"'x=$RANDOM'\ncount to ten","stream":false}' done | parallel --jobs $parallel curl -s $OLLAMA_HOST/api/generate -d "{}" ) t1=$(date +%s.%N) total_tokens=0 total_elapsed=0 evals=() while read count duration ; do total_tokens=$[total_tokens + count] total_elapsed=$[total_elapsed + duration] evals+=( $(dc <<< "9k $count $duration 1000000000//pq") ) done <<< "$(echo "$res" | jq -r '"\(.eval_count) \(.eval_duration)"')" echo "Overall TPS: " $(dc <<< "2k $total_tokens $t1 $t0 - /pq") echo "Average per-completion TPS $(dc <<< "2k 0 ${evals[*]/%/+} ${#evals[*]} /p")" ``` ```console $ ./9054.sh 2 Overall TPS: 110.86 Average per-completion TPS 68.12 $ ./9054.sh 3 Overall TPS: 132.55 Average per-completion TPS 65.71 $ ./9054.sh 4 Overall TPS: 155.41 Average per-completion TPS 60.99 ```

GiteaMirror commented

2026-05-04 12:06:35 -05:00

@LukasBreit commented on GitHub (May 12, 2025):

Thanks for your feedback!

WIth some little adjustments I was able to run the script and I can see the changing values in the output.
bash ollama_parallel_test.sh 2
Overall TPS: 56.13
Average per-completion TPS: 35.77
bash ollama_parallel_test.sh 3
Overall TPS: 62.42
Average per-completion TPS: 27.78
bash ollama_parallel_test.sh 4
Overall TPS: 66.41
Average per-completion TPS: 20.40
bash ollama_parallel_test.sh 5
Overall TPS: 63.77
Average per-completion TPS: 20.10
bash ollama_parallel_test.sh 8
Overall TPS: 66.42
Average per-completion TPS: 19.12

I have tested around a bit today and there are some results that I don't quite understand.

In my setup I am working with ChatOllama in Langchain and have a list of prompts that shall be processed in parallel. Therefore I have used the abatch method. These prompts need a big context size of up to 32K.
So I have instantiated the ChatOllama with is_relevant_llm = ChatOllama(model="granite3.2:8b", temperature=0, format="json", max_tokens=100, **num_ctx**=32000).

For processing I have tried different setups:

Starting Ollama server on command line with different values for parameters OLLAMA_NUM_PARALLEL and OLLAMA_CONTEXT_LENGTH:
OLLAMA_HOST=localhost OLLAMA_PORT=11434 OLLAMA_NUM_PARALLEL=2 OLLAMA_CONTEXT_LENGTH=32000 ollama serve
..and using this instance by the ChatOllama object. According to the logs it seems to me that indeed two requests are handled in parallel.

ollama_parallel_test.txt

Starting multiple Ollama servers on different ports via command line. In the code I have distributed each llm call to the next server/port. In the activity monitor of macOS I have also seen that GPU usage was distributed to the number of ollama servers.
Just using the plain ChatOllama object and instead of giving all prompts at once with abatch method I have iterated over the list of prompts and called invoke method for each sequentially. The ollama instance was started automatically and not by me via command line. This is the default way how I have used ChatOllama so far before trying to improve performance.

I have tested all these settings and played around with them, but it seems that the plain ChatOllama handling (3. case) was already the quickest one. Can you explain how this is handled behind the scenes? Is my hardware not able to process the prompts faster? It seems that when I increase the number of parallelizations each of them just gets slower so that there is no performance increase in the end.

@LukasBreit commented on GitHub (May 12, 2025): Thanks for your feedback! WIth some little adjustments I was able to run the script and I can see the changing values in the output. bash ollama_parallel_test.sh 2 Overall TPS: 56.13 Average per-completion TPS: 35.77 bash ollama_parallel_test.sh 3 Overall TPS: 62.42 Average per-completion TPS: 27.78 bash ollama_parallel_test.sh 4 Overall TPS: 66.41 Average per-completion TPS: 20.40 bash ollama_parallel_test.sh 5 Overall TPS: 63.77 Average per-completion TPS: 20.10 bash ollama_parallel_test.sh 8 Overall TPS: 66.42 Average per-completion TPS: 19.12 I have tested around a bit today and there are some results that I don't quite understand. In my setup I am working with ChatOllama in Langchain and have a list of prompts that shall be processed in parallel. Therefore I have used the abatch method. These prompts need a big context size of up to 32K. So I have instantiated the ChatOllama with `is_relevant_llm = ChatOllama(model="granite3.2:8b", temperature=0, format="json", max_tokens=100, **num_ctx**=32000)`. For processing I have tried different setups: 1. Starting Ollama server on command line with different values for parameters OLLAMA_NUM_PARALLEL and OLLAMA_CONTEXT_LENGTH: `OLLAMA_HOST=localhost OLLAMA_PORT=11434 OLLAMA_NUM_PARALLEL=2 OLLAMA_CONTEXT_LENGTH=32000 ollama serve` ..and using this instance by the ChatOllama object. According to the logs it seems to me that indeed two requests are handled in parallel. [ollama_parallel_test.txt](https://github.com/user-attachments/files/20162701/ollama_parallel_test.txt) 2. Starting multiple Ollama servers on different ports via command line. In the code I have distributed each llm call to the next server/port. In the activity monitor of macOS I have also seen that GPU usage was distributed to the number of ollama servers. 3. Just using the plain ChatOllama object and instead of giving all prompts at once with abatch method I have iterated over the list of prompts and called invoke method for each sequentially. The ollama instance was started automatically and not by me via command line. This is the default way how I have used ChatOllama so far before trying to improve performance. I have tested all these settings and played around with them, but it seems that the plain ChatOllama handling (3. case) was already the quickest one. Can you explain how this is handled behind the scenes? Is my hardware not able to process the prompts faster? It seems that when I increase the number of parallelizations each of them just gets slower so that there is no performance increase in the end.

GiteaMirror commented

2026-05-04 12:06:35 -05:00

@rick-github commented on GitHub (May 12, 2025):

Note that for ollama_parallel_test.sh to show the effect of parallelism, you need to set OLLAMA_NUM_PARALLEL to the maximum number of parallel requests you want to handle. You may have done that but your post and the log only show OLLAMA_NUM_PARALLEL=2. You also started the server with OLLAMA_CONTEXT_LENGTH=32000 but the log shows the context length is 10000. Also, OLLAMA_PORT is not an ollama configuration variable. The form is OLLAMA_HOST=localhost:11434, where the port number uses the default 11434 if not supplied (which is why it works anyway in your invocation).

I'm not familiar enough with langchain or Apple hardware to offer any explanation about the different scenarios. If scenario 3 is using the default configuration, then it may be starting with OLLAMA_NUM_PARALLEL=4. In general, individual completions will get slower as parallelism increases, while the overall completion rate will increase up to a limit determined by the hardware.

@rick-github commented on GitHub (May 12, 2025): Note that for `ollama_parallel_test.sh` to show the effect of parallelism, you need to set `OLLAMA_NUM_PARALLEL` to the maximum number of parallel requests you want to handle. You may have done that but your post and the log only show `OLLAMA_NUM_PARALLEL=2`. You also started the server with `OLLAMA_CONTEXT_LENGTH=32000` but the log shows the context length is 10000. Also, `OLLAMA_PORT` is not an ollama configuration variable. The form is `OLLAMA_HOST=localhost:11434`, where the port number uses the default 11434 if not supplied (which is why it works anyway in your invocation). I'm not familiar enough with langchain or Apple hardware to offer any explanation about the different scenarios. If scenario 3 is using the default configuration, then it may be starting with `OLLAMA_NUM_PARALLEL=4`. In general, individual completions will get slower as parallelism increases, while the overall completion rate will increase up to a limit determined by the hardware.

GiteaMirror commented

2026-05-04 12:06:35 -05:00

@Teeeto commented on GitHub (May 22, 2025):

I have the same question about scaling Ollama effectively. I have 4 GPUs (NVidia) only one is used (model fits it completely 20 GB / 24GB). I want to avoid using load balancer and multiple instances since it introduces another point of config/maintenance/failure.
Ollama docs say there is an inbuilt balancer but I can not get it to work.
https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests
OLLAMA_MAX_LOADED_MODELS is set to 3, OLLAMA_NUM_PARALLEL set to 4. But running 10 complex parallel requests I cannot see ollama spawning another model.

@Teeeto commented on GitHub (May 22, 2025): I have the same question about scaling Ollama effectively. I have 4 GPUs (NVidia) only one is used (model fits it completely 20 GB / 24GB). I want to avoid using load balancer and multiple instances since it introduces another point of config/maintenance/failure. Ollama docs say there is an inbuilt balancer but I can not get it to work. https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests OLLAMA_MAX_LOADED_MODELS is set to 3, OLLAMA_NUM_PARALLEL set to 4. But running 10 complex parallel requests I cannot see ollama spawning another model.

GiteaMirror commented

2026-05-04 12:06:35 -05:00

@rick-github commented on GitHub (May 22, 2025):

It doesn't spawn another model, it just uses more capacity when parallel processing. See the graph here. If you want to maximize overall throughput when running 10 complex parallel requests, set OLLAMA_NUM_PARALLEL=10. As discussed above, token generation rate for individual completions will be slower, but the aggregate token generation rate will be higher.

@rick-github commented on GitHub (May 22, 2025): It doesn't spawn another model, it just uses more capacity when parallel processing. See the graph [here](https://github.com/ollama/ollama/issues/9054#issuecomment-2839756297). If you want to maximize overall throughput when running 10 complex parallel requests, set `OLLAMA_NUM_PARALLEL=10`. As discussed above, token generation rate for individual completions will be slower, but the aggregate token generation rate will be higher.

GiteaMirror commented

2026-05-04 12:06:36 -05:00

@simonabisiani commented on GitHub (Jun 12, 2025):

@BilibalaX I come from a similar background and am facing a similar problem. Could I reach out to you with some questions?

@simonabisiani commented on GitHub (Jun 12, 2025): @BilibalaX I come from a similar background and am facing a similar problem. Could I reach out to you with some questions?

GiteaMirror commented

2026-05-04 12:06:36 -05:00

@falmanna commented on GitHub (Aug 21, 2025):

No, if you set OLLAMA_NUM_PARALLEL=4, ollama will process up to 4 requests at a time but which GPU they run on is indeterministic. If you really must have each GPU processing a different request simultaneously, you need to run 4 ollama servers, assign each a different GPU with CUDA_VISIBLE_DEVICES, and use a load balancing proxy in front the distribute the queries.

Note ollama gets better utilization when a single GPU does concurrent requests:

So you could also set OLLAMA_NUM_PARALLEL for each of these servers and increase throughput. For example:

x-ollama: &ollama
image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest}
volumes:
- ${OLLAMA_MODELS-./ollama}:/root/.ollama
environment: &env
OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1}
OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m}
OLLAMA_MAX_LOADED_MODELS: 1
OLLAMA_NUM_PARALLEL: 2
OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0}
OLLAMA_DEBUG: ${OLLAMA_DEBUG-1}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]

services:
ollama-1:
<< : *ollama
environment:
<< : *env
CUDA_VISIBLE_DEVICES: 0

ollama-2:
<< : *ollama
environment:
<< : *env
CUDA_VISIBLE_DEVICES: 1

ollama-3:
<< : *ollama
environment:
<< : *env
CUDA_VISIBLE_DEVICES: 2

ollama-4:
<< : *ollama
environment:
<< : *env
CUDA_VISIBLE_DEVICES: 3

ollama:
image: nginx-lb
build:
dockerfile_inline: |
FROM nginx:latest
RUN cat > /etc/nginx/conf.d/default.conf <<EOF
upstream ollama_group {
least_conn;
server ollama-1:11434 max_conns=2;
server ollama-2:11434 max_conns=2;
server ollama-3:11434 max_conns=2;
server ollama-4:11434 max_conns=2;
}
server {
listen 11434;
server_name localhost;
location / {
proxy_pass http://ollama_group;
}
}
EOF
ports:
- 11434:11434

This should be somewhere in the faqs. It took me a lot of reading to understand how parallel works in ollama

@falmanna commented on GitHub (Aug 21, 2025): > No, if you set `OLLAMA_NUM_PARALLEL=4`, ollama will process up to 4 requests at a time but which GPU they run on is indeterministic. If you really must have each GPU processing a different request simultaneously, you need to run 4 ollama servers, assign each a different GPU with `CUDA_VISIBLE_DEVICES`, and use a load balancing proxy in front the distribute the queries. > > Note ollama gets better utilization when a single GPU does concurrent requests: > > ![Image](https://github.com/user-attachments/assets/1c6c3b30-93ea-46c7-95fc-681d83776267) > > So you could also set `OLLAMA_NUM_PARALLEL` for each of these servers and increase throughput. For example: > > x-ollama: &ollama > image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest} > volumes: > - ${OLLAMA_MODELS-./ollama}:/root/.ollama > environment: &env > OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1} > OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m} > OLLAMA_MAX_LOADED_MODELS: 1 > OLLAMA_NUM_PARALLEL: 2 > OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0} > OLLAMA_DEBUG: ${OLLAMA_DEBUG-1} > deploy: > resources: > reservations: > devices: > - driver: nvidia > count: 1 > capabilities: [gpu] > > services: > ollama-1: > << : *ollama > environment: > << : *env > CUDA_VISIBLE_DEVICES: 0 > > ollama-2: > << : *ollama > environment: > << : *env > CUDA_VISIBLE_DEVICES: 1 > > ollama-3: > << : *ollama > environment: > << : *env > CUDA_VISIBLE_DEVICES: 2 > > ollama-4: > << : *ollama > environment: > << : *env > CUDA_VISIBLE_DEVICES: 3 > > ollama: > image: nginx-lb > build: > dockerfile_inline: | > FROM nginx:latest > RUN cat > /etc/nginx/conf.d/default.conf <<EOF > upstream ollama_group { > least_conn; > server ollama-1:11434 max_conns=2; > server ollama-2:11434 max_conns=2; > server ollama-3:11434 max_conns=2; > server ollama-4:11434 max_conns=2; > } > server { > listen 11434; > server_name localhost; > location / { > proxy_pass http://ollama_group; > } > } > EOF > ports: > - 11434:11434 This should be somewhere in the faqs. It took me a lot of reading to understand how parallel works in ollama

GiteaMirror commented

2026-05-04 12:06:37 -05:00

@rick-github commented on GitHub (Aug 21, 2025):

https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests

@rick-github commented on GitHub (Aug 21, 2025): https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests

GiteaMirror commented

2026-05-04 12:06:37 -05:00

@falmanna commented on GitHub (Aug 21, 2025):

@rick-github I read that multiple times, but the comment + the plot made it click for me

@falmanna commented on GitHub (Aug 21, 2025): @rick-github I read that multiple times, but the comment + the plot made it click for me

Sign in to join this conversation.

Branches Tags

main

parth-remove-ollama-agent-command

parth-agent-harness-skills-synthetic-tool

hoyyeva/fix-anthropic-text-before-thinking

parth-agent-cli-markdown-rendering

mxyng/docs-cloud

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#67949