[GH-ISSUE #380] Increase Inference Throughput by Employing Parallelism #62207

Closed
opened 2026-05-03 07:53:22 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @gusanmaz on GitHub (Aug 18, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/380

I am running llama2 model for inference on Mac Mini M2 Pro using Langchain. According to System Monitor ollama process doesn't consume significant CPU but around 95% GPU and around 3GB memory. When I run 2 instances of the almost same code, inference speed decreases around 2-fold.

The code I am running looks like this:

import json
import requests
from langchain.llms import Ollama
import time


with open("queries.json", "r") as file:
    queries = json.load(file)

try:
    with open("output.json", "r") as file:
        output_data = json.load(file)
except FileNotFoundError:
    output_data = {}


ollama = Ollama(base_url='http://localhost:11434', model="llama")

prev_time = None  

for query in queries:
    if query not in output_data:
        current_time = time.time()  

        if prev_time:
            elapsed_time = current_time - prev_time  
            print(f"Elapsed Time: {elapsed_time:.2f} seconds\n")

        out = ollama(query)
        output_data[query] = out
        with open("output.json", "w") as file:
            json.dump(output_data, file, indent=4)
        print("\n")
        print("Query: " +  query)
        print("Answer: " + out)
        print("\n")

        prev_time = current_time  

Is there a way to increase inference throughput using parallelism or other methods?

Originally created by @gusanmaz on GitHub (Aug 18, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/380 I am running llama2 model for inference on Mac Mini M2 Pro using Langchain. According to System Monitor ollama process doesn't consume significant CPU but around 95% GPU and around 3GB memory. When I run 2 instances of the almost same code, inference speed decreases around 2-fold. The code I am running looks like this: ```python import json import requests from langchain.llms import Ollama import time with open("queries.json", "r") as file: queries = json.load(file) try: with open("output.json", "r") as file: output_data = json.load(file) except FileNotFoundError: output_data = {} ollama = Ollama(base_url='http://localhost:11434', model="llama") prev_time = None for query in queries: if query not in output_data: current_time = time.time() if prev_time: elapsed_time = current_time - prev_time print(f"Elapsed Time: {elapsed_time:.2f} seconds\n") out = ollama(query) output_data[query] = out with open("output.json", "w") as file: json.dump(output_data, file, indent=4) print("\n") print("Query: " + query) print("Answer: " + out) print("\n") prev_time = current_time ``` Is there a way to increase inference throughput using parallelism or other methods?
GiteaMirror added the feature request label 2026-05-03 07:53:22 -05:00
Author
Owner

@antonpolishko commented on GitHub (Aug 19, 2023):

What you are looking for is called batch inference. Since ollama is based on llama.cpp, which currently doesn't support that yet.

On hosts with CUDA GPUs exllama has support of batch inference. I had only 2-3x speedups during my experiments. The other popular approach is vllm.

LLMs are mostly bottle necking by memory speeds and ollama is good as squeezing every bit of metal performance on Macs.

<!-- gh-comment-id:1684892168 --> @antonpolishko commented on GitHub (Aug 19, 2023): What you are looking for is called batch inference. Since ollama is based on llama.cpp, which currently doesn't support that yet. On hosts with CUDA GPUs exllama has support of batch inference. I had only 2-3x speedups during my experiments. The other popular approach is vllm. LLMs are mostly bottle necking by memory speeds and ollama is good as squeezing every bit of metal performance on Macs.
Author
Owner

@vividfog commented on GitHub (Oct 14, 2023):

Llama.cpp now has batched inference, aka parallel decoding. 11dc1091f6/examples/batched/README.md (L4)

<!-- gh-comment-id:1762925095 --> @vividfog commented on GitHub (Oct 14, 2023): Llama.cpp now has batched inference, aka parallel decoding. https://github.com/ggerganov/llama.cpp/blob/11dc1091f64b24ca6d643acc6d0051117ba60161/examples/batched/README.md?plain=1#L4
Author
Owner

@ishaan-jaff commented on GitHub (Nov 29, 2023):

@gusanmaz @antonpolishko

I'm the maintainer of LiteLLM we provide an Open source proxy for load balancing Ollama + Azure + OpenAI
It can process (500+ requests/second)

From the thread it looks like you're trying to maximize throughput, (i'd love feedback if you're trying to do this)

Here's the quick start:

Doc: https://docs.litellm.ai/docs/simple_proxy#load-balancing---multiple-instances-of-1-model

Step 1 Create a Config.yaml

model_list:
  - model_name: llama2
    litellm_params:
        model: ollama/zephyr
        api_base: http://localhost:11435
  - model_name: llama2
    litellm_params:
        model: ollama/llama2
        api_base: http://localhost:11436
  - model_name: llama2
    litellm_params:
        model: ollama/llama2
        api_base: http://localhost:11434

Step 2: Start the litellm proxy:

litellm --config /path/to/config.yaml

Step3 Make Request to LiteLLM proxy:

curl --location 'http://0.0.0.0:8000/chat/completions' \
--header 'Content-Type: application/json' \
--data ' {
      "model": "llama2",
      "messages": [
        {
          "role": "user",
          "content": "what llm are you"
        }
      ],
    }
'
<!-- gh-comment-id:1832486080 --> @ishaan-jaff commented on GitHub (Nov 29, 2023): @gusanmaz @antonpolishko I'm the maintainer of LiteLLM we provide an Open source proxy for load balancing Ollama + Azure + OpenAI **It can process (500+ requests/second)** From the thread it looks like you're trying to maximize throughput, **(i'd love feedback if you're trying to do this)** ## Here's the quick start: Doc: https://docs.litellm.ai/docs/simple_proxy#load-balancing---multiple-instances-of-1-model ## Step 1 Create a Config.yaml ```python model_list: - model_name: llama2 litellm_params: model: ollama/zephyr api_base: http://localhost:11435 - model_name: llama2 litellm_params: model: ollama/llama2 api_base: http://localhost:11436 - model_name: llama2 litellm_params: model: ollama/llama2 api_base: http://localhost:11434 ``` ## Step 2: Start the litellm proxy: ``` litellm --config /path/to/config.yaml ``` ## Step3 Make Request to LiteLLM proxy: ``` curl --location 'http://0.0.0.0:8000/chat/completions' \ --header 'Content-Type: application/json' \ --data ' { "model": "llama2", "messages": [ { "role": "user", "content": "what llm are you" } ], } ' ```
Author
Owner

@jmorganca commented on GitHub (Dec 22, 2023):

Thanks so much for the issue! Going to merge this with https://github.com/jmorganca/ollama/issues/358

<!-- gh-comment-id:1867191004 --> @jmorganca commented on GitHub (Dec 22, 2023): Thanks so much for the issue! Going to merge this with https://github.com/jmorganca/ollama/issues/358
Author
Owner

@Franckegao commented on GitHub (May 12, 2024):

@ishaan-jaff how do I run this with langchain, I am got errors

Error occurred in getting api base - LLM Provider NOT provided. Pass in the LLM provider you are trying to call. You passed model=lama3-70b-chinese
Pass model as E.g. For 'Huggingface' inference endpoints pass in completion(model='huggingface/starcoder',..) Learn more: https://docs.litellm.ai/docs/providers
2024-05-12 23:05:02 - Error occurred in getting api base - LLM Provider NOT provided. Pass in the LLM provider you are trying to call. You passed model=lama3-70b-chinese
Pass model as E.g. For 'Huggingface' inference endpoints pass in completion(model='huggingface/starcoder',..) Learn more: https://docs.litellm.ai/docs/providers
A file generated an exception: LLM Provider NOT provided. Pass in the LLM provider you are trying to call. You passed model=lama3-70b-chinese
Pass model as E.g. For 'Huggingface' inference endpoints pass in completion(model='huggingface/starcoder',..) Learn more: https://docs.litellm.ai/docs/providers. Retry 1..

And my config is

model_list:

  • model_name: lama3-70b-chinese
    litellm_params:
    model: ollama/wangshenzhi/llama3-70b-chinese-chat-ollama-q4:latest
    api_base: http://localhost:11434
  • model_name: lama3-70b-chinese
    litellm_params:
    model: ollama/wangshenzhi/llama3-70b-chinese-chat-ollama-q4:latest
    api_base: http://localhost:11434
  • model_name: lama3-70b-chinese
    litellm_params:
    model: ollama/wangshenzhi/llama3-70b-chinese-chat-ollama-q4:latest
    api_base: http://localhost:11434

And I am calling by using:

 model = ChatLiteLLM(
    model="lama3-70b-chinese",
    temperature= 0.1,
    streaming= True,
    # top_k = 40,
    verbose =True
)
<!-- gh-comment-id:2106284866 --> @Franckegao commented on GitHub (May 12, 2024): @ishaan-jaff how do I run this with langchain, I am got errors > Error occurred in getting api base - LLM Provider NOT provided. Pass in the LLM provider you are trying to call. You passed model=lama3-70b-chinese Pass model as E.g. For 'Huggingface' inference endpoints pass in `completion(model='huggingface/starcoder',..)` Learn more: https://docs.litellm.ai/docs/providers 2024-05-12 23:05:02 - Error occurred in getting api base - LLM Provider NOT provided. Pass in the LLM provider you are trying to call. You passed model=lama3-70b-chinese Pass model as E.g. For 'Huggingface' inference endpoints pass in `completion(model='huggingface/starcoder',..)` Learn more: https://docs.litellm.ai/docs/providers A file generated an exception: LLM Provider NOT provided. Pass in the LLM provider you are trying to call. You passed model=lama3-70b-chinese Pass model as E.g. For 'Huggingface' inference endpoints pass in `completion(model='huggingface/starcoder',..)` Learn more: https://docs.litellm.ai/docs/providers. Retry 1.. And my config is > model_list: > - model_name: lama3-70b-chinese > litellm_params: model: ollama/wangshenzhi/llama3-70b-chinese-chat-ollama-q4:latest api_base: http://localhost:11434 > - model_name: lama3-70b-chinese litellm_params: model: ollama/wangshenzhi/llama3-70b-chinese-chat-ollama-q4:latest api_base: http://localhost:11434 >- model_name: lama3-70b-chinese litellm_params: model: ollama/wangshenzhi/llama3-70b-chinese-chat-ollama-q4:latest api_base: http://localhost:11434 And I am calling by using: ```python model = ChatLiteLLM( model="lama3-70b-chinese", temperature= 0.1, streaming= True, # top_k = 40, verbose =True ) ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62207