[GH-ISSUE #764] How to multi threading with api << python >> #62400

Closed
opened 2026-05-03 08:48:52 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @missandi on GitHub (Oct 12, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/764

def generate(model_name, prompt, system=None, template=None, context=None, options=None, callback=None):
try:
url = f"{BASE_URL}/api/generate"
payload = {
"model": model_name,
"prompt": prompt,
"system": system,
"template": template,
"context": context,
"options": options
}

    # Remove keys with None values
    payload = {k: v for k, v in payload.items() if v is not None}
    
    with requests.post(url, json=payload, stream=True) as response:
        response.raise_for_status()
        
        # Creating a variable to hold the context history of the final chunk
        final_context = None
        
        # Variable to hold concatenated response strings if no callback is provided
        full_response = ""

        # Iterating over the response line by line and displaying the details
        for line in response.iter_lines():
            if line:
                # Parsing each line (JSON chunk) and extracting the details
                chunk = json.loads(line)
                
                # If a callback function is provided, call it with the chunk
                if callback:
                    callback(chunk)
                else:
                    # If this is not the last chunk, add the "response" field value to full_response and print it
                    if not chunk.get("done"):
                        response_piece = chunk.get("response", "")
                        full_response += response_piece
                        print(response_piece, end="", flush=True)
                        
                # Check if it's the last chunk (done is true)
                if chunk.get("done"):
                    final_context = chunk.get("context")
        # Return the full response and the final context
        return full_response, final_context
except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")
    return None, None

I am currently utilizing a function to execute an API locally using Python; however, I have observed that the performance is notably slow. As a potential solution, I am considering implementing multi-threading to enable simultaneous execution of multiple instances. I would greatly appreciate any assistance or recommendations in this regard. Thank you sincerely for your support.

Originally created by @missandi on GitHub (Oct 12, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/764 def generate(model_name, prompt, system=None, template=None, context=None, options=None, callback=None): try: url = f"{BASE_URL}/api/generate" payload = { "model": model_name, "prompt": prompt, "system": system, "template": template, "context": context, "options": options } # Remove keys with None values payload = {k: v for k, v in payload.items() if v is not None} with requests.post(url, json=payload, stream=True) as response: response.raise_for_status() # Creating a variable to hold the context history of the final chunk final_context = None # Variable to hold concatenated response strings if no callback is provided full_response = "" # Iterating over the response line by line and displaying the details for line in response.iter_lines(): if line: # Parsing each line (JSON chunk) and extracting the details chunk = json.loads(line) # If a callback function is provided, call it with the chunk if callback: callback(chunk) else: # If this is not the last chunk, add the "response" field value to full_response and print it if not chunk.get("done"): response_piece = chunk.get("response", "") full_response += response_piece print(response_piece, end="", flush=True) # Check if it's the last chunk (done is true) if chunk.get("done"): final_context = chunk.get("context") # Return the full response and the final context return full_response, final_context except requests.exceptions.RequestException as e: print(f"An error occurred: {e}") return None, None I am currently utilizing a function to execute an API locally using Python; however, I have observed that the performance is notably slow. As a potential solution, I am considering implementing multi-threading to enable simultaneous execution of multiple instances. I would greatly appreciate any assistance or recommendations in this regard. Thank you sincerely for your support.
GiteaMirror added the question label 2026-05-03 08:48:52 -05:00
Author
Owner

@mxyng commented on GitHub (Oct 25, 2023):

Ollama currently queues the requests so multithreading Python API requests will simply be queued.

You could start multiple instances of Ollama and have your client send to the different instances however the limitation is on the hardware where a single model will use all available resources for inference. If you start multiple instances, it will reduce the performance of each instance, proportional to the number of instances.

As an example, if a single instance of a 7B model evaluates at ~12 tokens/s, 4 instances of the same 7B model will evaluate at ~3 tokens/s.

<!-- gh-comment-id:1779982695 --> @mxyng commented on GitHub (Oct 25, 2023): Ollama currently queues the requests so multithreading Python API requests will simply be queued. You could start multiple instances of Ollama and have your client send to the different instances however the limitation is on the hardware where a single model will use all available resources for inference. If you start multiple instances, it will reduce the performance of each instance, proportional to the number of instances. As an example, if a single instance of a 7B model evaluates at ~12 tokens/s, 4 instances of the same 7B model will evaluate at ~3 tokens/s.
Author
Owner

@oliverbob commented on GitHub (Nov 22, 2023):

May I know the line of code that allows for ollama instances and/or multithreading setting without going through a Modelfile? For some reason my system cannot locate the Modelfile.

I dont mind recompiling if needed. Can I also do multithreading as an extra parameter in docker?

I have 8 vCPU in digital Ocean running on a 16MB of RAM. I dont mind increasing the resources with a dedicated droplet or renting a GPU on the cloud or upgrading my physical hardware on local server to achieve the desired result. Currently its running 7b parameter of mistral and 3b parameter of orca-mini. But I only receive response when the queue finishes.

Particularly, I want to achieve lightning speed for at least 12 or 24 threads/sessions.

On local, I want to be able to run it on a target system with the following specs:

GPU - 24G vRAM (also, what difference will it make if I make it 48GB?
CPU (Ryzen 9) - 24 threads
RAM (DDR5) - 128 GB

I came here to look for answers. Your help is appreciated.

Thank you very much.

<!-- gh-comment-id:1822536619 --> @oliverbob commented on GitHub (Nov 22, 2023): May I know the line of code that allows for ollama instances and/or multithreading setting without going through a Modelfile? For some reason my system cannot locate the Modelfile. I dont mind recompiling if needed. Can I also do multithreading as an extra parameter in docker? I have 8 vCPU in digital Ocean running on a 16MB of RAM. I dont mind increasing the resources with a dedicated droplet or renting a GPU on the cloud or upgrading my physical hardware on local server to achieve the desired result. Currently its running 7b parameter of mistral and 3b parameter of orca-mini. But I only receive response when the queue finishes. Particularly, I want to achieve lightning speed for at least 12 or 24 threads/sessions. On local, I want to be able to run it on a target system with the following specs: GPU - 24G vRAM (also, what difference will it make if I make it 48GB? CPU (Ryzen 9) - 24 threads RAM (DDR5) - 128 GB I came here to look for answers. Your help is appreciated. Thank you very much.
Author
Owner

@morgendigital commented on GitHub (Dec 1, 2023):

On Linux, you can use systemd services to spin up multiple Ollama instances on different ports. This allows you to serve multiple requests at once.

  1. Create an ollama-x.service file, where x is the instance number (e.g. ollama-1.service), in /etc/systemd/system folder
  2. Copy the configuration example below. Set the port (11435) in the OLLAMA_HOST variable uniquely for each instance.
[Unit]
Description=Ollama Service 1
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_ORIGINS=*"
Environment="OLLAMA_HOST=127.0.0.1:11435"
Environment="PATH=/root/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"

[Install]
WantedBy=default.target
  1. Run systemctl daemon-reload && systemctl start ollama-x.service (Replace x with your instance number)
  2. Optionally, keep persistent after reboot with systemctl enable ollama-x.service
<!-- gh-comment-id:1836760366 --> @morgendigital commented on GitHub (Dec 1, 2023): On Linux, you can use systemd services to spin up multiple Ollama instances on different ports. This allows you to serve multiple requests at once. 1. Create an _ollama-x.service_ file, where x is the instance number (e.g. `ollama-1.service`), in `/etc/systemd/system` folder 2. Copy the configuration example below. **Set the port (11435) in the _OLLAMA_HOST_ variable uniquely for each instance.** ``` [Unit] Description=Ollama Service 1 After=network-online.target [Service] ExecStart=/usr/local/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 Environment="OLLAMA_ORIGINS=*" Environment="OLLAMA_HOST=127.0.0.1:11435" Environment="PATH=/root/.cargo/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" [Install] WantedBy=default.target ``` 3. Run `systemctl daemon-reload && systemctl start ollama-x.service` _(Replace x with your instance number)_ 4. Optionally, keep persistent after reboot with `systemctl enable ollama-x.service`
Author
Owner

@jmorganca commented on GitHub (Dec 22, 2023):

Hi folks! Thanks so much for the issue. I'll merge this with the existing issue open for parallel requests https://github.com/jmorganca/ollama/issues/358

<!-- gh-comment-id:1867192495 --> @jmorganca commented on GitHub (Dec 22, 2023): Hi folks! Thanks so much for the issue. I'll merge this with the existing issue open for parallel requests https://github.com/jmorganca/ollama/issues/358
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62400