[GH-ISSUE #4752] Multi-GPU and batch management #2994

Open
opened 2026-04-12 13:23:24 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @LaetLanf on GitHub (May 31, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4752

Hello,

I'm confident that a feature enabling multi-GPU optimization and batch management would be beneficial.

I may have made a mistake, as I couldn't effectively use the ollama_num_parallel and ollama_max_loaded_models settings to optimize my Linux VM, which has four A100 80GB GPUs, using Llama3:70b-instruct.

I finally succeed to use the 4 GPUs in parallel, thanks to separate docker containers assigned to different ports. I also used AsyncClient() with Asyncio for effective asynchronous operations.

In any case, I'm happy to share my code if it might help someone.

Assign docker containers to GPU and ports

sudo docker run -d --gpus=1 -v ollama:/root/.ollama -p 11435:11434 --name ollama0 ollama/ollama:latest
sudo docker run -d --gpus=2 -v ollama:/root/.ollama -p 11436:11434 --name ollama1 ollama/ollama:latest
sudo docker run -d --gpus=3 -v ollama:/root/.ollama -p 11437:11434 --name ollama2 ollama/ollama:latest
sudo docker run -d --gpus=all -v ollama:/root/.ollama -p 11438:11434 --name ollama3 ollama/ollama:latest

Pull llama3:70b-instruct

sudo docker exec -it ollama0 ollama pull llama3:70b-instruct
sudo docker exec -it ollama1 ollama pull llama3:70b-instruct
sudo docker exec -it ollama2 ollama pull llama3:70b-instruct
sudo docker exec -it ollama3 ollama pull llama3:70b-instruct

Python import

import asyncio
import ollama
from ollama import AsyncClient

Chat Ollama with an asynchronous python function

async def ollama_chat_solo(client, messages, model_name):
    response = await client.chat(model=model_name, messages=messages, keep_alive=-1)
    return response

Batch processing, Ollama client and queue management

async def ollama_chat_batches(df, client_pool, sys_instruction, model_name):
    
    nb_thread = len(df['id_msg'])

    # Create an empty queue:
    task_queue = asyncio.Queue()

    # Build and add each task to the queue:
    for i in range(0, nb_questions, 4):
        for j in range(len(client_pool)):
            if i + j < nb_questions:
                id_question = df['id_question'][i + j]
                question = df['question'][i + j]

                messages = [
                    {'role': "system", 'content': sys_instruction},
                    {'role': "user", 'content': question}
                ]
                task = asyncio.ensure_future(ollama_chat_solo(client_pool[j % len(client_pool)], messages, model_name))

                await task_queue.put((id_question, task))

    # Process tasks in the order they were added to the queue
    responses = []
    while not task_queue.empty():
        thread_id, task = await task_queue.get()
        response = await task  # Wait for task completion
        if response is not None:
            responses.append((thread_id, response))  # Store response with thread ID

    return responses

Calling

model_name = 'llama3:70b-instruct'

client_pool = [AsyncClient(host='http://localhost:{}'.format(port)) for port in range(11435, 11439)]

sys_instruction = f"""You are an expert in geographic. Answer the question."""

responses = await ollama_chat_batches(questions_df, client_pool, sys_instruction, model_name)
Originally created by @LaetLanf on GitHub (May 31, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4752 Hello, I'm confident that a feature enabling multi-GPU optimization and batch management would be beneficial. I may have made a mistake, as I couldn't effectively use the `ollama_num_parallel` and `ollama_max_loaded_models` settings to optimize my Linux VM, which has four A100 80GB GPUs, using Llama3:70b-instruct. I finally succeed to use the 4 GPUs in parallel, thanks to separate docker containers assigned to different ports. I also used AsyncClient() with Asyncio for effective asynchronous operations. In any case, I'm happy to share my code if it might help someone. # Assign docker containers to GPU and ports ``` sudo docker run -d --gpus=1 -v ollama:/root/.ollama -p 11435:11434 --name ollama0 ollama/ollama:latest sudo docker run -d --gpus=2 -v ollama:/root/.ollama -p 11436:11434 --name ollama1 ollama/ollama:latest sudo docker run -d --gpus=3 -v ollama:/root/.ollama -p 11437:11434 --name ollama2 ollama/ollama:latest sudo docker run -d --gpus=all -v ollama:/root/.ollama -p 11438:11434 --name ollama3 ollama/ollama:latest ``` # Pull llama3:70b-instruct ``` sudo docker exec -it ollama0 ollama pull llama3:70b-instruct sudo docker exec -it ollama1 ollama pull llama3:70b-instruct sudo docker exec -it ollama2 ollama pull llama3:70b-instruct sudo docker exec -it ollama3 ollama pull llama3:70b-instruct ``` # Python import ``` import asyncio import ollama from ollama import AsyncClient ``` # Chat Ollama with an asynchronous python function ``` async def ollama_chat_solo(client, messages, model_name): response = await client.chat(model=model_name, messages=messages, keep_alive=-1) return response ``` # Batch processing, Ollama client and queue management ``` async def ollama_chat_batches(df, client_pool, sys_instruction, model_name): nb_thread = len(df['id_msg']) # Create an empty queue: task_queue = asyncio.Queue() # Build and add each task to the queue: for i in range(0, nb_questions, 4): for j in range(len(client_pool)): if i + j < nb_questions: id_question = df['id_question'][i + j] question = df['question'][i + j] messages = [ {'role': "system", 'content': sys_instruction}, {'role': "user", 'content': question} ] task = asyncio.ensure_future(ollama_chat_solo(client_pool[j % len(client_pool)], messages, model_name)) await task_queue.put((id_question, task)) # Process tasks in the order they were added to the queue responses = [] while not task_queue.empty(): thread_id, task = await task_queue.get() response = await task # Wait for task completion if response is not None: responses.append((thread_id, response)) # Store response with thread ID return responses ``` # Calling ``` model_name = 'llama3:70b-instruct' client_pool = [AsyncClient(host='http://localhost:{}'.format(port)) for port in range(11435, 11439)] sys_instruction = f"""You are an expert in geographic. Answer the question.""" responses = await ollama_chat_batches(questions_df, client_pool, sys_instruction, model_name) ```
GiteaMirror added the feature request label 2026-04-12 13:23:24 -05:00
Author
Owner

@KingingWang commented on GitHub (Jun 2, 2024):

You should be able to mount /usr/local/ollama/ as a read-only volume in Docker. This way, you won't need to download the model for each container.

<!-- gh-comment-id:2143765353 --> @KingingWang commented on GitHub (Jun 2, 2024): You should be able to mount `/usr/local/ollama/` as a read-only volume in Docker. This way, you won't need to download the model for each container.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2994