[GH-ISSUE #7570] How to install Olama in a distributed manner #51334

Closed
opened 2026-04-28 19:33:12 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @smileyboy2019 on GitHub (Nov 8, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7570

How to connect two servers with 4090 graphics cards and provide unified services

Originally created by @smileyboy2019 on GitHub (Nov 8, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7570 How to connect two servers with 4090 graphics cards and provide unified services
GiteaMirror added the feature request label 2026-04-28 19:33:12 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 8, 2024):

ollama doesn't currently support distributed inference. See #6729 for ongoing work.

If you don't need to run a single model across multiple distributed cards, then a load balancing proxy like litellm or ollama_proxy might work for you.

<!-- gh-comment-id:2464459557 --> @rick-github commented on GitHub (Nov 8, 2024): ollama doesn't currently support distributed inference. See #6729 for ongoing work. If you don't need to run a single model across multiple distributed cards, then a load balancing proxy like [litellm](https://github.com/BerriAI/litellm) or [ollama_proxy](https://github.com/ParisNeo/ollama_proxy_server) might work for you.
Author
Owner

@Jotschi commented on GitHub (Nov 8, 2024):

I use nginx for this. No need for a specific service. The clusters can be selected using the "X-LLM-Cluster" custom http header. I use this to select different ollama clusters via a single endpoint.


upstream ollama_nemo {
    least_conn;

    server SERVER_1:11436 max_conns=8; # GPU 1
    server SERVER_2:11436 max_conns=8; # GPU 2
}

upstream ollama_fallback {
    least_conn;

    server SERVER_1:11439 max_conns=8; # GPU 1
}

server {
    listen       8080;
    server_name  localhost;

    location / {
        if ($http_x_llm_cluster = "nemo") {
            proxy_pass http://ollama_nemo;
        }
        
        proxy_pass http://ollama_fallback;
        #return 400 'X-LLM-Cluster header not found or invalid';
    }

    # redirect server error pages to the static page /50x.html
    error_page   500 502 503 504  /50x.html;
    location = /50x.html {
        root   /usr/share/nginx/html;
    }
}
<!-- gh-comment-id:2464469733 --> @Jotschi commented on GitHub (Nov 8, 2024): I use nginx for this. No need for a specific service. The clusters can be selected using the "X-LLM-Cluster" custom http header. I use this to select different ollama clusters via a single endpoint. ``` upstream ollama_nemo { least_conn; server SERVER_1:11436 max_conns=8; # GPU 1 server SERVER_2:11436 max_conns=8; # GPU 2 } upstream ollama_fallback { least_conn; server SERVER_1:11439 max_conns=8; # GPU 1 } server { listen 8080; server_name localhost; location / { if ($http_x_llm_cluster = "nemo") { proxy_pass http://ollama_nemo; } proxy_pass http://ollama_fallback; #return 400 'X-LLM-Cluster header not found or invalid'; } # redirect server error pages to the static page /50x.html error_page 500 502 503 504 /50x.html; location = /50x.html { root /usr/share/nginx/html; } } ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51334