Feat: Ability to work with multiple Ollama servers #122

Closed
opened 2025-11-11 14:06:57 -06:00 by GiteaMirror · 18 comments
Owner

Originally created by @jeremiahsb on GitHub (Dec 26, 2023).

Originally assigned to: @tjbck on GitHub.

Is your feature request related to a problem? Please describe.
On my system I have a capable CPU with a large amount of RAM that is able to run quite large models, albeit slowly. I also have an RTX 3060 which is able to run smaller models quite quickly. I can easily have two Docker instances of Ollama running, one for GPU mode and one for CPU only mode. It would be great if we could have an ability to have a single instance of Ollama-Webui with the ability to switch between the two Ollama instances.

Describe the solution you'd like
Have a settings screen where I can add one, or more, additional Ollama server. Along with the entry screen, have a toggle or drop-down list that would ideally let me set the default Ollama server on a per model level.

Describe alternatives you've considered
An alternative solution, which would be less ideal, would be to simply run two instances of Ollama-Webui, each pointing to a different Ollama container.

Originally created by @jeremiahsb on GitHub (Dec 26, 2023). Originally assigned to: @tjbck on GitHub. **Is your feature request related to a problem? Please describe.** On my system I have a capable CPU with a large amount of RAM that is able to run quite large models, albeit slowly. I also have an RTX 3060 which is able to run smaller models quite quickly. I can easily have two Docker instances of Ollama running, one for GPU mode and one for CPU only mode. It would be great if we could have an ability to have a single instance of Ollama-Webui with the ability to switch between the two Ollama instances. **Describe the solution you'd like** Have a settings screen where I can add one, or more, additional Ollama server. Along with the entry screen, have a toggle or drop-down list that would ideally let me set the default Ollama server on a per model level. **Describe alternatives you've considered** An alternative solution, which would be less ideal, would be to simply run two instances of Ollama-Webui, each pointing to a different Ollama container.
Author
Owner

@tjbck commented on GitHub (Dec 27, 2023):

Hi, Thanks for the suggestion. I'll take a look in the near future and assess it's usability/feasibility. Thanks!

@tjbck commented on GitHub (Dec 27, 2023): Hi, Thanks for the suggestion. I'll take a look in the near future and assess it's usability/feasibility. Thanks!
Author
Owner

@Loki321 commented on GitHub (Dec 30, 2023):

+1 but I would like to use it slightly differently.

I have 2 ollama instances on 2 different machines, one that can only do CPU inference (but with a lot of RAM so can run larger models) and one that is not always available (due to power cost concerns) but uses fast GPU.

Ideally I would be able to define multiple ollama servers in the web-ui and an order in which to use them.

For example, if the machine with the GPU is available (and the selected model is on that machine) then route requests to that instance otherwise use the slower (but always on) machine.

Manual selection would also be fine, or listing them all in the same way the UI currently separates ollama and external (openai api) models. But being able to define a primary and a backup connection (or more than 2, with priorities) would be great as it would mean I can use a single interface to interact with ollama and speed things up by just switching a faster machine on if it's deemed necessary for the task.

@Loki321 commented on GitHub (Dec 30, 2023): +1 but I would like to use it slightly differently. I have 2 ollama instances on 2 different machines, one that can only do CPU inference (but with a lot of RAM so can run larger models) and one that is not always available (due to power cost concerns) but uses fast GPU. Ideally I would be able to define multiple ollama servers in the web-ui and an order in which to use them. For example, if the machine with the GPU is available (and the selected model is on that machine) then route requests to that instance otherwise use the slower (but always on) machine. Manual selection would also be fine, or listing them all in the same way the UI currently separates ollama and external (openai api) models. But being able to define a primary and a backup connection (or more than 2, with priorities) would be great as it would mean I can use a single interface to interact with ollama and speed things up by just switching a faster machine on if it's deemed necessary for the task.
Author
Owner

@dnviti commented on GitHub (Jan 3, 2024):

There would be a "context" problem if you switch from an instance to another in the middle of a conversation.
I tried to asses this during the kubernetes support development.
As you say the only way that i can think of is to be able to switch instance manually before having the conversation, in kubernetes this would mean to be able to select the exact statefulset using the statefulset id.

A really cool feature that ollama itself could implement would be to save the "chat context" data into a shared database (like mongodb) and reuse that as context for the next prompt response, just speculating, don't know if possible.

@dnviti commented on GitHub (Jan 3, 2024): There would be a "context" problem if you switch from an instance to another in the middle of a conversation. I tried to asses this during the kubernetes support development. As you say the only way that i can think of is to be able to switch instance manually before having the conversation, in kubernetes this would mean to be able to select the exact statefulset using the statefulset id. A really cool feature that ollama itself could implement would be to save the "chat context" data into a shared database (like mongodb) and reuse that as context for the next prompt response, just speculating, don't know if possible.
Author
Owner

@Loki321 commented on GitHub (Jan 3, 2024):

My use case only really needs to be able to spin up a machine and connect when I know I'm going to need it. Wouldn't really happen mid conversation.

Related to saving the chat context though, llama.cpp has a --prompt-cache flag where you can save the prompt cache to a file and load it back in again later. I would think it could be leveraged to achieve what you're talking about. Like you say though, would need to be done in ollama itself.

@Loki321 commented on GitHub (Jan 3, 2024): My use case only really needs to be able to spin up a machine and connect when I know I'm going to need it. Wouldn't really happen mid conversation. Related to saving the chat context though, llama.cpp has a `--prompt-cache` flag where you can save the prompt cache to a file and load it back in again later. I would think it could be leveraged to achieve what you're talking about. Like you say though, would need to be done in ollama itself.
Author
Owner

@dnviti commented on GitHub (Jan 4, 2024):

It would be possible in ollama then, the backend only needs to save the prompt cache using the chat id and always reuse it when asked a new question on that same chat, ofc using a shared volume of data.

@dnviti commented on GitHub (Jan 4, 2024): It would be possible in ollama then, the backend only needs to save the prompt cache using the chat id and always reuse it when asked a new question on that same chat, ofc using a shared volume of data.
Author
Owner

@SethBurkart123 commented on GitHub (Jan 16, 2024):

Just for anyone looking over this thread before the feature gets introduced. If you want to run the same models on multiple ollama instances (basically just to shorten queue times if there are a decent amount of people using your webui instance) you can do load balancing with nginx.

This solution isn't completely useful for some of the people in this thread, although it came in handy for me as I have two GPU's and wanted to make sure that multiple people could generate at once without queueing initially. If you are using the least_conn; method (not the one described later for @Loki321) - Both Ollama instances must have the same models. Otherwise you can get errors due to the webui thinking that certain models are available.

All you need is an nginx instance running (there's heaps of tutorials for that) and then to put this in your nginx.conf:

http {
    upstream backend_servers {
        least_conn;                     # Enable least connections load balancing

        # Add your Ollama servers here eg. if one of them was http://localhost:11434/api then you would add `server localhost:11434`
        server localhost:11434;          # Ollama server 1
        server localhost:11435;          # Ollama server 2
    }

    server {
        listen 9090;                     # This is the port you would use for the Ollama API URL in the WebUI eg. http://localhost:9090/api

        location / {
            proxy_pass http://backend_servers; # Forward requests to the upstream block
        }
    }
}

@Loki321 if you want to prioritise one server over the other instead of whichever has the smallest queue (ie. prioritise your GPU server when it's available otherwise fallback to cpu) you can remove the least_conn; line and modify the servers to:

        server localhost:11434 max_fails=3 fail_timeout=30s;  # Primary GPU server
        server localhost:11435 backup;                      # Backup CPU only server
@SethBurkart123 commented on GitHub (Jan 16, 2024): Just for anyone looking over this thread before the feature gets introduced. If you want to run the same models on multiple ollama instances (basically just to shorten queue times if there are a decent amount of people using your webui instance) you can do load balancing with nginx. This solution isn't completely useful for some of the people in this thread, although it came in handy for me as I have two GPU's and wanted to make sure that multiple people could generate at once without queueing initially. If you are using the `least_conn;` method (not the one described later for @Loki321) - **Both Ollama instances must have the same models.** Otherwise you can get errors due to the webui thinking that certain models are available. All you need is an nginx instance running (there's heaps of tutorials for that) and then to put this in your `nginx.conf`: ``` http { upstream backend_servers { least_conn; # Enable least connections load balancing # Add your Ollama servers here eg. if one of them was http://localhost:11434/api then you would add `server localhost:11434` server localhost:11434; # Ollama server 1 server localhost:11435; # Ollama server 2 } server { listen 9090; # This is the port you would use for the Ollama API URL in the WebUI eg. http://localhost:9090/api location / { proxy_pass http://backend_servers; # Forward requests to the upstream block } } } ``` @Loki321 if you want to prioritise one server over the other instead of whichever has the smallest queue (ie. prioritise your GPU server when it's available otherwise fallback to cpu) you can remove the `least_conn;` line and modify the servers to: ``` server localhost:11434 max_fails=3 fail_timeout=30s; # Primary GPU server server localhost:11435 backup; # Backup CPU only server ```
Author
Owner

@justinh-rahb commented on GitHub (Jan 17, 2024):

@SethBurkart123 fascinating, I didn't think this would actually work but it seems to so far... I can increase my Mac Studio's throughput running 3x Ollama "workers" on the same machine, it's got enough CPU and RAM. It's not quite triple the throughput, but it's definitely an improvement for users.

@justinh-rahb commented on GitHub (Jan 17, 2024): @SethBurkart123 fascinating, I didn't think this would actually work but it seems to so far... I can increase my Mac Studio's throughput running 3x Ollama "workers" on the same machine, it's got enough CPU and RAM. It's not quite triple the throughput, but it's definitely an improvement for users.
Author
Owner

@davidamacey commented on GitHub (Jan 24, 2024):

@SethBurkart123 I followed your instructions, but I am having a bit of trouble getting the nginx to forward to the different ollama servers. Below are some code snippets from my nginx config file and the docker compose to setup the deployment across 4 separate GPUs.

  1. Changed port numbers of ollama containers in compose and conf file, in various combinations
  2. Restarted compose after each change

With the setup below it will run only on the container with 11434 port. If I change 11434 to a different ollama container it loads on that respective GPU. I tested with least_conn and none, which defaults to round_robin

The load balance will not send to the other ollama containers. I test with two browsers on same machine and multiple different machines.

It may be something simple I missed; open to suggestions.

Any assistance is greatly appreciated

version: '3.8'

services:

  nginx:
    image: nginx:latest
    container_name: nginx
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    ports:
      - 9090:80
    networks:
      - ollama_net

  ollama-00:
    volumes:
      - /mnt/md0/ollama_models:/root/.ollama
    container_name: ollama-00
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11434:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '0' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-01:
    volumes:
      - /mnt/md0/ollama_models:/root/.ollama
    container_name: ollama-01
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11435:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '1' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-02:
    volumes:
      - /mnt/md0/ollama_models:/root/.ollama
    container_name: ollama-02
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11436:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '2' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-03:
    volumes:
      - /mnt/md0/ollama_models:/root/.ollama
    container_name: ollama-03
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11437:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '3' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-webui:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: ollama-webui
    volumes:
      - /mnt/md0/ollama_webui/:/app/backend/data
    depends_on:
      - nginx
      - ollama-00
      - ollama-01
      - ollama-02
      - ollama-03
    ports:
      - 3000:8080
    environment:
      - OLLAMA_API_BASE_URL=http://nginx:9090/api
    # extra_hosts:
    #   - host.docker.internal:host-gateway
    restart: unless-stopped
    networks:
      - ollama_net

networks:
  ollama_net:
    driver: bridge

# volumes:
#   ollama: {}
#   ollama-webui: {}
worker_processes auto;

events { worker_connections 1024; }

http {
    upstream ollama {
        least_conn;                     # Enable least connections load balancing

        server ollama-00:11434;          # Ollama server 0
        server ollama-01:11435;          # Ollama server 1
        server ollama-02:11436;          # Ollama server 2
        server ollama-03:11437;          # Ollama server 3
    }

    server {
        listen 9090;                     # This is the port you would use for the Ollama API URL in the WebUI eg. http://localhost:9090/api

        location / {
            proxy_pass http://ollama; # Forward requests to the upstream block
        }
    }
}

@davidamacey commented on GitHub (Jan 24, 2024): @SethBurkart123 I followed your instructions, but I am having a bit of trouble getting the nginx to forward to the different ollama servers. Below are some code snippets from my nginx config file and the docker compose to setup the deployment across 4 separate GPUs. 1. Changed port numbers of ollama containers in compose and conf file, in various combinations 2. Restarted compose after each change With the setup below it will run only on the container with 11434 port. If I change 11434 to a different ollama container it loads on that respective GPU. I tested with `least_conn` and none, which defaults to `round_robin` The load balance will not send to the other ollama containers. I test with two browsers on same machine and multiple different machines. It may be something simple I missed; open to suggestions. Any assistance is greatly appreciated ```docker version: '3.8' services: nginx: image: nginx:latest container_name: nginx volumes: - ./nginx.conf:/etc/nginx/nginx.conf ports: - 9090:80 networks: - ollama_net ollama-00: volumes: - /mnt/md0/ollama_models:/root/.ollama container_name: ollama-00 pull_policy: always tty: true restart: unless-stopped image: ollama/ollama:latest ports: - 11434:11434 # GPU support runtime: nvidia deploy: resources: reservations: devices: - driver: nvidia device_ids: [ '0' ] capabilities: - gpu networks: - ollama_net ollama-01: volumes: - /mnt/md0/ollama_models:/root/.ollama container_name: ollama-01 pull_policy: always tty: true restart: unless-stopped image: ollama/ollama:latest ports: - 11435:11434 # GPU support runtime: nvidia deploy: resources: reservations: devices: - driver: nvidia device_ids: [ '1' ] capabilities: - gpu networks: - ollama_net ollama-02: volumes: - /mnt/md0/ollama_models:/root/.ollama container_name: ollama-02 pull_policy: always tty: true restart: unless-stopped image: ollama/ollama:latest ports: - 11436:11434 # GPU support runtime: nvidia deploy: resources: reservations: devices: - driver: nvidia device_ids: [ '2' ] capabilities: - gpu networks: - ollama_net ollama-03: volumes: - /mnt/md0/ollama_models:/root/.ollama container_name: ollama-03 pull_policy: always tty: true restart: unless-stopped image: ollama/ollama:latest ports: - 11437:11434 # GPU support runtime: nvidia deploy: resources: reservations: devices: - driver: nvidia device_ids: [ '3' ] capabilities: - gpu networks: - ollama_net ollama-webui: build: context: . dockerfile: Dockerfile container_name: ollama-webui volumes: - /mnt/md0/ollama_webui/:/app/backend/data depends_on: - nginx - ollama-00 - ollama-01 - ollama-02 - ollama-03 ports: - 3000:8080 environment: - OLLAMA_API_BASE_URL=http://nginx:9090/api # extra_hosts: # - host.docker.internal:host-gateway restart: unless-stopped networks: - ollama_net networks: ollama_net: driver: bridge # volumes: # ollama: {} # ollama-webui: {} ``` ```bash worker_processes auto; events { worker_connections 1024; } http { upstream ollama { least_conn; # Enable least connections load balancing server ollama-00:11434; # Ollama server 0 server ollama-01:11435; # Ollama server 1 server ollama-02:11436; # Ollama server 2 server ollama-03:11437; # Ollama server 3 } server { listen 9090; # This is the port you would use for the Ollama API URL in the WebUI eg. http://localhost:9090/api location / { proxy_pass http://ollama; # Forward requests to the upstream block } } } ```
Author
Owner

@justinh-rahb commented on GitHub (Jan 24, 2024):

@davidamacey here are two possible solutions to address the issue in your Nginx configuration file:

  1. Change all the ports in your nginx.conf file to 11434, while keeping the current hostnames as they are. This will ensure that Nginx communicates with the containers on their internal Docker network with port 11434. In this case you don't actually need to publish external ports unless you have something else accessing your Ollama instances directly from outside of Docker.
  2. Change the ollama-##:1143# hostnames in your nginx.conf file to host.docker.internal:1143# instead. This will ensure that Nginx communicates with the containers using their published external ports. This is one way to do it, but I'd recommend the first.

Either of these changes should resolve the issue with your Nginx configuration file.

@justinh-rahb commented on GitHub (Jan 24, 2024): @davidamacey here are two possible solutions to address the issue in your Nginx configuration file: 1. Change all the ports in your nginx.conf file to 11434, while keeping the current hostnames as they are. This will ensure that Nginx communicates with the containers on their internal Docker network with port 11434. In this case you don't actually need to publish external ports unless you have something else accessing your Ollama instances directly from outside of Docker. 2. Change the `ollama-##:1143#` hostnames in your nginx.conf file to `host.docker.internal:1143#` instead. This will ensure that Nginx communicates with the containers using their published external ports. This is one way to do it, but I'd recommend the first. Either of these changes should resolve the issue with your Nginx configuration file.
Author
Owner

@davidamacey commented on GitHub (Jan 24, 2024):

@justinh-rahb and @SethBurkart123 Thank you for the prompt response! I tried a few more tests.

  1. I changed all ports in config to the same 11434, but I had the same result, sending two simultaneous requests from two browsers resulted in a queue for GPU 0
  2. host.docker.internal was not resolvable in nginx
  3. I tried making separate volumes for each ollama container, in case there was config per container. Currently I am using the same volume for each ollama model files.

I greatly appreciate your quick feedback. I will continue to troubleshoot.

@davidamacey commented on GitHub (Jan 24, 2024): @justinh-rahb and @SethBurkart123 Thank you for the prompt response! I tried a few more tests. 1. I changed all ports in config to the same 11434, but I had the same result, sending two simultaneous requests from two browsers resulted in a queue for GPU 0 2. `host.docker.internal` was not resolvable in nginx 3. I tried making separate volumes for each ollama container, in case there was config per container. Currently I am using the same volume for each ollama model files. I greatly appreciate your quick feedback. I will continue to troubleshoot.
Author
Owner

@davidamacey commented on GitHub (Jan 25, 2024):

@justinh-rahb and @SethBurkart123 I appreciate the guidance. Unfortunately, I was not able to get multiple ollama containers deployed behind an nginx load balancer. It seems the FastAPI StreamRequest is receiving a stream of data followed by the final POST once the chat is completed. Therefore, myself (consulting with friends), that the nginx is working but it doesn't handle the API stream as we expect.

This drove me to learn about vLLM. vLLM has direct compliance with OpenAI's API format, therefore I can deploy a local vLLM container with selected LLM. Using the UI I enter an EMPTY key and the URL to the vLLM instance. Select from model in chat and off to the races! This provides faster response times and async requests.

Con to vLLM is that it requires an NVIDIA GPU, so not all users will have this due to the popularity of Apple Silicon M Chips, etc.

I am happy to report your application works with vLLM backend without much effort. There is one potential bug on the System Prompt but I will make an issue for it.

Below is my Docker compose setup for those that are interested in giving it a try. Note, ollama container is required, otherwise the UI will throw errors that there isn't an ollama connection.

version: '3.8'

services:

  ollama-00:
    volumes:
      - /mnt/nas/ollama_webui/ollama:/root/.ollama
    container_name: ollama-00
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11434:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '0' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  vllm:
    container_name: vllm
    image: vllm/vllm-openai:latest
    pull_policy: always
    volumes:
      - /mnt/nas/hf_vllm_models/:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=<token>
    ports:
      - "8000:8000"
    ipc: host
    # command: ["--model", "mistralai/Mixtral-8x7B-Instruct-v0.1"]
    command: ["--model", "mistralai/Mistral-7B-Instruct-v0.2", "--tensor-parallel-size", "2"]
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '2' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-webui:
    image: ghcr.io/ollama-webui/ollama-webui:main
    container_name: ollama-webui
    pull_policy: always
    volumes:
      - /mnt/nas/ollama_webui/webui:/app/backend/data
    depends_on:
      - vllm
      - ollama-00
    ports:
      - 3000:8080
    environment:
      - OLLAMA_API_BASE_URL=http://ollama-00:11434/api
      - OPENAI_API_BASE_URL=http://vllm:8000/v1
      - OPENAI_API_KEY=EMPTY
    restart: unless-stopped
    networks:
      - ollama_net

networks:
  ollama_net:
    driver: bridge

Happy coding!

@davidamacey commented on GitHub (Jan 25, 2024): @justinh-rahb and @SethBurkart123 I appreciate the guidance. Unfortunately, I was not able to get multiple ollama containers deployed behind an nginx load balancer. It seems the FastAPI StreamRequest is receiving a stream of data followed by the final POST once the chat is completed. Therefore, myself (consulting with friends), that the nginx is working but it doesn't handle the API stream as we expect. This drove me to learn about vLLM. vLLM has direct compliance with OpenAI's API format, therefore I can deploy a local vLLM container with selected LLM. Using the UI I enter an EMPTY key and the URL to the vLLM instance. Select from model in chat and off to the races! This provides faster response times and async requests. Con to vLLM is that it requires an NVIDIA GPU, so not all users will have this due to the popularity of Apple Silicon M Chips, etc. I am happy to report your application works with vLLM backend without much effort. There is one potential bug on the System Prompt but I will make an issue for it. Below is my Docker compose setup for those that are interested in giving it a try. Note, ollama container is required, otherwise the UI will throw errors that there isn't an ollama connection. ```docker version: '3.8' services: ollama-00: volumes: - /mnt/nas/ollama_webui/ollama:/root/.ollama container_name: ollama-00 pull_policy: always tty: true restart: unless-stopped image: ollama/ollama:latest ports: - 11434:11434 # GPU support runtime: nvidia deploy: resources: reservations: devices: - driver: nvidia device_ids: [ '0' ] capabilities: - gpu networks: - ollama_net vllm: container_name: vllm image: vllm/vllm-openai:latest pull_policy: always volumes: - /mnt/nas/hf_vllm_models/:/root/.cache/huggingface environment: - HUGGING_FACE_HUB_TOKEN=<token> ports: - "8000:8000" ipc: host # command: ["--model", "mistralai/Mixtral-8x7B-Instruct-v0.1"] command: ["--model", "mistralai/Mistral-7B-Instruct-v0.2", "--tensor-parallel-size", "2"] runtime: nvidia deploy: resources: reservations: devices: - driver: nvidia device_ids: ['0', '2' ] capabilities: - gpu networks: - ollama_net ollama-webui: image: ghcr.io/ollama-webui/ollama-webui:main container_name: ollama-webui pull_policy: always volumes: - /mnt/nas/ollama_webui/webui:/app/backend/data depends_on: - vllm - ollama-00 ports: - 3000:8080 environment: - OLLAMA_API_BASE_URL=http://ollama-00:11434/api - OPENAI_API_BASE_URL=http://vllm:8000/v1 - OPENAI_API_KEY=EMPTY restart: unless-stopped networks: - ollama_net networks: ollama_net: driver: bridge ``` Happy coding!
Author
Owner

@explorigin commented on GitHub (Feb 12, 2024):

I recommend that webui become a frontend to litellm proxy. There are a lot of things that litellm can do that we could be the pretty face for. Managing multiple endpoints is one of them. https://docs.litellm.ai/docs/proxy/configs

@explorigin commented on GitHub (Feb 12, 2024): I recommend that webui become a frontend to litellm proxy. There are a lot of things that litellm can do that we could be the pretty face for. Managing multiple endpoints is one of them. https://docs.litellm.ai/docs/proxy/configs
Author
Owner

@justinh-rahb commented on GitHub (Feb 12, 2024):

I recommend that webui become a frontend to litellm proxy. There are a lot of things that litellm can do that we could be the pretty face for. Managing multiple endpoints is one of them. https://docs.litellm.ai/docs/proxy/configs

I've been beating this drum for a while now, it does seem to me to be the quickest way to bootstrap support for other backends.

@justinh-rahb commented on GitHub (Feb 12, 2024): > I recommend that webui become a frontend to litellm proxy. There are a lot of things that litellm can do that we could be the pretty face for. Managing multiple endpoints is one of them. https://docs.litellm.ai/docs/proxy/configs I've been beating this drum for a while now, it does seem to me to be the quickest way to bootstrap support for other backends.
Author
Owner

@VfBfoerst commented on GitHub (Feb 20, 2024):

Speaking of litellm, I got it to work with my open-webui and it handles loadbalancing very well (tested with 2 GPUs and 4 Ollama Instances).
The only "problem" which appeared was after adding authentication to the litellm proxy server. Then, the webui couldn't speak with the API anymore, I guess because of the missing possibility to add the bearer token.
Can you maybe add a bearer token field when adding the external litellm-api URL, e.g. here?:
image

Also it would be nice to set different bearer tokens per user, so i can be able to track the usage of the litellm api on a per user basis.

I can also create a new issue with this, if wanted :)

Edit: the corresponding header is e.g. curl http://123.123.123.123/v1/chat/completions -H 'Authorization: Bearer sk-1234'

@VfBfoerst commented on GitHub (Feb 20, 2024): Speaking of [litellm](https://github.com/BerriAI/litellm), I got it to work with my open-webui and it handles loadbalancing very well (tested with 2 GPUs and 4 Ollama Instances). The only "problem" which appeared was after adding authentication to the litellm proxy server. Then, the webui couldn't speak with the API anymore, I guess because of the missing possibility to add the bearer token. Can you maybe add a bearer token field when adding the external litellm-api URL, e.g. here?: ![image](https://github.com/open-webui/open-webui/assets/64251549/ae9b536f-3e77-4fc5-bf6f-a3615c90152e) Also it would be nice to set different bearer tokens per user, so i can be able to track the usage of the litellm api on a per user basis. I can also create a new issue with this, if wanted :) _Edit: the corresponding header is e.g. `curl http://123.123.123.123/v1/chat/completions -H 'Authorization: Bearer sk-1234'`_
Author
Owner

@VfBfoerst commented on GitHub (Feb 22, 2024):

Speaking of litellm, I got it to work with my open-webui and it handles loadbalancing very well (tested with 2 GPUs and 4 Ollama Instances). The only "problem" which appeared was after adding authentication to the litellm proxy server. Then, the webui couldn't speak with the API anymore, I guess because of the missing possibility to add the bearer token. Can you maybe add a bearer token field when adding the external litellm-api URL, e.g. here?: image

Also it would be nice to set different bearer tokens per user, so i can be able to track the usage of the litellm api on a per user basis.

I can also create a new issue with this, if wanted :)

Edit: the corresponding header is e.g. curl http://123.123.123.123/v1/chat/completions -H 'Authorization: Bearer sk-1234'

Works in newest version, thanks, you can set the token in the api key field :}

@VfBfoerst commented on GitHub (Feb 22, 2024): > Speaking of [litellm](https://github.com/BerriAI/litellm), I got it to work with my open-webui and it handles loadbalancing very well (tested with 2 GPUs and 4 Ollama Instances). The only "problem" which appeared was after adding authentication to the litellm proxy server. Then, the webui couldn't speak with the API anymore, I guess because of the missing possibility to add the bearer token. Can you maybe add a bearer token field when adding the external litellm-api URL, e.g. here?: ![image](https://private-user-images.githubusercontent.com/64251549/306250484-ae9b536f-3e77-4fc5-bf6f-a3615c90152e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDg1OTM4NTQsIm5iZiI6MTcwODU5MzU1NCwicGF0aCI6Ii82NDI1MTU0OS8zMDYyNTA0ODQtYWU5YjUzNmYtM2U3Ny00ZmM1LWJmNmYtYTM2MTVjOTAxNTJlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAyMjIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMjIyVDA5MTkxNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWMxMTQ2MDFkOTI2YmM3ODBlODYwODczZTExZGNlYzc4YTRiMmI2MjliMjJjYmNlZDdiYTg1NWQ2ZjE3ZTlmMTUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.-wx9GJbe6KBooVcl_wuTxJNl07PihBywTxnBg4S9RJc) > > Also it would be nice to set different bearer tokens per user, so i can be able to track the usage of the litellm api on a per user basis. > > I can also create a new issue with this, if wanted :) > > _Edit: the corresponding header is e.g. `curl http://123.123.123.123/v1/chat/completions -H 'Authorization: Bearer sk-1234'`_ Works in newest version, thanks, you can set the token in the api key field :}
Author
Owner

@christopher-kapic commented on GitHub (Feb 22, 2024):

I know this isn't exactly what you guys are discussing, but I'm also not sure if it's sufficiently different to open a new feature request. If it is, I can do that.

It would be awesome if users could add their own OpenAI endpoints/api keys, or at least allow multiple OpenAI endpoints from the admin's perspective (I think this option would require database changes and I'm not familiar enough with peewee at the moment to implement this).

One way this could be done specifically for individual users is how TypingMind does it—storing the API key and endpoint in localStorage and making a direct request from the browser to the OpenAI endpoint (although this may introduce CORS errors, I think only when working with custom APIs that follow the OpenAI API specs). However, I think the ideal solution would allow users to store their custom keys/endpoints in the database so the request can be made on the backend to avoid CORS errors.

Any thoughts on this?

@christopher-kapic commented on GitHub (Feb 22, 2024): I know this isn't _exactly_ what you guys are discussing, but I'm also not sure if it's sufficiently different to open a new feature request. If it is, I can do that. It would be awesome if users could add their own OpenAI endpoints/api keys, or at least allow multiple OpenAI endpoints from the admin's perspective (I think this option would require database changes and I'm not familiar enough with peewee at the moment to implement this). One way this could be done specifically for individual users is how [TypingMind](https://typingmind.com) does it—storing the API key and endpoint in localStorage and making a direct request from the browser to the OpenAI endpoint (although this may introduce CORS errors, I think only when working with custom APIs that follow the OpenAI API specs). However, I think the ideal solution would allow users to store their custom keys/endpoints in the database so the request can be made on the backend to avoid CORS errors. Any thoughts on this?
Author
Owner

@justinh-rahb commented on GitHub (Feb 22, 2024):

@christopher-kapic, the WebUI initially processed OpenAI requests solely on the browser side, with settings stored in local storage exactly as you say. However, we received several requests about it and decided to change the implementation to proxy through the backend like the Ollama API requests. It appears that supporting both methods might be necessary to cater to all users, but this could become quite intricate.

@justinh-rahb commented on GitHub (Feb 22, 2024): @christopher-kapic, the WebUI initially processed OpenAI requests solely on the browser side, with settings stored in local storage exactly as you say. However, we received several requests about it and decided to change the implementation to proxy through the backend like the Ollama API requests. It appears that supporting both methods might be necessary to cater to all users, but this could become quite intricate.
Author
Owner

@yanchxx commented on GitHub (Jun 22, 2025):

@justinh-rahb and @SethBurkart123 I appreciate the guidance. Unfortunately, I was not able to get multiple ollama containers deployed behind an nginx load balancer. It seems the FastAPI StreamRequest is receiving a stream of data followed by the final POST once the chat is completed. Therefore, myself (consulting with friends), that the nginx is working but it doesn't handle the API stream as we expect.

This drove me to learn about vLLM. vLLM has direct compliance with OpenAI's API format, therefore I can deploy a local vLLM container with selected LLM. Using the UI I enter an EMPTY key and the URL to the vLLM instance. Select from model in chat and off to the races! This provides faster response times and async requests.

Con to vLLM is that it requires an NVIDIA GPU, so not all users will have this due to the popularity of Apple Silicon M Chips, etc.

I am happy to report your application works with vLLM backend without much effort. There is one potential bug on the System Prompt but I will make an issue for it.

Below is my Docker compose setup for those that are interested in giving it a try. Note, ollama container is required, otherwise the UI will throw errors that there isn't an ollama connection.

version: '3.8'

services:

  ollama-00:
    volumes:
      - /mnt/nas/ollama_webui/ollama:/root/.ollama
    container_name: ollama-00
    pull_policy: always
    tty: true
    restart: unless-stopped
    image: ollama/ollama:latest
    ports:
      - 11434:11434
    # GPU support
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ '0' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  vllm:
    container_name: vllm
    image: vllm/vllm-openai:latest
    pull_policy: always
    volumes:
      - /mnt/nas/hf_vllm_models/:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=<token>
    ports:
      - "8000:8000"
    ipc: host
    # command: ["--model", "mistralai/Mixtral-8x7B-Instruct-v0.1"]
    command: ["--model", "mistralai/Mistral-7B-Instruct-v0.2", "--tensor-parallel-size", "2"]
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '2' ]
              capabilities:
                - gpu
    networks:
      - ollama_net

  ollama-webui:
    image: ghcr.io/ollama-webui/ollama-webui:main
    container_name: ollama-webui
    pull_policy: always
    volumes:
      - /mnt/nas/ollama_webui/webui:/app/backend/data
    depends_on:
      - vllm
      - ollama-00
    ports:
      - 3000:8080
    environment:
      - OLLAMA_API_BASE_URL=http://ollama-00:11434/api
      - OPENAI_API_BASE_URL=http://vllm:8000/v1
      - OPENAI_API_KEY=EMPTY
    restart: unless-stopped
    networks:
      - ollama_net

networks:
  ollama_net:
    driver: bridge

Happy coding!

It seems to be a bug in the latest version of nginx. I used nginx 1.18 and it can load balance normally. This also troubled me for a long time, haha.

@yanchxx commented on GitHub (Jun 22, 2025): > [@justinh-rahb](https://github.com/justinh-rahb) and [@SethBurkart123](https://github.com/SethBurkart123) I appreciate the guidance. Unfortunately, I was not able to get multiple ollama containers deployed behind an nginx load balancer. It seems the FastAPI StreamRequest is receiving a stream of data followed by the final POST once the chat is completed. Therefore, myself (consulting with friends), that the nginx is working but it doesn't handle the API stream as we expect. > > This drove me to learn about vLLM. vLLM has direct compliance with OpenAI's API format, therefore I can deploy a local vLLM container with selected LLM. Using the UI I enter an EMPTY key and the URL to the vLLM instance. Select from model in chat and off to the races! This provides faster response times and async requests. > > Con to vLLM is that it requires an NVIDIA GPU, so not all users will have this due to the popularity of Apple Silicon M Chips, etc. > > I am happy to report your application works with vLLM backend without much effort. There is one potential bug on the System Prompt but I will make an issue for it. > > Below is my Docker compose setup for those that are interested in giving it a try. Note, ollama container is required, otherwise the UI will throw errors that there isn't an ollama connection. > > ``` > version: '3.8' > > services: > > ollama-00: > volumes: > - /mnt/nas/ollama_webui/ollama:/root/.ollama > container_name: ollama-00 > pull_policy: always > tty: true > restart: unless-stopped > image: ollama/ollama:latest > ports: > - 11434:11434 > # GPU support > runtime: nvidia > deploy: > resources: > reservations: > devices: > - driver: nvidia > device_ids: [ '0' ] > capabilities: > - gpu > networks: > - ollama_net > > vllm: > container_name: vllm > image: vllm/vllm-openai:latest > pull_policy: always > volumes: > - /mnt/nas/hf_vllm_models/:/root/.cache/huggingface > environment: > - HUGGING_FACE_HUB_TOKEN=<token> > ports: > - "8000:8000" > ipc: host > # command: ["--model", "mistralai/Mixtral-8x7B-Instruct-v0.1"] > command: ["--model", "mistralai/Mistral-7B-Instruct-v0.2", "--tensor-parallel-size", "2"] > runtime: nvidia > deploy: > resources: > reservations: > devices: > - driver: nvidia > device_ids: ['0', '2' ] > capabilities: > - gpu > networks: > - ollama_net > > ollama-webui: > image: ghcr.io/ollama-webui/ollama-webui:main > container_name: ollama-webui > pull_policy: always > volumes: > - /mnt/nas/ollama_webui/webui:/app/backend/data > depends_on: > - vllm > - ollama-00 > ports: > - 3000:8080 > environment: > - OLLAMA_API_BASE_URL=http://ollama-00:11434/api > - OPENAI_API_BASE_URL=http://vllm:8000/v1 > - OPENAI_API_KEY=EMPTY > restart: unless-stopped > networks: > - ollama_net > > networks: > ollama_net: > driver: bridge > ``` > > Happy coding! It seems to be a bug in the latest version of nginx. I used nginx 1.18 and it can load balance normally. This also troubled me for a long time, haha.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#122