[GH-ISSUE #9607] non global CUDA_VISIBLE_DEVICES #52779

Closed
opened 2026-04-29 00:51:05 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @NGC13009 on GitHub (Mar 9, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9607

The currently available GPU are set globally by the environment variable. Would it be possible to provide a method that would allow me to specify that I want different models to be loaded into different GPUs?

For example, having qwen2.5:32b load into cuda:0,cuda:1 and having qwq in larger contexts load into cuda:2~6 this way. This is because I've found that the current loading, no matter how sequential, results in multiple models being evenly distributed across multiple graphics cards, which is less efficient and wastes more video memory.

Also, setting different concurrency numbers for different models is urgently needed. like #8842

Originally created by @NGC13009 on GitHub (Mar 9, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9607 The currently available GPU are set globally by the environment variable. Would it be possible to provide a method that would allow me to specify that I want different models to be loaded into different GPUs? For example, having `qwen2.5:32b` load into `cuda:0,cuda:1` and having `qwq` in larger contexts load into `cuda:2~6` this way. This is because I've found that the current loading, no matter how sequential, results in multiple models being evenly distributed across multiple graphics cards, which is less efficient and wastes more video memory. Also, setting different concurrency numbers for different models is urgently needed. like [#8842](https://github.com/ollama/ollama/issues/8842)
GiteaMirror added the feature request label 2026-04-29 00:51:05 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 9, 2025):

Use multiple ollama servers, bind the required GPUs with CUDA_VISIBLE_DEVICES, run a proxy in front to provide a unified interface.

For example:

x-ollama: &ollama
  image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest}
  volumes:
    - ${OLLAMA_MODELS-./ollama}:/root/.ollama
  environment: &env
    OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1}
    OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m}
    OLLAMA_MAX_LOADED_MODELS: 1
    OLLAMA_NUM_PARALLEL: 1
    OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0}
    OLLAMA_DEBUG: ${OLLAMA_DEBUG-1}
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

services:
  ollama-1:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 0,1

  ollama-2:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 2,3,4,5,6

  ollama:
    image: nginx-lb
    build:
      dockerfile_inline: |
        FROM nginx:latest
        RUN cat > /etc/nginx/conf.d/default.conf <<EOF
        upstream ollama_group {
          least_conn;
          server ollama-1:11434 max_conns=8;
          server ollama-2:11434 max_conns=8;
        }
        server {
          listen 11434;
          server_name localhost;
          location / {
            proxy_pass http://ollama_group;
          }
        }
        EOF
    ports:
      - 11434:11434
<!-- gh-comment-id:2708764862 --> @rick-github commented on GitHub (Mar 9, 2025): Use multiple ollama servers, bind the required GPUs with CUDA_VISIBLE_DEVICES, run a proxy in front to provide a unified interface. For example: ```yaml x-ollama: &ollama image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest} volumes: - ${OLLAMA_MODELS-./ollama}:/root/.ollama environment: &env OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1} OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m} OLLAMA_MAX_LOADED_MODELS: 1 OLLAMA_NUM_PARALLEL: 1 OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0} OLLAMA_DEBUG: ${OLLAMA_DEBUG-1} deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] services: ollama-1: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 0,1 ollama-2: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 2,3,4,5,6 ollama: image: nginx-lb build: dockerfile_inline: | FROM nginx:latest RUN cat > /etc/nginx/conf.d/default.conf <<EOF upstream ollama_group { least_conn; server ollama-1:11434 max_conns=8; server ollama-2:11434 max_conns=8; } server { listen 11434; server_name localhost; location / { proxy_pass http://ollama_group; } } EOF ports: - 11434:11434 ```
Author
Owner

@rick-github commented on GitHub (Mar 9, 2025):

See also #3902

<!-- gh-comment-id:2708769281 --> @rick-github commented on GitHub (Mar 9, 2025): See also #3902
Author
Owner

@NGC13009 commented on GitHub (Mar 9, 2025):

Due to special network reasons within China, maybe not quite able to use docker solution, can you give a solution to run directly in the system environment? For example, defining temporary environment variables directly within a specific terminal, pulling up the model and bypassing ollama's own model process management.

Also, with this approach, does nginx automatically route different models? For example, if ollama-1 runs model A and ollama-2 runs model B, any api requests are automatically routed through?

<!-- gh-comment-id:2708865752 --> @NGC13009 commented on GitHub (Mar 9, 2025): Due to special network reasons within China, maybe not quite able to use docker solution, can you give a solution to run directly in the system environment? For example, defining temporary environment variables directly within a specific terminal, pulling up the model and bypassing ollama's own model process management. Also, with this approach, does nginx automatically route different models? For example, if ollama-1 runs model A and ollama-2 runs model B, any api requests are automatically routed through?
Author
Owner

@rick-github commented on GitHub (Mar 9, 2025):

The docker containers are just a useful management mechanism, you can just start individual servers:

#!/bin/bash
export OLLAMA_HOST=:11435
export CUDA_VISIBLE_DEVICES=0,1
ollama serve &
#!/bin/bash
export OLLAMA_HOST=:11436
export CUDA_VISIBLE_DEVICES=2,3,4,5,6
ollama serve &

and configure nginx to talk to the servers:

        upstream ollama_group {
          least_conn;
          server localhost:11435 max_conns=8;
          server localhost:11436 max_conns=8;
        }
        server {
          listen 11434;
          server_name localhost;
          location / {
            proxy_pass http://ollama_group;
          }
        }

For routing to particular models, you can use litellm in place of nginx and set up aliases for particular models:

  - model_name: qwen2.5-coder
    litellm_params:
      model: ollama/qwen2.5-coder:32b
      api_base: http://localhost:11435

  - model_name: qwq
    litellm_params:
      model: ollama/qwq
      api_base: http://localhost:11436

But this means switching to the OpenAI format as litellm doesn't have an ollama API compatible endpoint (or didn't, last time I checked).

There are many other ollama proxy servers on github, some might have request routing based on model.

<!-- gh-comment-id:2708873267 --> @rick-github commented on GitHub (Mar 9, 2025): The docker containers are just a useful management mechanism, you can just start individual servers: ```bash #!/bin/bash export OLLAMA_HOST=:11435 export CUDA_VISIBLE_DEVICES=0,1 ollama serve & ``` ```bash #!/bin/bash export OLLAMA_HOST=:11436 export CUDA_VISIBLE_DEVICES=2,3,4,5,6 ollama serve & ``` and configure nginx to talk to the servers: ``` upstream ollama_group { least_conn; server localhost:11435 max_conns=8; server localhost:11436 max_conns=8; } server { listen 11434; server_name localhost; location / { proxy_pass http://ollama_group; } } ``` For routing to particular models, you can use litellm in place of nginx and set up aliases for particular models: ``` - model_name: qwen2.5-coder litellm_params: model: ollama/qwen2.5-coder:32b api_base: http://localhost:11435 - model_name: qwq litellm_params: model: ollama/qwq api_base: http://localhost:11436 ``` But this means switching to the OpenAI format as litellm doesn't have an ollama API compatible endpoint (or didn't, last time I checked). There are many other ollama proxy servers on github, some might have request routing based on model.
Author
Owner

@NGC13009 commented on GitHub (Mar 12, 2025):

@rick-github Thank you very much for the method, we solved it with this method so far.

However, it is still expected that in the future ollama will integrate these features into the model's configuration file to make it easier to use!

<!-- gh-comment-id:2718852820 --> @NGC13009 commented on GitHub (Mar 12, 2025): @rick-github Thank you very much for the method, we solved it with this method so far. However, it is still expected that in the future ollama will integrate these features into the model's configuration file to make it easier to use!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52779