[GH-ISSUE #12097] VRAM consumption on 4xH100NVL thru parallel gpt-oss-20b #70098

Closed
opened 2026-05-04 20:21:04 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @BBOBDI on GitHub (Aug 27, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12097

What is the issue?

Hi,

I have a Debian Linux machine with four Nvidia H100 NVLs (4x94GB NVRAM). I'd like to run a whole gpt-oss-20b hub on it, with a context window of 131,072 tokens for each model. I'd like to run Ollama with the highest possible OLLAMA_NUM_PARALLEL variable. But I notice a sudden change in VRAM consumption when the OLLAMA_NUM_PARALLEL variable causes me to consume more VRAM than what's present on a single H100 NVL card (94GB):

  • OLLAMA_NUM_PARALLEL = 1 ==> memory.required.full="31.7 GiB"
  • OLLAMA_NUM_PARALLEL = 2 ==> memory.required.full="50.8 GiB"
  • OLLAMA_NUM_PARALLEL = 3 ==> memory.required.full="70.0 GiB"
  • OLLAMA_NUM_PARALLEL = 4 ==> memory.required.full="89.1 GiB"

And then...

  • OLLAMA_NUM_PARALLEL = 5 ==> memory.required.full="270.0 GiB"

And it gets worse with an OLLAMA_NUM_PARALLEL = 6 :-( With this value, the inference is done on the CPU only.

I've attached the Ollama logs for each value of OLLAMA_NUM_PARALLEL. Logically, on a machine like mine, I should have an OLLAMA_NUM_PARALLEL variable that should be around 16. Did I miss something? Or is this a bug?

gpt-oss-20b-OLLAMA_NUM_PARALLEL_1.txt

gpt-oss-20b-OLLAMA_NUM_PARALLEL_2.txt

gpt-oss-20b-OLLAMA_NUM_PARALLEL_3.txt

gpt-oss-20b-OLLAMA_NUM_PARALLEL_4.txt

gpt-oss-20b-OLLAMA_NUM_PARALLEL_5.txt

gpt-oss-20b-OLLAMA_NUM_PARALLEL_6.txt

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.11.7

Originally created by @BBOBDI on GitHub (Aug 27, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12097 ### What is the issue? Hi, I have a Debian Linux machine with four Nvidia H100 NVLs (4x94GB NVRAM). I'd like to run a whole gpt-oss-20b hub on it, with a context window of 131,072 tokens for each model. I'd like to run Ollama with the highest possible OLLAMA_NUM_PARALLEL variable. But I notice a sudden change in VRAM consumption when the OLLAMA_NUM_PARALLEL variable causes me to consume more VRAM than what's present on a single H100 NVL card (94GB): - OLLAMA_NUM_PARALLEL = 1 ==> memory.required.full="31.7 GiB" - OLLAMA_NUM_PARALLEL = 2 ==> memory.required.full="50.8 GiB" - OLLAMA_NUM_PARALLEL = 3 ==> memory.required.full="70.0 GiB" - OLLAMA_NUM_PARALLEL = 4 ==> memory.required.full="89.1 GiB" And then... - OLLAMA_NUM_PARALLEL = 5 ==> memory.required.full="270.0 GiB" And it gets worse with an OLLAMA_NUM_PARALLEL = 6 :-( With this value, the inference is done on the CPU only. I've attached the Ollama logs for each value of OLLAMA_NUM_PARALLEL. Logically, on a machine like mine, I should have an OLLAMA_NUM_PARALLEL variable that should be around 16. Did I miss something? Or is this a bug? [gpt-oss-20b-OLLAMA_NUM_PARALLEL_1.txt](https://github.com/user-attachments/files/22006918/gpt-oss-20b-OLLAMA_NUM_PARALLEL_1.txt) [gpt-oss-20b-OLLAMA_NUM_PARALLEL_2.txt](https://github.com/user-attachments/files/22006920/gpt-oss-20b-OLLAMA_NUM_PARALLEL_2.txt) [gpt-oss-20b-OLLAMA_NUM_PARALLEL_3.txt](https://github.com/user-attachments/files/22006922/gpt-oss-20b-OLLAMA_NUM_PARALLEL_3.txt) [gpt-oss-20b-OLLAMA_NUM_PARALLEL_4.txt](https://github.com/user-attachments/files/22006923/gpt-oss-20b-OLLAMA_NUM_PARALLEL_4.txt) [gpt-oss-20b-OLLAMA_NUM_PARALLEL_5.txt](https://github.com/user-attachments/files/22006925/gpt-oss-20b-OLLAMA_NUM_PARALLEL_5.txt) [gpt-oss-20b-OLLAMA_NUM_PARALLEL_6.txt](https://github.com/user-attachments/files/22006929/gpt-oss-20b-OLLAMA_NUM_PARALLEL_6.txt) ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.11.7
GiteaMirror added the bug label 2026-05-04 20:21:04 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 27, 2025):

The memory graph is duplicated per-device, so once the model spills across multiple devices, the memory requirements shoot up. You can reduce the size of the graph by setting OLLAMA_FLASH_ATTENTION=1 but it will still be duplicated per device. One approach you can take is to run one model per GPU (thereby avoiding the duplication of the graph) by binding a GPU to an ollama server, and then using a load balancer to route queries.

An example docker config:

x-ollama: &ollama
  image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest}
  volumes:
    - ${OLLAMA_MODELS-./ollama}:/root/.ollama
  environment: &env
    OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1}
    OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m}
    OLLAMA_MAX_LOADED_MODELS: 1
    OLLAMA_NUM_PARALLEL: 4
    OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0}
    OLLAMA_DEBUG: ${OLLAMA_DEBUG-1}
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]

services:
  ollama-1:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 0

  ollama-2:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 1

  ollama-3:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 2

  ollama-4:
    << : *ollama
    environment:
      << : *env
      CUDA_VISIBLE_DEVICES: 3

  ollama:
    image: nginx-lb
    build:
      dockerfile_inline: |
        FROM nginx:latest
        RUN cat > /etc/nginx/conf.d/default.conf <<EOF
        upstream ollama_group {
          least_conn;
          server ollama-1:11434 max_conns=4;
          server ollama-2:11434 max_conns=4;
          server ollama-3:11434 max_conns=4;
          server ollama-4:11434 max_conns=4;
        }
        server {
          listen 11434;
          server_name localhost;
          location / {
            proxy_pass http://ollama_group;
          }
        }
        EOF
    ports:
      - 11434:11434
<!-- gh-comment-id:3228138169 --> @rick-github commented on GitHub (Aug 27, 2025): The memory graph is duplicated per-device, so once the model spills across multiple devices, the memory requirements shoot up. You can reduce the size of the graph by setting `OLLAMA_FLASH_ATTENTION=1` but it will still be duplicated per device. One approach you can take is to run one model per GPU (thereby avoiding the duplication of the graph) by binding a GPU to an ollama server, and then using a load balancer to route queries. An example docker config: ```dockerfile x-ollama: &ollama image: ollama/ollama:${OLLAMA_DOCKER_TAG-latest} volumes: - ${OLLAMA_MODELS-./ollama}:/root/.ollama environment: &env OLLAMA_KEEP_ALIVE: ${OLLAMA_KEEP_ALIVE--1} OLLAMA_LOAD_TIMEOUT: ${OLLAMA_LOAD_TIMEOUT-5m} OLLAMA_MAX_LOADED_MODELS: 1 OLLAMA_NUM_PARALLEL: 4 OLLAMA_FLASH_ATTENTION: ${OLLAMA_FLASH_ATTENTION-0} OLLAMA_DEBUG: ${OLLAMA_DEBUG-1} deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] services: ollama-1: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 0 ollama-2: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 1 ollama-3: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 2 ollama-4: << : *ollama environment: << : *env CUDA_VISIBLE_DEVICES: 3 ollama: image: nginx-lb build: dockerfile_inline: | FROM nginx:latest RUN cat > /etc/nginx/conf.d/default.conf <<EOF upstream ollama_group { least_conn; server ollama-1:11434 max_conns=4; server ollama-2:11434 max_conns=4; server ollama-3:11434 max_conns=4; server ollama-4:11434 max_conns=4; } server { listen 11434; server_name localhost; location / { proxy_pass http://ollama_group; } } EOF ports: - 11434:11434 ```
Author
Owner

@jessegross commented on GitHub (Aug 27, 2025):

In this case flash attention will dramatically reduce the size of the graph. You can turn it on now as Rick mentioned or it will be on by default for gpt-oss in the next release.

<!-- gh-comment-id:3228991376 --> @jessegross commented on GitHub (Aug 27, 2025): In this case flash attention will dramatically reduce the size of the graph. You can turn it on now as Rick mentioned or it will be on by default for gpt-oss in the next release.
Author
Owner

@BBOBDI commented on GitHub (Aug 27, 2025):

Thank you all for your various feedback!

<!-- gh-comment-id:3230169789 --> @BBOBDI commented on GitHub (Aug 27, 2025): Thank you all for your various feedback!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70098