[GH-ISSUE #8472] Hangs with 2x P40 GPUs when switch model #83138

Closed
opened 2026-05-09 17:16:57 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @fred-vaneijk on GitHub (Jan 17, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8472

What is the issue?

When running ollama in a docker container with 2 Nvidia P40 GPUs and open webui is used to switch the model (from Phi to lama3.2 or vice versa) the web ui hangs when issuing a new query. In nvtop I notice a Compute process that is using 100% CPU (not GPU). The only way to recover is to restart the docker container. I suspect there is some infinite loop occurring during or prior to the model switch. I also noticed that if OLLAMA_KEEP_ALIVE is set to a very short time like 10s the 100% cpu issue also manifests.

Model switching works fine when configured with a single gpu i.e. NVIDIA_VISIBLE_DEVICES=0 and count: 1.

here is the docker yaml config file

version: '3.8'
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
environment:
- OLLAMA_API_BASE_URL=http://host.docker.internal:11434/api
- WEBUI_AUTH=false
depends_on:
- ollama
volumes:
- open-webui:/app/backend/data
networks:
- ollama-network
extra_hosts:
- "host.docker.internal:host-gateway"

ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
environment:
- NVIDIA_VISIBLE_DEVICES=0,1
- OLLAMA_CUDA_VERSION=12.6
- OLLAMA_MAX_LOADED_MODELS=1
- OLLAMA_KEEP_ALIVE=-1
- OLLAMA_MAX_QUEUE=1
- OLLAMA_HOST=0.0.0.0
- OLLAMA_DEBUG=1 # Add this for verbose logging
- OLLAMA_TIMEOUT=10s # Add timeout for operations
- OLLAMA_NOPRUNE=true
runtime: nvidia
volumes:
- ollama:/root/.ollama
networks:
- ollama-network

networks:
ollama-network:
driver: bridge

volumes:
open-webui:
ollama:

Image
ollama-p40-log.txt

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.7-0-ga420a45-dirty

Originally created by @fred-vaneijk on GitHub (Jan 17, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8472 ### What is the issue? When running ollama in a docker container with 2 Nvidia P40 GPUs and open webui is used to switch the model (from Phi to lama3.2 or vice versa) the web ui hangs when issuing a new query. In nvtop I notice a Compute process that is using 100% CPU (not GPU). The only way to recover is to restart the docker container. I suspect there is some infinite loop occurring during or prior to the model switch. I also noticed that if OLLAMA_KEEP_ALIVE is set to a very short time like 10s the 100% cpu issue also manifests. Model switching works fine when configured with a single gpu i.e. NVIDIA_VISIBLE_DEVICES=0 and count: 1. here is the docker yaml config file version: '3.8' services: open-webui: image: ghcr.io/open-webui/open-webui:main container_name: open-webui restart: unless-stopped ports: - "3000:8080" environment: - OLLAMA_API_BASE_URL=http://host.docker.internal:11434/api - WEBUI_AUTH=false depends_on: - ollama volumes: - open-webui:/app/backend/data networks: - ollama-network extra_hosts: - "host.docker.internal:host-gateway" ollama: image: ollama/ollama:latest container_name: ollama restart: unless-stopped ports: - "11434:11434" deploy: resources: reservations: devices: - driver: nvidia count: 2 capabilities: [gpu] environment: - NVIDIA_VISIBLE_DEVICES=0,1 - OLLAMA_CUDA_VERSION=12.6 - OLLAMA_MAX_LOADED_MODELS=1 - OLLAMA_KEEP_ALIVE=-1 - OLLAMA_MAX_QUEUE=1 - OLLAMA_HOST=0.0.0.0 - OLLAMA_DEBUG=1 # Add this for verbose logging - OLLAMA_TIMEOUT=10s # Add timeout for operations - OLLAMA_NOPRUNE=true runtime: nvidia volumes: - ollama:/root/.ollama networks: - ollama-network networks: ollama-network: driver: bridge volumes: open-webui: ollama: ![Image](https://github.com/user-attachments/assets/5831ca57-b64f-446b-a6a0-dc5b7f266cce) [ollama-p40-log.txt](https://github.com/user-attachments/files/18458285/ollama-p40-log.txt) ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.7-0-ga420a45-dirty
GiteaMirror added the bug label 2026-05-09 17:16:57 -05:00
Author
Owner

@fred-vaneijk commented on GitHub (Jan 17, 2025):

setting to this in the yaml for the docker seems to make it work. Can someone confirm this is the right thing to do?

  • OLLAMA_NUM_PARALLEL=2
  • NVIDIA_VISIBLE_DEVICES=all

here is the new yaml

version: '3.8'
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
environment:
- OLLAMA_API_BASE_URL=http://host.docker.internal:11434/api
- WEBUI_AUTH=false
depends_on:
- ollama
volumes:
- open-webui:/app/backend/data
networks:
- ollama-network
extra_hosts:
- "host.docker.internal:host-gateway"

ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
environment:
- NVIDIA_VISIBLE_DEVICES=all
- OLLAMA_CUDA_VERSION=12.6
- OLLAMA_MAX_LOADED_MODELS=1
- OLLAMA_KEEP_ALIVE=-1
- OLLAMA_MAX_QUEUE=1
- OLLAMA_HOST=0.0.0.0
- OLLAMA_DEBUG=1 # Add this for verbose logging
- OLLAMA_TIMEOUT=10s # Add timeout for operations
- OLLAMA_NOPRUNE=true
- OLLAMA_NUM_PARALLEL=2
runtime: nvidia
volumes:
- ollama:/root/.ollama
networks:
- ollama-network

networks:
ollama-network:
driver: bridge

volumes:
open-webui:
ollama:

<!-- gh-comment-id:2599007937 --> @fred-vaneijk commented on GitHub (Jan 17, 2025): setting to this in the yaml for the docker seems to make it work. Can someone confirm this is the right thing to do? - OLLAMA_NUM_PARALLEL=2 - NVIDIA_VISIBLE_DEVICES=all here is the new yaml version: '3.8' services: open-webui: image: ghcr.io/open-webui/open-webui:main container_name: open-webui restart: unless-stopped ports: - "3000:8080" environment: - OLLAMA_API_BASE_URL=http://host.docker.internal:11434/api - WEBUI_AUTH=false depends_on: - ollama volumes: - open-webui:/app/backend/data networks: - ollama-network extra_hosts: - "host.docker.internal:host-gateway" ollama: image: ollama/ollama:latest container_name: ollama restart: unless-stopped ports: - "11434:11434" deploy: resources: reservations: devices: - driver: nvidia count: 2 capabilities: [gpu] environment: - NVIDIA_VISIBLE_DEVICES=all - OLLAMA_CUDA_VERSION=12.6 - OLLAMA_MAX_LOADED_MODELS=1 - OLLAMA_KEEP_ALIVE=-1 - OLLAMA_MAX_QUEUE=1 - OLLAMA_HOST=0.0.0.0 - OLLAMA_DEBUG=1 # Add this for verbose logging - OLLAMA_TIMEOUT=10s # Add timeout for operations - OLLAMA_NOPRUNE=true - OLLAMA_NUM_PARALLEL=2 runtime: nvidia volumes: - ollama:/root/.ollama networks: - ollama-network networks: ollama-network: driver: bridge volumes: open-webui: ollama:
Author
Owner

@rick-github commented on GitHub (Jan 17, 2025):

The log shows phi:2.7b-chat-v2-q4_0 being loaded, answering a few questions, then being unloaded because a new model has been requested. It's being unloaded because OLLAMA_MAX_LOADED_MODELS=1. I assume there's a reason for this, since phi and llama3.2 should happily co-reside on your GPUs. There's nothing else in the log that shows a problem.

NVIDIA_VISIBLE_DEVICES=all and NVIDIA_VISIBLE_DEVICES=0,1 should be functionally identical and shouldn't make a difference to model loading. Having said that, it's an Nvidia library variable so there may be something inside the driver that is affected, we have no visibility into that.

OLLAMA_NUM_PARALLEL=2 only affects the number of simultaneous completions that a model will do. The only difference from the point of model loading is that it allocate a larger context buffer. There's plenty of free VRAM so this shouldn't make a difference.

Note that OLLAMA_CUDA_VERSION and OLLAMA_TIMEOUT are not ollama variables, they have no effect.

There's not a lot info to go on. I speculate that ollama unloaded phi and started to load llama3.2 and somehow got wedged - the 100% CPU may be the model loader thread trying to get the model loaded. But I would expect to see some log lines about memory allocations, etc. so it seems to have got wedged quite early in the process, which is unusual. The fact that it seems to work fine with one GPU is also a puzzle, because these models are small and ollama would only be using one GPU in either case.

So I'm afraid all I can ask is that you continue experimenting and provide logs. There's nothing here so far that is actionable.

<!-- gh-comment-id:2599368541 --> @rick-github commented on GitHub (Jan 17, 2025): The log shows phi:2.7b-chat-v2-q4_0 being loaded, answering a few questions, then being unloaded because a new model has been requested. It's being unloaded because `OLLAMA_MAX_LOADED_MODELS=1`. I assume there's a reason for this, since phi and llama3.2 should happily co-reside on your GPUs. There's nothing else in the log that shows a problem. `NVIDIA_VISIBLE_DEVICES=all` and `NVIDIA_VISIBLE_DEVICES=0,1` should be functionally identical and shouldn't make a difference to model loading. Having said that, it's an Nvidia library variable so there may be something inside the driver that is affected, we have no visibility into that. `OLLAMA_NUM_PARALLEL=2` only affects the number of simultaneous completions that a model will do. The only difference from the point of model loading is that it allocate a larger context buffer. There's plenty of free VRAM so this shouldn't make a difference. Note that `OLLAMA_CUDA_VERSION` and `OLLAMA_TIMEOUT` are not ollama variables, they have no effect. There's not a lot info to go on. I speculate that ollama unloaded phi and started to load llama3.2 and somehow got wedged - the 100% CPU may be the model loader thread trying to get the model loaded. But I would expect to see some log lines about memory allocations, etc. so it seems to have got wedged quite early in the process, which is unusual. The fact that it seems to work fine with one GPU is also a puzzle, because these models are small and ollama would only be using one GPU in either case. So I'm afraid all I can ask is that you continue experimenting and provide logs. There's nothing here so far that is actionable.
Author
Owner

@fred-vaneijk commented on GitHub (Jan 18, 2025):

Thanks for the response...

I ran it his way (see below) and it will not switch models. Also, it looks like there is a crash around GPU initialization and memory management. Can you take another look? With - OLLAMA_MAX_LOADED_MODELS=1 un-commented it was switching models.

Image

log-18jan.txt

version: '3.8'
services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_API_BASE_URL=http://host.docker.internal:11434/api
      - WEBUI_AUTH=false
    depends_on:
      - ollama
    volumes:
      - open-webui:/app/backend/data
    networks:
      - ollama-network
    extra_hosts:
      - "host.docker.internal:host-gateway"

  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    environment:
      - NVIDIA_VISIBLE_DEVICES=all  # Changed to use all GPUs

      - OLLAMA_NUM_PARALLEL=2
#      - OLLAMA_MAX_LOADED_MODELS=1
      - OLLAMA_KEEP_ALIVE=-1
      - OLLAMA_MAX_QUEUE=1
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_DEBUG=true
      - OLLAMA_NOPRUNE=true

    runtime: nvidia
    volumes:
      - ollama:/root/.ollama
    networks:
      - ollama-network

networks:
  ollama-network:
    driver: bridge

volumes:
  open-webui:
  ollama:


<!-- gh-comment-id:2599731728 --> @fred-vaneijk commented on GitHub (Jan 18, 2025): Thanks for the response... I ran it his way (see below) and it will not switch models. Also, it looks like there is a crash around GPU initialization and memory management. Can you take another look? With - OLLAMA_MAX_LOADED_MODELS=1 un-commented it was switching models. ![Image](https://github.com/user-attachments/assets/1be13c35-5e9a-4f97-82f4-e69a6f296821) [log-18jan.txt](https://github.com/user-attachments/files/18464752/log-18jan.txt) ```yaml version: '3.8' services: open-webui: image: ghcr.io/open-webui/open-webui:main container_name: open-webui restart: unless-stopped ports: - "3000:8080" environment: - OLLAMA_API_BASE_URL=http://host.docker.internal:11434/api - WEBUI_AUTH=false depends_on: - ollama volumes: - open-webui:/app/backend/data networks: - ollama-network extra_hosts: - "host.docker.internal:host-gateway" ollama: image: ollama/ollama:latest container_name: ollama restart: unless-stopped ports: - "11434:11434" deploy: resources: reservations: devices: - driver: nvidia count: 2 capabilities: [gpu] environment: - NVIDIA_VISIBLE_DEVICES=all # Changed to use all GPUs - OLLAMA_NUM_PARALLEL=2 # - OLLAMA_MAX_LOADED_MODELS=1 - OLLAMA_KEEP_ALIVE=-1 - OLLAMA_MAX_QUEUE=1 - OLLAMA_HOST=0.0.0.0 - OLLAMA_DEBUG=true - OLLAMA_NOPRUNE=true runtime: nvidia volumes: - ollama:/root/.ollama networks: - ollama-network networks: ollama-network: driver: bridge volumes: open-webui: ollama:
Author
Owner

@fred-vaneijk commented on GitHub (Jan 18, 2025):

for reference here is the nvtop and log of it switching models with OLLAMA_MAX_LOADED_MODELS un-commented

Image

log-jan18-works.txt

<!-- gh-comment-id:2599739584 --> @fred-vaneijk commented on GitHub (Jan 18, 2025): for reference here is the nvtop and log of it switching models with OLLAMA_MAX_LOADED_MODELS un-commented ![Image](https://github.com/user-attachments/assets/e73669fb-ce9d-474e-90e5-506ca3b0a71a) [log-jan18-works.txt](https://github.com/user-attachments/files/18464775/log-jan18-works.txt)
Author
Owner

@fred-vaneijk commented on GitHub (Jan 18, 2025):

As an experiment I ran some of the issues through 3.5 Sonet. It came up with this modified sched.go file. Does what it suggested make any sense (I am new to golang and ollama, I specialize professionally as a C/C++ programmer) ? Can you do a diff with the latest sched.go? Also, I would love to be able to build and run ollama under a debugger, any help there would be appreciated.

sched.go.txt

<!-- gh-comment-id:2599742807 --> @fred-vaneijk commented on GitHub (Jan 18, 2025): As an experiment I ran some of the issues through 3.5 Sonet. It came up with this modified sched.go file. Does what it suggested make any sense (I am new to golang and ollama, I specialize professionally as a C/C++ programmer) ? Can you do a diff with the latest sched.go? Also, I would love to be able to build and run ollama under a debugger, any help there would be appreciated. [sched.go.txt](https://github.com/user-attachments/files/18464782/sched.go.txt)
Author
Owner

@fred-vaneijk commented on GitHub (Jan 20, 2025):

submitted PR https://github.com/ollama/ollama/pull/8504

This describes how to deal with this issue, basically the VM/Proxmox was not setup right. closing this issue.

<!-- gh-comment-id:2602763950 --> @fred-vaneijk commented on GitHub (Jan 20, 2025): submitted PR https://github.com/ollama/ollama/pull/8504 This describes how to deal with this issue, basically the VM/Proxmox was not setup right. closing this issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#83138