[GH-ISSUE #4846] Performance degrades over time when running in Docker with Nvidia GPU #49576

Open
opened 2026-04-28 12:18:20 -05:00 by GiteaMirror · 12 comments
Owner

Originally created by @zeyuchenphd on GitHub (Jun 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4846

What is the issue?

I am working in a multi-GPU environment. I set up multiple docker containers to assign each GPU to it so I can process my workload in parallel.

Here is the command I use to set up the container:
sudo docker run -d --gpus device=GPU-46b6fece-aec9-853f-0956-2d43359e28e3 -v ollama:/root/.ollama -p 11435:11434 --name ollama0 ollama/ollama

I change the port for each container and use a list of clients to split the workload.

I noticed the performance of the Ollama Docker container degrades significantly over time. I am processing a workload with over 134,000 queries with llama3:instruct. In the beginning, the processing speed is about 1 to 2 items/s, after processing a few thousands of queries, it slows down to 10 to 12 items/s, and it gets worse over time.

If I remove and reconfigure the container, The performance will return to normal.

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.1.38

Originally created by @zeyuchenphd on GitHub (Jun 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4846 ### What is the issue? I am working in a multi-GPU environment. I set up multiple docker containers to assign each GPU to it so I can process my workload in parallel. Here is the command I use to set up the container: `sudo docker run -d --gpus device=GPU-46b6fece-aec9-853f-0956-2d43359e28e3 -v ollama:/root/.ollama -p 11435:11434 --name ollama0 ollama/ollama` I change the port for each container and use a list of clients to split the workload. I noticed the performance of the Ollama Docker container degrades significantly over time. I am processing a workload with over 134,000 queries with llama3:instruct. In the beginning, the processing speed is about 1 to 2 items/s, after processing a few thousands of queries, it slows down to 10 to 12 items/s, and it gets worse over time. If I remove and reconfigure the container, The performance will return to normal. ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.38
GiteaMirror added the dockerperformancebugnvidia labels 2026-04-28 12:18:21 -05:00
Author
Owner

@JerryGamble1 commented on GitHub (Jun 24, 2024):

We are seeing the same behavior (probably irrelevant difference is that we have an Intel processor in the server).

<!-- gh-comment-id:2187169404 --> @JerryGamble1 commented on GitHub (Jun 24, 2024): We are seeing the same behavior (probably irrelevant difference is that we have an Intel processor in the server).
Author
Owner

@zeyuchenphd commented on GitHub (Jun 24, 2024):

I suspect there might be an Nvidia driver compatibility issue. We have two A6000 48GB with 535.161.07 driver installed.
I have observed a similar issue reported by other users on Reddit. https://www.reddit.com/r/ollama/comments/1debdr7/dockerized_ollama_doesnt_use_gpu_even_though_its/

<!-- gh-comment-id:2187197850 --> @zeyuchenphd commented on GitHub (Jun 24, 2024): I suspect there might be an Nvidia driver compatibility issue. We have two A6000 48GB with 535.161.07 driver installed. I have observed a similar issue reported by other users on Reddit. https://www.reddit.com/r/ollama/comments/1debdr7/dockerized_ollama_doesnt_use_gpu_even_though_its/
Author
Owner

@shivakharbanda commented on GitHub (Oct 19, 2024):

facing same issue help

<!-- gh-comment-id:2423635820 --> @shivakharbanda commented on GitHub (Oct 19, 2024): facing same issue help
Author
Owner

@jessegross commented on GitHub (Oct 23, 2024):

@shivakharbanda Can you provide more information about the version of Ollama that you are running and whether the Nvidia driver version in the linked thread helps?

There are some fixes in current versions that may help but it's hard to tell from the original context since this is relatively old.

<!-- gh-comment-id:2433538773 --> @jessegross commented on GitHub (Oct 23, 2024): @shivakharbanda Can you provide more information about the version of Ollama that you are running and whether the Nvidia driver version in the linked thread helps? There are some fixes in current versions that may help but it's hard to tell from the original context since this is relatively old.
Author
Owner

@shivakharbanda commented on GitHub (Oct 25, 2024):

i am using a 80 gb a100 gpu (if any more info is needed let me know) and i am using latest ollama using

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

i followed all the instructions provided in the below link
https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image

btw we have switched to vllm now. its utilizing the gpu well and there is no performance issue as of now.

<!-- gh-comment-id:2436930395 --> @shivakharbanda commented on GitHub (Oct 25, 2024): i am using a 80 gb a100 gpu (if any more info is needed let me know) and i am using latest ollama using ``` docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama ``` i followed all the instructions provided in the below link https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image btw we have switched to vllm now. its utilizing the gpu well and there is no performance issue as of now.
Author
Owner

@neuhaus commented on GitHub (Jan 21, 2025):

I have an issue where ollama in the docker container stops recognizing the nVidia card after a while.
In the ollama docker log, i get this message:
ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
whereas when it detects the card successfully, it looks like this:

ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes

Restarting the docker container resolves the issue.

Here is the relevant part of my compose.yml file:

version: '3.8'
services:
  ollama:
    image: ollama/ollama
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    container_name: ollama
    networks:
      - ollama_network
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: ["gpu"]  # enables GPU usage
    restart: unless-stopped

I'm using nvidia-container-toolkit 1.17.3-1 on Ubuntu 24.04 with nvidia-driver-560 version 560.35.05-0ubuntu1.

The problem occurs so frequently that i have added a cronjob to check for the log line indicating that the hardware was not found and then automatically restarting the ollama docker container. Not a good solution.

<!-- gh-comment-id:2604180867 --> @neuhaus commented on GitHub (Jan 21, 2025): I have an issue where ollama in the docker container stops recognizing the nVidia card **after a while**. In the ollama docker log, i get this message: `ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected` whereas when it detects the card successfully, it looks like this: ``` ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes ``` Restarting the docker container resolves the issue. Here is the relevant part of my compose.yml file: ``` version: '3.8' services: ollama: image: ollama/ollama runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=all container_name: ollama networks: - ollama_network deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: ["gpu"] # enables GPU usage restart: unless-stopped ``` I'm using nvidia-container-toolkit 1.17.3-1 on Ubuntu 24.04 with nvidia-driver-560 version 560.35.05-0ubuntu1. The problem occurs so frequently that i have added a cronjob to check for the log line indicating that the hardware was not found and then automatically restarting the ollama docker container. Not a good solution.
Author
Owner

@neuhaus commented on GitHub (Jan 21, 2025):

In the beginning, the processing speed is about 1 to 2 items/s, after processing a few thousands of queries, it slows down to 10 to 12 items/s, and it gets worse over time.

@nycameraguy Can you clarify? Is it seconds per item?

<!-- gh-comment-id:2604183668 --> @neuhaus commented on GitHub (Jan 21, 2025): > In the beginning, the processing speed is about 1 to 2 items/s, after processing a few thousands of queries, it slows down to 10 to 12 items/s, and it gets worse over time. @nycameraguy Can you clarify? Is it seconds per item?
Author
Owner

@FrobtheBuilder commented on GitHub (May 1, 2025):

@neuhaus can I get a copy of that script because this is STILL! a problem, this is exactly what's happening on my instance.

<!-- gh-comment-id:2843785458 --> @FrobtheBuilder commented on GitHub (May 1, 2025): @neuhaus can I get a copy of that script because this is STILL! a problem, this is exactly what's happening on my instance.
Author
Owner

@neuhaus commented on GitHub (May 2, 2025):

@neuhaus can I get a copy of that script because this is STILL! a problem, this is exactly what's happening on my instance.

Sure, check it out at https://gist.github.com/neuhaus/2fc66f9b1ebaa08ff95a527b91556113
The problem persists on Ubuntu 24.04.2 with nvidia-container-toolkit 1.17.6-1 and nvidia-driver 560.35.05-0ubuntu1

<!-- gh-comment-id:2847091773 --> @neuhaus commented on GitHub (May 2, 2025): > [@neuhaus](https://github.com/neuhaus) can I get a copy of that script because this is STILL! a problem, this is exactly what's happening on my instance. Sure, check it out at https://gist.github.com/neuhaus/2fc66f9b1ebaa08ff95a527b91556113 The problem persists on Ubuntu 24.04.2 with nvidia-container-toolkit 1.17.6-1 and nvidia-driver 560.35.05-0ubuntu1
Author
Owner

@lightrush commented on GitHub (May 3, 2025):

I'm observing this on Ubuntu 22.04 with NVIDIA driver 550.144.03 on an RTX 3090.

<!-- gh-comment-id:2848456404 --> @lightrush commented on GitHub (May 3, 2025): I'm observing this on Ubuntu 22.04 with NVIDIA driver 550.144.03 on an RTX 3090.
Author
Owner

@neuhaus commented on GitHub (May 5, 2025):

Sorry, i kind of hijacked this issue that's really about a different thing so @lightrush i'm not sure what you are referring to - is the GPU no longer recognized by Ollama running in the container ever now and then?

<!-- gh-comment-id:2850117865 --> @neuhaus commented on GitHub (May 5, 2025): Sorry, i kind of hijacked this issue that's really about a different thing so @lightrush i'm not sure what you are referring to - is the GPU no longer recognized by Ollama running in the container ever now and then?
Author
Owner

@lightrush commented on GitHub (Jun 20, 2025):

It looks like this is a known problem. I'm not sure what the implications of the workaround are.

<!-- gh-comment-id:2992068817 --> @lightrush commented on GitHub (Jun 20, 2025): It looks like this is a [known problem.](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#linux-docker) I'm not sure what the implications of the workaround are.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#49576