[GH-ISSUE #7761] High Inference Time and Limited GPU Utilization with Ollama Docker #4957

Open
opened 2026-04-12 16:00:58 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @nicho2 on GitHub (Nov 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7761

What is the issue?

Description:

I am using Ollama in a Docker setup with GPU support, configured to use all available GPUs on my system. However, when using the NemoTron model with a simple prompt and utilizing the function calling feature, the inference time is around 50 seconds to get a response, which is too high for my use case.

Docker Configuration:

Here is my docker-compose.yml file for Ollama:

services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    hostname: ollama    
    ports:
      - "11434:11434"
    volumes:
      - /home/system/dockers/volumes/ollama:/root/.ollama
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped
    networks:
      - genai_network

networks:
  genai_network:

Log Details:

Server Initialization:

Listening on [::]:11434 (version 0.4.2)
Dynamic LLM libraries runners="[cpu_avx2 cuda_v11 cuda_v12 cpu cpu_avx]"
Looking for compatible GPUs
Inference compute id=GPU-660ca8b7-181a-ede9-f6fe-8ccd5f9dbb89 library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA RTX 6000 Ada Generation" total="47.5 GiB" available="47.1 GiB"
Inference compute id=GPU-d62f5e11-4192-0e70-0732-55b558edcb7a library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA RTX 6000 Ada Generation" total="47.5 GiB" available="46.4 GiB"
Inference compute id=GPU-45972459-815c-f304-9fdf-b952276c9b13 library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA RTX 6000 Ada Generation" total="47.5 GiB" available="47.0 GiB"

Model Loading:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.68 MiB

Observed Issue:

During inference with the NemoTron model, the response time is around 50 seconds.
The logs show that only one GPU seems to be utilized (found 1 CUDA devices), despite multiple GPUs being detected during service initialization.

Questions:

GPU Load Balancing:

Does Ollama support load balancing across multiple GPUs? If yes, why do the logs indicate that only one GPU is used (found 1 CUDA devices) when the model is loaded?

Performance Optimization:

What steps are recommended to reduce inference time?
Should I adjust configuration settings in Docker or Ollama?
Could variables like GGML_CUDA_FORCE_CUBLAS or GGML_CUDA_FORCE_MMQ improve performance?

Technical Context:

Ollama Version: 0.4.2
Hardware Configuration:
GPUs: 3 x NVIDIA RTX 6000 Ada Generation (47.5 GiB VRAM each)
NVIDIA Driver: 12.4
Model Used: NemoTron
Usage Scenario: Simple prompt with function calling.

Expectation:

  • Confirmation on multi-GPU support in Ollama.
  • Suggestions to reduce inference time.
  • Documentation or examples of optimized configuration for heavy workloads with multiple GPUs.

Thank You for Your Help!

nvidia-smi

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.4.2

Originally created by @nicho2 on GitHub (Nov 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7761 ### What is the issue? ## Description: I am using Ollama in a Docker setup with GPU support, configured to use all available GPUs on my system. However, when using the NemoTron model with a simple prompt and utilizing the function calling feature, the inference time is around 50 seconds to get a response, which is too high for my use case. ## Docker Configuration: Here is my docker-compose.yml file for Ollama: services: ollama: image: ollama/ollama container_name: ollama hostname: ollama ports: - "11434:11434" volumes: - /home/system/dockers/volumes/ollama:/root/.ollama runtime: nvidia environment: - NVIDIA_VISIBLE_DEVICES=all deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] restart: unless-stopped networks: - genai_network networks: genai_network: ## Log Details: ### Server Initialization: Listening on [::]:11434 (version 0.4.2) Dynamic LLM libraries runners="[cpu_avx2 cuda_v11 cuda_v12 cpu cpu_avx]" Looking for compatible GPUs Inference compute id=GPU-660ca8b7-181a-ede9-f6fe-8ccd5f9dbb89 library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA RTX 6000 Ada Generation" total="47.5 GiB" available="47.1 GiB" Inference compute id=GPU-d62f5e11-4192-0e70-0732-55b558edcb7a library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA RTX 6000 Ada Generation" total="47.5 GiB" available="46.4 GiB" Inference compute id=GPU-45972459-815c-f304-9fdf-b952276c9b13 library=cuda variant=v12 compute=8.9 driver=12.4 name="NVIDIA RTX 6000 Ada Generation" total="47.5 GiB" available="47.0 GiB" ### Model Loading: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0.68 MiB ## Observed Issue: During inference with the NemoTron model, the response time is around 50 seconds. The logs show that only one GPU seems to be utilized (found 1 CUDA devices), despite multiple GPUs being detected during service initialization. ## Questions: ### GPU Load Balancing: Does Ollama support load balancing across multiple GPUs? If yes, why do the logs indicate that only one GPU is used (found 1 CUDA devices) when the model is loaded? ### Performance Optimization: What steps are recommended to reduce inference time? Should I adjust configuration settings in Docker or Ollama? Could variables like GGML_CUDA_FORCE_CUBLAS or GGML_CUDA_FORCE_MMQ improve performance? ## Technical Context: Ollama Version: 0.4.2 Hardware Configuration: GPUs: 3 x NVIDIA RTX 6000 Ada Generation (47.5 GiB VRAM each) NVIDIA Driver: 12.4 Model Used: NemoTron Usage Scenario: Simple prompt with function calling. ## Expectation: - Confirmation on multi-GPU support in Ollama. - Suggestions to reduce inference time. - Documentation or examples of optimized configuration for heavy workloads with multiple GPUs. ## Thank You for Your Help! ![nvidia-smi](https://github.com/user-attachments/assets/b3be6b97-0b16-47ed-8942-2ff7f4dd6aff) ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.4.2
GiteaMirror added the bug label 2026-04-12 16:00:58 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

Load balancing a single inference is not a thing for LLMs: https://github.com/ollama/ollama/issues/7648#issuecomment-2473561990

What prompt are you sending? Full server logs with OLLAMA_DEBUG=1 may help in debugging.

<!-- gh-comment-id:2488811942 --> @rick-github commented on GitHub (Nov 20, 2024): Load balancing a single inference is not a thing for LLMs: https://github.com/ollama/ollama/issues/7648#issuecomment-2473561990 What prompt are you sending? Full [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) with [`OLLAMA_DEBUG=1`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-do-i-configure-ollama-server) may help in debugging.
Author
Owner

@nicho2 commented on GitHub (Nov 21, 2024):

hello,

i have captured the full log in file below:
log_ollama_inference.txt

looks line 633

i installed ollama as a service on linux (to verify no difference with docker)

<!-- gh-comment-id:2490222176 --> @nicho2 commented on GitHub (Nov 21, 2024): hello, i have captured the full log in file below: [log_ollama_inference.txt](https://github.com/user-attachments/files/17841101/log_ollama_inference.txt) looks line 633 i installed ollama as a service on linux (to verify no difference with docker)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4957