[GH-ISSUE #1561] GPU not being used and 'out of memory' - 'no CUDA-capable device is detected' errors while running on Docker Compose #47366

Closed
opened 2026-04-28 03:37:53 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @seth100 on GitHub (Dec 16, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1561

Originally assigned to: @dhiltgen on GitHub.

I'm using the following docker compose file:

ollama:
    image: ollama/ollama:latest
    container_name: ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    volumes:
      - ./ollama:/root/.ollama
    ports:
      - 11434:11434
    tty: true
    restart: unless-stopped

I'm on Ubuntu 22.04.
The GPU is a GeForce GTX 1660 OC edition 6GB GDDR5 and nvidia-container-toolkit is installed.

Here is the outpot of $ docker exec -it ollama nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1660        Off | 00000000:26:00.0  On |                  N/A |
| 27%   36C    P5              10W / 120W |    887MiB /  6144MiB |      8%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

The issue is that I get the following errors and only the CPU is used while running ollama, GPU is like in idle:

llama2, mistral:

ollama        | CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:9080: out of memory
ollama        | current device: 0
ollama        | GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:9080: !"CUDA error"
ollama        | 2023/12/16 08:19:26 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:9080: out of memory
ollama        | current device: 0
ollama        | GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:9080: !"CUDA error"
ollama        | 2023/12/16 08:19:26 llama.go:459: error starting llama runner: llama runner process has terminated
ollama        | 2023/12/16 08:19:26 llama.go:525: llama runner stopped successfully
ollama        | 2023/12/16 08:19:26 llama.go:436: starting llama runner
ollama        | 2023/12/16 08:19:26 llama.go:494: waiting for llama runner to start responding
ollama        | {"timestamp":1702714766,"level":"WARNING","function":"server_params_parse","line":2148,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1}
ollama        | {"timestamp":1702714766,"level":"INFO","function":"main","line":2652,"message":"build info","build":441,"commit":"948ff13"}
ollama        | {"timestamp":1702714766,"level":"INFO","function":"main","line":2655,"message":"system info","n_threads":6,"n_threads_batch":-1,"total_threads":12,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}

orca-mini:

ollama        | CUDA error 100 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:493: no CUDA-capable device is detected
ollama        | current device: 624750624
ollama        | GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:493: !"CUDA error"
ollama        | 2023/12/16 08:48:01 llama.go:451: 100 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:493: no CUDA-capable device is detected
ollama        | current device: 624750624
ollama        | GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:493: !"CUDA error"
ollama        | 2023/12/16 08:48:01 llama.go:459: error starting llama runner: llama runner process has terminated
ollama        | 2023/12/16 08:48:01 llama.go:525: llama runner stopped successfully
ollama        | 2023/12/16 08:48:01 llama.go:436: starting llama runner
ollama        | 2023/12/16 08:48:01 llama.go:494: waiting for llama runner to start responding
ollama        | {"timestamp":1702716481,"level":"WARNING","function":"server_params_parse","line":2148,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1}
ollama        | {"timestamp":1702716481,"level":"INFO","function":"main","line":2652,"message":"build info","build":441,"commit":"948ff13"}
ollama        | {"timestamp":1702716481,"level":"INFO","function":"main","line":2655,"message":"system info","n_threads":6,"n_threads_batch":-1,"total_threads":12,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "}

I noticed from other issues that some of those errors are common for other people, is that a bug or am I doing anything wrong?

I also tried to add in the compose yaml file:

    runtime: nvidia
    cap_add:
      - SYS_ADMIN
    privileged: true
    environment:
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - NVIDIA_VISIBLE_DEVICES=all

but I get same results!

Thanks

Originally created by @seth100 on GitHub (Dec 16, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1561 Originally assigned to: @dhiltgen on GitHub. I'm using the following docker compose file: ```yml ollama: image: ollama/ollama:latest container_name: ollama deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] volumes: - ./ollama:/root/.ollama ports: - 11434:11434 tty: true restart: unless-stopped ``` I'm on Ubuntu 22.04. The GPU is a `GeForce GTX 1660 OC edition 6GB GDDR5` and `nvidia-container-toolkit` is installed. Here is the outpot of `$ docker exec -it ollama nvidia-smi`: ```sh +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce GTX 1660 Off | 00000000:26:00.0 On | N/A | | 27% 36C P5 10W / 120W | 887MiB / 6144MiB | 8% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| +---------------------------------------------------------------------------------------+ ``` The issue is that I get the following errors and only the CPU is used while running ollama, GPU is like in idle: **`llama2`, `mistral`**: ```sh ollama | CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:9080: out of memory ollama | current device: 0 ollama | GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:9080: !"CUDA error" ollama | 2023/12/16 08:19:26 llama.go:451: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:9080: out of memory ollama | current device: 0 ollama | GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:9080: !"CUDA error" ollama | 2023/12/16 08:19:26 llama.go:459: error starting llama runner: llama runner process has terminated ollama | 2023/12/16 08:19:26 llama.go:525: llama runner stopped successfully ollama | 2023/12/16 08:19:26 llama.go:436: starting llama runner ollama | 2023/12/16 08:19:26 llama.go:494: waiting for llama runner to start responding ollama | {"timestamp":1702714766,"level":"WARNING","function":"server_params_parse","line":2148,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1} ollama | {"timestamp":1702714766,"level":"INFO","function":"main","line":2652,"message":"build info","build":441,"commit":"948ff13"} ollama | {"timestamp":1702714766,"level":"INFO","function":"main","line":2655,"message":"system info","n_threads":6,"n_threads_batch":-1,"total_threads":12,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "} ``` **`orca-mini`**: ```sh ollama | CUDA error 100 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:493: no CUDA-capable device is detected ollama | current device: 624750624 ollama | GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:493: !"CUDA error" ollama | 2023/12/16 08:48:01 llama.go:451: 100 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:493: no CUDA-capable device is detected ollama | current device: 624750624 ollama | GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/gguf/ggml-cuda.cu:493: !"CUDA error" ollama | 2023/12/16 08:48:01 llama.go:459: error starting llama runner: llama runner process has terminated ollama | 2023/12/16 08:48:01 llama.go:525: llama runner stopped successfully ollama | 2023/12/16 08:48:01 llama.go:436: starting llama runner ollama | 2023/12/16 08:48:01 llama.go:494: waiting for llama runner to start responding ollama | {"timestamp":1702716481,"level":"WARNING","function":"server_params_parse","line":2148,"message":"Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support","n_gpu_layers":-1} ollama | {"timestamp":1702716481,"level":"INFO","function":"main","line":2652,"message":"build info","build":441,"commit":"948ff13"} ollama | {"timestamp":1702716481,"level":"INFO","function":"main","line":2655,"message":"system info","n_threads":6,"n_threads_batch":-1,"total_threads":12,"system_info":"AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | "} ``` I noticed from other issues that some of those errors are common for other people, is that a bug or am I doing anything wrong? I also tried to add in the compose yaml file: ```sh runtime: nvidia cap_add: - SYS_ADMIN privileged: true environment: - NVIDIA_DRIVER_CAPABILITIES=compute,utility - NVIDIA_VISIBLE_DEVICES=all ``` but I get same results! Thanks
Author
Owner

@rgaidot commented on GitHub (Dec 18, 2023):

Can you create your own ollama image via a Dockerfile (FROM ... git clone https://github.com/jmorganca/ollama.git ...) from scratch?

<!-- gh-comment-id:1859405183 --> @rgaidot commented on GitHub (Dec 18, 2023): Can you create your own ollama image via a Dockerfile (FROM ... git clone https://github.com/jmorganca/ollama.git ...) from scratch?
Author
Owner

@seth100 commented on GitHub (Dec 18, 2023):

@rgaidot just tried, it did not change the behavior, still using only CPU

<!-- gh-comment-id:1859877232 --> @seth100 commented on GitHub (Dec 18, 2023): @rgaidot just tried, it did not change the behavior, still using only CPU
Author
Owner

@gagliardetto commented on GitHub (Dec 19, 2023):

I was able to fix the issue modifying this line of code:

From this:

1ca484f67e/llm/llama.go (L299)

to this:

layers := int(freeBytes/bytesPerLayer) * 3 / 5
<!-- gh-comment-id:1863422508 --> @gagliardetto commented on GitHub (Dec 19, 2023): I was able to fix the issue modifying this line of code: From this: https://github.com/jmorganca/ollama/blob/1ca484f67e6f607114496211004942013e5595eb/llm/llama.go#L299 to this: ```go layers := int(freeBytes/bytesPerLayer) * 3 / 5 ```
Author
Owner

@djmaze commented on GitHub (Dec 20, 2023):

I am quite sure the docker image is built wrongly so CUDA cannot work. I created a PR at #1644 which makes it work correctly for me.

<!-- gh-comment-id:1865234254 --> @djmaze commented on GitHub (Dec 20, 2023): I am quite sure the docker image is built wrongly so CUDA cannot work. I created a PR at #1644 which makes it work correctly for me.
Author
Owner

@mongolu commented on GitHub (Dec 20, 2023):

Sorry to intervene, I'm using it with docker on wsl2 and it's using GPUs

<!-- gh-comment-id:1865254242 --> @mongolu commented on GitHub (Dec 20, 2023): Sorry to intervene, I'm using it with docker on wsl2 and it's using GPUs
Author
Owner

@seth100 commented on GitHub (Dec 21, 2023):

I am quite sure the docker image is built wrongly so CUDA cannot work. I created a PR at #1644 which makes it work correctly for me.

hope it'll be merged soon, thanks!

<!-- gh-comment-id:1865770582 --> @seth100 commented on GitHub (Dec 21, 2023): > I am quite sure the docker image is built wrongly so CUDA cannot work. I created a PR at #1644 which makes it work correctly for me. hope it'll be merged soon, thanks!
Author
Owner

@dhiltgen commented on GitHub (Jan 27, 2024):

@seth100 please give the latest docker image we produce a try? (version 0.1.22) It should be able to detect the CUDA GPU, and if supported, use it, otherwise fallback to CPU mode. If it still doesn't detect the GPU, please run the container with OLLAMA_DEBUG=1 in the environment and share the logs so we can see why it's failing.

<!-- gh-comment-id:1912897143 --> @dhiltgen commented on GitHub (Jan 27, 2024): @seth100 please give the latest docker image we produce a try? (version 0.1.22) It should be able to detect the CUDA GPU, and if supported, use it, otherwise fallback to CPU mode. If it still doesn't detect the GPU, please run the container with OLLAMA_DEBUG=1 in the environment and share the logs so we can see why it's failing.
Author
Owner

@dhiltgen commented on GitHub (Feb 1, 2024):

If you're still having problems with 0.1.22 or newer, please re-open.

<!-- gh-comment-id:1922460806 --> @dhiltgen commented on GitHub (Feb 1, 2024): If you're still having problems with 0.1.22 or newer, please re-open.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47366