[GH-ISSUE #3647] Ollama reverts to CPU on a100 docker. "error looking up CUDA GPU memory: device memory info lookup failure 0: 4 #48761

Closed
opened 2026-04-28 09:13:16 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @Yaffa16 on GitHub (Apr 15, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3647

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

time=2024-04-15T09:17:48.609Z level=INFO source=gpu.go:82 msg="Nvidia GPU detected"
time=2024-04-15T09:17:48.609Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-15T09:17:48.617Z level=INFO source=gpu.go:109 msg="error looking up CUDA GPU memory: device memory info lookup failure 0: 4"
time=2024-04-15T09:17:48.617Z level=INFO source=routes.go:1133 msg="no GPU detected"
time=2024-04-15T09:17:49.031Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-15T09:17:49.031Z level=INFO source=gpu.go:109 msg="error looking up CUDA GPU memory: device memory info lookup failure 0: 4"
time=2024-04-15T09:17:49.031Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-04-15T09:17:49.031Z level=INFO source=gpu.go:109 msg="error looking up CUDA GPU memory: device memory info lookup failure 0: 4"
time=2024-04-15T09:17:49.031Z level=INFO source=llm.go:85 msg="GPU not available, falling back to CPU"
time=2024-04-15T09:17:49.034Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama3883625654/runners/cpu_avx2/libext_server.so"
time=2024-04-15T09:17:49.034Z level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:01:00.0 Off | On |
| N/A 38C P0 32W / 250W | 90MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:43:00.0 Off | On |
| N/A 32C P0 35W / 250W | 17843MiB / 40960MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| 0 1 0 0 | 53MiB / 19968MiB | 56 0 | 4 0 2 0 0 |
| | 1MiB / 32767MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 5 0 1 | 25MiB / 9856MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 0 13 0 2 | 12MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 8191MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
| 1 0 0 0 | 17843MiB / 40326MiB | 98 0 | 7 0 5 1 1 |
| | 7MiB / 65536MiB | | |
+------------------+----------------------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 1 0 0 4197 C /opt/conda/bin/python 5180MiB |
| 1 0 0 4201 C /opt/conda/bin/python 12644MiB |
+-----------------------------------------------------------------------------------------+
$

What did you expect to see?

No response

Steps to reproduce

No response

Are there any recent changes that introduced the issue?

No response

OS

Linux

Architecture

No response

Platform

No response

Ollama version

latest

GPU

Nvidia

GPU info

a100

CPU

No response

Other software

No response

Originally created by @Yaffa16 on GitHub (Apr 15, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3647 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? time=2024-04-15T09:17:48.609Z level=INFO source=gpu.go:82 msg="Nvidia GPU detected" time=2024-04-15T09:17:48.609Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-15T09:17:48.617Z level=INFO source=gpu.go:109 msg="error looking up CUDA GPU memory: device memory info lookup failure 0: 4" time=2024-04-15T09:17:48.617Z level=INFO source=routes.go:1133 msg="no GPU detected" time=2024-04-15T09:17:49.031Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-15T09:17:49.031Z level=INFO source=gpu.go:109 msg="error looking up CUDA GPU memory: device memory info lookup failure 0: 4" time=2024-04-15T09:17:49.031Z level=INFO source=cpu_common.go:11 msg="CPU has AVX2" time=2024-04-15T09:17:49.031Z level=INFO source=gpu.go:109 msg="error looking up CUDA GPU memory: device memory info lookup failure 0: 4" time=2024-04-15T09:17:49.031Z level=INFO source=llm.go:85 msg="GPU not available, falling back to CPU" time=2024-04-15T09:17:49.034Z level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama3883625654/runners/cpu_avx2/libext_server.so" time=2024-04-15T09:17:49.034Z level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server" +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-PCIE-40GB Off | 00000000:01:00.0 Off | On | | N/A 38C P0 32W / 250W | 90MiB / 40960MiB | N/A Default | | | | Enabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-PCIE-40GB Off | 00000000:43:00.0 Off | On | | N/A 32C P0 35W / 250W | 17843MiB / 40960MiB | N/A Default | | | | Enabled | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+==================================+===========+=======================| | 0 1 0 0 | 53MiB / 19968MiB | 56 0 | 4 0 2 0 0 | | | 1MiB / 32767MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 0 5 0 1 | 25MiB / 9856MiB | 28 0 | 2 0 1 0 0 | | | 0MiB / 16383MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 0 13 0 2 | 12MiB / 4864MiB | 14 0 | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------------------+-----------+-----------------------+ | 1 0 0 0 | 17843MiB / 40326MiB | 98 0 | 7 0 5 1 1 | | | 7MiB / 65536MiB | | | +------------------+----------------------------------+-----------+-----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 1 0 0 4197 C /opt/conda/bin/python 5180MiB | | 1 0 0 4201 C /opt/conda/bin/python 12644MiB | +-----------------------------------------------------------------------------------------+ $ ### What did you expect to see? _No response_ ### Steps to reproduce _No response_ ### Are there any recent changes that introduced the issue? _No response_ ### OS Linux ### Architecture _No response_ ### Platform _No response_ ### Ollama version latest ### GPU Nvidia ### GPU info a100 ### CPU _No response_ ### Other software _No response_
GiteaMirror added the gpubugnvidia labels 2026-04-28 09:13:17 -05:00
Author
Owner

@dhiltgen commented on GitHub (Apr 15, 2024):

You didn't mention, but I believe you're running 0.1.29 or perhaps older.

The error message error looking up CUDA GPU memory: device memory info lookup failure 0: 4 maps to an error code from nvmlDeviceGetMemoryInfo failing on device 0, and status code 4 maps to NVML_ERROR_NO_PERMISSION.

A quick workaround would be to run Ollama as root, but a proper solution would be to adjust the system permissions so the ollama user can access the GPU. I don't know what Distro you're running, or if this is a container, so I'm not sure what the exact solution is.

<!-- gh-comment-id:2057972286 --> @dhiltgen commented on GitHub (Apr 15, 2024): You didn't mention, but I believe you're running 0.1.29 or perhaps older. The error message `error looking up CUDA GPU memory: device memory info lookup failure 0: 4` maps to an error code from `nvmlDeviceGetMemoryInfo` failing on device 0, and status code 4 maps to NVML_ERROR_NO_PERMISSION. A quick workaround would be to run Ollama as root, but a proper solution would be to adjust the system permissions so the `ollama` user can access the GPU. I don't know what Distro you're running, or if this is a container, so I'm not sure what the exact solution is.
Author
Owner

@dhiltgen commented on GitHub (Apr 24, 2024):

If adjusting permissions doesn't resolve the problem, please let us know and I'll reopen the issue.

<!-- gh-comment-id:2073705603 --> @dhiltgen commented on GitHub (Apr 24, 2024): If adjusting permissions doesn't resolve the problem, please let us know and I'll reopen the issue.
Author
Owner

@UmutAlihan commented on GitHub (Sep 16, 2024):

Hi, I am currently experiencing this issue on a prod openshift (k8s) environment, deployed via official dockerhub image.

We know that permissions to access gpu's ok by checking nvidia-smi command with same user.

Our Ollama version is 0.3.1.

Is there any output you want me to share, please let me know. Any help is appreciated. Cheers

Ollama debug output gml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected llm_load_tensors: ggml ctx size = 0.14 MiB ggml_cuda_host_malloc: failed to allocate 15317.02 MiB of pinned memory: no CUDA-capable device is detected llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 15317.02 MiB time=2024-09-16T09:15:43.930Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server not responding" time=2024-09-16T09:15:44.182Z level=ERROR source=sched.go:446 msg="error loading llama server" error="llama runner process has terminated: signal: killed" time=2024-09-16T09:15:44.182Z level=WARN source=server.go:503 msg="llama runner process no longer running" sys=9 string="signal: killed"
<!-- gh-comment-id:2352390876 --> @UmutAlihan commented on GitHub (Sep 16, 2024): Hi, I am currently experiencing this issue on a prod openshift (k8s) environment, deployed via official dockerhub image. We know that permissions to access gpu's ok by checking nvidia-smi command with same user. Our Ollama version is 0.3.1. Is there any output you want me to share, please let me know. Any help is appreciated. Cheers <details closed> <summary>Ollama debug output</summary> gml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected llm_load_tensors: ggml ctx size = 0.14 MiB ggml_cuda_host_malloc: failed to allocate 15317.02 MiB of pinned memory: no CUDA-capable device is detected llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 15317.02 MiB time=2024-09-16T09:15:43.930Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server not responding" time=2024-09-16T09:15:44.182Z level=ERROR source=sched.go:446 msg="error loading llama server" error="llama runner process has terminated: signal: killed" time=2024-09-16T09:15:44.182Z level=WARN source=server.go:503 msg="llama runner process no longer running" sys=9 string="signal: killed" </details>
Author
Owner

@dhiltgen commented on GitHub (Sep 25, 2024):

@UmutAlihan your scenario looks different than this issue. Please open a new issue and include a more complete server log. The partial logs you included don't really make sense (if we didn't discover the GPU we shouldn't have loaded 33/33 layers on the GPU, so there's more going on here in other log lines that are important to understand.)

<!-- gh-comment-id:2375197628 --> @dhiltgen commented on GitHub (Sep 25, 2024): @UmutAlihan your scenario looks different than this issue. Please open a new issue and include a more complete server log. The partial logs you included don't really make sense (if we didn't discover the GPU we shouldn't have loaded 33/33 layers on the GPU, so there's more going on here in other log lines that are important to understand.)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#48761