[GH-ISSUE #6149] Why is the NVidia GPU always going crashing when using ./ollama-linux-amd64 ? #3840

Closed
opened 2026-04-12 14:40:43 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @tifDev on GitHub (Aug 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6149

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Hello,

I've tried the protable edition that doesn't needs root installation (./ollama-linux-amd64).
Everything work fine but after a couple minutes the GPU stops working and ollama starts to use CPU only.

This is the error faced:


ggml_cuda_init: failed to initialize CUDA: unknown error
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 21 repeating layers to GPU
llm_load_tensors: offloaded 21/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4437.80 MiB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_cuda_host_malloc: failed to allocate 256.00 MiB of pinned memory: unknown error
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
ggml_cuda_host_malloc: failed to allocate 0.50 MiB of pinned memory: unknown error
llama_new_context_with_model:        CPU  output buffer size =     0.50 MiB
ggml_cuda_host_malloc: failed to allocate 258.50 MiB of pinned memory: unknown error
llama_new_context_with_model:  CUDA_Host compute buffer size =   258.50 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
INFO [main] model loaded | tid="132032175734784" timestamp=1722637437
time=2024-08-02T23:23:57.145+01:00 level=INFO source=server.go:609 msg="llama runner started in 1.51 seconds"
INFO [update_slots] input truncated | n_ctx=2048 n_erase=1611 n_keep=4 n_left=2044 n_shift=1022 tid="132032175734784" timestamp=1722637437
[GIN] 2024/08/02 - 23:24:46 | 500 |          4m0s |       127.0.0.1 | POST     "/api/chat"
cuda driver library failed to get device context 999time=2024-08-02T23:29:46.493+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:46.744+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:46.995+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:47.245+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:47.495+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:47.744+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:47.994+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:48.245+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:48.495+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:48.744+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:48.995+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:49.244+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:49.494+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:49.745+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:49.995+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:50.245+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:50.495+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:50.745+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:50.994+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
cuda driver library failed to get device context 999time=2024-08-02T23:29:51.245+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
time=2024-08-02T23:29:51.494+01:00 level=WARN source=sched.go:674 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001706621 model=/home/xxxx/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87
cuda driver library failed to get device context 999time=2024-08-02T23:29:51.494+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
time=2024-08-02T23:29:51.744+01:00 level=WARN source=sched.go:674 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251527354 model=/home/xxxx/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87
cuda driver library failed to get device context 999time=2024-08-02T23:29:51.745+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory"
time=2024-08-02T23:29:51.994+01:00 level=WARN source=sched.go:674 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501999011

As workaround, I need to excute:
sudo modprobe -r nvidia_uvm && sudo modprobe nvidia_uvm

For the system wide installation this issue doesn't happens.

Can you suggest why is this happening and how to solve it?

Rgds,
KS

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

ollama version is 0.3.3

Originally created by @tifDev on GitHub (Aug 3, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6149 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Hello, I've tried the protable edition that doesn't needs root installation (./ollama-linux-amd64). Everything work fine but after a couple minutes the GPU stops working and ollama starts to use CPU only. This is the error faced: ``` log ggml_cuda_init: failed to initialize CUDA: unknown error llm_load_tensors: ggml ctx size = 0.14 MiB llm_load_tensors: offloading 21 repeating layers to GPU llm_load_tensors: offloaded 21/33 layers to GPU llm_load_tensors: CPU buffer size = 4437.80 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 ggml_cuda_host_malloc: failed to allocate 256.00 MiB of pinned memory: unknown error llama_kv_cache_init: CPU KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB ggml_cuda_host_malloc: failed to allocate 0.50 MiB of pinned memory: unknown error llama_new_context_with_model: CPU output buffer size = 0.50 MiB ggml_cuda_host_malloc: failed to allocate 258.50 MiB of pinned memory: unknown error llama_new_context_with_model: CUDA_Host compute buffer size = 258.50 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 1 INFO [main] model loaded | tid="132032175734784" timestamp=1722637437 time=2024-08-02T23:23:57.145+01:00 level=INFO source=server.go:609 msg="llama runner started in 1.51 seconds" INFO [update_slots] input truncated | n_ctx=2048 n_erase=1611 n_keep=4 n_left=2044 n_shift=1022 tid="132032175734784" timestamp=1722637437 [GIN] 2024/08/02 - 23:24:46 | 500 | 4m0s | 127.0.0.1 | POST "/api/chat" cuda driver library failed to get device context 999time=2024-08-02T23:29:46.493+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:46.744+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:46.995+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:47.245+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:47.495+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:47.744+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:47.994+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:48.245+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:48.495+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:48.744+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:48.995+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:49.244+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:49.494+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:49.745+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:49.995+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:50.245+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:50.495+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:50.745+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:50.994+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" cuda driver library failed to get device context 999time=2024-08-02T23:29:51.245+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" time=2024-08-02T23:29:51.494+01:00 level=WARN source=sched.go:674 msg="gpu VRAM usage didn't recover within timeout" seconds=5.001706621 model=/home/xxxx/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 cuda driver library failed to get device context 999time=2024-08-02T23:29:51.494+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" time=2024-08-02T23:29:51.744+01:00 level=WARN source=sched.go:674 msg="gpu VRAM usage didn't recover within timeout" seconds=5.251527354 model=/home/xxxx/.ollama/models/blobs/sha256-87048bcd55216712ef14c11c2c303728463207b165bf18440b9b84b07ec00f87 cuda driver library failed to get device context 999time=2024-08-02T23:29:51.745+01:00 level=WARN source=gpu.go:374 msg="error looking up nvidia GPU memory" time=2024-08-02T23:29:51.994+01:00 level=WARN source=sched.go:674 msg="gpu VRAM usage didn't recover within timeout" seconds=5.501999011 ``` As workaround, I need to excute: `sudo modprobe -r nvidia_uvm && sudo modprobe nvidia_uvm` For the system wide installation this issue doesn't happens. Can you suggest why is this happening and how to solve it? Rgds, KS ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version ollama version is 0.3.3
GiteaMirror added the dockernvidiabug labels 2026-04-12 14:40:43 -05:00
Author
Owner

@R-Prady commented on GitHub (Aug 5, 2024):

im running on nvidia L4. I have the same issue as well

<!-- gh-comment-id:2269108564 --> @R-Prady commented on GitHub (Aug 5, 2024): im running on nvidia L4. I have the same issue as well
Author
Owner

@rick-github commented on GitHub (Aug 5, 2024):

Would it be possible to post the full log? You posted from what I presume is the first error message, but there may be relevant information earlier in the log.

<!-- gh-comment-id:2269262989 --> @rick-github commented on GitHub (Aug 5, 2024): Would it be possible to post the full log? You posted from what I presume is the first error message, but there may be relevant information earlier in the log.
Author
Owner

@Trismegiste commented on GitHub (Aug 7, 2024):

Is this running on a laptop ?
Does dmesg -T contain the infamous message "GPU has fallen off the bus"

Could be overheating or could be aggressive power saving management that disable the GPU.

Example : https://forums.developer.nvidia.com/t/how-to-solve-nvrm-gpu-000000-0-gpu-has-fallen-off-the-bus-completely/230923

<!-- gh-comment-id:2272985896 --> @Trismegiste commented on GitHub (Aug 7, 2024): Is this running on a laptop ? Does ```dmesg -T``` contain the infamous message "GPU has fallen off the bus" Could be overheating or could be aggressive power saving management that disable the GPU. Example : https://forums.developer.nvidia.com/t/how-to-solve-nvrm-gpu-000000-0-gpu-has-fallen-off-the-bus-completely/230923
Author
Owner

@dhiltgen commented on GitHub (Oct 24, 2024):

You didn't mention, but I believe you're running in a container.

If so, on the host, setting up the same thing we do in our install script might help keep things happy. https://github.com/ollama/ollama/blob/main/scripts/install.sh#L358-L367

<!-- gh-comment-id:2434160433 --> @dhiltgen commented on GitHub (Oct 24, 2024): You didn't mention, but I believe you're running in a container. If so, on the host, setting up the same thing we do in our install script might help keep things happy. https://github.com/ollama/ollama/blob/main/scripts/install.sh#L358-L367
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3840