Ollama CUDA on Ubuntu Issue #1905

Closed
opened 2025-11-12 10:37:32 -06:00 by GiteaMirror · 4 comments
Owner

Originally created by @frankmedia on GitHub (Mar 13, 2024).

Originally assigned to: @dhiltgen on GitHub.

Ollama runs for about 10 - 15 minutes and then it stops due some CUDA issue.

Mar 13 04:27:53 marco-All-Series ollama[886]: [GIN] 2024/03/13 - 04:27:53 | 200 | 1m18s | 192.168.1.186 | POST "/api/generate"
Mar 13 04:27:54 marco-All-Series ollama[886]: {"function":"launch_slot_with_data","level":"INFO","line":823,"msg":"slot is processing task","slot_id":0,"task_id":4327,"tid":"139815765968640","timestamp":1710318474}
Mar 13 04:27:54 marco-All-Series ollama[886]: {"function":"update_slots","level":"INFO","line":1796,"msg":"slot progression","n_past":86,"n_prompt_tokens_processed":149,"slot_id":0,"task_id":4327,"tid":"139815765968640","timestamp":1710318474}
Mar 13 04:27:54 marco-All-Series ollama[886]: {"function":"update_slots","level":"INFO","line":1821,"msg":"kv cache rm [p0, end)","p0":86,"slot_id":0,"task_id":4327,"tid":"139815765968640","timestamp":1710318474}
Mar 13 04:28:19 marco-All-Series ollama[886]: CUDA error: unknown error
Mar 13 04:28:19 marco-All-Series ollama[886]: current device: 0, in function ggml_backend_cuda_get_tensor_async at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:12080
Mar 13 04:28:19 marco-All-Series ollama[886]: cudaMemcpyAsync(data, (const char *)tensor->data + offset, size, cudaMemcpyDeviceToHost, g_cudaStreams[cuda_ctx->device][0])
Mar 13 04:28:19 marco-All-Series ollama[886]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:256: !"CUDA error"
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 905]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 906]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 907]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 908]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 909]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 917]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 918]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 919]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 920]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 921]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 922]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 923]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 1368]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 1370]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 1371]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 1372]
Mar 13 04:28:19 marco-All-Series ollama[1385]: [Thread debugging using libthread_db enabled]
Mar 13 04:28:19 marco-All-Series ollama[1385]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Mar 13 04:28:19 marco-All-Series ollama[1385]: 0x00000000004700a3 in ?? ()
Mar 13 04:28:19 marco-All-Series ollama[1385]: #0 0x00000000004700a3 in ?? ()
Mar 13 04:28:19 marco-All-Series ollama[1385]: #1 0x0000000000437eb0 in ?? ()
Mar 13 04:28:19 marco-All-Series ollama[1385]: #2 0x0000000011b3a7e8 in ?? ()
Mar 13 04:28:19 marco-All-Series ollama[1385]: #3 0x0000000000000080 in ?? ()
Mar 13 04:28:19 marco-All-Series ollama[1385]: #4 0x0000000000000000 in ?? ()
Mar 13 04:28:19 marco-All-Series ollama[1385]: [Inferior 1 (process 886) detached]

Also nvidia-smi doesn't show any valid GPU I have a Nvidia Tesla T4 installed. Which works perfectly fine on start up before ollama kicks in.

$ nvidia-smi
Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

Originally created by @frankmedia on GitHub (Mar 13, 2024). Originally assigned to: @dhiltgen on GitHub. Ollama runs for about 10 - 15 minutes and then it stops due some CUDA issue. Mar 13 04:27:53 marco-All-Series ollama[886]: [GIN] 2024/03/13 - 04:27:53 | 200 | 1m18s | 192.168.1.186 | POST "/api/generate" Mar 13 04:27:54 marco-All-Series ollama[886]: {"function":"launch_slot_with_data","level":"INFO","line":823,"msg":"slot is processing task","slot_id":0,"task_id":4327,"tid":"139815765968640","timestamp":1710318474} Mar 13 04:27:54 marco-All-Series ollama[886]: {"function":"update_slots","level":"INFO","line":1796,"msg":"slot progression","n_past":86,"n_prompt_tokens_processed":149,"slot_id":0,"task_id":4327,"tid":"139815765968640","timestamp":1710318474} Mar 13 04:27:54 marco-All-Series ollama[886]: {"function":"update_slots","level":"INFO","line":1821,"msg":"kv cache rm [p0, end)","p0":86,"slot_id":0,"task_id":4327,"tid":"139815765968640","timestamp":1710318474} Mar 13 04:28:19 marco-All-Series ollama[886]: CUDA error: unknown error Mar 13 04:28:19 marco-All-Series ollama[886]: current device: 0, in function ggml_backend_cuda_get_tensor_async at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:12080 Mar 13 04:28:19 marco-All-Series ollama[886]: cudaMemcpyAsync(data, (const char *)tensor->data + offset, size, cudaMemcpyDeviceToHost, g_cudaStreams[cuda_ctx->device][0]) Mar 13 04:28:19 marco-All-Series ollama[886]: GGML_ASSERT: /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml-cuda.cu:256: !"CUDA error" Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 905] Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 906] Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 907] Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 908] Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 909] Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 917] Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 918] Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 919] Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 920] Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 921] Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 922] Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 923] Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 1368] Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 1370] Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 1371] Mar 13 04:28:19 marco-All-Series ollama[1385]: [New LWP 1372] Mar 13 04:28:19 marco-All-Series ollama[1385]: [Thread debugging using libthread_db enabled] Mar 13 04:28:19 marco-All-Series ollama[1385]: Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Mar 13 04:28:19 marco-All-Series ollama[1385]: 0x00000000004700a3 in ?? () Mar 13 04:28:19 marco-All-Series ollama[1385]: #0 0x00000000004700a3 in ?? () Mar 13 04:28:19 marco-All-Series ollama[1385]: #1 0x0000000000437eb0 in ?? () Mar 13 04:28:19 marco-All-Series ollama[1385]: #2 0x0000000011b3a7e8 in ?? () Mar 13 04:28:19 marco-All-Series ollama[1385]: #3 0x0000000000000080 in ?? () Mar 13 04:28:19 marco-All-Series ollama[1385]: #4 0x0000000000000000 in ?? () Mar 13 04:28:19 marco-All-Series ollama[1385]: [Inferior 1 (process 886) detached] Also nvidia-smi doesn't show any valid GPU I have a Nvidia Tesla T4 installed. Which works perfectly fine on start up before ollama kicks in. $ nvidia-smi Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error
GiteaMirror added the buglinuxnvidia labels 2025-11-12 10:37:32 -06:00
Author
Owner

@dhiltgen commented on GitHub (Mar 21, 2024):

Given nvidia-smi stops working, this sounds like it might be an NVIDIA driver bug. For similar "unknown errors" some users have reported that sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm has helped reset things with a wedged driver that is causing "unknown errors" from the CUDA library APIs.

@dhiltgen commented on GitHub (Mar 21, 2024): Given `nvidia-smi` stops working, this sounds like it might be an NVIDIA driver bug. For similar "unknown errors" some users have reported that `sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm` has helped reset things with a wedged driver that is causing "unknown errors" from the CUDA library APIs.
Author
Owner

@dhiltgen commented on GitHub (Mar 26, 2024):

One other thing to try. For systems showing the "unknown error" or "999" error on nvidia GPUs, try checking dmesg logs (dmesg -l err) to see if there's anything interesting being reported by the nvidia drivers.

@dhiltgen commented on GitHub (Mar 26, 2024): One other thing to try. For systems showing the "unknown error" or "999" error on nvidia GPUs, try checking `dmesg` logs (`dmesg -l err`) to see if there's anything interesting being reported by the nvidia drivers.
Author
Owner

@dhiltgen commented on GitHub (Apr 12, 2024):

If you're still having troubles, please give the above suggestions a try and let us know.

@dhiltgen commented on GitHub (Apr 12, 2024): If you're still having troubles, please give the above suggestions a try and let us know.
Author
Owner

@h9052300 commented on GitHub (Nov 4, 2025):

Here is the full, professional English translation of the troubleshooting report you requested.


## System Fault Troubleshooting Report

Date: November 4, 2025
Affected System: mmxrtx4090-To-Be-Filled-By-O-E-M


1. System Environment Information

  • Operating System: Ubuntu 24.04 LTS (or derivative)
  • GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
  • Kernel Version (Initial): 6.14.0-27-generic (An older kernel, potential source of incompatibility)
  • Kernel Version (Current): 6.14.0-33-generic (Observed in initramfs update logs)
  • Key Software: Ollama API (for AI inference), NVIDIA Drivers (multiple versions), dkms

2. Core Fault Description

The system is fully stable under low-load desktop conditions. Executing nvidia-smi successfully displays GPU information, indicating the GPU is in a P8 (minimum power) state, consuming approximately 10-20W.

The point of failure is the performance state transition.

As soon as any application (in this case, the Ollama API) attempts to utilize the GPU for a high-load compute task, the driver attempts to switch the GPU from the P8 state to the P0 (maximum performance) state.

At this exact moment of transition, the NVIDIA driver immediately crashes.

Post-crash, the GPU "drops" from the system bus, causing nvidia-smi to immediately report Unable to determine the device handle... Unknown Error or No devices were found. A system reboot is required to detect the GPU again, but the issue is 100% reproducible.


3. Troubleshooting History (Chronological)

We have progressed through four primary phases of troubleshooting, with each phase's hypothesis based on the failure of the previous one.

Phase 1: Hypothesis - Driver/Kernel Mismatch

  • Symptoms: nvidia-smi reported Unknown Error.
  • Investigation: dmesg logs showed signature... missing errors, and dkms status was inconsistent.
  • Hypothesis: A failed dkms installation or Secure Boot was blocking the module.
  • Actions:
    1. Checked mokutil, confirming SecureBoot disabled.
    2. Performed the first driver purge.
  • Result: Advanced to Phase 2.

Phase 2: Hypothesis - Unstable Driver Version

  • Hypothesis: A specific driver version (e.g., a New Feature Branch) was buggy.
  • Actions:
    1. Installed nvidia-driver-570 (New Feature Branch).
    2. Result: nvidia-smi was successful at idle. However, it crashed immediately when running Ollama (P8 -> P0 failure).
    3. Downgraded and installed nvidia-driver-550 (Production Branch).
    4. Result: nvidia-smi was successful at idle. It crashed again in the exact same manner when running Ollama.
  • Conclusion: This hypothesis was disproven. The issue is not driver-version-specific (550 vs. 570), as both fail identically. The problem is deeper.

Phase 3: Hypothesis - Kernel Chaos & System Artifacts

  • Symptoms: During an apt -f install attempt, logs showed the 560 driver failing to compile (Bad return status for module build) against the extremely old 6.14.0-27-generic kernel.
  • Hypothesis: The system environment was a "hodgepodge" of artifacts from multiple, failed NVIDIA driver installations (560, 575, 580) and was running on an outdated, incompatible kernel.
  • Actions (The "Fundamental Fix" Plan):
    1. Repaired the dpkg lock state (sudo kill, sudo rm /var/lib/dpkg/lock..., sudo dpkg --configure -a).
    2. Performed a second, more thorough purge.
    3. Upgraded the kernel (sudo apt install linux-generic-hwe-24.04).
    4. Reinstalled nvidia-driver-550.
  • Result: The system was in a "perfect" state: clean, new kernel, and a stable 550 driver. nvidia-smi was successful. However, upon running the Ollama API, the fault was 100% reproduced.

Phase 4: Hypothesis - Hardware/BIOS Power Management Conflict

  • Hypothesis: With all software-layer variables (driver version, kernel match, system purity) eliminated, the fault must lie in the low-level communication between the driver and the hardware. The most likely culprit is the "Dynamic Power Management" feature conflicting with the motherboard's BIOS/ACPI.
  • Actions: Attempted to implement your planned Step 3 (disabling dynamic power management) by creating /etc/modprobe.d/nvidia-power.conf.
  • Latest Development: While running sudo update-initramfs -u, the system reported libkmod: ERROR... ignoring bad line starting with 'ptions'.
  • Cause: A typo was identified (options was mistyped as ptions).

4. Current Status & Unresolved Issues

We are currently in the middle of executing Phase 4.

  1. Identified Problem: The typo in nvidia-power.conf has prevented our attempt to disable dynamic power management from taking effect.
  2. Immediate Past Action: I provided the command to correct this typo (sudo bash -c 'echo "options nvidia NVreg_DynamicPowerManagement=0x00" > /etc/modprobe.d/nvidia-power.conf').
  3. Unresolved Core Problem:
    • The P8 -> P0 state transition crash still exists.
    • Our primary solution for this (NVreg_DynamicPowerManagement=0x00) has not yet been correctly applied and tested with a reboot.
    • If this solution fails after being applied correctly, the root cause will definitively point to your original troubleshooting plan's final steps: BIOS settings (like PCI-E Link State Power Management) or physical hardware (PSU, GPU power cabling).

Summary: We are not running in circles. We have used the process of elimination to narrow the problem from a broad "software" issue to a highly specific "hardware/BIOS power management" conflict. Your immediate next step is to correctly execute the plan for Phase 4.

@h9052300 commented on GitHub (Nov 4, 2025): Here is the full, professional English translation of the troubleshooting report you requested. --- ### ## System Fault Troubleshooting Report **Date:** November 4, 2025 **Affected System:** `mmxrtx4090-To-Be-Filled-By-O-E-M` --- ### 1. System Environment Information * **Operating System:** Ubuntu 24.04 LTS (or derivative) * **GPU:** NVIDIA GeForce RTX 4090 (24GB VRAM) * **Kernel Version (Initial):** `6.14.0-27-generic` (An older kernel, potential source of incompatibility) * **Kernel Version (Current):** `6.14.0-33-generic` (Observed in `initramfs` update logs) * **Key Software:** Ollama API (for AI inference), NVIDIA Drivers (multiple versions), `dkms` --- ### 2. Core Fault Description The system is fully stable under low-load desktop conditions. Executing `nvidia-smi` successfully displays GPU information, indicating the GPU is in a **P8** (minimum power) state, consuming approximately 10-20W. **The point of failure is the performance state transition.** As soon as any application (in this case, the `Ollama API`) attempts to utilize the GPU for a high-load compute task, the driver attempts to switch the GPU from the **P8** state to the **P0** (maximum performance) state. **At this exact moment of transition, the NVIDIA driver immediately crashes.** Post-crash, the GPU "drops" from the system bus, causing `nvidia-smi` to immediately report `Unable to determine the device handle... Unknown Error` or `No devices were found`. A system reboot is required to detect the GPU again, but the issue is **100% reproducible**. --- ### 3. Troubleshooting History (Chronological) We have progressed through four primary phases of troubleshooting, with each phase's hypothesis based on the failure of the previous one. #### **Phase 1: Hypothesis - Driver/Kernel Mismatch** * **Symptoms:** `nvidia-smi` reported `Unknown Error`. * **Investigation:** `dmesg` logs showed `signature... missing` errors, and `dkms status` was inconsistent. * **Hypothesis:** A failed `dkms` installation or Secure Boot was blocking the module. * **Actions:** 1. Checked `mokutil`, confirming `SecureBoot disabled`. 2. Performed the first driver purge. * **Result:** Advanced to Phase 2. #### **Phase 2: Hypothesis - Unstable Driver Version** * **Hypothesis:** A specific driver version (e.g., a New Feature Branch) was buggy. * **Actions:** 1. Installed `nvidia-driver-570` (New Feature Branch). 2. **Result:** `nvidia-smi` was successful at idle. However, it **crashed immediately** when running Ollama (P8 -> P0 failure). 3. Downgraded and installed `nvidia-driver-550` (Production Branch). 4. **Result:** `nvidia-smi` was successful at idle. **It crashed again** in the exact same manner when running Ollama. * **Conclusion:** **This hypothesis was disproven.** The issue is not driver-version-specific (550 vs. 570), as both fail identically. The problem is deeper. #### **Phase 3: Hypothesis - Kernel Chaos & System Artifacts** * **Symptoms:** During an `apt -f install` attempt, logs showed the `560` driver failing to compile (`Bad return status for module build`) against the extremely old `6.14.0-27-generic` kernel. * **Hypothesis:** The system environment was a "hodgepodge" of artifacts from multiple, failed NVIDIA driver installations (`560`, `575`, `580`) and was running on an outdated, incompatible kernel. * **Actions (The "Fundamental Fix" Plan):** 1. Repaired the `dpkg` lock state (`sudo kill`, `sudo rm /var/lib/dpkg/lock...`, `sudo dpkg --configure -a`). 2. Performed a second, more thorough `purge`. 3. Upgraded the kernel (`sudo apt install linux-generic-hwe-24.04`). 4. Reinstalled `nvidia-driver-550`. * **Result:** The system was in a "perfect" state: clean, new kernel, and a stable `550` driver. `nvidia-smi` was successful. **However, upon running the Ollama API, the fault was 100% reproduced.** #### **Phase 4: Hypothesis - Hardware/BIOS Power Management Conflict** * **Hypothesis:** With all software-layer variables (driver version, kernel match, system purity) eliminated, the fault must lie in the low-level communication between the driver and the hardware. The most likely culprit is the **"Dynamic Power Management"** feature conflicting with the motherboard's BIOS/ACPI. * **Actions:** Attempted to implement your planned Step 3 (disabling dynamic power management) by creating `/etc/modprobe.d/nvidia-power.conf`. * **Latest Development:** While running `sudo update-initramfs -u`, the system reported `libkmod: ERROR... ignoring bad line starting with 'ptions'`. * **Cause:** A **typo** was identified (`options` was mistyped as `ptions`). --- ### 4. Current Status & Unresolved Issues **We are currently in the middle of executing Phase 4.** 1. **Identified Problem:** The typo in `nvidia-power.conf` has prevented our attempt to disable dynamic power management from taking effect. 2. **Immediate Past Action:** I provided the command to correct this typo (`sudo bash -c 'echo "options nvidia NVreg_DynamicPowerManagement=0x00" > /etc/modprobe.d/nvidia-power.conf'`). 3. **Unresolved Core Problem:** * The **P8 -> P0 state transition crash** still exists. * Our primary solution for this (`NVreg_DynamicPowerManagement=0x00`) has not yet been correctly applied and tested with a reboot. * **If** this solution fails after being applied correctly, the root cause will definitively point to your original troubleshooting plan's final steps: **BIOS settings** (like `PCI-E Link State Power Management`) or **physical hardware** (PSU, GPU power cabling). **Summary:** We are not running in circles. We have used the process of elimination to narrow the problem from a broad "software" issue to a highly specific "hardware/BIOS power management" conflict. Your immediate next step is to **correctly** execute the plan for Phase 4.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#1905