[GH-ISSUE #12682] Fan on passively cooled server GPUs (NVIDIA P40) runs at 100% after v1.30 update due to GPU inactivity #54925

Open
opened 2026-04-29 08:00:30 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @JanDamek on GitHub (Oct 17, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12682

What is the issue?

Describe the bug

Since the update to version 1.30 (and later), Ollama's new behavior of fully unloading the GPU context after a period of inactivity causes the system fans connected to passively cooled server GPUs, like the NVIDIA P40, to spin up to 100%.

This is a significant regression from v1.28, where the Ollama service would keep the GPU initialized and "awake" for the entire duration of its runtime, thus allowing for proper thermal management and quiet fan operation.

To Reproduce

  1. Use a server with a passively cooled GPU, such as an NVIDIA Tesla P40.
  2. Install Ollama v1.30 or a more recent version.
  3. Start the Ollama service (ollama serve).
  4. Observe that the system/GPU fans spin up to maximum speed because the GPU is not actively initialized by Ollama at startup.
  5. Run a model (e.g., ollama run llama3). The GPU initializes, the model loads, and the fans calm down to normal levels.
  6. Wait for 5 minutes after the request is complete.
  7. Observe that the model is unloaded, the GPU context is released, and the fans immediately spin back up to 100%.

Expected behavior

The behavior should be similar to version 1.28. The Ollama service, upon starting, should perform a basic initialization of the GPU to bring it out of a deep idle state. This keeps the card "awake" and allows the server's Baseboard Management Controller (BMC) to manage fan speeds correctly based on actual temperature readings, rather than defaulting to 100% as a safety precaution.

Unloading a model from VRAM should not mean fully de-initializing the GPU context, which leads to this issue on server hardware.

Actual behavior

Ollama v1.30+ leaves the GPU in a deep idle state until a model is requested. When the model is unloaded after the keep_alive timeout (default 5 minutes), the GPU returns to this deep idle state. For passively cooled cards like the P40, the server's management system interprets this inactive state as a potential risk and ramps up fans to maximum speed to prevent overheating.

This makes running Ollama on dedicated server hardware extremely loud and likely increases power consumption unnecessarily.

System Information

  • OS: TrueNAS SCALE 23.10
  • GPU: NVIDIA Tesla P40
  • Ollama Version: 0.1.30
  • Docker: Yes (Running inside a Docker container on TrueNAS)

Additional context

I have tried setting the OLLAMA_KEEP_ALIVE=-1 environment variable in my Docker setup, but it doesn't seem to reliably solve the core issue. The ideal solution would be for the ollama serve process to maintain a persistent, low-power connection to the GPU, preventing it from entering the problematic deep idle state, just as it did in v1.28. This is a critical issue for anyone using Ollama on headless server hardware with passively cooled GPUs.

Relevant log output


OS

Docker

GPU

Nvidia

CPU

Intel

Ollama version

1.30

Originally created by @JanDamek on GitHub (Oct 17, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12682 ### What is the issue? ### Describe the bug Since the update to version 1.30 (and later), Ollama's new behavior of fully unloading the GPU context after a period of inactivity causes the system fans connected to passively cooled server GPUs, like the NVIDIA P40, to spin up to 100%. This is a significant regression from v1.28, where the Ollama service would keep the GPU initialized and "awake" for the entire duration of its runtime, thus allowing for proper thermal management and quiet fan operation. ### To Reproduce 1. Use a server with a passively cooled GPU, such as an NVIDIA Tesla P40. 2. Install Ollama v1.30 or a more recent version. 3. Start the Ollama service (`ollama serve`). 4. Observe that the system/GPU fans spin up to maximum speed because the GPU is not actively initialized by Ollama at startup. 5. Run a model (e.g., `ollama run llama3`). The GPU initializes, the model loads, and the fans calm down to normal levels. 6. Wait for 5 minutes after the request is complete. 7. Observe that the model is unloaded, the GPU context is released, and the fans immediately spin back up to 100%. ### Expected behavior The behavior should be similar to version 1.28. The Ollama service, upon starting, should perform a basic initialization of the GPU to bring it out of a deep idle state. This keeps the card "awake" and allows the server's Baseboard Management Controller (BMC) to manage fan speeds correctly based on actual temperature readings, rather than defaulting to 100% as a safety precaution. Unloading a model from VRAM should not mean fully de-initializing the GPU context, which leads to this issue on server hardware. ### Actual behavior Ollama v1.30+ leaves the GPU in a deep idle state until a model is requested. When the model is unloaded after the `keep_alive` timeout (default 5 minutes), the GPU returns to this deep idle state. For passively cooled cards like the P40, the server's management system interprets this inactive state as a potential risk and ramps up fans to maximum speed to prevent overheating. This makes running Ollama on dedicated server hardware extremely loud and likely increases power consumption unnecessarily. ### System Information - **OS:** TrueNAS SCALE 23.10 - **GPU:** NVIDIA Tesla P40 - **Ollama Version:** 0.1.30 - **Docker:** Yes (Running inside a Docker container on TrueNAS) ### Additional context I have tried setting the `OLLAMA_KEEP_ALIVE=-1` environment variable in my Docker setup, but it doesn't seem to reliably solve the core issue. The ideal solution would be for the `ollama serve` process to maintain a persistent, low-power connection to the GPU, preventing it from entering the problematic deep idle state, just as it did in v1.28. This is a critical issue for anyone using Ollama on headless server hardware with passively cooled GPUs. ### Relevant log output ```shell ``` ### OS Docker ### GPU Nvidia ### CPU Intel ### Ollama version 1.30
GiteaMirror added the bugneeds more info labels 2026-04-29 08:00:30 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 17, 2025):

v0.1.30 is quite old and there have been significant architectural changes since it was released. Does this still occur with the latest version?

See also #8591.

<!-- gh-comment-id:3417300968 --> @rick-github commented on GitHub (Oct 17, 2025): v0.1.30 is quite old and there have been significant architectural changes since it was released. Does this still occur with the [latest version](https://github.com/ollama/ollama/releases/tag/v0.12.6)? See also #8591.
Author
Owner

@JanDamek commented on GitHub (Oct 17, 2025):

Hello, thank you for the follow-up.

I can confirm this issue is happening with the :latest image, which I updated to today, October 17, 2025, on my TrueNAS system.

You are correct that issue #8591 is very similar, but I want to clarify the specific cause on my hardware. My server uses a passively cooled NVIDIA Tesla P40. The core problem is how the server's management system reacts when the GPU becomes completely idle.

Here is the chain of events with the new version:

  1. When Ollama is not processing a request, it seems to completely release the GPU, likely de-initializing the CUDA context.
  2. This puts the P40 into a deep idle state where it no longer reports its status to the server's Baseboard Management Controller (BMC).
  3. As a failsafe, the server assumes the card might be overheating without being able to report it, so it spins all system fans to 100% to prevent damage.

The most important piece of information is this: I performed a rollback to my previously working Docker image, and the problem is completely gone.

  • The working (older) image reports Ollama version 0.12.2 via ollama -v.
  • With this older version, the moment the Ollama container starts, the server fans immediately slow down to normal operating speeds. This proves that version 0.12.5 initializes the GPU upon startup and keeps it in an "awake" state, which is the correct behavior for this type of hardware.

So, the architectural change between v0.12.2 and the latest versions, which was likely intended to save resources, has this critical side effect on passively cooled server hardware.

Thank you for looking into this.

<!-- gh-comment-id:3417458127 --> @JanDamek commented on GitHub (Oct 17, 2025): Hello, thank you for the follow-up. I can confirm this issue is happening with the **`:latest`** image, which I updated to today, October 17, 2025, on my TrueNAS system. You are correct that issue #8591 is very similar, but I want to clarify the specific cause on my hardware. My server uses a **passively cooled NVIDIA Tesla P40**. The core problem is how the server's management system reacts when the GPU becomes completely idle. Here is the chain of events with the new version: 1. When Ollama is not processing a request, it seems to completely release the GPU, likely de-initializing the CUDA context. 2. This puts the P40 into a deep idle state where it no longer reports its status to the server's Baseboard Management Controller (BMC). 3. As a failsafe, the server assumes the card might be overheating without being able to report it, so it spins all system fans to 100% to prevent damage. The most important piece of information is this: **I performed a rollback to my previously working Docker image, and the problem is completely gone.** * The working (older) image reports Ollama version **`0.12.2`** via `ollama -v`. * With this older version, the moment the Ollama container starts, the server fans immediately slow down to normal operating speeds. This proves that version `0.12.5` initializes the GPU upon startup and keeps it in an "awake" state, which is the correct behavior for this type of hardware. So, the architectural change between v0.12.2 and the latest versions, which was likely intended to save resources, has this critical side effect on passively cooled server hardware. Thank you for looking into this.
Author
Owner

@rick-github commented on GitHub (Oct 17, 2025):

So it also worked somewhere between 0.1.30 and 0.12.2? What is unreliable about OLLAMA_KEEP_ALIVE=-1?

<!-- gh-comment-id:3417488590 --> @rick-github commented on GitHub (Oct 17, 2025): So it also worked somewhere between 0.1.30 and 0.12.2? What is unreliable about `OLLAMA_KEEP_ALIVE=-1`?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#54925