[GH-ISSUE #13494] Integer underflow in GPU memory calculation causes "unable to allocate CUDA buffer" error when loading small models #55410

Closed
opened 2026-04-29 09:08:11 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @takumi-ricoh on GitHub (Dec 16, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13494

Integer underflow in GPU memory calculation causes "unable to allocate CUDA buffer" error when loading small models

Summary

When loading a small 8B Q4_K_M model after two large models (120B and 20B) are already loaded, Ollama fails with "unable to allocate CUDA buffer" error due to an integer underflow bug in GPU memory calculation. The bug causes Ollama to incorrectly believe a nearly-full GPU has ~17 exabytes of free memory.

Environment

  • Ollama version: v0.12.11 (also reproduced on v0.13.4)
  • OS: Linux
  • GPUs: 3x NVIDIA L40S (48GB each)
  • Docker deployment

Configuration

OLLAMA_KEEP_ALIVE=-1
OLLAMA_MAX_LOADED_MODELS=3
OLLAMA_SCHED_SPREAD=0
OLLAMA_DEBUG=1

Steps to Reproduce

  1. Load first large model (120B model, e.g., a quantized 120B model):

    ollama run gpt-oss:120b "Hello"
    # Model loads successfully (~80GB VRAM used)
    
  2. Load second model (20B model, e.g., a safeguard/moderation model):

    ollama run your-safeguard-model:20b "Hello"
    # Model loads successfully (~39GB VRAM used)
    # Total: 2 models loaded
    
  3. Attempt to load small 8B Q4_K_M model:

    ollama run your-8b-model:q4_k_m "Hello"
    # FAILS with error below
    

Expected Behavior

Since OLLAMA_MAX_LOADED_MODELS=3, Ollama should either:

  1. Auto-unload one of the existing models to make room, OR
  2. Load the small model on GPU with sufficient free space (GPU 1 or GPU 2)

Actual Behavior

Error: 500 Internal Server Error: llama runner process has terminated:
error loading model: unable to allocate CUDA0 buffer

GPU state before error:

GPU 0: 357 MiB free, 45.7 GB used
GPU 1: 21 GB free, 25 GB used
GPU 2: 1.4 GB free, 44.7 GB used

Debug Logs Analysis

With OLLAMA_DEBUG=1, the logs reveal integer underflow:

time=2025-12-16T07:36:37.183Z level=INFO msg="updated VRAM" gpu=GPU-5320d871... available="357.9 MiB"
time=2025-12-16T07:36:37.183Z level=DEBUG source=server.go:921 msg="available gpu" id=GPU-5320d871... "available layer vram"="17179869183.3 GiB"
time=2025-12-16T07:36:37.183Z level=INFO msg=load request="GPULayers:33[ID:GPU-5320d871... Layers:33(0..32)]"

Key evidence:

  • GPU 0 has only 357.9 MiB free
  • But Ollama calculates 17,179,869,183.3 GiB (~17 exabytes) of "available layer vram"
  • All 33 layers are assigned to GPU 0 despite insufficient space
  • Actual CUDA allocation fails

Root Cause

Integer underflow occurs in llm/server.go around lines 914-919 in the buildLayout function:

reserved := uint64(float32(gl[i].FreeMemory)*backoff) + gl[i].MinimumMemory() + envconfig.GpuOverhead() + memory.GPUs[j].Graph
if gl[i].FreeMemory > reserved {
    gl[i].FreeMemory -= reserved
} else {
    gl[i].FreeMemory = 0
}

Issue: When this code is executed multiple times for the same GPU (when memory.GPUs contains multiple entries for the same device), the second iteration causes underflow:

  1. First iteration: FreeMemory = 357.9 MiB, reserved = 457 MiB + 17.5 GiB

    • Since 357.9 MiB < reserved, set FreeMemory = 0
  2. Second iteration (same GPU): FreeMemory = 0, tries to subtract reserved again

    • Although the if-else prevents direct underflow, the value displayed in logs shows 17,179,869,183.3 GiB
    • This suggests the Graph memory calculation results in incorrect values being used

The value 17,179,869,183.3 GiB2^64 - 17.5 GiB, indicating uint64 underflow when subtracting graph memory from a GPU with insufficient free space.

Why Large Models Don't Trigger This Bug

When loading another large model (e.g., gpt-oss:120b → another 70B+ model):

  • The new model is obviously too large to fit
  • Auto-unload logic kicks in before memory calculation
  • Clean state prevents integer underflow condition

When loading a small 8B Q4_K_M model after large models:

  • OLLAMA_MAX_LOADED_MODELS still has capacity (2/3 used)
  • Small 8B model appears to fit (only ~4.4GB needed), so no auto-unload occurs
  • GPU memory calculation proceeds
  • Integer underflow bug is triggered
  • Ollama incorrectly assigns all layers to nearly-full GPU 0
  • Actual allocation fails

This is an ironic situation: smaller models are more likely to encounter this bug than larger models.

Suggested Fix

Ensure gl[i].FreeMemory calculations are performed only once per GPU, or add bounds checking to prevent underflow when FreeMemory is already 0 or insufficient.

Possible fix in llm/server.go around line 914-919:

reserved := uint64(float32(gl[i].FreeMemory)*backoff) + gl[i].MinimumMemory() + envconfig.GpuOverhead() + memory.GPUs[j].Graph
if gl[i].FreeMemory > reserved {
    gl[i].FreeMemory -= reserved
} else {
    gl[i].FreeMemory = 0
    // Prevent further processing on this GPU if it has no available memory
}

Or ensure that each GPU is only processed once in the loop by tracking which GPUs have already been processed.

Workaround

Option 1: Use SCHED_SPREAD

Set OLLAMA_SCHED_SPREAD=1 to spread models across all GPUs, which reduces the likelihood of hitting this bug (though doesn't eliminate it):

OLLAMA_SCHED_SPREAD=1

Option 2: Manual unload

Manually unload models before loading new ones:

ollama stop gpt-oss:120b
ollama run your-8b-model:q4_k_m "Hello"

Additional Notes

This bug affects scenarios where:

  • Multiple large models are loaded sequentially
  • OLLAMA_MAX_LOADED_MODELS allows one more model
  • The next model is small enough that auto-unload doesn't trigger
  • One GPU is nearly full but has a small amount of free memory

The bug is triggered more frequently with OLLAMA_SCHED_SPREAD=0 (pack strategy) than with OLLAMA_SCHED_SPREAD=1 (spread strategy), but can occur in both cases.

Originally created by @takumi-ricoh on GitHub (Dec 16, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13494 # Integer underflow in GPU memory calculation causes "unable to allocate CUDA buffer" error when loading small models ## Summary When loading a small 8B Q4_K_M model after two large models (120B and 20B) are already loaded, Ollama fails with "unable to allocate CUDA buffer" error due to an integer underflow bug in GPU memory calculation. The bug causes Ollama to incorrectly believe a nearly-full GPU has ~17 exabytes of free memory. ## Environment - Ollama version: v0.12.11 (also reproduced on v0.13.4) - OS: Linux - GPUs: 3x NVIDIA L40S (48GB each) - Docker deployment ## Configuration ```bash OLLAMA_KEEP_ALIVE=-1 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_SCHED_SPREAD=0 OLLAMA_DEBUG=1 ``` ## Steps to Reproduce 1. Load first large model (120B model, e.g., a quantized 120B model): ```bash ollama run gpt-oss:120b "Hello" # Model loads successfully (~80GB VRAM used) ``` 2. Load second model (20B model, e.g., a safeguard/moderation model): ```bash ollama run your-safeguard-model:20b "Hello" # Model loads successfully (~39GB VRAM used) # Total: 2 models loaded ``` 3. Attempt to load small 8B Q4_K_M model: ```bash ollama run your-8b-model:q4_k_m "Hello" # FAILS with error below ``` ## Expected Behavior Since `OLLAMA_MAX_LOADED_MODELS=3`, Ollama should either: 1. Auto-unload one of the existing models to make room, OR 2. Load the small model on GPU with sufficient free space (GPU 1 or GPU 2) ## Actual Behavior ``` Error: 500 Internal Server Error: llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer ``` GPU state before error: ``` GPU 0: 357 MiB free, 45.7 GB used GPU 1: 21 GB free, 25 GB used GPU 2: 1.4 GB free, 44.7 GB used ``` ## Debug Logs Analysis With `OLLAMA_DEBUG=1`, the logs reveal integer underflow: ``` time=2025-12-16T07:36:37.183Z level=INFO msg="updated VRAM" gpu=GPU-5320d871... available="357.9 MiB" time=2025-12-16T07:36:37.183Z level=DEBUG source=server.go:921 msg="available gpu" id=GPU-5320d871... "available layer vram"="17179869183.3 GiB" time=2025-12-16T07:36:37.183Z level=INFO msg=load request="GPULayers:33[ID:GPU-5320d871... Layers:33(0..32)]" ``` **Key evidence:** - GPU 0 has only **357.9 MiB** free - But Ollama calculates **17,179,869,183.3 GiB** (~17 exabytes) of "available layer vram" - All 33 layers are assigned to GPU 0 despite insufficient space - Actual CUDA allocation fails ## Root Cause Integer underflow occurs in `llm/server.go` around lines 914-919 in the `buildLayout` function: ```go reserved := uint64(float32(gl[i].FreeMemory)*backoff) + gl[i].MinimumMemory() + envconfig.GpuOverhead() + memory.GPUs[j].Graph if gl[i].FreeMemory > reserved { gl[i].FreeMemory -= reserved } else { gl[i].FreeMemory = 0 } ``` **Issue:** When this code is executed multiple times for the same GPU (when `memory.GPUs` contains multiple entries for the same device), the second iteration causes underflow: 1. First iteration: `FreeMemory = 357.9 MiB`, `reserved = 457 MiB + 17.5 GiB` - Since `357.9 MiB < reserved`, set `FreeMemory = 0` 2. Second iteration (same GPU): `FreeMemory = 0`, tries to subtract `reserved` again - Although the `if-else` prevents direct underflow, the value displayed in logs shows `17,179,869,183.3 GiB` - This suggests the `Graph` memory calculation results in incorrect values being used The value `17,179,869,183.3 GiB` ≈ `2^64 - 17.5 GiB`, indicating uint64 underflow when subtracting graph memory from a GPU with insufficient free space. ## Why Large Models Don't Trigger This Bug When loading another large model (e.g., `gpt-oss:120b` → another 70B+ model): - The new model is obviously too large to fit - Auto-unload logic kicks in **before** memory calculation - Clean state prevents integer underflow condition When loading a small 8B Q4_K_M model after large models: - `OLLAMA_MAX_LOADED_MODELS` still has capacity (2/3 used) - Small 8B model appears to fit (only ~4.4GB needed), so no auto-unload occurs - GPU memory calculation proceeds - Integer underflow bug is triggered - Ollama incorrectly assigns all layers to nearly-full GPU 0 - Actual allocation fails **This is an ironic situation: smaller models are more likely to encounter this bug than larger models.** ## Suggested Fix Ensure `gl[i].FreeMemory` calculations are performed only once per GPU, or add bounds checking to prevent underflow when `FreeMemory` is already 0 or insufficient. Possible fix in `llm/server.go` around line 914-919: ```go reserved := uint64(float32(gl[i].FreeMemory)*backoff) + gl[i].MinimumMemory() + envconfig.GpuOverhead() + memory.GPUs[j].Graph if gl[i].FreeMemory > reserved { gl[i].FreeMemory -= reserved } else { gl[i].FreeMemory = 0 // Prevent further processing on this GPU if it has no available memory } ``` Or ensure that each GPU is only processed once in the loop by tracking which GPUs have already been processed. ## Workaround ### Option 1: Use SCHED_SPREAD Set `OLLAMA_SCHED_SPREAD=1` to spread models across all GPUs, which reduces the likelihood of hitting this bug (though doesn't eliminate it): ```bash OLLAMA_SCHED_SPREAD=1 ``` ### Option 2: Manual unload Manually unload models before loading new ones: ```bash ollama stop gpt-oss:120b ollama run your-8b-model:q4_k_m "Hello" ``` ## Additional Notes This bug affects scenarios where: - Multiple large models are loaded sequentially - `OLLAMA_MAX_LOADED_MODELS` allows one more model - The next model is small enough that auto-unload doesn't trigger - One GPU is nearly full but has a small amount of free memory The bug is triggered more frequently with `OLLAMA_SCHED_SPREAD=0` (pack strategy) than with `OLLAMA_SCHED_SPREAD=1` (spread strategy), but can occur in both cases.
GiteaMirror added the bug label 2026-04-29 09:08:11 -05:00
Author
Owner

@jessegross commented on GitHub (Dec 17, 2025):

Please include the log from when this happens on 0.13.4.

<!-- gh-comment-id:3666827482 --> @jessegross commented on GitHub (Dec 17, 2025): Please include the log from when this happens on 0.13.4.
Author
Owner

@takumi-ricoh commented on GitHub (Dec 19, 2025):

ollama_bug_logs_v0.13.4.txt

Response to GitHub Issue #13494

Reproduced on v0.13.4

I've reproduced this issue on Ollama v0.13.4. Here are the logs showing the integer underflow bug:

Environment

  • Ollama version: v0.13.4 (Docker)
  • GPUs: 3x NVIDIA L40S (48GB each)
  • Other GPU consumers: LocalAI, vLLM, Whisper running on the same system (consuming some GPU 0 memory)
  • Configuration:
    OLLAMA_DEBUG=1
    OLLAMA_KEEP_ALIVE=-1
    OLLAMA_MAX_LOADED_MODELS=3
    OLLAMA_SCHED_SPREAD=0
    

Steps to Reproduce

  1. Have other GPU-consuming services running (LocalAI, vLLM, etc.) that use some GPU memory
  2. Load a large model (e.g., 120B model using ~80GB VRAM across GPUs)
  3. Load a second model (e.g., 8B model with 64K context using ~39GB VRAM)
  4. Attempt to load a third small model (8B Q4_K_M)

Key Log Evidence

Integer underflow detected - GPU reports ~17 exabytes of "available" VRAM:

time=2025-12-19T10:36:20.958Z level=DEBUG source=server.go:965 msg="available gpu" id=GPU-5320d871... library=CUDA "available layer vram"="17179869183.3 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2025-12-19T10:36:20.958Z level=DEBUG source=server.go:965 msg="available gpu" id=GPU-5320d871... library=CUDA "available layer vram"="17179869165.8 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="17.5 GiB"

The value 17179869183.3 GiB2^64 - 17.5 GiB bytes, indicating uint64 underflow.

Resulting error:

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4403.49 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 4617396480
llama_model_load: error loading model: unable to allocate CUDA0 buffer
time=2025-12-19T10:36:22.622Z level=INFO source=sched.go:470 msg="Load failed" model=... error="llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer"

GPU State Before Error

GPU 0: 378 MiB free (nearly full - partially used by LocalAI/vLLM/Whisper)
GPU 1: 1.5 GB free
GPU 2: 21 GB free

Despite GPU 2 having 21GB free, the scheduler incorrectly assigned all layers to GPU 0 (which only had 378 MiB free) due to the underflow bug.

Root Cause Hypothesis

The integer underflow occurs in the GPU memory calculation when:

  1. Multiple models are already loaded, and/or other GPU consumers (LocalAI, vLLM, etc.) are using GPU memory
  2. FreeMemory is less than reserved (MinimumMemory + Overhead + Graph)
  3. The subtraction results in uint64 underflow, producing ~17 exabytes

This bug is more likely to trigger in environments where Ollama shares GPUs with other services.

Workaround

Setting OLLAMA_SCHED_SPREAD=1 reduces the likelihood of triggering this bug by spreading models across GPUs more evenly.

<!-- gh-comment-id:3674566437 --> @takumi-ricoh commented on GitHub (Dec 19, 2025): [ollama_bug_logs_v0.13.4.txt](https://github.com/user-attachments/files/24256096/ollama_bug_logs_v0.13.4.txt) # Response to GitHub Issue #13494 ## Reproduced on v0.13.4 I've reproduced this issue on **Ollama v0.13.4**. Here are the logs showing the integer underflow bug: ### Environment - Ollama version: v0.13.4 (Docker) - GPUs: 3x NVIDIA L40S (48GB each) - Other GPU consumers: LocalAI, vLLM, Whisper running on the same system (consuming some GPU 0 memory) - Configuration: ``` OLLAMA_DEBUG=1 OLLAMA_KEEP_ALIVE=-1 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_SCHED_SPREAD=0 ``` ### Steps to Reproduce 1. Have other GPU-consuming services running (LocalAI, vLLM, etc.) that use some GPU memory 2. Load a large model (e.g., 120B model using ~80GB VRAM across GPUs) 3. Load a second model (e.g., 8B model with 64K context using ~39GB VRAM) 4. Attempt to load a third small model (8B Q4_K_M) ### Key Log Evidence **Integer underflow detected** - GPU reports ~17 exabytes of "available" VRAM: ``` time=2025-12-19T10:36:20.958Z level=DEBUG source=server.go:965 msg="available gpu" id=GPU-5320d871... library=CUDA "available layer vram"="17179869183.3 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2025-12-19T10:36:20.958Z level=DEBUG source=server.go:965 msg="available gpu" id=GPU-5320d871... library=CUDA "available layer vram"="17179869165.8 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="17.5 GiB" ``` The value `17179869183.3 GiB` ≈ `2^64 - 17.5 GiB` bytes, indicating uint64 underflow. **Resulting error:** ``` ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4403.49 MiB on device 0: cudaMalloc failed: out of memory alloc_tensor_range: failed to allocate CUDA0 buffer of size 4617396480 llama_model_load: error loading model: unable to allocate CUDA0 buffer time=2025-12-19T10:36:22.622Z level=INFO source=sched.go:470 msg="Load failed" model=... error="llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer" ``` ### GPU State Before Error ``` GPU 0: 378 MiB free (nearly full - partially used by LocalAI/vLLM/Whisper) GPU 1: 1.5 GB free GPU 2: 21 GB free ``` Despite GPU 2 having 21GB free, the scheduler incorrectly assigned all layers to GPU 0 (which only had 378 MiB free) due to the underflow bug. ### Root Cause Hypothesis The integer underflow occurs in the GPU memory calculation when: 1. Multiple models are already loaded, and/or other GPU consumers (LocalAI, vLLM, etc.) are using GPU memory 2. `FreeMemory` is less than `reserved` (MinimumMemory + Overhead + Graph) 3. The subtraction results in uint64 underflow, producing ~17 exabytes This bug is more likely to trigger in environments where Ollama shares GPUs with other services. ### Workaround Setting `OLLAMA_SCHED_SPREAD=1` reduces the likelihood of triggering this bug by spreading models across GPUs more evenly.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55410