[GH-ISSUE #13499] CUBLAS_STATUS_ALLOC_FAILED when running nemotron-3-nano:30b on an 8GB GPU #34662

Closed
opened 2026-04-22 18:24:36 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @0x7CFE on GitHub (Dec 16, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13499

What is the issue?

Hi,

I am trying to run nemotron-3-nano:30b on my rather old Radeon 5700 XT with 8GB VRAM. I am using the most recent ollama installed via default sh script.

This setup successfully runs qwen3:30b as well as gpt-oss:20B, so I expect nemotron to work as well.

However, it fails with CUBLAS_STATUS_ALLOC_FAILED error. However, the same model runs fine when on CPU only.

ollama.log

Ubuntu 24.04.3 LTS
6.18.0-061800-generic
AMD Ryzen AI 9 HX 370 w/ Radeon 890M

According to LACT my GPU VRAM is almost empty:

Image

Relevant log output

дек 16 20:53:03 fw13 ollama[365955]: load_tensors: loading model tensors, this can take a while... (mmap = true)
дек 16 20:53:03 fw13 ollama[365955]: load_tensors: offloading 15 repeating layers to GPU
дек 16 20:53:03 fw13 ollama[365955]: load_tensors: offloaded 15/53 layers to GPU
дек 16 20:53:03 fw13 ollama[365955]: load_tensors:   CPU_Mapped model buffer size = 23139.98 MiB
дек 16 20:53:03 fw13 ollama[365955]: load_tensors:        ROCm0 model buffer size =  6841.82 MiB
дек 16 20:53:06 fw13 ollama[365955]: llama_context: constructing llama_context
дек 16 20:53:06 fw13 ollama[365955]: llama_context: n_seq_max     = 1
дек 16 20:53:06 fw13 ollama[365955]: llama_context: n_ctx         = 4096
дек 16 20:53:06 fw13 ollama[365955]: llama_context: n_ctx_seq     = 4096
дек 16 20:53:06 fw13 ollama[365955]: llama_context: n_batch       = 512
дек 16 20:53:06 fw13 ollama[365955]: llama_context: n_ubatch      = 512
дек 16 20:53:06 fw13 ollama[365955]: llama_context: causal_attn   = 1
дек 16 20:53:06 fw13 ollama[365955]: llama_context: flash_attn    = auto
дек 16 20:53:06 fw13 ollama[365955]: llama_context: kv_unified    = false
дек 16 20:53:06 fw13 ollama[365955]: llama_context: freq_base     = 10000.0
дек 16 20:53:06 fw13 ollama[365955]: llama_context: freq_scale    = 1
дек 16 20:53:06 fw13 ollama[365955]: llama_context: n_ctx_seq (4096) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
дек 16 20:53:06 fw13 ollama[365955]: llama_context:        CPU  output buffer size =     0.51 MiB
дек 16 20:53:06 fw13 ollama[365955]: llama_kv_cache:        CPU KV buffer size =    20.00 MiB
дек 16 20:53:06 fw13 ollama[365955]: llama_kv_cache:      ROCm0 KV buffer size =     4.00 MiB
дек 16 20:53:06 fw13 ollama[365955]: llama_kv_cache: size =   24.00 MiB (  4096 cells,   6 layers,  1/1 seqs), K (f16):   12.00 MiB, V (f16):   12.00 MiB
дек 16 20:53:06 fw13 ollama[365955]: llama_memory_recurrent:        CPU RS buffer size =    33.12 MiB
дек 16 20:53:06 fw13 ollama[365955]: llama_memory_recurrent:      ROCm0 RS buffer size =    14.49 MiB
дек 16 20:53:06 fw13 ollama[365955]: llama_memory_recurrent: size =   47.62 MiB (     1 cells,  52 layers,  1 seqs), R (f32):    1.62 MiB, S (f32):   46.00 MiB
дек 16 20:53:06 fw13 ollama[365955]: llama_context: Flash Attention was auto, set to enabled
дек 16 20:53:06 fw13 ollama[365955]: ROCm error: CUBLAS_STATUS_ALLOC_FAILED
дек 16 20:53:06 fw13 ollama[365955]:   current device: 0, in function cublas_handle at //ml/backend/ggml/ggml/src/ggml-cuda/common.cuh:1257
дек 16 20:53:06 fw13 ollama[365955]:   hipblasCreate(&cublas_handles[device])
дек 16 20:53:06 fw13 ollama[365955]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: ROCm error

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.13.4

Originally created by @0x7CFE on GitHub (Dec 16, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13499 ### What is the issue? Hi, I am trying to run `nemotron-3-nano:30b` on my rather old Radeon 5700 XT with 8GB VRAM. I am using the most recent ollama installed via default `sh` script. This setup successfully runs `qwen3:30b` as well as `gpt-oss:20B`, so I expect nemotron to work as well. However, it fails with `CUBLAS_STATUS_ALLOC_FAILED` error. However, the same model runs fine when on CPU only. [ollama.log](https://github.com/user-attachments/files/24195246/ollama.log) Ubuntu 24.04.3 LTS 6.18.0-061800-generic AMD Ryzen AI 9 HX 370 w/ Radeon 890M According to LACT my GPU VRAM is almost empty: <img width="50%" alt="Image" src="https://github.com/user-attachments/assets/f80a558b-db85-4c22-8b95-edf9b4100d33" /> ### Relevant log output ```shell дек 16 20:53:03 fw13 ollama[365955]: load_tensors: loading model tensors, this can take a while... (mmap = true) дек 16 20:53:03 fw13 ollama[365955]: load_tensors: offloading 15 repeating layers to GPU дек 16 20:53:03 fw13 ollama[365955]: load_tensors: offloaded 15/53 layers to GPU дек 16 20:53:03 fw13 ollama[365955]: load_tensors: CPU_Mapped model buffer size = 23139.98 MiB дек 16 20:53:03 fw13 ollama[365955]: load_tensors: ROCm0 model buffer size = 6841.82 MiB дек 16 20:53:06 fw13 ollama[365955]: llama_context: constructing llama_context дек 16 20:53:06 fw13 ollama[365955]: llama_context: n_seq_max = 1 дек 16 20:53:06 fw13 ollama[365955]: llama_context: n_ctx = 4096 дек 16 20:53:06 fw13 ollama[365955]: llama_context: n_ctx_seq = 4096 дек 16 20:53:06 fw13 ollama[365955]: llama_context: n_batch = 512 дек 16 20:53:06 fw13 ollama[365955]: llama_context: n_ubatch = 512 дек 16 20:53:06 fw13 ollama[365955]: llama_context: causal_attn = 1 дек 16 20:53:06 fw13 ollama[365955]: llama_context: flash_attn = auto дек 16 20:53:06 fw13 ollama[365955]: llama_context: kv_unified = false дек 16 20:53:06 fw13 ollama[365955]: llama_context: freq_base = 10000.0 дек 16 20:53:06 fw13 ollama[365955]: llama_context: freq_scale = 1 дек 16 20:53:06 fw13 ollama[365955]: llama_context: n_ctx_seq (4096) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized дек 16 20:53:06 fw13 ollama[365955]: llama_context: CPU output buffer size = 0.51 MiB дек 16 20:53:06 fw13 ollama[365955]: llama_kv_cache: CPU KV buffer size = 20.00 MiB дек 16 20:53:06 fw13 ollama[365955]: llama_kv_cache: ROCm0 KV buffer size = 4.00 MiB дек 16 20:53:06 fw13 ollama[365955]: llama_kv_cache: size = 24.00 MiB ( 4096 cells, 6 layers, 1/1 seqs), K (f16): 12.00 MiB, V (f16): 12.00 MiB дек 16 20:53:06 fw13 ollama[365955]: llama_memory_recurrent: CPU RS buffer size = 33.12 MiB дек 16 20:53:06 fw13 ollama[365955]: llama_memory_recurrent: ROCm0 RS buffer size = 14.49 MiB дек 16 20:53:06 fw13 ollama[365955]: llama_memory_recurrent: size = 47.62 MiB ( 1 cells, 52 layers, 1 seqs), R (f32): 1.62 MiB, S (f32): 46.00 MiB дек 16 20:53:06 fw13 ollama[365955]: llama_context: Flash Attention was auto, set to enabled дек 16 20:53:06 fw13 ollama[365955]: ROCm error: CUBLAS_STATUS_ALLOC_FAILED дек 16 20:53:06 fw13 ollama[365955]: current device: 0, in function cublas_handle at //ml/backend/ggml/ggml/src/ggml-cuda/common.cuh:1257 дек 16 20:53:06 fw13 ollama[365955]: hipblasCreate(&cublas_handles[device]) дек 16 20:53:06 fw13 ollama[365955]: //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: ROCm error ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.13.4
GiteaMirror added the bug label 2026-04-22 18:24:36 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 16, 2025):

nemotron-3-nano:30b is run with the llama.cpp engine which is not as accurate with memory requirements as the ollama engine. OOMs can be mitigated with some of the methods shown here.

<!-- gh-comment-id:3662877313 --> @rick-github commented on GitHub (Dec 16, 2025): `nemotron-3-nano:30b` is run with the llama.cpp engine which is not as accurate with memory requirements as the ollama engine. OOMs can be mitigated with some of the methods shown [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288).
Author
Owner

@0x7CFE commented on GitHub (Dec 17, 2025):

Thank you for the hint!

Indeed, after I set OLLAMA_GPU_OVERHEAD=268435456 it loaded successfully. Though, for this particular model, in my setup it was slower than CPU only 🤷.

<!-- gh-comment-id:3664265898 --> @0x7CFE commented on GitHub (Dec 17, 2025): Thank you for the hint! Indeed, after I set `OLLAMA_GPU_OVERHEAD=268435456` it loaded successfully. Though, for this particular model, in my setup it was slower than CPU only 🤷.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#34662