[GH-ISSUE #11202] Ollama overcommiting GPU memory on my config with large models #69439

Open
opened 2026-05-04 18:06:08 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @int13h82 on GitHub (Jun 25, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11202

What is the issue?

Threadripper 3995WX
512Gb system RAM
2x AMD MI50 32 Gb
Everything is fine up to qwen3:235b-a22b (142Gb) - system and GPU ram distributes correctly
With deepseek-r1:671b Ollama stops saying:
$ ollama run deepseek-r1:671b --verbose --keepalive=-1m
Error: llama runner process has terminated: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate ROCm0 buffer of size 5274339328

See the corresponding log in the next field. Seems ollama tries to allocate 28426.68 MiB + 1280.00 MiB + 5030.00 MiB on GPU0 - asking of 35 gigs of VRAM instead of avaliable 32.

Can be fixed with the modelfile, limiting GPU layers to 6:
FROM deepseek-r1:671b
PARAMETER num_gpu 6

Layers buffer downs to 21 Gb, resulting in correct launch and 78% mem load @ GPU0 and 87% load at GPU1

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 24.04.2 LTS
Release: 24.04
Codename: noble
$ uname -r
6.8.0-62-generic
$ dkms status
amdgpu/6.10.5-2119913.24.04, 6.8.0-62-generic, x86_64: installed
$ dpkg -l | grep rocm
ii rocm 6.3.3.60303-74~24.04 amd64 Radeon Open Compute (ROCm) software stack meta package

Relevant log output

Jun 25 18:39:14 tr64-ai ollama[1753774]: load_tensors: loading model tensors, this can take a while... (mmap = true)
Jun 25 18:43:09 tr64-ai ollama[1753774]: load_tensors: offloading 8 repeating layers to GPU
Jun 25 18:43:09 tr64-ai ollama[1753774]: load_tensors: offloaded 8/62 layers to GPU
Jun 25 18:43:09 tr64-ai ollama[1753774]: load_tensors:        ROCm0 model buffer size = 28426.68 MiB
Jun 25 18:43:09 tr64-ai ollama[1753774]: load_tensors:        ROCm1 model buffer size = 28426.68 MiB
Jun 25 18:43:09 tr64-ai ollama[1753774]: load_tensors:   CPU_Mapped model buffer size = 328836.27 MiB
Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: constructing llama_context
Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: n_seq_max     = 1
Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: n_ctx         = 4096
Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: n_ctx_per_seq = 4096
Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: n_batch       = 512
Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: n_ubatch      = 512
Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: causal_attn   = 1
Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: flash_attn    = 0
Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: freq_base     = 10000.0
Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: freq_scale    = 0.025
Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of>
Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context:        CPU  output buffer size =     0.52 MiB
Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = >
Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_kv_cache_unified:      ROCm0 KV buffer size =  1280.00 MiB
Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_kv_cache_unified:      ROCm1 KV buffer size =  1280.00 MiB
Jun 25 18:43:26 tr64-ai ollama[1753774]: llama_kv_cache_unified:        CPU KV buffer size = 16960.00 MiB
Jun 25 18:43:26 tr64-ai ollama[1753774]: llama_kv_cache_unified: KV self size  = 19520.00 MiB, K (f16): 11712.00 MiB, V (f1>
Jun 25 18:43:26 tr64-ai ollama[1753774]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5030.00 MiB on device 0: cu>
Jun 25 18:43:26 tr64-ai ollama[1753774]: ggml_gallocr_reserve_n: failed to allocate ROCm0 buffer of size 5274339328
Jun 25 18:43:27 tr64-ai ollama[1753774]: llama_init_from_model: failed to initialize the context: failed to allocate comput>
Jun 25 18:43:27 tr64-ai ollama[1753774]: panic: unable to create llama context
Jun 25 18:43:27 tr64-ai ollama[1753774]: goroutine 147 [running]:
Jun 25 18:43:27 tr64-ai ollama[1753774]: github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc0006283f0, {0x8>
Jun 25 18:43:27 tr64-ai ollama[1753774]:         github.com/ollama/ollama/runner/llamarunner/runner.go:757 +0x389
Jun 25 18:43:27 tr64-ai ollama[1753774]: created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1
Jun 25 18:43:27 tr64-ai ollama[1753774]:         github.com/ollama/ollama/runner/llamarunner/runner.go:848 +0xb57
Jun 25 18:43:28 tr64-ai ollama[1753774]: time=2025-06-25T18:43:28.100Z level=INFO source=server.go:625 msg="waiting for ser>
Jun 25 18:43:40 tr64-ai ollama[1753774]: time=2025-06-25T18:43:40.041Z level=ERROR source=server.go:457 msg="llama runner t>
Jun 25 18:43:40 tr64-ai ollama[1753774]: time=2025-06-25T18:43:40.075Z level=ERROR source=sched.go:489 msg="error loading l>
Jun 25 18:43:40 tr64-ai ollama[1753774]: [GIN] 2025/06/25 - 18:43:40 | 500 |         4m27s |       127.0.0.1 | POST     "/a

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.9.2

Originally created by @int13h82 on GitHub (Jun 25, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11202 ### What is the issue? Threadripper 3995WX 512Gb system RAM 2x AMD MI50 32 Gb Everything is fine up to qwen3:235b-a22b (142Gb) - system and GPU ram distributes correctly With deepseek-r1:671b Ollama stops saying: _$ ollama run deepseek-r1:671b --verbose --keepalive=-1m Error: llama runner process has terminated: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate ROCm0 buffer of size 5274339328_ See the corresponding log in the next field. Seems ollama tries to allocate 28426.68 MiB + 1280.00 MiB + 5030.00 MiB on GPU0 - asking of 35 gigs of VRAM instead of avaliable 32. Can be fixed with the modelfile, limiting GPU layers to 6: _FROM deepseek-r1:671b PARAMETER num_gpu 6_ Layers buffer downs to 21 Gb, resulting in correct launch and 78% mem load @ GPU0 and 87% load at GPU1 $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 24.04.2 LTS Release: 24.04 Codename: noble $ uname -r 6.8.0-62-generic $ dkms status amdgpu/6.10.5-2119913.24.04, 6.8.0-62-generic, x86_64: installed $ dpkg -l | grep rocm ii rocm 6.3.3.60303-74~24.04 amd64 Radeon Open Compute (ROCm) software stack meta package ### Relevant log output ```shell Jun 25 18:39:14 tr64-ai ollama[1753774]: load_tensors: loading model tensors, this can take a while... (mmap = true) Jun 25 18:43:09 tr64-ai ollama[1753774]: load_tensors: offloading 8 repeating layers to GPU Jun 25 18:43:09 tr64-ai ollama[1753774]: load_tensors: offloaded 8/62 layers to GPU Jun 25 18:43:09 tr64-ai ollama[1753774]: load_tensors: ROCm0 model buffer size = 28426.68 MiB Jun 25 18:43:09 tr64-ai ollama[1753774]: load_tensors: ROCm1 model buffer size = 28426.68 MiB Jun 25 18:43:09 tr64-ai ollama[1753774]: load_tensors: CPU_Mapped model buffer size = 328836.27 MiB Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: constructing llama_context Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: n_seq_max = 1 Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: n_ctx = 4096 Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: n_ctx_per_seq = 4096 Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: n_batch = 512 Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: n_ubatch = 512 Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: causal_attn = 1 Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: flash_attn = 0 Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: freq_base = 10000.0 Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: freq_scale = 0.025 Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of> Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_context: CPU output buffer size = 0.52 MiB Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = > Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_kv_cache_unified: ROCm0 KV buffer size = 1280.00 MiB Jun 25 18:43:19 tr64-ai ollama[1753774]: llama_kv_cache_unified: ROCm1 KV buffer size = 1280.00 MiB Jun 25 18:43:26 tr64-ai ollama[1753774]: llama_kv_cache_unified: CPU KV buffer size = 16960.00 MiB Jun 25 18:43:26 tr64-ai ollama[1753774]: llama_kv_cache_unified: KV self size = 19520.00 MiB, K (f16): 11712.00 MiB, V (f1> Jun 25 18:43:26 tr64-ai ollama[1753774]: ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5030.00 MiB on device 0: cu> Jun 25 18:43:26 tr64-ai ollama[1753774]: ggml_gallocr_reserve_n: failed to allocate ROCm0 buffer of size 5274339328 Jun 25 18:43:27 tr64-ai ollama[1753774]: llama_init_from_model: failed to initialize the context: failed to allocate comput> Jun 25 18:43:27 tr64-ai ollama[1753774]: panic: unable to create llama context Jun 25 18:43:27 tr64-ai ollama[1753774]: goroutine 147 [running]: Jun 25 18:43:27 tr64-ai ollama[1753774]: github.com/ollama/ollama/runner/llamarunner.(*Server).loadModel(0xc0006283f0, {0x8> Jun 25 18:43:27 tr64-ai ollama[1753774]: github.com/ollama/ollama/runner/llamarunner/runner.go:757 +0x389 Jun 25 18:43:27 tr64-ai ollama[1753774]: created by github.com/ollama/ollama/runner/llamarunner.Execute in goroutine 1 Jun 25 18:43:27 tr64-ai ollama[1753774]: github.com/ollama/ollama/runner/llamarunner/runner.go:848 +0xb57 Jun 25 18:43:28 tr64-ai ollama[1753774]: time=2025-06-25T18:43:28.100Z level=INFO source=server.go:625 msg="waiting for ser> Jun 25 18:43:40 tr64-ai ollama[1753774]: time=2025-06-25T18:43:40.041Z level=ERROR source=server.go:457 msg="llama runner t> Jun 25 18:43:40 tr64-ai ollama[1753774]: time=2025-06-25T18:43:40.075Z level=ERROR source=sched.go:489 msg="error loading l> Jun 25 18:43:40 tr64-ai ollama[1753774]: [GIN] 2025/06/25 - 18:43:40 | 500 | 4m27s | 127.0.0.1 | POST "/a ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.9.2
GiteaMirror added the bug label 2026-05-04 18:06:08 -05:00
Author
Owner

@int13h82 commented on GitHub (Jun 25, 2025):

ollama.txt
Full log with both failed and sucsessful attempts.

<!-- gh-comment-id:3006406163 --> @int13h82 commented on GitHub (Jun 25, 2025): [ollama.txt](https://github.com/user-attachments/files/20912870/ollama.txt) Full log with both failed and sucsessful attempts.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69439