[GH-ISSUE #5949] Out of Memory Error when using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf model with Ollama ROCm with num_ctx=120000 #29475

Open
opened 2026-04-22 08:24:07 -05:00 by GiteaMirror · 15 comments
Owner

Originally created by @renbuarl on GitHub (Jul 25, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5949

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

OS: Linux 6.5.0-44-generic #44~22.04.1-Ubuntu

GPU:

AMD Radeon RX 7900 XTX (24 GiB VRAM)

AMD Radeon RX 7900 XTX (24 GiB VRAM)

AMD Radeon RX 7900 XTX (24 GiB VRAM)

Ollama version: 0.2.8

ROCm module version: 6.7.0
amdgpu-install_6.1.60103-1_all.deb

Model: Meta-Llama-3.1-8B-Instruct-Q8_0

While testing the Meta-Llama-3.1-8B-Instruct-Q8_0.gguf model, I encountered an out of memory error well before reaching the maximum context size of 128k for the model. The model crashes after processing approximately 28,000 tokens, regardless of whether using one GPU with 24GB of memory (nctx = 30,000) or three GPUs with a combined memory of 72GB (nctx = 120,000).

Error:
Jul 25 12:39:17 ailab ollama[683]: CUDA error: out of memory
Jul 25 12:39:17 ailab ollama[683]: current device: 0, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:291
Jul 25 12:39:17 ailab ollama[683]: ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
Jul 25 12:39:17 ailab ollama[683]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: !"CUDA error"

There might be similar issues, but out of memory errors with multiple GPUs have not been reported yet.

OS

Linux

GPU

AMD

CPU

Intel

Ollama version

0.2.8

Originally created by @renbuarl on GitHub (Jul 25, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5949 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? OS: Linux 6.5.0-44-generic #44~22.04.1-Ubuntu GPU: AMD Radeon RX 7900 XTX (24 GiB VRAM) AMD Radeon RX 7900 XTX (24 GiB VRAM) AMD Radeon RX 7900 XTX (24 GiB VRAM) Ollama version: 0.2.8 ROCm module version: 6.7.0 amdgpu-install_6.1.60103-1_all.deb Model: Meta-Llama-3.1-8B-Instruct-Q8_0 While testing the Meta-Llama-3.1-8B-Instruct-Q8_0.gguf model, I encountered an out of memory error well before reaching the maximum context size of 128k for the model. The model crashes after processing approximately 28,000 tokens, regardless of whether using one GPU with 24GB of memory (nctx = 30,000) or three GPUs with a combined memory of 72GB (nctx = 120,000). Error: Jul 25 12:39:17 ailab ollama[683]: CUDA error: out of memory Jul 25 12:39:17 ailab ollama[683]: current device: 0, in function alloc at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:291 Jul 25 12:39:17 ailab ollama[683]: ggml_cuda_device_malloc(&ptr, look_ahead_size, device) Jul 25 12:39:17 ailab ollama[683]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: !"CUDA error" There might be similar issues, but out of memory errors with multiple GPUs have not been reported yet. ### OS Linux ### GPU AMD ### CPU Intel ### Ollama version 0.2.8
GiteaMirror added the memoryamdbug labels 2026-04-22 08:24:08 -05:00
Author
Owner

@rick-github commented on GitHub (Jul 25, 2024):

Server logs would help with diagnosis. Sounds similar to https://github.com/ollama/ollama/issues/5913, there's a workaround in the comments.

<!-- gh-comment-id:2250158998 --> @rick-github commented on GitHub (Jul 25, 2024): Server logs would help with diagnosis. Sounds similar to https://github.com/ollama/ollama/issues/5913, there's a workaround in the comments.
Author
Owner

@renbuarl commented on GitHub (Jul 25, 2024):

Very similar to #5913 but for the case of multiple GPUs, while in #5913 it is indeed a workaround as VRAM is genuinely low. In this case, reducing num_gpu is simply offloading to CPU when there are available GPUs and sufficient VRAM. This is an obvious bug.

<!-- gh-comment-id:2250188049 --> @renbuarl commented on GitHub (Jul 25, 2024): Very similar to #5913 but for the case of multiple GPUs, while in #5913 it is indeed a workaround as VRAM is genuinely low. In this case, reducing num_gpu is simply offloading to CPU when there are available GPUs and sufficient VRAM. This is an obvious bug.
Author
Owner

@rick-github commented on GitHub (Jul 25, 2024):

Server logs would help with diagnosis.

<!-- gh-comment-id:2250195914 --> @rick-github commented on GitHub (Jul 25, 2024): Server logs would help with diagnosis.
Author
Owner

@renbuarl commented on GitHub (Jul 25, 2024):

journal.txt

<!-- gh-comment-id:2250207106 --> @renbuarl commented on GitHub (Jul 25, 2024): [journal.txt](https://github.com/user-attachments/files/16377575/journal.txt)
Author
Owner

@dhiltgen commented on GitHub (Jul 26, 2024):

The bug here is likely we're not properly adjusting the prediction for the large context size.

<!-- gh-comment-id:2253482998 --> @dhiltgen commented on GitHub (Jul 26, 2024): The bug here is likely we're not properly adjusting the prediction for the large context size.
Author
Owner

@rick-github commented on GitHub (Jul 26, 2024):

I did a little experiment, loaded the same model multiple times with different versions of ollama. ollama always made the same calculations, but as versions from 0.1.40 to 0.3.0, the VRAM usage from the llama server went from 5156MiB to 5214MiB. Not a lot, but when llama.cpp is using 23.9 of 24G (https://github.com/ollama/ollama/issues/5913), it may be enough to push things over the edge.

<!-- gh-comment-id:2253492193 --> @rick-github commented on GitHub (Jul 26, 2024): I did a little experiment, loaded the same model multiple times with different versions of ollama. ollama always made the same calculations, but as versions from 0.1.40 to 0.3.0, the VRAM usage from the llama server went from 5156MiB to 5214MiB. Not a lot, but when llama.cpp is using 23.9 of 24G (https://github.com/ollama/ollama/issues/5913), it may be enough to push things over the edge.
Author
Owner

@Speedway1 commented on GitHub (Jul 28, 2024):

Just a head's up that the problem of generating garbage with >1 AMD GPUs is still an issue. Something broke a few versions of Ollama ago because it used to work. It's also specific to Ollama because llama.cpp for the same models where VRAM usage >24GB works well, the load it shared across GPUs without any issues. But with Ollama as soon as more than 1 GPU is needed, garbage is produced.

We're still trying to get some helpful information to feed back to the team here to get this fixed but I would set expectations that even if this report OOM is fixed, it's still not going to run across multiple cards successfully. Best to limit to 1 GPU and CPU RAM which seems to work.

ATM we're downgraded our multi-GPU AMD boxes to be multiple Ollamas running on single GPUs separated by port number. E.g. a 2 GPU box will have 2 instances of Ollama runnins, with two different port numbers. Each Ollama instance is strictred to 1 GPU only and of course can use CPU if needed. Run multi-card jobs on NVIDIA which is better supported both at the OS level and within Ollama.

Where we absolutely must use multi-card AMD GPUs, we're using llama.cpp and its' OpenAI API compatible server. Runs across all GPUs no problem provided the it's compiled with the LLAMA_CUDA_NO_PEER_COPY=1 flag. But this means that Ollama's wonderful LLM swapping support is missing, so it ties a machine down to serving that one LLM only. E.g. Llama3 70B. We adjust LLM routing accordingly in this instance.

Hope these comments help. It won't be long before the problems with multi-AMD cards are fixed. Just a matter of getting the correct diagnostics which is elusive at the moment.

<!-- gh-comment-id:2254300965 --> @Speedway1 commented on GitHub (Jul 28, 2024): Just a head's up that the problem of generating garbage with >1 AMD GPUs is still an issue. Something broke a few versions of Ollama ago because it used to work. It's also specific to Ollama because llama.cpp for the same models where VRAM usage >24GB works well, the load it shared across GPUs without any issues. But with Ollama as soon as more than 1 GPU is needed, garbage is produced. We're still trying to get some helpful information to feed back to the team here to get this fixed but I would set expectations that even if this report OOM is fixed, it's still not going to run across multiple cards successfully. Best to limit to 1 GPU and CPU RAM which seems to work. ATM we're downgraded our multi-GPU AMD boxes to be multiple Ollamas running on single GPUs separated by port number. E.g. a 2 GPU box will have 2 instances of Ollama runnins, with two different port numbers. Each Ollama instance is strictred to 1 GPU only and of course can use CPU if needed. Run multi-card jobs on NVIDIA which is better supported both at the OS level and within Ollama. Where we absolutely must use multi-card AMD GPUs, we're using llama.cpp and its' OpenAI API compatible server. Runs across all GPUs no problem provided the it's compiled with the LLAMA_CUDA_NO_PEER_COPY=1 flag. But this means that Ollama's wonderful LLM swapping support is missing, so it ties a machine down to serving that one LLM only. E.g. Llama3 70B. We adjust LLM routing accordingly in this instance. Hope these comments help. It won't be long before the problems with multi-AMD cards are fixed. Just a matter of getting the correct diagnostics which is elusive at the moment.
Author
Owner

@renbuarl commented on GitHub (Jul 30, 2024):

Speedway1, thank you for your message!
However, it seems that the issue is not with ollama, but with llama.cpp.
I built the latest release of llama.cpp #b3488 following the methodology described in
https://github.com/eliranwong/MultiAMDGPU_AIDev_Ubuntu (Thanks to the author!)

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make clean && make -j4 GGML_HIPBLAS=1 AMDGPU_TARGETS=gfx1100

~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32000 --host '192.168.0.5' --port 8081 -ngl 99

When the real context is more than 10k,

The same error occurs as in ollama:

CUDA error: out of memory
current device: 2, in function alloc at ggml/src/ggml-cuda.cu:291
ggml_cuda_device_malloc(&ptr, look_ahead_size, device)
ggml/src/ggml-cuda.cu:101: CUDA error

<!-- gh-comment-id:2257732976 --> @renbuarl commented on GitHub (Jul 30, 2024): Speedway1, thank you for your message! However, it seems that the issue is not with ollama, but with llama.cpp. I built the latest release of llama.cpp #b3488 following the methodology described in https://github.com/eliranwong/MultiAMDGPU_AIDev_Ubuntu (Thanks to the author!) git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make clean && make -j4 GGML_HIPBLAS=1 AMDGPU_TARGETS=gfx1100 ~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32000 --host '192.168.0.5' --port 8081 -ngl 99 When the real context is more than 10k, The same error occurs as in ollama: CUDA error: out of memory current device: 2, in function alloc at ggml/src/ggml-cuda.cu:291 ggml_cuda_device_malloc(&ptr, look_ahead_size, device) ggml/src/ggml-cuda.cu:101: CUDA error
Author
Owner

@renbuarl commented on GitHub (Jul 30, 2024):

https://github.com/ggerganov/llama.cpp/issues/8766

<!-- gh-comment-id:2257905114 --> @renbuarl commented on GitHub (Jul 30, 2024): https://github.com/ggerganov/llama.cpp/issues/8766
Author
Owner

@Speedway1 commented on GitHub (Jul 30, 2024):

Hi @renbuarl , I think that the problem there is your massive context length. It takes a lot of VRAM. Here is a simple bit of bash that we run when loading up LLMs on AMD to monitor the consumption, it's handy to have open in a window!

while true; do rocm-smi; sleep 1; done

At the moment we have Mixtral Large, quantised to Q2_k running on 2x Radeon 7900 XTX (2x24GB) with the following:

llama.cpp/server -m /home/tmp/Mistral-Large-Instruct-2407_q2_k.gguf -ngl 89 -n 1500 -c 1500 --host 0.0.0.0 --port 2600 -a mistral
For those worried about the security: This is behind a firewall on our Dev VPN hence the open listen.

This is the SMI output:


======================================================== Concise Info ========================================================
Device  Node  IDs              Temp    Power   Partitions          SCLK     MCLK     Fan     Perf  PwrCap       VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Avg)   (Mem, Compute, ID)                                                            
==============================================================================================================================
0       1     0x744c,   55924  68.0°C  175.0W  N/A, N/A, 0         1667Mhz  1249Mhz  40.0%   auto  327.0W       99%    49%   
1       2     0x744c,   27211  70.0°C  178.0W  N/A, N/A, 0         1725Mhz  1249Mhz  41.96%  auto  327.0W       98%    49%   
2       3     0x164e,   33198  39.0°C  41.04W  N/A, N/A, 0         None     1800Mhz  0%      auto  Unsupported  17%    0%    
==============================================================================================================================

(Currently the machine is busy as you can see).

This is on llama.cpp. We cannot get Ollama to work across the cards at the moment.

Not sure if any of this is useful to you, but hoping that maybe some of it is.

<!-- gh-comment-id:2258129150 --> @Speedway1 commented on GitHub (Jul 30, 2024): Hi @renbuarl , I think that the problem there is your massive context length. It takes a lot of VRAM. Here is a simple bit of bash that we run when loading up LLMs on AMD to monitor the consumption, it's handy to have open in a window! while true; do rocm-smi; sleep 1; done At the moment we have Mixtral Large, quantised to Q2_k running on 2x Radeon 7900 XTX (2x24GB) with the following: `llama.cpp/server -m /home/tmp/Mistral-Large-Instruct-2407_q2_k.gguf -ngl 89 -n 1500 -c 1500 --host 0.0.0.0 --port 2600 -a mistral ` For those worried about the security: This is behind a firewall on our Dev VPN hence the open listen. This is the SMI output: ``` ======================================================== Concise Info ======================================================== Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU% (DID, GUID) (Edge) (Avg) (Mem, Compute, ID) ============================================================================================================================== 0 1 0x744c, 55924 68.0°C 175.0W N/A, N/A, 0 1667Mhz 1249Mhz 40.0% auto 327.0W 99% 49% 1 2 0x744c, 27211 70.0°C 178.0W N/A, N/A, 0 1725Mhz 1249Mhz 41.96% auto 327.0W 98% 49% 2 3 0x164e, 33198 39.0°C 41.04W N/A, N/A, 0 None 1800Mhz 0% auto Unsupported 17% 0% ============================================================================================================================== ``` (Currently the machine is busy as you can see). This is on llama.cpp. We cannot get Ollama to work across the cards at the moment. Not sure if any of this is useful to you, but hoping that maybe some of it is.
Author
Owner

@renbuarl commented on GitHub (Jul 31, 2024):

Hi @renbuarl , I think that the problem there is your massive context length.

Great advice to use the '--flash-attn' option.

~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 65536 --host '192.168.0.5' --port 8081 -ngl 99 --flash-attn

Maximum vram consumption is 68.88 GB with a real context of 32k, and there is no 'CUDA error: out of memory'.

<!-- gh-comment-id:2260458167 --> @renbuarl commented on GitHub (Jul 31, 2024): > Hi @renbuarl , I think that the problem there is your massive context length. Great advice to use the '--flash-attn' option. ~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 65536 --host '192.168.0.5' --port 8081 -ngl 99 --flash-attn Maximum vram consumption is 68.88 GB with a real context of 32k, and there is no 'CUDA error: out of memory'.
Author
Owner

@renbuarl commented on GitHub (Jul 31, 2024):

The bug here is likely we're not properly adjusting the prediction for the large context size.

What do we have?

When launching without the --flash-attn option for llama-server

~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99
The average VRAM consumption is 68.40 GB but we crash with 'CUDA error: out of memory' with relatively small actual context.

When launching with the --flash-attn option for llama-server it works perfectly
~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 --flash-attn
The average VRAM consumption is 58.56 GB.

<!-- gh-comment-id:2260511070 --> @renbuarl commented on GitHub (Jul 31, 2024): > The bug here is likely we're not properly adjusting the prediction for the large context size. What do we have? When launching without the --flash-attn option for llama-server ~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 The average VRAM consumption is 68.40 GB but we crash with 'CUDA error: out of memory' with relatively small actual context. When launching with the --flash-attn option for llama-server it works perfectly ~/llama.cpp/llama-server -v -m /home/user/backups/models/70/Qwen2-72B-Instruct-Q4_K_M.gguf -c 32768 --host '192.168.0.5' --port 8081 -ngl 99 --flash-attn The average VRAM consumption is 58.56 GB.
Author
Owner

@ott2 commented on GitHub (Aug 1, 2024):

Specifying --no-kv-offload bypasses this error for me, even with the default 128K context. Otherwise using context -c 70689 or larger results in an out of memory error.

Background:

Totally different setup here (M1 Mac, 32GB RAM), but I'm also seeing repeatable memory outs with this model. This happens when the context size goes beyond a threshold value.

In my case

./llama-cli -m models/Meta-Llama-3.1-8B-Instruct-
Q8_0.gguf -c 70688 -p sing

works fine, but changing that to -c 70689 results in

ggml_metal_graph_compute: command buffer 0 failed with status 5
error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory)

With a word-level diff, the only significant difference I can see in the logs is the pair of lines:

llama_kv_cache_init:      Metal KV buffer size =  8836.00 MiB
llama_new_context_with_model:      Metal compute buffer size =  4588.07 MiB

versus

llama_kv_cache_init:      Metal KV buffer size =    8840.00 MiB
llama_new_context_with_model:      Metal compute buffer size =  4590.13 MiB

The context window increases the Metal KV buffer size until it hits the mysterious maximum value of 8836MiB as the limit in my case (a bit more than 8GB). Specifying --no-kv-offload seems to effectively switch off using the GPU, but at least allows inference to go ahead.

<!-- gh-comment-id:2261685665 --> @ott2 commented on GitHub (Aug 1, 2024): Specifying `--no-kv-offload` bypasses this error for me, even with the default 128K context. Otherwise using context `-c 70689` or larger results in an out of memory error. Background: Totally different setup here (M1 Mac, 32GB RAM), but I'm also seeing repeatable memory outs with this model. This happens when the context size goes beyond a threshold value. In my case ``` ./llama-cli -m models/Meta-Llama-3.1-8B-Instruct- Q8_0.gguf -c 70688 -p sing ``` works fine, but changing that to `-c 70689` results in ``` ggml_metal_graph_compute: command buffer 0 failed with status 5 error: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) ``` With a word-level diff, the only significant difference I can see in the logs is the pair of lines: ``` llama_kv_cache_init: Metal KV buffer size = 8836.00 MiB llama_new_context_with_model: Metal compute buffer size = 4588.07 MiB ``` versus ``` llama_kv_cache_init: Metal KV buffer size = 8840.00 MiB llama_new_context_with_model: Metal compute buffer size = 4590.13 MiB ``` The context window increases the Metal KV buffer size until it hits the mysterious maximum value of 8836MiB as the limit in my case (a bit more than 8GB). Specifying `--no-kv-offload` seems to effectively switch off using the GPU, but at least allows inference to go ahead.
Author
Owner

@DevElCuy commented on GitHub (Oct 3, 2024):

The bug here is likely we're not properly adjusting the prediction for the large context size.

Just learned today that memory allocation is related to the context size param. Any way to make it dynamic? We are talking about MAX context size so no need to allocate a lot of VRAM that we are hardly ever using

<!-- gh-comment-id:2390328275 --> @DevElCuy commented on GitHub (Oct 3, 2024): > The bug here is likely we're not properly adjusting the prediction for the large context size. Just learned today that memory allocation is related to the context size param. Any way to make it dynamic? We are talking about MAX context size so no need to allocate a lot of VRAM that we are hardly ever using
Author
Owner

@dhiltgen commented on GitHub (Oct 17, 2024):

@develCuy we're tracking dynamic context size management via #1005

<!-- gh-comment-id:2420122475 --> @dhiltgen commented on GitHub (Oct 17, 2024): @develCuy we're tracking dynamic context size management via #1005
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#29475