[GH-ISSUE #5741] ROCm Memory Issues with Long Contexts #3575

Open
opened 2026-04-12 14:18:23 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @ProjectMoon on GitHub (Jul 17, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5741

What is the issue?

Similar (?) to #1952. I've been noticing that ollama will crash when using long context lengths on ROCm. In particular, the most noticeable thing is that I can continue large conversations with ollama from start, while the model remains loaded in memory. But after coming back later and the model needs to reload, it cannot process the context.

Here is message I put in 1952.


Would like to prod this issue again, as I am still seeing this with GLM4 at 65k context size. Loads fine without much context, but has issues loading larger contexts. I even set the context size to 8k o_O.

Important bits:

  • It looks like GPU VRAM hits 100% but then can't spill over into memory for larger contexts. rocm-smi shows VRAM going 98%... 99%.. 100%, then crash.
  • Forcing GPU layers down to 15 out of 41 and disabling mmap and setting num_batch to 256 for GLM 4 makes VRAM hover around 35%, with 8k context size.
  • Leaving mmap disabled and num_batch at 256, and letting it load all 41 GPU layers into memory uses 69% VRAM.
  • Setting num_ctx to 60,000 will still make it try to load all layers into the GPU, and then it crashes because it runs out of VRAM.
  • Moving num_gpu down to 30 or even 20 allows it to load more context. But this is only delaying the inevitable. Long enough context will = crash.

Shouldn't ollama be calculating that it needs to load less layers into the GPU in this situation? Like I can adjust it manually, but if ollama receives num_ctx that'll make the model crash, shouldn't it start using system RAM instead?

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.2.5

Originally created by @ProjectMoon on GitHub (Jul 17, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5741 ### What is the issue? Similar (?) to #1952. I've been noticing that ollama will crash when using long context lengths on ROCm. In particular, the most noticeable thing is that I can continue large conversations with ollama from start, while the model remains loaded in memory. But after coming back later and the model needs to reload, it cannot process the context. Here is message I put in 1952. ------ Would like to prod this issue again, as I am still seeing this with GLM4 at 65k context size. Loads fine without much context, but has issues loading larger contexts. I even set the context size to 8k o_O. Important bits: * It looks like GPU VRAM hits 100% but then can't spill over into memory for larger contexts. `rocm-smi` shows VRAM going 98%... 99%.. 100%, then crash. * Forcing GPU layers down to 15 out of 41 and disabling mmap and setting num_batch to 256 for GLM 4 makes VRAM hover around 35%, with 8k context size. * Leaving mmap disabled and num_batch at 256, and letting it load all 41 GPU layers into memory uses 69% VRAM. * Setting num_ctx to 60,000 will still make it try to load all layers into the GPU, and then it crashes because it runs out of VRAM. * Moving num_gpu down to 30 or even 20 allows it to load more context. But this is only delaying the inevitable. Long enough context will = crash. Shouldn't ollama be calculating that it needs to load less layers into the GPU in this situation? Like I can adjust it manually, but if ollama receives num_ctx that'll make the model crash, shouldn't it start using system RAM instead? ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.2.5
GiteaMirror added the bug label 2026-04-12 14:18:24 -05:00
Author
Owner

@ProjectMoon commented on GitHub (Jul 17, 2024):

Associated with this: I am running gemma2:27b. Before the latest version, it was running at around 5 tokens / second with 38 of 47 layers on the GPU. But now if it tries to load 38 layers on the GPU, ollama runs out of VRAM. Even without a context (i.e. new convo). So I had to dial num_gpu down to 30 layers to get it to run.

<!-- gh-comment-id:2232859570 --> @ProjectMoon commented on GitHub (Jul 17, 2024): Associated with this: I am running `gemma2:27b`. Before the latest version, it was running at around 5 tokens / second with 38 of 47 layers on the GPU. But now if it tries to load 38 layers on the GPU, ollama runs out of VRAM. Even without a context (i.e. new convo). So I had to dial `num_gpu` down to 30 layers to get it to run.
Author
Owner

@ProjectMoon commented on GitHub (Jul 22, 2024):

So for Deepseek Chat Lite, I set the context length to 128k. ollama will load up only 5 GPU layers due to the large context size. However, when it begins processing longer inputs, it will still run out of VRAM on the GPU. I set num_gpu down to 1 layer, and it was stable, using 81% of my 16 GB of VRAM. So maybe this points to the algorithm being too optimistic about how much VRAM a large context will use under ROCm?

<!-- gh-comment-id:2242459328 --> @ProjectMoon commented on GitHub (Jul 22, 2024): So for Deepseek Chat Lite, I set the context length to 128k. ollama will load up only 5 GPU layers due to the large context size. However, when it begins processing longer inputs, it will still run out of VRAM on the GPU. I set `num_gpu` down to 1 layer, and it was stable, using 81% of my 16 GB of VRAM. So maybe this points to the algorithm being too optimistic about how much VRAM a large context will use under ROCm?
Author
Owner

@fgsfds1 commented on GitHub (Dec 19, 2024):

I also have this problem, long contexts make ollama ROCm OOM.

https://github.com/ollama/ollama/pull/5922 is a hacky workaround, since setting it up for use with a long context will probably affect the number of layers offloaded to GPU, which is fine when the context is long and needs somewhere to go, but working with a shorter context right after that limits the VRAM used for no reason, making the performance worse.

From what I've seen in memory usage, for some reason the length of context isn't accounted at all on ROCm when loading the model. For example, on my 16G VRAM, with OLLAMA_GPU_OVERHEAD of 3G, model loads right until ~13G is used, and then it slowly rises to 16G and OOMs.

<!-- gh-comment-id:2553532931 --> @fgsfds1 commented on GitHub (Dec 19, 2024): I also have this problem, long contexts make ollama ROCm OOM. https://github.com/ollama/ollama/pull/5922 is a hacky workaround, since setting it up for use with a long context will probably affect the number of layers offloaded to GPU, which is fine when the context is long and needs somewhere to go, but working with a shorter context right after that limits the VRAM used for no reason, making the performance worse. From what I've seen in memory usage, for some reason the length of context isn't accounted at all on ROCm when loading the model. For example, on my 16G VRAM, with OLLAMA_GPU_OVERHEAD of 3G, model loads right until ~13G is used, and then it slowly rises to 16G and OOMs.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3575