[GH-ISSUE #4354] Models often don't load on versions after 0.1.132 #49228

Closed
opened 2026-04-28 10:57:32 -05:00 by GiteaMirror · 9 comments
Owner

Originally created by @ProjectMoon on GitHub (May 11, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4354

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

Many models, in particular codegemma 1.1 7b q8_0, don't load for various reasons on versions after 0.1.132. Works fine on 132. I don't have the logs on hand at the moment, but can add them later. The errors relate to out of memory errors and unable to reset the GPU VRAM.

This is using ROCm (ollama distribution of it, from the tar.gz) on AMD RX 6800 XT.

Is there a centralized issue for this already?

OS

No response

GPU

AMD

CPU

AMD

Ollama version

0.1.133-0.1.136

Originally created by @ProjectMoon on GitHub (May 11, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4354 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? Many models, in particular codegemma 1.1 7b q8_0, don't load for various reasons on versions after 0.1.132. Works fine on 132. I don't have the logs on hand at the moment, but can add them later. The errors relate to out of memory errors and unable to reset the GPU VRAM. This is using ROCm (ollama distribution of it, from the tar.gz) on AMD RX 6800 XT. Is there a centralized issue for this already? ### OS _No response_ ### GPU AMD ### CPU AMD ### Ollama version 0.1.133-0.1.136
GiteaMirror added the memorybug labels 2026-04-28 10:57:32 -05:00
Author
Owner

@coder543 commented on GitHub (May 11, 2024):

This was probably the main issue for this kind of thing: https://github.com/ollama/ollama/issues/1952#issuecomment-2105376333

I would probably leave a comment there too. Since you're on AMD, it's not actually related to CUDA, but it sounds like the same fundamental issue. Something in the calculation of how much VRAM the model will use on AMD is probably wrong.

You can manually lower the number of GPU layers being offloaded to the GPU, either by creating a custom Modelfile and importing that, or by setting the right parameter on the requests. If you're using Open WebUI, it's very easy to change that parameter there, for instance. Not convenient, but there should be ways to work around the crashing from out of memory.

<!-- gh-comment-id:2105766177 --> @coder543 commented on GitHub (May 11, 2024): This was probably the main issue for this kind of thing: https://github.com/ollama/ollama/issues/1952#issuecomment-2105376333 I would probably leave a comment there too. Since you're on AMD, it's not actually related to CUDA, but it sounds like the same fundamental issue. Something in the calculation of how much VRAM the model will use on AMD is probably wrong. You can manually lower the number of GPU layers being offloaded to the GPU, either by creating a custom Modelfile and importing that, or by setting the right parameter on the requests. If you're using Open WebUI, it's very easy to change that parameter there, for instance. Not convenient, but there should be ways to work around the crashing from out of memory.
Author
Owner

@ProjectMoon commented on GitHub (May 13, 2024):

Here are the logs:

llama_kv_cache_init:      ROCm0 KV buffer size =  7182.00 MiB
llama_new_context_with_model: KV self size  = 7182.00 MiB, K (f16): 3591.00 MiB, V (f16): 3591.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     1.98 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 573.07 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate ROCm0 buffer of size 600903680
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model '/mnt/tank/ai/models/ollama/blobs/sha256-039382c23c3693407458d9d014b661611c81a957ded71da627db6aaeee3fb1cc'
terminate called without an active exception

It seems to be related to either context size or max token size. Will add more info later.

<!-- gh-comment-id:2107647395 --> @ProjectMoon commented on GitHub (May 13, 2024): Here are the logs: ``` llama_kv_cache_init: ROCm0 KV buffer size = 7182.00 MiB llama_new_context_with_model: KV self size = 7182.00 MiB, K (f16): 3591.00 MiB, V (f16): 3591.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 1.98 MiB ggml_backend_cuda_buffer_type_alloc_buffer: allocating 573.07 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate ROCm0 buffer of size 600903680 llama_new_context_with_model: failed to allocate compute buffers llama_init_from_gpt_params: error: failed to create context with model '/mnt/tank/ai/models/ollama/blobs/sha256-039382c23c3693407458d9d014b661611c81a957ded71da627db6aaeee3fb1cc' terminate called without an active exception ``` It seems to be related to either context size or max token size. Will add more info later.
Author
Owner

@ProjectMoon commented on GitHub (May 13, 2024):

So, I can run the model with a smaller context size/number of tokens to generate, but not higher. I was trying with a context size of 8199 (model forced it), and then the error above occurred, and then ollama started trying to load it repeatedly.

<!-- gh-comment-id:2107962811 --> @ProjectMoon commented on GitHub (May 13, 2024): So, I can run the model with a smaller context size/number of tokens to generate, but not higher. I was trying with a context size of 8199 (model forced it), and then the error above occurred, and then ollama started trying to load it repeatedly.
Author
Owner

@ProjectMoon commented on GitHub (May 13, 2024):

Bit more testing: If I force the context size to 8192 on codegemma v1.1, it crashes with the above error. Leaving it to the defaults seems to work fine. Oddly enough, when loading the model with n_ctx 8192, it showed up as n_ctx 16384 in the ollama logs? o_O

<!-- gh-comment-id:2108556708 --> @ProjectMoon commented on GitHub (May 13, 2024): Bit more testing: If I force the context size to 8192 on codegemma v1.1, it crashes with the above error. Leaving it to the defaults seems to work fine. Oddly enough, when loading the model with n_ctx 8192, it showed up as n_ctx 16384 in the ollama logs? o_O
Author
Owner

@dhiltgen commented on GitHub (Jul 25, 2024):

Are you still seeing this behavior on the latest release? I was able to load codegemma with various context sizes on an RX 6800 without hitting oom crashes.

<!-- gh-comment-id:2251196610 --> @dhiltgen commented on GitHub (Jul 25, 2024): Are you still seeing this behavior on the latest release? I was able to load codegemma with various context sizes on an RX 6800 without hitting oom crashes.
Author
Owner

@ProjectMoon commented on GitHub (Jul 25, 2024):

Not this exact issue, but I still run into problems, particularly with longer contexts: https://github.com/ollama/ollama/issues/5741

I actually had to implement a GPU layer downscaling filter for Open Web UI to override num_gpu on ollama crash so I can continue conversations. 🤔

<!-- gh-comment-id:2251269646 --> @ProjectMoon commented on GitHub (Jul 25, 2024): Not this exact issue, but I still run into problems, particularly with longer contexts: https://github.com/ollama/ollama/issues/5741 I actually had to implement a GPU layer downscaling filter for Open Web UI to override `num_gpu` on ollama crash so I can continue conversations. :thinking:
Author
Owner

@dhiltgen commented on GitHub (Jul 26, 2024):

Sorry you're still hitting OOM challenges with large contexts. I've got a PR up to add back a workaround that should make it a little easier to mitigate these defects until we can get them properly fixed - see #5922

<!-- gh-comment-id:2253092548 --> @dhiltgen commented on GitHub (Jul 26, 2024): Sorry you're still hitting OOM challenges with large contexts. I've got a PR up to add back a workaround that should make it a little easier to mitigate these defects until we can get them properly fixed - see #5922
Author
Owner

@ProjectMoon commented on GitHub (Jul 26, 2024):

That will certainly help. Do you want any logs or test contexts or anything? To diagnose the underlying issues?

<!-- gh-comment-id:2253114061 --> @ProjectMoon commented on GitHub (Jul 26, 2024): That will certainly help. Do you want any logs or test contexts or anything? To diagnose the underlying issues?
Author
Owner

@pdevine commented on GitHub (Oct 16, 2024):

I'm going to close this since #5922 was merged, but we can reopen it if you're still hitting it.

<!-- gh-comment-id:2417623338 --> @pdevine commented on GitHub (Oct 16, 2024): I'm going to close this since #5922 was merged, but we can reopen it if you're still hitting it.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#49228