[GH-ISSUE #12571] Beepo-22B model not working in 0.12.5 #8339

Closed
opened 2026-04-12 20:55:32 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @twhlai on GitHub (Oct 11, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12571

What is the issue?

When I try to run the huggingface model hf.co/mradermacher/Beepo-22B-i1-GGUF:IQ3_S on 0.12.5, I get the following error:
Error: 500 Internal Server Error: llama runner process has terminated: exit status 2
The model used to work on 0.12.3.

Relevant log output


OS

Windows

GPU

AMD, Nvidia

CPU

AMD

Ollama version

0.12.5

Originally created by @twhlai on GitHub (Oct 11, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12571 ### What is the issue? When I try to run the huggingface model hf.co/mradermacher/Beepo-22B-i1-GGUF:IQ3_S on 0.12.5, I get the following error: `Error: 500 Internal Server Error: llama runner process has terminated: exit status 2` The model used to work on 0.12.3. ### Relevant log output ```shell ``` ### OS Windows ### GPU AMD, Nvidia ### CPU AMD ### Ollama version 0.12.5
GiteaMirror added the bug label 2026-04-12 20:55:32 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 11, 2025):

Server log will help in debugging.

<!-- gh-comment-id:3393088436 --> @rick-github commented on GitHub (Oct 11, 2025): [Server log]( https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will help in debugging.
Author
Owner

@twhlai commented on GitHub (Oct 11, 2025):

Here is the server.log

<!-- gh-comment-id:3393671639 --> @twhlai commented on GitHub (Oct 11, 2025): Here is the [server.log](https://github.com/user-attachments/files/22867062/server.log)
Author
Owner

@rick-github commented on GitHub (Oct 11, 2025):

time=2025-10-11T17:26:59.685-04:00 level=INFO source=server.go:545 msg=offload library=CUDA layers.requested=-1
 layers.model=57 layers.offload=54 layers.split=[54] memory.available="[11.8 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="12.2 GiB" memory.required.partial="11.6 GiB" memory.required.kv="1.8 GiB"
 memory.required.allocations="[11.6 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.8 GiB"
 memory.weights.nonrepeating="157.5 MiB" memory.graph.full="832.0 MiB" memory.graph.partial="860.3 MiB"
...
graph_reserve: failed to allocate compute buffers
llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
panic: unable to create llama context

The runner tried to use more memory than was available. The server estimated that it needed 11.6G with 11.8G available, but when it came time to allocate the memory, there was not enough. Comparison with a log from 0.12.3 should reveal whether 0.12.5 was over-optimistic, or if the allocation itself was less efficient. In the meantime, you can try some of the mitigations shown here.

<!-- gh-comment-id:3393683128 --> @rick-github commented on GitHub (Oct 11, 2025): ``` time=2025-10-11T17:26:59.685-04:00 level=INFO source=server.go:545 msg=offload library=CUDA layers.requested=-1 layers.model=57 layers.offload=54 layers.split=[54] memory.available="[11.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="12.2 GiB" memory.required.partial="11.6 GiB" memory.required.kv="1.8 GiB" memory.required.allocations="[11.6 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.8 GiB" memory.weights.nonrepeating="157.5 MiB" memory.graph.full="832.0 MiB" memory.graph.partial="860.3 MiB" ... graph_reserve: failed to allocate compute buffers llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers panic: unable to create llama context ``` The runner tried to use more memory than was available. The server estimated that it needed 11.6G with 11.8G available, but when it came time to allocate the memory, there was not enough. Comparison with a log from 0.12.3 should reveal whether 0.12.5 was over-optimistic, or if the allocation itself was less efficient. In the meantime, you can try some of the mitigations shown [here](https://github.com/ollama/ollama/issues/8597#issuecomment-2614533288).
Author
Owner

@twhlai commented on GitHub (Oct 12, 2025):

Thanks. I installed 0.12.3 and got server_0_12_3.log

<!-- gh-comment-id:3393767439 --> @twhlai commented on GitHub (Oct 12, 2025): Thanks. I installed 0.12.3 and got [server_0_12_3.log](https://github.com/user-attachments/files/22867833/server_0_12_3.log)
Author
Owner

@rick-github commented on GitHub (Oct 12, 2025):

time=2025-10-11T20:24:48.456-04:00 level=INFO source=server.go:544 msg=offload library=cuda layers.requested=-1
 layers.model=57 layers.offload=50 layers.split=[50] memory.available="[11.0 GiB]" memory.gpu_overhead="0 B"
 memory.required.full="12.2 GiB" memory.required.partial="10.9 GiB" memory.required.kv="1.8 GiB"
 memory.required.allocations="[10.9 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.8 GiB"
 memory.weights.nonrepeating="157.5 MiB" memory.graph.full="832.0 MiB" memory.graph.partial="860.3 MiB"

0.12.3 only tried to offload 50, compared with 0.12.5 trying to offload 54. So 0.12.5 was too optimistic. There's another ticket about memory issues in 012.5 in #12579 which might be related.

<!-- gh-comment-id:3394335475 --> @rick-github commented on GitHub (Oct 12, 2025): ``` time=2025-10-11T20:24:48.456-04:00 level=INFO source=server.go:544 msg=offload library=cuda layers.requested=-1 layers.model=57 layers.offload=50 layers.split=[50] memory.available="[11.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="12.2 GiB" memory.required.partial="10.9 GiB" memory.required.kv="1.8 GiB" memory.required.allocations="[10.9 GiB]" memory.weights.total="8.9 GiB" memory.weights.repeating="8.8 GiB" memory.weights.nonrepeating="157.5 MiB" memory.graph.full="832.0 MiB" memory.graph.partial="860.3 MiB" ``` 0.12.3 only tried to offload 50, compared with 0.12.5 trying to offload 54. So 0.12.5 was too optimistic. There's another ticket about memory issues in 012.5 in #12579 which might be related.
Author
Owner

@jessegross commented on GitHub (Oct 14, 2025):

This model is llama architecture, so it is using the old engine by default and therefore old memory estimates, which are known to not be 100% accurate. You could try running it on the Ollama engine by setting OLLAMA_NEW_ENGINE=1, which should be more accurate about memory usage.

As far as the differences between versions, there are probably 2 factors:

  • The log from 0.12.3 shows less available VRAM, which is why fewer layers were offloaded. Depending on how layers pack on the GPU, this might also leave more slack space for any mis-estimations.
  • As mentioned in #12579, there are cases where we might crash at longer context lengths during inference after the model was loaded. The problem and fix affect both the old and new engines. Since that causes more memory to be allocated at startup and the old engine is less robust to that, it might trigger the crash here. In this case, the context length is shorter compared to that bug so the effect is less dramatic but OLLAMA_FLASH_ATTENTION=1 might help some as well.
<!-- gh-comment-id:3403561732 --> @jessegross commented on GitHub (Oct 14, 2025): This model is llama architecture, so it is using the old engine by default and therefore old memory estimates, which are known to not be 100% accurate. You could try running it on the Ollama engine by setting OLLAMA_NEW_ENGINE=1, which should be more accurate about memory usage. As far as the differences between versions, there are probably 2 factors: - The log from 0.12.3 shows less available VRAM, which is why fewer layers were offloaded. Depending on how layers pack on the GPU, this might also leave more slack space for any mis-estimations. - As mentioned in #12579, there are cases where we might crash at longer context lengths during inference after the model was loaded. The problem and fix affect both the old and new engines. Since that causes more memory to be allocated at startup and the old engine is less robust to that, it might trigger the crash here. In this case, the context length is shorter compared to that bug so the effect is less dramatic but OLLAMA_FLASH_ATTENTION=1 might help some as well.
Author
Owner

@twhlai commented on GitHub (Oct 14, 2025):

Beepo-22B does run once I set OLLAMA_NEW_ENGINE=1. Thanks!

<!-- gh-comment-id:3403985993 --> @twhlai commented on GitHub (Oct 14, 2025): Beepo-22B does run once I set OLLAMA_NEW_ENGINE=1. Thanks!
Author
Owner

@bhse610 commented on GitHub (Oct 16, 2025):

OLLAMA_NEW_ENGINE=1 ollama serve
solves the issue. Thanks

<!-- gh-comment-id:3412301431 --> @bhse610 commented on GitHub (Oct 16, 2025): OLLAMA_NEW_ENGINE=1 ollama serve solves the issue. Thanks
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8339