[GH-ISSUE #5157] Update llama.cpp to support qwen2-57B-A14B pls #49762

Closed
opened 2026-04-28 12:52:53 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @CoreJa on GitHub (Jun 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5157

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

llama.cpp is able to support qwen2-57B-A14B moe model now via https://github.com/ggerganov/llama.cpp/pull/7835. Please help update submodule llama.cpp to a newer version. Otherwise i quants of this model(I tried iq4_xs) would end up with a CUDA error: CUBLAS_STATUS_NOT_INITIALIZED.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

v0.1.44

Originally created by @CoreJa on GitHub (Jun 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5157 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? llama.cpp is able to support qwen2-57B-A14B moe model now via https://github.com/ggerganov/llama.cpp/pull/7835. Please help update submodule llama.cpp to a newer version. Otherwise i quants of this model(I tried iq4_xs) would end up with a `CUDA error: CUBLAS_STATUS_NOT_INITIALIZED`. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version v0.1.44
GiteaMirror added the bug label 2026-04-28 12:52:53 -05:00
Author
Owner

@dhiltgen commented on GitHub (Jun 20, 2024):

This landed just after we updated for 0.1.45. We'll pick this up on our next update ~next week.

<!-- gh-comment-id:2180960040 --> @dhiltgen commented on GitHub (Jun 20, 2024): This landed just after we updated for 0.1.45. We'll pick this up on our next update ~next week.
Author
Owner

@CoreJa commented on GitHub (Jun 21, 2024):

Thx for your quick response, I'll close this issue when it releases. Appreciate your help!

<!-- gh-comment-id:2181847795 --> @CoreJa commented on GitHub (Jun 21, 2024): Thx for your quick response, I'll close this issue when it releases. Appreciate your help!
Author
Owner

@bunnyfu commented on GitHub (Jun 23, 2024):

Also something to keep in mind with the Qwen2 57b MoE model is that it breaks if you raise the batch above 256. Ollama tries to default to batches of 512. I have been manually setting in llama.cpp -ubatch 256 to make sure to avoid this error on some configs (macs): GGML_ASSERT: ggml-metal.m:1857: dst_rows <= 2048

More info here: https://github.com/ggerganov/llama.cpp/issues/7652

<!-- gh-comment-id:2184919158 --> @bunnyfu commented on GitHub (Jun 23, 2024): Also something to keep in mind with the Qwen2 57b MoE model is that it breaks if you raise the batch above 256. Ollama tries to default to batches of 512. I have been manually setting in llama.cpp -ubatch 256 to make sure to avoid this error on some configs (macs): `GGML_ASSERT: ggml-metal.m:1857: dst_rows <= 2048 ` More info here: https://github.com/ggerganov/llama.cpp/issues/7652
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#49762