[GH-ISSUE #13480] Fail to generate when using unsloth/Qwen3-VL-4B-Instruct-GGUF #34653

Open
opened 2026-04-22 18:23:47 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @cvrunmin on GitHub (Dec 15, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13480

What is the issue?

when using unsloth/Qwen3-VL-4B-Instruct-GGUF (and also unsloth/Qwen3-VL-8B-Instruct-GGUF), ollama can pull, and load the model into memory. But as soon as user make a prompt to it, it responses with:

Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details

When I track the log, there is a noticeable error log:

llama-context.cpp:1186: GGML_ASSERT((n_outputs_prev + n_outputs)*n_embd <= (int64_t) embd_size) failed

Relevant log output

clip_ctx: CLIP using CPU backend
load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842

load_hparams: projector:          qwen3vl_merger
load_hparams: n_embd:             1024
load_hparams: n_head:             16
load_hparams: n_ff:               4096
load_hparams: n_layer:            24
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     2560

--- vision hparams ---
load_hparams: image_size:         768
load_hparams: patch_size:         16
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            2
load_hparams: n_wa_pattern:       0
load_hparams: image_min_pixels:   8192
load_hparams: image_max_pixels:   4194304

load_hparams: model size:         797.43 MiB
load_hparams: metadata size:      0.11 MiB
load_tensors: loaded 316 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-1b9f4e92f0fbda14d7d7b58baed86039b8a980fe503d9d6a9393f25c0028f1fc
alloc_compute_meta: warmup with image size = 1472 x 1472
alloc_compute_meta:        CPU compute buffer size =   322.49 MiB
alloc_compute_meta: graph splits = 1, nodes = 766
warmup: flash attention is enabled
time=2025-12-15T17:08:25.275+08:00 level=INFO source=server.go:1332 msg="llama runner started in 2.77 seconds"
time=2025-12-15T17:08:25.275+08:00 level=INFO source=sched.go:517 msg="loaded runners" count=1
time=2025-12-15T17:08:25.275+08:00 level=INFO source=server.go:1294 msg="waiting for llama runner to start responding"
time=2025-12-15T17:08:25.275+08:00 level=INFO source=server.go:1332 msg="llama runner started in 2.77 seconds"
[GIN] 2025/12/15 - 17:08:25 | 200 |   3.00066843s |       127.0.0.1 | POST     "/api/generate"
llama-context.cpp:1186: GGML_ASSERT((n_outputs_prev + n_outputs)*n_embd <= (int64_t) embd_size) failed
[New LWP 3124676]
[New LWP 3124677]
[New LWP 3124678]
[New LWP 3124679]
[New LWP 3124680]
[New LWP 3124681]
[New LWP 3124682]
[New LWP 3124683]

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.13.2, 0.13.3

Originally created by @cvrunmin on GitHub (Dec 15, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13480 ### What is the issue? when using unsloth/Qwen3-VL-4B-Instruct-GGUF (and also unsloth/Qwen3-VL-8B-Instruct-GGUF), ollama can pull, and load the model into memory. But as soon as user make a prompt to it, it responses with: ``` Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details ``` When I track the log, there is a noticeable error log: ``` llama-context.cpp:1186: GGML_ASSERT((n_outputs_prev + n_outputs)*n_embd <= (int64_t) embd_size) failed ``` ### Relevant log output ```shell clip_ctx: CLIP using CPU backend load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks load_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024 load_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842 load_hparams: projector: qwen3vl_merger load_hparams: n_embd: 1024 load_hparams: n_head: 16 load_hparams: n_ff: 4096 load_hparams: n_layer: 24 load_hparams: ffn_op: gelu load_hparams: projection_dim: 2560 --- vision hparams --- load_hparams: image_size: 768 load_hparams: patch_size: 16 load_hparams: has_llava_proj: 0 load_hparams: minicpmv_version: 0 load_hparams: n_merge: 2 load_hparams: n_wa_pattern: 0 load_hparams: image_min_pixels: 8192 load_hparams: image_max_pixels: 4194304 load_hparams: model size: 797.43 MiB load_hparams: metadata size: 0.11 MiB load_tensors: loaded 316 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-1b9f4e92f0fbda14d7d7b58baed86039b8a980fe503d9d6a9393f25c0028f1fc alloc_compute_meta: warmup with image size = 1472 x 1472 alloc_compute_meta: CPU compute buffer size = 322.49 MiB alloc_compute_meta: graph splits = 1, nodes = 766 warmup: flash attention is enabled time=2025-12-15T17:08:25.275+08:00 level=INFO source=server.go:1332 msg="llama runner started in 2.77 seconds" time=2025-12-15T17:08:25.275+08:00 level=INFO source=sched.go:517 msg="loaded runners" count=1 time=2025-12-15T17:08:25.275+08:00 level=INFO source=server.go:1294 msg="waiting for llama runner to start responding" time=2025-12-15T17:08:25.275+08:00 level=INFO source=server.go:1332 msg="llama runner started in 2.77 seconds" [GIN] 2025/12/15 - 17:08:25 | 200 | 3.00066843s | 127.0.0.1 | POST "/api/generate" llama-context.cpp:1186: GGML_ASSERT((n_outputs_prev + n_outputs)*n_embd <= (int64_t) embd_size) failed [New LWP 3124676] [New LWP 3124677] [New LWP 3124678] [New LWP 3124679] [New LWP 3124680] [New LWP 3124681] [New LWP 3124682] [New LWP 3124683] ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.13.2, 0.13.3
GiteaMirror added the bug label 2026-04-22 18:23:47 -05:00
Author
Owner

@iosub commented on GitHub (Dec 15, 2025):

you maybe want to try https://github.com/ollama/ollama/pull/13456

<!-- gh-comment-id:3656632428 --> @iosub commented on GitHub (Dec 15, 2025): you maybe want to try https://github.com/ollama/ollama/pull/13456
Author
Owner

@rick-github commented on GitHub (Dec 15, 2025):

The model loads so #13456 is not relevant.

<!-- gh-comment-id:3657564265 --> @rick-github commented on GitHub (Dec 15, 2025): The model loads so #13456 is not relevant.
Author
Owner

@iosub commented on GitHub (Dec 15, 2025):

The model loads so #13456 is not relevant.

Thanks for the reply, but I need to be very clear on a key point: “the model is loading” is not the same as “split models are supported.”

Right now, main can appear to load a split model (you’ll see progress / memory activity), but inference will not work correctly because the runtime does not actually support the split/sharded layout end-to-end. In other words: loading succeeds, execution does not.

At the moment, the only way to correctly support split models in Ollama is via one of the PRs that adds the missing split handling logic:

https://github.com/ollama/ollama/pull/13456
https://github.com/ollama/ollama/pull/13306
If you believe split models already work on main without these changes, please provide a minimal reproducible example (exact model files/layout, exact command, and the full logs). Otherwise, stating “you don’t need this PR because the model is loading” is misleading for users, because it implies functionality that isn’t actually there.

I’m happy to adjust the PRs if there’s a preferred approach, but we need to align on the fact that true split support requires code changes, not just a successful load step.

<!-- gh-comment-id:3657660086 --> @iosub commented on GitHub (Dec 15, 2025): > The model loads so [#13456](https://github.com/ollama/ollama/pull/13456) is not relevant. Thanks for the reply, but I need to be very clear on a key point: “the model is loading” is not the same as “split models are supported.” Right now, main can appear to load a split model (you’ll see progress / memory activity), but inference will not work correctly because the runtime does not actually support the split/sharded layout end-to-end. In other words: loading succeeds, execution does not. At the moment, the only way to correctly support split models in Ollama is via one of the PRs that adds the missing split handling logic: https://github.com/ollama/ollama/pull/13456 https://github.com/ollama/ollama/pull/13306 If you believe split models already work on main without these changes, please provide a minimal reproducible example (exact model files/layout, exact command, and the full logs). Otherwise, stating “you don’t need this PR because the model is loading” is misleading for users, because it implies functionality that isn’t actually there. I’m happy to adjust the PRs if there’s a preferred approach, but we need to align on the fact that true split support requires code changes, not just a successful load step.
Author
Owner

@rick-github commented on GitHub (Dec 15, 2025):

The model is not split.

<!-- gh-comment-id:3657668149 --> @rick-github commented on GitHub (Dec 15, 2025): The model is not split.
Author
Owner

@iosub commented on GitHub (Dec 15, 2025):

The model is not split.

If we are talking about this model, https://huggingface.co/unsloth/Qwen3-VL-4B-Instruct-GGUF

As far as I know it's split

<!-- gh-comment-id:3657699462 --> @iosub commented on GitHub (Dec 15, 2025): > The model is not split. If we are talking about this model, https://huggingface.co/unsloth/Qwen3-VL-4B-Instruct-GGUF As far as I know it's split
Author
Owner

@rick-github commented on GitHub (Dec 15, 2025):

My apologies, I thought "split" was referring to the HF practice of models being partitioned into multiple smaller GGUF files, not the HF practice of dividing a model into text and vision weights. Does the model load with your modified ollama?

<!-- gh-comment-id:3657756959 --> @rick-github commented on GitHub (Dec 15, 2025): My apologies, I thought "split" was referring to the HF practice of models being partitioned into multiple smaller GGUF files, not the HF practice of dividing a model into text and vision weights. Does the model load with your modified ollama?
Author
Owner

@iosub commented on GitHub (Dec 15, 2025):

My apologies, I thought "split" was referring to the HF practice of models being partitioned into multiple smaller GGUF files, not the HF practice of dividing a model into text and vision weights. Does the model load with your modified ollama?

Yes with both options, llama engine or ollama new.

<!-- gh-comment-id:3657875643 --> @iosub commented on GitHub (Dec 15, 2025): > My apologies, I thought "split" was referring to the HF practice of models being partitioned into multiple smaller GGUF files, not the HF practice of dividing a model into text and vision weights. Does the model load with your modified ollama? Yes with both options, llama engine or ollama new.
Author
Owner

@iosub commented on GitHub (Dec 18, 2025):

https://github.com/ollama/ollama/pull/13456#issuecomment-3669326557 , none of the PR are accepted,

A new split model has come out
https://huggingface.co/unsloth/GLM-4.6V-Flash-GGUF

@cvrunmin @rick-github

Because is closer to llama.cpp and much easier to manage I will keep updating public my first PR https://github.com/ollama/ollama/pull/13306 for those models that come out as split version in this repo https://github.com/iosub/ollama/tree/feat/mrope-main .but if I get a message from Ollama maintainer not to do it I will make it private. I dont want any problem with the ollama team, I respect their work

thanks

<!-- gh-comment-id:3669417685 --> @iosub commented on GitHub (Dec 18, 2025): https://github.com/ollama/ollama/pull/13456#issuecomment-3669326557 , none of the PR are accepted, A new split model has come out https://huggingface.co/unsloth/GLM-4.6V-Flash-GGUF @cvrunmin @rick-github Because is closer to llama.cpp and much easier to manage I will keep updating public my first PR https://github.com/ollama/ollama/pull/13306 for those models that come out as split version in this repo https://github.com/iosub/ollama/tree/feat/mrope-main .but if I get a message from Ollama maintainer not to do it I will make it private. I dont want any problem with the ollama team, I respect their work thanks
Author
Owner

@timothyleung1 commented on GitHub (Feb 14, 2026):

Having the same issue - using the hf.co/Qwen/Qwen3-VL-32B-Instruct-GGUF:Q4_K_M on a 5070ti and a 4000 pro rtx

<!-- gh-comment-id:3900818196 --> @timothyleung1 commented on GitHub (Feb 14, 2026): Having the same issue - using the hf.co/Qwen/Qwen3-VL-32B-Instruct-GGUF:Q4_K_M on a 5070ti and a 4000 pro rtx
Author
Owner

@elkay commented on GitHub (Mar 4, 2026):

Same problem here with multiple versions of hf.co/Qwen/Qwen3-VL-32B-Instruct. Oddly enough, the BF16 does work (it's the first I tried, but it was slow). I'm running on 96GB RAM 96GB VRAM. I was trying to experiment with quantized models to improve the speed, but oddly enough Q8 ran out of memory and nearly crashed my machine. Very odd since BF16 worked. So I tried Q6. Same thing, system RAM climbs until kaboom. Q5. Same. Q4.. now I'm just getting the crashes reported in this post. Odd behavior indeed since I've run much larger models than this without any issues, and as mentioned the BF16 did work, at least for a few test prompts only working with text.

warmup: flash attention is enabled time=2026-03-04T16:42:31.592-05:00 level=INFO source=server.go:1388 msg="llama runner started in 1.80 seconds" time=2026-03-04T16:42:31.592-05:00 level=INFO source=sched.go:565 msg="loaded runners" count=1 time=2026-03-04T16:42:31.592-05:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" time=2026-03-04T16:42:31.592-05:00 level=INFO source=server.go:1388 msg="llama runner started in 1.80 seconds" llama-context.cpp:1238: GGML_ASSERT((n_outputs_prev + n_outputs)*n_embd <= (int64_t) embd_size) failed

Works:

https://ollama.com/library/qwen3-vl:32b-instruct-bf16

No workie:

https://huggingface.co/unsloth/Qwen3-VL-32B-Instruct-GGUF

<!-- gh-comment-id:4000513917 --> @elkay commented on GitHub (Mar 4, 2026): Same problem here with multiple versions of hf.co/Qwen/Qwen3-VL-32B-Instruct. Oddly enough, the BF16 does work (it's the first I tried, but it was slow). I'm running on 96GB RAM 96GB VRAM. I was trying to experiment with quantized models to improve the speed, but oddly enough Q8 ran out of memory and nearly crashed my machine. Very odd since BF16 worked. So I tried Q6. Same thing, system RAM climbs until kaboom. Q5. Same. Q4.. now I'm just getting the crashes reported in this post. Odd behavior indeed since I've run much larger models than this without any issues, and as mentioned the BF16 did work, at least for a few test prompts only working with text. `warmup: flash attention is enabled time=2026-03-04T16:42:31.592-05:00 level=INFO source=server.go:1388 msg="llama runner started in 1.80 seconds" time=2026-03-04T16:42:31.592-05:00 level=INFO source=sched.go:565 msg="loaded runners" count=1 time=2026-03-04T16:42:31.592-05:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" time=2026-03-04T16:42:31.592-05:00 level=INFO source=server.go:1388 msg="llama runner started in 1.80 seconds" llama-context.cpp:1238: GGML_ASSERT((n_outputs_prev + n_outputs)*n_embd <= (int64_t) embd_size) failed` Works: https://ollama.com/library/qwen3-vl:32b-instruct-bf16 No workie: https://huggingface.co/unsloth/Qwen3-VL-32B-Instruct-GGUF
Author
Owner

@dougmaitelli commented on GitHub (Mar 4, 2026):

As mentioned here: https://github.com/ollama/ollama/issues/14575#issuecomment-3989918451 This won't work on Ollama until upstream fix is merged to Llama.cpp

<!-- gh-comment-id:4000628661 --> @dougmaitelli commented on GitHub (Mar 4, 2026): As mentioned here: https://github.com/ollama/ollama/issues/14575#issuecomment-3989918451 This won't work on Ollama until upstream fix is merged to Llama.cpp
Author
Owner

@elkay commented on GitHub (Mar 4, 2026):

As mentioned here: #14575 (comment) This won't work on Ollama until upstream fix is merged to Llama.cpp

Ah ok. Does the current Ollama-hosted BF16 one have the vision stripped and that's why it works?

<!-- gh-comment-id:4000678562 --> @elkay commented on GitHub (Mar 4, 2026): > As mentioned here: [#14575 (comment)](https://github.com/ollama/ollama/issues/14575#issuecomment-3989918451) This won't work on Ollama until upstream fix is merged to Llama.cpp Ah ok. Does the current Ollama-hosted BF16 one have the vision stripped and that's why it works?
Author
Owner

@rick-github commented on GitHub (Mar 5, 2026):

The error in this issue (GGML_ASSERT((n_outputs_prev + n_outputs)*n_embd <= (int64_t) embd_size)) is not a result of the split model. The model loads fine, it's just it errors out with the assert when it starts inference.

<!-- gh-comment-id:4001294132 --> @rick-github commented on GitHub (Mar 5, 2026): The error in this issue (`GGML_ASSERT((n_outputs_prev + n_outputs)*n_embd <= (int64_t) embd_size)`) is not a result of the split model. The model loads fine, it's just it errors out with the assert when it starts inference.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#34653