[GH-ISSUE #14401] glm-ocr model crashes with GGML_ASSERT – rope dimension metadata missing? #55867

Closed
opened 2026-04-29 09:49:42 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @hapm on GitHub (Feb 24, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14401

What is the issue?

Hello Ollama team,

I’m investigating the prebuilt GLM-OCR model in Ollama 0.17.0 and noticed that it crashes during load with:

ggml.c:4081: GGML_ASSERT(a->ne[2] * 4 == b->ne[0]) failed

Analysis by OpenAI ChatGPT (the furthest I could get with my knowledge) suggests this may be due to a mismatch in rotary positional embedding (RoPE) dimensions:

  • GLM-OCR has hidden_size = 1536 and num_heads = 16, so each head should have a rope dimension of 96.

  • Ollama’s loader appears to default to 128 because the prebuilt model metadata lacks glmocr.rope.dimension_count.

  • GGML expects either:

    • a->ne[2] == b->ne[0] (single RoPE per token), or
    • a->ne[2]*4 == b->ne[0] (multi-RoPE / YaRN mode)
      …so the mismatch triggers the assertion.

Question: Could the prebuilt GLM-OCR bundle be missing the correct rope dimension metadata? If so, would adding:

"glmocr.rope.dimension_count": 96

to the model manifest or GGUF metadata resolve this, or is there another reason GGML asserts in this case?

References:

Relevant log output

time=2026-02-24T20:44:09.164+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-02-24T20:44:10.006+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:7000 KvCacheType: NumThreads:8 GPULayers:17[ID:GPU-a320ee8a-1774-feab-4f66-466ab92102d0 Layers:17(0..16)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-24T20:44:10.669+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:7000 KvCacheType: NumThreads:8 GPULayers:17[ID:GPU-a320ee8a-1774-feab-4f66-466ab92102d0 Layers:17(0..16)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" 
time=2026-02-24T20:44:10.669+01:00 level=INFO source=ggml.go:482 msg="offloading 16 repeating layers to GPU"
time=2026-02-24T20:44:10.669+01:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU" 
time=2026-02-24T20:44:10.669+01:00 level=INFO source=ggml.go:494 msg="offloaded 17/17 layers to GPU" 
time=2026-02-24T20:44:10.669+01:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="1.9 GiB" 
time=2026-02-24T20:44:10.669+01:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="174.0 MiB" 
time=2026-02-24T20:44:10.669+01:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="448.0 MiB" 
time=2026-02-24T20:44:10.669+01:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="1.9 GiB"
time=2026-02-24T20:44:10.669+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="31.2 MiB" 
time=2026-02-24T20:44:10.669+01:00 level=INFO source=device.go:272 msg="total memory" size="4.5 GiB" 
time=2026-02-24T20:44:10.669+01:00 level=INFO source=sched.go:566 msg="loaded runners" count=1 
time=2026-02-24T20:44:10.669+01:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" 
time=2026-02-24T20:44:10.670+01:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" 
time=2026-02-24T20:44:11.171+01:00 level=INFO source=server.go:1388 msg="llama runner started in 2.19 seconds" 
ggml.c:4081: GGML_ASSERT(a->ne[2] * 4 == b->ne[0]) failed

OS

Windows 10

GPU

NVIDIA GeForce RTX 2080 Ti

CPU

Intel Core i9 9900K

Ollama version

0.17.0

Originally created by @hapm on GitHub (Feb 24, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14401 ### What is the issue? Hello Ollama team, I’m investigating the prebuilt **GLM-OCR** model in Ollama 0.17.0 and noticed that it crashes during load with: ```text ggml.c:4081: GGML_ASSERT(a->ne[2] * 4 == b->ne[0]) failed ``` Analysis by OpenAI ChatGPT (the furthest I could get with my knowledge) suggests this may be due to a mismatch in **rotary positional embedding (RoPE) dimensions**: - GLM-OCR has `hidden_size = 1536` and `num_heads = 16`, so each head should have a rope dimension of **96**. - Ollama’s loader appears to default to **128** because the prebuilt model metadata lacks `glmocr.rope.dimension_count`. - GGML expects either: - `a->ne[2] == b->ne[0]` (single RoPE per token), or - `a->ne[2]*4 == b->ne[0]` (multi-RoPE / YaRN mode) …so the mismatch triggers the assertion. **Question:** Could the prebuilt GLM-OCR bundle be missing the correct rope dimension metadata? If so, would adding: ```json "glmocr.rope.dimension_count": 96 ``` to the model manifest or GGUF metadata resolve this, or is there another reason GGML asserts in this case? **References:** - llama.cpp GLM-OCR support PR (Feb 2026): [https://github.com/ggml-org/llama.cpp/pull/51](https://github.com/ggml-org/llama.cpp/pull/51) - ChatGPT analysis and investigation: https://chatgpt.com/share/699e291d-bf78-800f-8ed4-f468705ed3d4 ### Relevant log output ```shell time=2026-02-24T20:44:09.164+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-02-24T20:44:10.006+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:7000 KvCacheType: NumThreads:8 GPULayers:17[ID:GPU-a320ee8a-1774-feab-4f66-466ab92102d0 Layers:17(0..16)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-24T20:44:10.669+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:7000 KvCacheType: NumThreads:8 GPULayers:17[ID:GPU-a320ee8a-1774-feab-4f66-466ab92102d0 Layers:17(0..16)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-24T20:44:10.669+01:00 level=INFO source=ggml.go:482 msg="offloading 16 repeating layers to GPU" time=2026-02-24T20:44:10.669+01:00 level=INFO source=ggml.go:489 msg="offloading output layer to GPU" time=2026-02-24T20:44:10.669+01:00 level=INFO source=ggml.go:494 msg="offloaded 17/17 layers to GPU" time=2026-02-24T20:44:10.669+01:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="1.9 GiB" time=2026-02-24T20:44:10.669+01:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="174.0 MiB" time=2026-02-24T20:44:10.669+01:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="448.0 MiB" time=2026-02-24T20:44:10.669+01:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="1.9 GiB" time=2026-02-24T20:44:10.669+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="31.2 MiB" time=2026-02-24T20:44:10.669+01:00 level=INFO source=device.go:272 msg="total memory" size="4.5 GiB" time=2026-02-24T20:44:10.669+01:00 level=INFO source=sched.go:566 msg="loaded runners" count=1 time=2026-02-24T20:44:10.669+01:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" time=2026-02-24T20:44:10.670+01:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" time=2026-02-24T20:44:11.171+01:00 level=INFO source=server.go:1388 msg="llama runner started in 2.19 seconds" ggml.c:4081: GGML_ASSERT(a->ne[2] * 4 == b->ne[0]) failed ``` ### OS Windows 10 ### GPU NVIDIA GeForce RTX 2080 Ti ### CPU Intel Core i9 9900K ### Ollama version 0.17.0
GiteaMirror added the bug label 2026-04-29 09:49:42 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 24, 2026):

Perhaps #14171. What happens if you set context size to 8192?

<!-- gh-comment-id:3955209209 --> @rick-github commented on GitHub (Feb 24, 2026): Perhaps #14171. What happens if you set context size to 8192?
Author
Owner

@hapm commented on GitHub (Feb 25, 2026):

That's it. Thanks for ignoring all the missleading ai stuff. Should have tried that first. Context size was configured to 7000, works well with 8192.

<!-- gh-comment-id:3956880709 --> @hapm commented on GitHub (Feb 25, 2026): That's it. Thanks for ignoring all the missleading ai stuff. Should have tried that first. Context size was configured to 7000, works well with 8192.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#55867