[GH-ISSUE #12821] gemma3:27b|gemma3:27b-it-qat failed in ollama 0.12.7-rc0 #34256

Closed
opened 2026-04-22 17:41:48 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @jay875037671 on GitHub (Oct 29, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12821

What is the issue?

those 2 models workd well in 0.12.6(but 0.12.6 qwen3-vl failed though lol)
crash on: compiler=cgo(clang) ggml.c:1921: GGML_ASSERT(ggml_can_repeat(b, a)) failed

logfile:

[GIN] 2025/10/29 - 16:22:50 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/10/29 - 16:22:50 | 200 | 8.5009ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/10/29 - 16:22:59 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/10/29 - 16:22:59 | 200 | 72.9136ms | 127.0.0.1 | POST "/api/show" ggml_backend_cuda_device_get_memory device GPU-2fb823f9-0c5c-9291-49c5-2685c7450e9a utilizing NVML memory reporting free: 20149010432 total: 34190917632 time=2025-10-29T16:22:59.298+08:00 level=INFO source=cpu_windows.go:139 msg=packages count=1 time=2025-10-29T16:22:59.298+08:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=8 efficiency=0 threads=16 time=2025-10-29T16:22:59.322+08:00 level=INFO source=sched.go:559 msg="updated VRAM based on existing loaded models" gpu=GPU-2fb823f9-0c5c-9291-49c5-2685c7450e9a library=CUDA total="31.8 GiB" available="18.8 GiB" time=2025-10-29T16:22:59.399+08:00 level=INFO source=server.go:215 msg="enabling flash attention" time=2025-10-29T16:22:59.422+08:00 level=INFO source=server.go:385 msg="starting runner" cmd="C:\Users\jayli\AppData\Local\Programs\Ollama\ollama.exe runner --ollama-engine --model G:\ollama\models\blobs\sha256-e796792eba26c4d3b04b0ac5adb01a453dd9ec2dfd83b6c59cbf6fe5f30b0f68 --port 7761" time=2025-10-29T16:22:59.435+08:00 level=INFO source=server.go:638 msg="loading model" "model layers"=63 requested=-1 time=2025-10-29T16:22:59.435+08:00 level=INFO source=server.go:643 msg="system memory" total="63.6 GiB" free="46.1 GiB" free_swap="95.0 GiB" time=2025-10-29T16:22:59.435+08:00 level=INFO source=server.go:650 msg="gpu memory" id=GPU-2fb823f9-0c5c-9291-49c5-2685c7450e9a library=CUDA available="18.3 GiB" free="18.8 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-10-29T16:22:59.478+08:00 level=INFO source=runner.go:1337 msg="starting ollama engine" time=2025-10-29T16:22:59.498+08:00 level=INFO source=runner.go:1372 msg="Server listening on 127.0.0.1:7761" time=2025-10-29T16:22:59.499+08:00 level=INFO source=runner.go:1210 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:63[ID:GPU-2fb823f9-0c5c-9291-49c5-2685c7450e9a Layers:63(0..62)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-29T16:22:59.540+08:00 level=INFO source=ggml.go:135 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=37 load_backend: loaded CPU backend from C:\Users\jayli\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090 D, compute capability 12.0, VMM: yes, ID: GPU-2fb823f9-0c5c-9291-49c5-2685c7450e9a load_backend: loaded CUDA backend from C:\Users\jayli\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2025-10-29T16:22:59.632+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) ggml.c:1921: GGML_ASSERT(ggml_can_repeat(b, a)) failed time=2025-10-29T16:22:59.839+08:00 level=INFO source=sched.go:446 msg="Load failed" model=G:\ollama\models\blobs\sha256-e796792eba26c4d3b04b0ac5adb01a453dd9ec2dfd83b6c59cbf6fe5f30b0f68 error="do load request: Post "http://127.0.0.1:7761/load": read tcp 127.0.0.1:7769->127.0.0.1:7761: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2025/10/29 - 16:22:59 | 500 | 708.8267ms | 127.0.0.1 | POST "/api/generate" time=2025-10-29T16:22:59.853+08:00 level=ERROR source=server.go:273 msg="llama runner terminated" error="exit status 0xc0000409"

Relevant log output


OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.12.7-rc0

Originally created by @jay875037671 on GitHub (Oct 29, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12821 ### What is the issue? those 2 models workd well in 0.12.6(but 0.12.6 qwen3-vl failed though lol) crash on: compiler=cgo(clang) ggml.c:1921: GGML_ASSERT(ggml_can_repeat(b, a)) failed logfile: [GIN] 2025/10/29 - 16:22:50 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/10/29 - 16:22:50 | 200 | 8.5009ms | 127.0.0.1 | GET "/api/tags" [GIN] 2025/10/29 - 16:22:59 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2025/10/29 - 16:22:59 | 200 | 72.9136ms | 127.0.0.1 | POST "/api/show" ggml_backend_cuda_device_get_memory device GPU-2fb823f9-0c5c-9291-49c5-2685c7450e9a utilizing NVML memory reporting free: 20149010432 total: 34190917632 time=2025-10-29T16:22:59.298+08:00 level=INFO source=cpu_windows.go:139 msg=packages count=1 time=2025-10-29T16:22:59.298+08:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=8 efficiency=0 threads=16 time=2025-10-29T16:22:59.322+08:00 level=INFO source=sched.go:559 msg="updated VRAM based on existing loaded models" gpu=GPU-2fb823f9-0c5c-9291-49c5-2685c7450e9a library=CUDA total="31.8 GiB" available="18.8 GiB" time=2025-10-29T16:22:59.399+08:00 level=INFO source=server.go:215 msg="enabling flash attention" time=2025-10-29T16:22:59.422+08:00 level=INFO source=server.go:385 msg="starting runner" cmd="C:\\Users\\jayli\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model G:\\ollama\\models\\blobs\\sha256-e796792eba26c4d3b04b0ac5adb01a453dd9ec2dfd83b6c59cbf6fe5f30b0f68 --port 7761" time=2025-10-29T16:22:59.435+08:00 level=INFO source=server.go:638 msg="loading model" "model layers"=63 requested=-1 time=2025-10-29T16:22:59.435+08:00 level=INFO source=server.go:643 msg="system memory" total="63.6 GiB" free="46.1 GiB" free_swap="95.0 GiB" time=2025-10-29T16:22:59.435+08:00 level=INFO source=server.go:650 msg="gpu memory" id=GPU-2fb823f9-0c5c-9291-49c5-2685c7450e9a library=CUDA available="18.3 GiB" free="18.8 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-10-29T16:22:59.478+08:00 level=INFO source=runner.go:1337 msg="starting ollama engine" time=2025-10-29T16:22:59.498+08:00 level=INFO source=runner.go:1372 msg="Server listening on 127.0.0.1:7761" time=2025-10-29T16:22:59.499+08:00 level=INFO source=runner.go:1210 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:4096 KvCacheType: NumThreads:8 GPULayers:63[ID:GPU-2fb823f9-0c5c-9291-49c5-2685c7450e9a Layers:63(0..62)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-29T16:22:59.540+08:00 level=INFO source=ggml.go:135 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=1247 num_key_values=37 load_backend: loaded CPU backend from C:\Users\jayli\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090 D, compute capability 12.0, VMM: yes, ID: GPU-2fb823f9-0c5c-9291-49c5-2685c7450e9a load_backend: loaded CUDA backend from C:\Users\jayli\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2025-10-29T16:22:59.632+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) ggml.c:1921: GGML_ASSERT(ggml_can_repeat(b, a)) failed time=2025-10-29T16:22:59.839+08:00 level=INFO source=sched.go:446 msg="Load failed" model=G:\ollama\models\blobs\sha256-e796792eba26c4d3b04b0ac5adb01a453dd9ec2dfd83b6c59cbf6fe5f30b0f68 error="do load request: Post \"http://127.0.0.1:7761/load\": read tcp 127.0.0.1:7769->127.0.0.1:7761: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2025/10/29 - 16:22:59 | 500 | 708.8267ms | 127.0.0.1 | POST "/api/generate" time=2025-10-29T16:22:59.853+08:00 level=ERROR source=server.go:273 msg="llama runner terminated" error="exit status 0xc0000409" ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.12.7-rc0
GiteaMirror added the bug label 2026-04-22 17:41:48 -05:00
Author
Owner

@theuncivilizedarchive commented on GitHub (Oct 29, 2025):

failure within the ggml library, specifically GGML_ASSERT(ggml_can_repeat(b, a)) failed in file ggml.c at line 1921, downgrade Ollama to a stable version or use CPU fallback

<!-- gh-comment-id:3460882679 --> @theuncivilizedarchive commented on GitHub (Oct 29, 2025): failure within the ggml library, specifically GGML_ASSERT(ggml_can_repeat(b, a)) failed in file ggml.c at line 1921, downgrade Ollama to a stable version or use CPU fallback
Author
Owner

@maglat commented on GitHub (Oct 29, 2025):

Yes, Gemma 3 fail for me in 0.12.7-rc0 as well

<!-- gh-comment-id:3461221184 --> @maglat commented on GitHub (Oct 29, 2025): Yes, Gemma 3 fail for me in 0.12.7-rc0 as well
Author
Owner

@Burnarz commented on GitHub (Oct 29, 2025):

I'm also using gemma3:27b-it-qat, but got no errors with official release of 0.12.7

<!-- gh-comment-id:3464829867 --> @Burnarz commented on GitHub (Oct 29, 2025): I'm also using gemma3:27b-it-qat, but got no errors with official release of 0.12.7
Author
Owner

@jessegross commented on GitHub (Oct 29, 2025):

Fixed by #12834

<!-- gh-comment-id:3464860979 --> @jessegross commented on GitHub (Oct 29, 2025): Fixed by #12834
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#34256