[GH-ISSUE #12845] v0.12.7: 500 Error when loading Qwen3 VL 235B (Instruct & Thinking) #70569

Closed
opened 2026-05-04 22:02:00 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @asitwere on GitHub (Oct 29, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12845

What is the issue?

500 error when attempting to load qwen3-vl:235b-a22b-instruct-q8_0 & qwen3-vl:235b-a22b-thinking-q8_0:

Error: 500 Internal Server Error: do load request: Post "http://127.0.0.1:49700/load": EOF

Relevant log output

time=2025-10-29T19:40:52.117-04:00 level=ERROR source=server.go:273 msg="llama runner terminated" error="exit status 2"
time=2025-10-29T19:40:52.117-04:00 level=INFO source=sched.go:446 msg="Load failed" model=$HOME/.ollama/models/blobs/sha256-ce3185bd840262a8900cf68a8af99526135813a2179dc07f4455d45fc5e674b5 error="do load request: Post \"http://127.0.0.1:49819/load\": EOF"
[GIN] 2025/10/29 - 19:40:52 | 500 |  609.105416ms |       127.0.0.1 | POST     "/api/generate"

time=2025-10-29T19:40:57.321-04:00 level=INFO source=server.go:385 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model $HOME/.ollama/models/blobs/sha256-16e9268fa8528f00458a3881b9325fa93c92322b4cb5b5efcc29962096d2cef9 --port 49824"
time=2025-10-29T19:40:57.322-04:00 level=INFO source=server.go:638 msg="loading model" "model layers"=95 requested=-1
time=2025-10-29T19:40:57.322-04:00 level=INFO source=server.go:643 msg="system memory" total="512.0 GiB" free="493.5 GiB" free_swap="0 B"
time=2025-10-29T19:40:57.322-04:00 level=INFO source=server.go:650 msg="gpu memory" id=0 library=Metal available="463.5 GiB" free="464.0 GiB" minimum="512.0 MiB" overhead="0 B"
time=2025-10-29T19:40:57.330-04:00 level=INFO source=runner.go:1337 msg="starting ollama engine"
time=2025-10-29T19:40:57.331-04:00 level=INFO source=runner.go:1372 msg="Server listening on 127.0.0.1:49824"
time=2025-10-29T19:40:57.334-04:00 level=INFO source=runner.go:1210 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:32768 KvCacheType: NumThreads:24 GPULayers:95[ID:0 Layers:95(0..94)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-29T19:40:57.349-04:00 level=INFO source=ggml.go:135 msg="" architecture=qwen3vlmoe file_type=Q8_0 name="" description="" num_tensors=1590 num_key_values=43
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.005 sec
ggml_metal_device_init: GPU name:   Apple M3 Ultra
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 498216.21 MB
time=2025-10-29T19:40:57.350-04:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M3 Ultra
ggml_metal_init: use bfloat         = true
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
ggml-backend.cpp:1751: GGML_ASSERT((int)sched->hash_set.size >= measure_graph->n_nodes + measure_graph->n_leafs) failed
(lldb) process attach --pid <PID>
error: attach failed: attach failed (Not allowed to attach to process.  Look in the console messages (Console.app), near the debugserver entries, when the attach failed.  The subsystem that denied the attach permission will likely have logged an informative message about why it was denied.)
SIGABRT: abort
PC=0x186196388 m=14 sigcode=0
signal arrived during cgo execution

goroutine 16 gp=0x14000503180 m=14 mp=0x14000501008 [syscall]:
runtime.cgocall(0x105679f00, 0x140000450d8)
	/Users/runner/hostedtoolcache/go/1.24.0/arm64/src/runtime/cgocall.go:167 +0x44 fp=0x14000045090 sp=0x14000045050 pc=0x104b2a764
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_reserve(0x12e81fe00, 0x13de04ff0)
	_cgo_gotypes.go:980 +0x34 fp=0x140000450d0 sp=0x14000045090 pc=0x104efe9f4
github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve.func1(...)
	/Users/runner/work/ollama/ollama/ml/backend/ggml/ggml.go:838
github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0x14000d0fa40)
	/Users/runner/work/ollama/ollama/ml/backend/ggml/ggml.go:838 +0x8c fp=0x14000045350 sp=0x140000450d0 pc=0x104f0862c
github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0x140002290e0)
	/Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:1106 +0xa18 fp=0x14000045680 sp=0x14000045350 pc=0x104fada78
github.com/ollama/ollama/runner/ollamarunner.(*Server).allocModel(0x140002290e0, {0x16b347be2?, 0x0?}, {0x0, 0x18, {0x14000780840, 0x1, 0x1}, 0x0}, {0x0?, ...}, ...)
	/Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:1163 +0x22c fp=0x14000045710 sp=0x14000045680 pc=0x104fade7c
github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0x140002290e0, {0x105dee8e8, 0x14000516000}, 0x14000512000)
	/Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:1237 +0x460 fp=0x14000045aa0 sp=0x14000045710 pc=0x104fae750
github.com/ollama/ollama/runner/ollamarunner.(*Server).load-fm({0x105dee8e8?, 0x14000516000?}, 0x14000319b28?)
	<autogenerated>:1 +0x40 fp=0x14000045ad0 sp=0x14000045aa0 pc=0x104fb0650
net/http.HandlerFunc.ServeHTTP(0x14000035bc0?, {0x105dee8e8?, 0x14000516000?}, 0x14000319b10?)
	/Users/runner/hostedtoolcache/go/1.24.0/arm64/src/net/http/server.go:2294 +0x38 fp=0x14000045b00 sp=0x14000045ad0 pc=0x104de5908
net/http.(*ServeMux).ServeHTTP(0x10?, {0x105dee8e8, 0x14000516000}, 0x14000512000)
	/Users/runner/hostedtoolcache/go/1.24.0/arm64/src/net/http/server.go:2822 +0x1b4 fp=0x14000045b50 sp=0x14000045b00 pc=0x104de7494
net/http.serverHandler.ServeHTTP({0x105deaed0?}, {0x105dee8e8?, 0x14000516000?}, 0x1?)
	/Users/runner/hostedtoolcache/go/1.24.0/arm64/src/net/http/server.go:3301 +0xbc fp=0x14000045b80 sp=0x14000045b50 pc=0x104e0317c
net/http.(*conn).serve(0x1400001e3f0, {0x105df0c78, 0x14000222060})
	/Users/runner/hostedtoolcache/go/1.24.0/arm64/src/net/http/server.go:2102 +0x52c fp=0x14000045fa0 sp=0x14000045b80 pc=0x104de40ac
net/http.(*Server).Serve.gowrap3()
	/Users/runner/hostedtoolcache/go/1.24.0/arm64/src/net/http/server.go:3454 +0x30 fp=0x14000045fd0 sp=0x14000045fa0 pc=0x104de9270
runtime.goexit({})
	/Users/runner/hostedtoolcache/go/1.24.0/arm64/src/runtime/asm_arm64.s:1223 +0x4 fp=0x14000045fd0 sp=0x14000045fd0 pc=0x104b358f4
created by net/http.(*Server).Serve in goroutine 1
	/Users/runner/hostedtoolcache/go/1.24.0/arm64/src/net/http/server.go:3454 +0x3d8

goroutine 1 gp=0x140000021c0 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	/Users/runner/hostedtoolcache/go/1.24.0/arm64/src/runtime/proc.go:435 +0xc8 fp=0x14000b49720 sp=0x14000b49700 pc=0x104b2dc88
...
runtime.goexit({})
	/Users/runner/hostedtoolcache/go/1.24.0/arm64/src/runtime/asm_arm64.s:1223 +0x4 fp=0x140004617d0 sp=0x140004617d0 pc=0x104b358f4
created by net/http.(*connReader).startBackgroundRead in goroutine 16
	/Users/runner/hostedtoolcache/go/1.24.0/arm64/src/net/http/server.go:686 +0xc4

r0      0x0
r1      0x0
...
fault   0x186196388
time=2025-10-29T19:40:57.823-04:00 level=ERROR source=server.go:273 msg="llama runner terminated" error="exit status 2"
time=2025-10-29T19:40:57.824-04:00 level=INFO source=sched.go:446 msg="Load failed" model=$HOME/.ollama/models/blobs/sha256-16e9268fa8528f00458a3881b9325fa93c92322b4cb5b5efcc29962096d2cef9 error="do load request: Post \"http://127.0.0.1:49824/load\": EOF"
[GIN] 2025/10/29 - 19:40:57 | 500 |  601.827417ms |       127.0.0.1 | POST     "/api/generate"

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

v0.12.7

Originally created by @asitwere on GitHub (Oct 29, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12845 ### What is the issue? 500 error when attempting to load `qwen3-vl:235b-a22b-instruct-q8_0` & `qwen3-vl:235b-a22b-thinking-q8_0`: `Error: 500 Internal Server Error: do load request: Post "http://127.0.0.1:49700/load": EOF` ### Relevant log output ```shell time=2025-10-29T19:40:52.117-04:00 level=ERROR source=server.go:273 msg="llama runner terminated" error="exit status 2" time=2025-10-29T19:40:52.117-04:00 level=INFO source=sched.go:446 msg="Load failed" model=$HOME/.ollama/models/blobs/sha256-ce3185bd840262a8900cf68a8af99526135813a2179dc07f4455d45fc5e674b5 error="do load request: Post \"http://127.0.0.1:49819/load\": EOF" [GIN] 2025/10/29 - 19:40:52 | 500 | 609.105416ms | 127.0.0.1 | POST "/api/generate" time=2025-10-29T19:40:57.321-04:00 level=INFO source=server.go:385 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model $HOME/.ollama/models/blobs/sha256-16e9268fa8528f00458a3881b9325fa93c92322b4cb5b5efcc29962096d2cef9 --port 49824" time=2025-10-29T19:40:57.322-04:00 level=INFO source=server.go:638 msg="loading model" "model layers"=95 requested=-1 time=2025-10-29T19:40:57.322-04:00 level=INFO source=server.go:643 msg="system memory" total="512.0 GiB" free="493.5 GiB" free_swap="0 B" time=2025-10-29T19:40:57.322-04:00 level=INFO source=server.go:650 msg="gpu memory" id=0 library=Metal available="463.5 GiB" free="464.0 GiB" minimum="512.0 MiB" overhead="0 B" time=2025-10-29T19:40:57.330-04:00 level=INFO source=runner.go:1337 msg="starting ollama engine" time=2025-10-29T19:40:57.331-04:00 level=INFO source=runner.go:1372 msg="Server listening on 127.0.0.1:49824" time=2025-10-29T19:40:57.334-04:00 level=INFO source=runner.go:1210 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:32768 KvCacheType: NumThreads:24 GPULayers:95[ID:0 Layers:95(0..94)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-29T19:40:57.349-04:00 level=INFO source=ggml.go:135 msg="" architecture=qwen3vlmoe file_type=Q8_0 name="" description="" num_tensors=1590 num_key_values=43 ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.005 sec ggml_metal_device_init: GPU name: Apple M3 Ultra ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 498216.21 MB time=2025-10-29T19:40:57.350-04:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 compiler=cgo(clang) ggml_metal_init: allocating ggml_metal_init: picking default device: Apple M3 Ultra ggml_metal_init: use bfloat = true ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true ggml-backend.cpp:1751: GGML_ASSERT((int)sched->hash_set.size >= measure_graph->n_nodes + measure_graph->n_leafs) failed (lldb) process attach --pid <PID> error: attach failed: attach failed (Not allowed to attach to process. Look in the console messages (Console.app), near the debugserver entries, when the attach failed. The subsystem that denied the attach permission will likely have logged an informative message about why it was denied.) SIGABRT: abort PC=0x186196388 m=14 sigcode=0 signal arrived during cgo execution goroutine 16 gp=0x14000503180 m=14 mp=0x14000501008 [syscall]: runtime.cgocall(0x105679f00, 0x140000450d8) /Users/runner/hostedtoolcache/go/1.24.0/arm64/src/runtime/cgocall.go:167 +0x44 fp=0x14000045090 sp=0x14000045050 pc=0x104b2a764 github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_reserve(0x12e81fe00, 0x13de04ff0) _cgo_gotypes.go:980 +0x34 fp=0x140000450d0 sp=0x14000045090 pc=0x104efe9f4 github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve.func1(...) /Users/runner/work/ollama/ollama/ml/backend/ggml/ggml.go:838 github.com/ollama/ollama/ml/backend/ggml.(*Context).Reserve(0x14000d0fa40) /Users/runner/work/ollama/ollama/ml/backend/ggml/ggml.go:838 +0x8c fp=0x14000045350 sp=0x140000450d0 pc=0x104f0862c github.com/ollama/ollama/runner/ollamarunner.(*Server).reserveWorstCaseGraph(0x140002290e0) /Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:1106 +0xa18 fp=0x14000045680 sp=0x14000045350 pc=0x104fada78 github.com/ollama/ollama/runner/ollamarunner.(*Server).allocModel(0x140002290e0, {0x16b347be2?, 0x0?}, {0x0, 0x18, {0x14000780840, 0x1, 0x1}, 0x0}, {0x0?, ...}, ...) /Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:1163 +0x22c fp=0x14000045710 sp=0x14000045680 pc=0x104fade7c github.com/ollama/ollama/runner/ollamarunner.(*Server).load(0x140002290e0, {0x105dee8e8, 0x14000516000}, 0x14000512000) /Users/runner/work/ollama/ollama/runner/ollamarunner/runner.go:1237 +0x460 fp=0x14000045aa0 sp=0x14000045710 pc=0x104fae750 github.com/ollama/ollama/runner/ollamarunner.(*Server).load-fm({0x105dee8e8?, 0x14000516000?}, 0x14000319b28?) <autogenerated>:1 +0x40 fp=0x14000045ad0 sp=0x14000045aa0 pc=0x104fb0650 net/http.HandlerFunc.ServeHTTP(0x14000035bc0?, {0x105dee8e8?, 0x14000516000?}, 0x14000319b10?) /Users/runner/hostedtoolcache/go/1.24.0/arm64/src/net/http/server.go:2294 +0x38 fp=0x14000045b00 sp=0x14000045ad0 pc=0x104de5908 net/http.(*ServeMux).ServeHTTP(0x10?, {0x105dee8e8, 0x14000516000}, 0x14000512000) /Users/runner/hostedtoolcache/go/1.24.0/arm64/src/net/http/server.go:2822 +0x1b4 fp=0x14000045b50 sp=0x14000045b00 pc=0x104de7494 net/http.serverHandler.ServeHTTP({0x105deaed0?}, {0x105dee8e8?, 0x14000516000?}, 0x1?) /Users/runner/hostedtoolcache/go/1.24.0/arm64/src/net/http/server.go:3301 +0xbc fp=0x14000045b80 sp=0x14000045b50 pc=0x104e0317c net/http.(*conn).serve(0x1400001e3f0, {0x105df0c78, 0x14000222060}) /Users/runner/hostedtoolcache/go/1.24.0/arm64/src/net/http/server.go:2102 +0x52c fp=0x14000045fa0 sp=0x14000045b80 pc=0x104de40ac net/http.(*Server).Serve.gowrap3() /Users/runner/hostedtoolcache/go/1.24.0/arm64/src/net/http/server.go:3454 +0x30 fp=0x14000045fd0 sp=0x14000045fa0 pc=0x104de9270 runtime.goexit({}) /Users/runner/hostedtoolcache/go/1.24.0/arm64/src/runtime/asm_arm64.s:1223 +0x4 fp=0x14000045fd0 sp=0x14000045fd0 pc=0x104b358f4 created by net/http.(*Server).Serve in goroutine 1 /Users/runner/hostedtoolcache/go/1.24.0/arm64/src/net/http/server.go:3454 +0x3d8 goroutine 1 gp=0x140000021c0 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) /Users/runner/hostedtoolcache/go/1.24.0/arm64/src/runtime/proc.go:435 +0xc8 fp=0x14000b49720 sp=0x14000b49700 pc=0x104b2dc88 ... runtime.goexit({}) /Users/runner/hostedtoolcache/go/1.24.0/arm64/src/runtime/asm_arm64.s:1223 +0x4 fp=0x140004617d0 sp=0x140004617d0 pc=0x104b358f4 created by net/http.(*connReader).startBackgroundRead in goroutine 16 /Users/runner/hostedtoolcache/go/1.24.0/arm64/src/net/http/server.go:686 +0xc4 r0 0x0 r1 0x0 ... fault 0x186196388 time=2025-10-29T19:40:57.823-04:00 level=ERROR source=server.go:273 msg="llama runner terminated" error="exit status 2" time=2025-10-29T19:40:57.824-04:00 level=INFO source=sched.go:446 msg="Load failed" model=$HOME/.ollama/models/blobs/sha256-16e9268fa8528f00458a3881b9325fa93c92322b4cb5b5efcc29962096d2cef9 error="do load request: Post \"http://127.0.0.1:49824/load\": EOF" [GIN] 2025/10/29 - 19:40:57 | 500 | 601.827417ms | 127.0.0.1 | POST "/api/generate" ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version v0.12.7
GiteaMirror added the macosbug labels 2026-05-04 22:02:01 -05:00
Author
Owner

@MarkMuravev commented on GitHub (Oct 30, 2025):

Same problem on Ubuntu 22.04

<!-- gh-comment-id:3470361899 --> @MarkMuravev commented on GitHub (Oct 30, 2025): Same problem on Ubuntu 22.04
Author
Owner

@jessegross commented on GitHub (Oct 30, 2025):

Fixed by #12862

<!-- gh-comment-id:3470456328 --> @jessegross commented on GitHub (Oct 30, 2025): Fixed by #12862
Author
Owner

@linkdata commented on GitHub (Oct 31, 2025):

Enabling flash attention did not solve this for me:

time=2025-10-31T11:52:52.190+01:00 level=INFO source=runner.go:76 msg="discovering available GPUs..."
time=2025-10-31T11:52:52.192+01:00 level=INFO source=server.go:385 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --port 55969"
time=2025-10-31T11:52:52.284+01:00 level=INFO source=types.go:42 msg="inference compute" id=0 filtered_id="" library=Metal compute=0.0 name=Metal description="Apple M3 Ultra" libdirs="" driver=0.0 pci
_id="" type=discrete total="208.0 GiB" available="208.0 GiB"
[GIN] 2025/10/31 - 11:52:52 | 200 |      48.166µs |       127.0.0.1 | GET      "/api/version"
[GIN] 2025/10/31 - 11:52:52 | 200 |      81.583µs |       127.0.0.1 | GET      "/api/version"
[GIN] 2025/10/31 - 11:52:52 | 200 |     842.459µs |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/10/31 - 11:52:52 | 200 |   44.342833ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/10/31 - 11:52:56 | 200 |      19.625µs |       127.0.0.1 | GET      "/api/version"
[GIN] 2025/10/31 - 11:52:56 | 200 |     772.542µs |       127.0.0.1 | GET      "/api/tags"
[GIN] 2025/10/31 - 11:52:56 | 200 |   49.850583ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2025/10/31 - 11:52:56 | 200 |    33.70425ms |       127.0.0.1 | POST     "/api/show"
time=2025-10-31T11:52:56.259+01:00 level=INFO source=server.go:215 msg="enabling flash attention"
time=2025-10-31T11:52:56.259+01:00 level=INFO source=server.go:385 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/user/.ollama/mo
dels/blobs/sha256-02e588929c87e95d29571faea0693185503bea6f06ac3ea516092b037e149c8a --port 55973"
time=2025-10-31T11:52:56.261+01:00 level=INFO source=server.go:638 msg="loading model" "model layers"=95 requested=-1
time=2025-10-31T11:52:56.261+01:00 level=INFO source=server.go:643 msg="system memory" total="256.0 GiB" free="182.8 GiB" free_swap="0 B"
time=2025-10-31T11:52:56.261+01:00 level=INFO source=server.go:650 msg="gpu memory" id=0 library=Metal available="207.5 GiB" free="208.0 GiB" minimum="512.0 MiB" overhead="0 B"
time=2025-10-31T11:52:56.269+01:00 level=INFO source=runner.go:1337 msg="starting ollama engine"
time=2025-10-31T11:52:56.269+01:00 level=INFO source=runner.go:1372 msg="Server listening on 127.0.0.1:55973"
time=2025-10-31T11:52:56.272+01:00 level=INFO source=runner.go:1210 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:20 G
PULayers:95[ID:0 Layers:95(0..94)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-31T11:52:56.289+01:00 level=INFO source=ggml.go:135 msg="" architecture=qwen3vlmoe file_type=Q4_K_M name="" description="" num_tensors=1590 num_key_values=43
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.006 sec
ggml_metal_device_init: GPU name:   Apple M3 Ultra
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 223338.30 MB
time=2025-10-31T11:52:56.290+01:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 co
mpiler=cgo(clang)
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple M3 Ultra
ggml_metal_init: use bfloat         = true
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
ggml-backend.cpp:1751: GGML_ASSERT((int)sched->hash_set.size >= measure_graph->n_nodes + measure_graph->n_leafs) failed
<!-- gh-comment-id:3472507793 --> @linkdata commented on GitHub (Oct 31, 2025): Enabling flash attention did not solve this for me: ``` time=2025-10-31T11:52:52.190+01:00 level=INFO source=runner.go:76 msg="discovering available GPUs..." time=2025-10-31T11:52:52.192+01:00 level=INFO source=server.go:385 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --port 55969" time=2025-10-31T11:52:52.284+01:00 level=INFO source=types.go:42 msg="inference compute" id=0 filtered_id="" library=Metal compute=0.0 name=Metal description="Apple M3 Ultra" libdirs="" driver=0.0 pci _id="" type=discrete total="208.0 GiB" available="208.0 GiB" [GIN] 2025/10/31 - 11:52:52 | 200 | 48.166µs | 127.0.0.1 | GET "/api/version" [GIN] 2025/10/31 - 11:52:52 | 200 | 81.583µs | 127.0.0.1 | GET "/api/version" [GIN] 2025/10/31 - 11:52:52 | 200 | 842.459µs | 127.0.0.1 | GET "/api/tags" [GIN] 2025/10/31 - 11:52:52 | 200 | 44.342833ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/10/31 - 11:52:56 | 200 | 19.625µs | 127.0.0.1 | GET "/api/version" [GIN] 2025/10/31 - 11:52:56 | 200 | 772.542µs | 127.0.0.1 | GET "/api/tags" [GIN] 2025/10/31 - 11:52:56 | 200 | 49.850583ms | 127.0.0.1 | POST "/api/show" [GIN] 2025/10/31 - 11:52:56 | 200 | 33.70425ms | 127.0.0.1 | POST "/api/show" time=2025-10-31T11:52:56.259+01:00 level=INFO source=server.go:215 msg="enabling flash attention" time=2025-10-31T11:52:56.259+01:00 level=INFO source=server.go:385 msg="starting runner" cmd="/Applications/Ollama.app/Contents/Resources/ollama runner --ollama-engine --model /Users/user/.ollama/mo dels/blobs/sha256-02e588929c87e95d29571faea0693185503bea6f06ac3ea516092b037e149c8a --port 55973" time=2025-10-31T11:52:56.261+01:00 level=INFO source=server.go:638 msg="loading model" "model layers"=95 requested=-1 time=2025-10-31T11:52:56.261+01:00 level=INFO source=server.go:643 msg="system memory" total="256.0 GiB" free="182.8 GiB" free_swap="0 B" time=2025-10-31T11:52:56.261+01:00 level=INFO source=server.go:650 msg="gpu memory" id=0 library=Metal available="207.5 GiB" free="208.0 GiB" minimum="512.0 MiB" overhead="0 B" time=2025-10-31T11:52:56.269+01:00 level=INFO source=runner.go:1337 msg="starting ollama engine" time=2025-10-31T11:52:56.269+01:00 level=INFO source=runner.go:1372 msg="Server listening on 127.0.0.1:55973" time=2025-10-31T11:52:56.272+01:00 level=INFO source=runner.go:1210 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:131072 KvCacheType: NumThreads:20 G PULayers:95[ID:0 Layers:95(0..94)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-31T11:52:56.289+01:00 level=INFO source=ggml.go:135 msg="" architecture=qwen3vlmoe file_type=Q4_K_M name="" description="" num_tensors=1590 num_key_values=43 ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.006 sec ggml_metal_device_init: GPU name: Apple M3 Ultra ggml_metal_device_init: GPU family: MTLGPUFamilyApple9 (1009) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 223338.30 MB time=2025-10-31T11:52:56.290+01:00 level=INFO source=ggml.go:104 msg=system Metal.0.EMBED_LIBRARY=1 CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.FP16_VA=1 CPU.0.DOTPROD=1 CPU.0.LLAMAFILE=1 CPU.0.ACCELERATE=1 co mpiler=cgo(clang) ggml_metal_init: allocating ggml_metal_init: picking default device: Apple M3 Ultra ggml_metal_init: use bfloat = true ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true ggml-backend.cpp:1751: GGML_ASSERT((int)sched->hash_set.size >= measure_graph->n_nodes + measure_graph->n_leafs) failed ```
Author
Owner

@jessegross commented on GitHub (Nov 3, 2025):

@linkdata My previous comment was incorrect, this was actually fixed by #12807, which is available in the current 0.12.9 release.

<!-- gh-comment-id:3482975078 --> @jessegross commented on GitHub (Nov 3, 2025): @linkdata My previous comment was incorrect, this was actually fixed by #12807, which is available in the current 0.12.9 release.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70569