[GH-ISSUE #14444] qwen3.5:35b fails with cudaMemcpyAsyncReserve #9381

Closed
opened 2026-04-12 22:17:07 -05:00 by GiteaMirror · 29 comments
Owner

Originally created by @thepyper on GitHub (Feb 26, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14444

Originally assigned to: @jmorganca on GitHub.

What is the issue?

Hi!

What happens is that opencode (v1.2.25) is not able to use that model, saying that "model runner has unexpectedly stopper, this may be due to resource limitations or an internal error, check ollama server logs for details".
In server logs, some cuda command line error is displayed, see attached log.
What is quite strange is that the same model works correctly from the ollama chat interface, here model responds correctly.
Ollama v0.17.1, windows 11, 64GB ram.
Thanks!!

Relevant log output

....
time=2026-02-26T16:50:28.505+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64892"
time=2026-02-26T16:50:30.736+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64898"
time=2026-02-26T16:50:31.444+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64904"
time=2026-02-26T16:50:31.693+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64910"
time=2026-02-26T16:50:31.940+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64916"
time=2026-02-26T16:50:32.173+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64922"
time=2026-02-26T16:50:32.407+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64928"
time=2026-02-26T16:50:32.638+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64934"
time=2026-02-26T16:50:32.870+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64940"
time=2026-02-26T16:50:33.101+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64946"
time=2026-02-26T16:50:33.336+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64952"
time=2026-02-26T16:50:33.571+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64958"
time=2026-02-26T16:50:33.802+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64964"
time=2026-02-26T16:50:34.051+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64970"
time=2026-02-26T16:50:34.282+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64976"
time=2026-02-26T16:50:34.515+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64982"
time=2026-02-26T16:50:34.745+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64988"
time=2026-02-26T16:50:34.986+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64994"
time=2026-02-26T16:50:35.236+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 65000"
time=2026-02-26T16:50:35.486+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 65006"
time=2026-02-26T16:50:35.490+01:00 level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[] error="failed to finish discovery before timeout"
time=2026-02-26T16:50:35.490+01:00 level=WARN source=runner.go:356 msg="unable to refresh free memory, using old values"
time=2026-02-26T16:50:35.491+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 65007"
time=2026-02-26T16:50:35.721+01:00 level=INFO source=cpu_windows.go:148 msg=packages count=1
time=2026-02-26T16:50:35.721+01:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=12 efficiency=0 threads=24
time=2026-02-26T16:50:35.817+01:00 level=INFO source=server.go:247 msg="enabling flash attention"
time=2026-02-26T16:50:35.818+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\thepy\\.ollama\\models\\blobs\\sha256-2abd0d805943fa113f934d1ae4f2d5a749b5d4fe2a0a9c64b645c1df15868da7 --port 65013"
time=2026-02-26T16:50:35.821+01:00 level=INFO source=sched.go:491 msg="system memory" total="63.2 GiB" free="50.0 GiB" free_swap="73.1 GiB"
time=2026-02-26T16:50:35.821+01:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-f3695d4c-f3d0-07f8-69e1-1269f8d1e52d library=CUDA available="7.3 GiB" free="7.8 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-02-26T16:50:35.821+01:00 level=INFO source=server.go:757 msg="loading model" "model layers"=41 requested=-1
time=2026-02-26T16:50:35.853+01:00 level=INFO source=runner.go:1411 msg="starting ollama engine"
time=2026-02-26T16:50:35.863+01:00 level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:65013"
time=2026-02-26T16:50:35.863+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:12 GPULayers:41[ID:GPU-f3695d4c-f3d0-07f8-69e1-1269f8d1e52d Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T16:50:35.899+01:00 level=INFO source=ggml.go:136 msg="" architecture=qwen35moe file_type=Q4_K_M name="" description="" num_tensors=1959 num_key_values=57
load_backend: loaded CPU backend from C:\Users\thepy\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes, ID: GPU-f3695d4c-f3d0-07f8-69e1-1269f8d1e52d
load_backend: loaded CUDA backend from C:\Users\thepy\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2026-02-26T16:50:36.044+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-02-26T16:50:36.959+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:12 GPULayers:12[ID:GPU-f3695d4c-f3d0-07f8-69e1-1269f8d1e52d Layers:12(28..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T16:50:37.229+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:12 GPULayers:12[ID:GPU-f3695d4c-f3d0-07f8-69e1-1269f8d1e52d Layers:12(28..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T16:50:37.750+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:12 GPULayers:12[ID:GPU-f3695d4c-f3d0-07f8-69e1-1269f8d1e52d Layers:12(28..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T16:50:37.750+01:00 level=INFO source=ggml.go:482 msg="offloading 12 repeating layers to GPU"
time=2026-02-26T16:50:37.750+01:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-02-26T16:50:37.750+01:00 level=INFO source=ggml.go:494 msg="offloaded 12/41 layers to GPU"
time=2026-02-26T16:50:37.750+01:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="6.1 GiB"
time=2026-02-26T16:50:37.750+01:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="16.1 GiB"
time=2026-02-26T16:50:37.750+01:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="495.1 MiB"
time=2026-02-26T16:50:37.750+01:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="1.1 GiB"
time=2026-02-26T16:50:37.750+01:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="589.2 MiB"
time=2026-02-26T16:50:37.750+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="630.8 MiB"
time=2026-02-26T16:50:37.750+01:00 level=INFO source=device.go:272 msg="total memory" size="25.0 GiB"
time=2026-02-26T16:50:37.750+01:00 level=INFO source=sched.go:566 msg="loaded runners" count=1
time=2026-02-26T16:50:37.750+01:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-02-26T16:50:37.750+01:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-02-26T16:50:40.760+01:00 level=INFO source=server.go:1388 msg="llama runner started in 4.94 seconds"
time=2026-02-26T16:50:40.860+01:00 level=WARN source=runner.go:187 msg="truncating input prompt" limit=4096 prompt=11346 keep=4 new=4096
CUDA error: invalid argument
  current device: 0, in function ggml_cuda_cpy at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\cpy.cu:438
  cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error
time=2026-02-26T16:50:41.032+01:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:65013/completion\": read tcp 127.0.0.1:65018->127.0.0.1:65013: wsarecv: An existing connection was forcibly closed by the remote host."
[GIN] 2026/02/26 - 16:50:41 | 500 |   12.6669052s |       127.0.0.1 | POST     "/v1/chat/completions"
time=2026-02-26T16:50:41.514+01:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"
[GIN] 2026/02/26 - 16:50:48 | 200 |      3.1183ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/02/26 - 16:51:18 | 200 |       3.137ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/02/26 - 16:51:48 | 200 |      3.1599ms |       127.0.0.1 | GET      "/api/tags"

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.17.1

Originally created by @thepyper on GitHub (Feb 26, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14444 Originally assigned to: @jmorganca on GitHub. ### What is the issue? Hi! What happens is that opencode (v1.2.25) is not able to use that model, saying that "model runner has unexpectedly stopper, this may be due to resource limitations or an internal error, check ollama server logs for details". In server logs, some cuda command line error is displayed, see attached log. What is quite strange is that the same model works correctly from the ollama chat interface, here model responds correctly. Ollama v0.17.1, windows 11, 64GB ram. Thanks!! ### Relevant log output ```shell .... time=2026-02-26T16:50:28.505+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64892" time=2026-02-26T16:50:30.736+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64898" time=2026-02-26T16:50:31.444+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64904" time=2026-02-26T16:50:31.693+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64910" time=2026-02-26T16:50:31.940+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64916" time=2026-02-26T16:50:32.173+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64922" time=2026-02-26T16:50:32.407+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64928" time=2026-02-26T16:50:32.638+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64934" time=2026-02-26T16:50:32.870+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64940" time=2026-02-26T16:50:33.101+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64946" time=2026-02-26T16:50:33.336+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64952" time=2026-02-26T16:50:33.571+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64958" time=2026-02-26T16:50:33.802+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64964" time=2026-02-26T16:50:34.051+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64970" time=2026-02-26T16:50:34.282+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64976" time=2026-02-26T16:50:34.515+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64982" time=2026-02-26T16:50:34.745+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64988" time=2026-02-26T16:50:34.986+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 64994" time=2026-02-26T16:50:35.236+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 65000" time=2026-02-26T16:50:35.486+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 65006" time=2026-02-26T16:50:35.490+01:00 level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\lib\\ollama C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\cuda_v13]" extra_envs=map[] error="failed to finish discovery before timeout" time=2026-02-26T16:50:35.490+01:00 level=WARN source=runner.go:356 msg="unable to refresh free memory, using old values" time=2026-02-26T16:50:35.491+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 65007" time=2026-02-26T16:50:35.721+01:00 level=INFO source=cpu_windows.go:148 msg=packages count=1 time=2026-02-26T16:50:35.721+01:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=12 efficiency=0 threads=24 time=2026-02-26T16:50:35.817+01:00 level=INFO source=server.go:247 msg="enabling flash attention" time=2026-02-26T16:50:35.818+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\thepy\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\thepy\\.ollama\\models\\blobs\\sha256-2abd0d805943fa113f934d1ae4f2d5a749b5d4fe2a0a9c64b645c1df15868da7 --port 65013" time=2026-02-26T16:50:35.821+01:00 level=INFO source=sched.go:491 msg="system memory" total="63.2 GiB" free="50.0 GiB" free_swap="73.1 GiB" time=2026-02-26T16:50:35.821+01:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-f3695d4c-f3d0-07f8-69e1-1269f8d1e52d library=CUDA available="7.3 GiB" free="7.8 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-02-26T16:50:35.821+01:00 level=INFO source=server.go:757 msg="loading model" "model layers"=41 requested=-1 time=2026-02-26T16:50:35.853+01:00 level=INFO source=runner.go:1411 msg="starting ollama engine" time=2026-02-26T16:50:35.863+01:00 level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:65013" time=2026-02-26T16:50:35.863+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:12 GPULayers:41[ID:GPU-f3695d4c-f3d0-07f8-69e1-1269f8d1e52d Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T16:50:35.899+01:00 level=INFO source=ggml.go:136 msg="" architecture=qwen35moe file_type=Q4_K_M name="" description="" num_tensors=1959 num_key_values=57 load_backend: loaded CPU backend from C:\Users\thepy\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes, ID: GPU-f3695d4c-f3d0-07f8-69e1-1269f8d1e52d load_backend: loaded CUDA backend from C:\Users\thepy\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2026-02-26T16:50:36.044+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-02-26T16:50:36.959+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:12 GPULayers:12[ID:GPU-f3695d4c-f3d0-07f8-69e1-1269f8d1e52d Layers:12(28..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T16:50:37.229+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:12 GPULayers:12[ID:GPU-f3695d4c-f3d0-07f8-69e1-1269f8d1e52d Layers:12(28..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T16:50:37.750+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:12 GPULayers:12[ID:GPU-f3695d4c-f3d0-07f8-69e1-1269f8d1e52d Layers:12(28..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T16:50:37.750+01:00 level=INFO source=ggml.go:482 msg="offloading 12 repeating layers to GPU" time=2026-02-26T16:50:37.750+01:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU" time=2026-02-26T16:50:37.750+01:00 level=INFO source=ggml.go:494 msg="offloaded 12/41 layers to GPU" time=2026-02-26T16:50:37.750+01:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="6.1 GiB" time=2026-02-26T16:50:37.750+01:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="16.1 GiB" time=2026-02-26T16:50:37.750+01:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="495.1 MiB" time=2026-02-26T16:50:37.750+01:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="1.1 GiB" time=2026-02-26T16:50:37.750+01:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="589.2 MiB" time=2026-02-26T16:50:37.750+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="630.8 MiB" time=2026-02-26T16:50:37.750+01:00 level=INFO source=device.go:272 msg="total memory" size="25.0 GiB" time=2026-02-26T16:50:37.750+01:00 level=INFO source=sched.go:566 msg="loaded runners" count=1 time=2026-02-26T16:50:37.750+01:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" time=2026-02-26T16:50:37.750+01:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" time=2026-02-26T16:50:40.760+01:00 level=INFO source=server.go:1388 msg="llama runner started in 4.94 seconds" time=2026-02-26T16:50:40.860+01:00 level=WARN source=runner.go:187 msg="truncating input prompt" limit=4096 prompt=11346 keep=4 new=4096 CUDA error: invalid argument current device: 0, in function ggml_cuda_cpy at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\cpy.cu:438 cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream) C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error time=2026-02-26T16:50:41.032+01:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:65013/completion\": read tcp 127.0.0.1:65018->127.0.0.1:65013: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2026/02/26 - 16:50:41 | 500 | 12.6669052s | 127.0.0.1 | POST "/v1/chat/completions" time=2026-02-26T16:50:41.514+01:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1" [GIN] 2026/02/26 - 16:50:48 | 200 | 3.1183ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/02/26 - 16:51:18 | 200 | 3.137ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/02/26 - 16:51:48 | 200 | 3.1599ms | 127.0.0.1 | GET "/api/tags" ``` ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.17.1
GiteaMirror added the bug label 2026-04-12 22:17:07 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 26, 2026):

time=2026-02-26T16:50:40.860+01:00 level=WARN source=runner.go:187 msg="truncating input prompt"
 limit=4096 prompt=11346 keep=4 new=4096

OpenCode is sending instructions and tools too big for the default context, so the ollama server is truncating to fix exactly in the available space. This is triggering a bug in the CUDA memcpy routine which needs to be investigated, but a workaround for you is to set OLLAMA_CONTEXT_LENGTH=32768 to allow space for context and token generation.

<!-- gh-comment-id:3967586663 --> @rick-github commented on GitHub (Feb 26, 2026): ``` time=2026-02-26T16:50:40.860+01:00 level=WARN source=runner.go:187 msg="truncating input prompt" limit=4096 prompt=11346 keep=4 new=4096 ``` OpenCode is sending instructions and tools too big for the default context, so the ollama server is truncating to fix exactly in the available space. This is triggering a bug in the CUDA memcpy routine which needs to be investigated, but a workaround for you is to set `OLLAMA_CONTEXT_LENGTH=32768` to allow space for context and token generation.
Author
Owner

@rick-github commented on GitHub (Feb 26, 2026):

https://github.com/ollama/ollama/issues/14419#issuecomment-3959159035

<!-- gh-comment-id:3968015979 --> @rick-github commented on GitHub (Feb 26, 2026): https://github.com/ollama/ollama/issues/14419#issuecomment-3959159035
Author
Owner

@Adrian-at-CrimsonAzure commented on GitHub (Feb 26, 2026):

I am getting the same invalid argument error, but without the truncation line. This happens with both OpenWebUI and from the ollama run command for any message more than a few words long. What's strange to me is that it doesn't look like it uses more VRAM or RAM than qwen3:30b or this 42b qwen3+brainstorm model, at least, judging by the allocation given in the logs. The same prompt works fine on either of those two, but crashes on Qwen3.5.

Ollama v0.17.1 docker, Ubuntu 24.04.

Logs
[GIN] 2026/02/26 - 19:15:34 | 200 |      42.429µs |       127.0.0.1 | HEAD     "/"
[GIN] 2026/02/26 - 19:15:35 | 200 |  715.789254ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/02/26 - 19:15:36 | 200 |  697.127338ms |       127.0.0.1 | POST     "/api/show"
time=2026-02-26T19:15:37.031Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 35197"
time=2026-02-26T19:15:37.602Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 36969"
time=2026-02-26T19:15:37.891Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 34163"
time=2026-02-26T19:15:38.183Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 44353"
time=2026-02-26T19:15:38.462Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 35081"
time=2026-02-26T19:15:38.755Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 45225"
time=2026-02-26T19:15:39.046Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 33447"
time=2026-02-26T19:15:39.340Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 40459"
time=2026-02-26T19:15:39.625Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 45787"
time=2026-02-26T19:15:39.862Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 45541"
time=2026-02-26T19:15:40.130Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42583"
time=2026-02-26T19:15:40.420Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 35447"
time=2026-02-26T19:15:40.701Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42267"
time=2026-02-26T19:15:40.986Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 37165"
time=2026-02-26T19:15:41.285Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 34497"
time=2026-02-26T19:15:41.565Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 34005"
time=2026-02-26T19:15:41.867Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 41865"
time=2026-02-26T19:15:42.161Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 33223"
time=2026-02-26T19:15:42.351Z level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v12]" extra_envs=map[] error="failed to finish discovery before timeout"
time=2026-02-26T19:15:42.352Z level=WARN source=runner.go:356 msg="unable to refresh free memory, using old values"
time=2026-02-26T19:15:42.352Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 36083"
time=2026-02-26T19:15:42.645Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
time=2026-02-26T19:15:43.095Z level=INFO source=server.go:247 msg="enabling flash attention"
time=2026-02-26T19:15:43.095Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-2abd0d805943fa113f934d1ae4f2d5a749b5d4fe2a0a9c64b645c1df15868da7 --port 38985"
time=2026-02-26T19:15:43.095Z level=INFO source=sched.go:491 msg="system memory" total="62.7 GiB" free="62.3 GiB" free_swap="0 B"
time=2026-02-26T19:15:43.095Z level=INFO source=sched.go:498 msg="gpu memory" id=GPU-95d928de-a574-b766-ce5e-1bbf75c34c1e library=CUDA available="10.5 GiB" free="10.9 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-02-26T19:15:43.096Z level=INFO source=sched.go:498 msg="gpu memory" id=GPU-5fc14822-5dbe-647e-adcd-448f67369791 library=CUDA available="10.3 GiB" free="10.8 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-02-26T19:15:43.096Z level=INFO source=server.go:757 msg="loading model" "model layers"=41 requested=-1
time=2026-02-26T19:15:43.117Z level=INFO source=runner.go:1411 msg="starting ollama engine"
time=2026-02-26T19:15:43.117Z level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:38985"
time=2026-02-26T19:15:43.128Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:256000 KvCacheType: NumThreads:6 GPULayers:41[ID:GPU-95d928de-a574-b766-ce5e-1bbf75c34c1e Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T19:15:43.296Z level=INFO source=ggml.go:136 msg="" architecture=qwen35moe file_type=Q4_K_M name="" description="" num_tensors=1959 num_key_values=57
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, ID: GPU-95d928de-a574-b766-ce5e-1bbf75c34c1e
Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, ID: GPU-5fc14822-5dbe-647e-adcd-448f67369791
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
time=2026-02-26T19:15:43.378Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-02-26T19:15:48.850Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:256000 KvCacheType: NumThreads:6 GPULayers:25[ID:GPU-95d928de-a574-b766-ce5e-1bbf75c34c1e Layers:10(15..24) ID:GPU-5fc14822-5dbe-647e-adcd-448f67369791 Layers:15(25..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T19:15:53.133Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:256000 KvCacheType: NumThreads:6 GPULayers:21[ID:GPU-95d928de-a574-b766-ce5e-1bbf75c34c1e Layers:9(19..27) ID:GPU-5fc14822-5dbe-647e-adcd-448f67369791 Layers:12(28..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T19:15:57.385Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:256000 KvCacheType: NumThreads:6 GPULayers:21[ID:GPU-95d928de-a574-b766-ce5e-1bbf75c34c1e Layers:9(19..27) ID:GPU-5fc14822-5dbe-647e-adcd-448f67369791 Layers:12(28..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T19:16:03.652Z level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:256000 KvCacheType: NumThreads:6 GPULayers:21[ID:GPU-95d928de-a574-b766-ce5e-1bbf75c34c1e Layers:9(19..27) ID:GPU-5fc14822-5dbe-647e-adcd-448f67369791 Layers:12(28..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T19:16:03.653Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="4.6 GiB"
time=2026-02-26T19:16:03.653Z level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="6.1 GiB"
time=2026-02-26T19:16:03.653Z level=INFO source=device.go:245 msg="model weights" device=CPU size="11.6 GiB"
time=2026-02-26T19:16:03.653Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.8 GiB"
time=2026-02-26T19:16:03.653Z level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="1.9 GiB"
time=2026-02-26T19:16:03.653Z level=INFO source=device.go:256 msg="kv cache" device=CPU size="2.7 GiB"
time=2026-02-26T19:16:03.653Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="3.7 GiB"
time=2026-02-26T19:16:03.653Z level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="1.4 GiB"
time=2026-02-26T19:16:03.653Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="630.8 MiB"
time=2026-02-26T19:16:03.653Z level=INFO source=device.go:272 msg="total memory" size="34.3 GiB"
time=2026-02-26T19:16:03.653Z level=INFO source=sched.go:566 msg="loaded runners" count=1
time=2026-02-26T19:16:03.653Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-02-26T19:16:03.653Z level=INFO source=ggml.go:482 msg="offloading 21 repeating layers to GPU"
time=2026-02-26T19:16:03.653Z level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-02-26T19:16:03.653Z level=INFO source=ggml.go:494 msg="offloaded 21/41 layers to GPU"
time=2026-02-26T19:16:03.654Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-02-26T19:16:16.526Z level=INFO source=server.go:1388 msg="llama runner started in 33.43 seconds"
[GIN] 2026/02/26 - 19:16:16 | 200 |  40.30545129s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/02/26 - 19:16:54 | 200 |    4.764626ms |      172.16.5.4 | GET      "/api/tags"
[GIN] 2026/02/26 - 19:16:54 | 200 |     148.766µs |      172.16.5.4 | GET      "/api/ps"
CUDA error: invalid argument
current device: 0, in function ggml_cuda_cpy at //ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:438
cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream)
//ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error
/usr/lib/ollama/libggml-base.so.0(+0x1bae8)[0x7fe4005c1ae8]
/usr/lib/ollama/libggml-base.so.0(ggml_print_backtrace+0x1e6)[0x7fe4005c1eb6]
/usr/lib/ollama/libggml-base.so.0(ggml_abort+0x11d)[0x7fe4005c203d]
/usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x143272)[0x7fe37769d272]
/usr/lib/ollama/cuda_v12/libggml-cuda.so(_Z13ggml_cuda_cpyR25ggml_backend_cuda_contextPK11ggml_tensorPS1_+0x1e50)[0x7fe37765b1d0]
/usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x153cae)[0x7fe3776adcae]
/usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x156eea)[0x7fe3776b0eea]
/usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x158e55)[0x7fe3776b2e55]
/usr/bin/ollama(+0x13ac156)[0x55f692353156]
/usr/bin/ollama(+0x132034b)[0x55f6922c734b]
/usr/bin/ollama(+0x3ddae1)[0x55f691384ae1]
SIGABRT: abort
PC=0x7fe44f61fb2c m=5 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 2072 gp=0xc000532000 m=5 mp=0xc0000a8008 [syscall]:
runtime.cgocall(0x55f6922c7330, 0xc000088aa0)
  runtime/cgocall.go:167 +0x4b fp=0xc000088a78 sp=0xc000088a40 pc=0x55f691379a6b
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(0x7fe3de82c8d0, 0x7fdab4f08f20)
  _cgo_gotypes.go:979 +0x4a fp=0xc000088aa0 sp=0xc000088a78 pc=0x55f691864b0a
github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify.func2(...)
  github.com/ollama/ollama/ml/backend/ggml/ggml.go:825
github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify(0xc000546880, 0xc001c1c000?, {0xc0000448c0, 0x1, 0x2?})
  github.com/ollama/ollama/ml/backend/ggml/ggml.go:825 +0x1b2 fp=0xc000088b78 sp=0xc000088aa0 pc=0x55f691873492
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc00022f0e0, {0x0, {0x55f692c1e9d0, 0xc000546880}, {0x55f692c2be30, 0xc001c1a720}, {0xc001ac2008, 0x200, 0x25f}, {{0x55f692c2be30, ...}, ...}, ...})
  github.com/ollama/ollama/runner/ollamarunner/runner.go:716 +0x862 fp=0xc000088ef0 sp=0xc000088b78 pc=0x55f69199e282
github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1()
  github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x58 fp=0xc000088fe0 sp=0xc000088ef0 pc=0x55f69199bf78
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc000088fe8 sp=0xc000088fe0 pc=0x55f691384e61
created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 5
  github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x2cd

goroutine 1 gp=0xc000002380 m=nil [IO wait, 1 minutes]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc000ef9778 sp=0xc000ef9758 pc=0x55f69137ceee
runtime.netpollblock(0xc0005177c8?, 0x913164a6?, 0xf6?)
  runtime/netpoll.go:575 +0xf7 fp=0xc000ef97b0 sp=0xc000ef9778 pc=0x55f691342097
internal/poll.runtime_pollWait(0x7fe408234610, 0x72)
  runtime/netpoll.go:351 +0x85 fp=0xc000ef97d0 sp=0xc000ef97b0 pc=0x55f69137c105
internal/poll.(*pollDesc).wait(0xc0001d8080?, 0x900000036?, 0x0)
  internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000ef97f8 sp=0xc000ef97d0 pc=0x55f691404487
internal/poll.(*pollDesc).waitRead(...)
  internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc0001d8080)
  internal/poll/fd_unix.go:620 +0x295 fp=0xc000ef98a0 sp=0xc000ef97f8 pc=0x55f691409855
net.(*netFD).accept(0xc0001d8080)
  net/fd_unix.go:172 +0x29 fp=0xc000ef9958 sp=0xc000ef98a0 pc=0x55f69147cd49
net.(*TCPListener).accept(0xc00044d900)
  net/tcpsock_posix.go:159 +0x1b fp=0xc000ef99a8 sp=0xc000ef9958 pc=0x55f691492c5b
net.(*TCPListener).Accept(0xc00044d900)
  net/tcpsock.go:380 +0x30 fp=0xc000ef99d8 sp=0xc000ef99a8 pc=0x55f691491b10
net/http.(*onceCloseListener).Accept(0xc000036750?)
  <autogenerated>:1 +0x24 fp=0xc000ef99f0 sp=0xc000ef99d8 pc=0x55f6916a99c4
net/http.(*Server).Serve(0xc000697500, {0x55f692c0fbc0, 0xc00044d900})
  net/http/server.go:3424 +0x30c fp=0xc000ef9b20 sp=0xc000ef99f0 pc=0x55f69168128c
github.com/ollama/ollama/runner/ollamarunner.Execute({0xc00012a030, 0x4, 0x4})
  github.com/ollama/ollama/runner/ollamarunner/runner.go:1447 +0x94e fp=0xc000ef9cf0 sp=0xc000ef9b20 pc=0x55f6919a520e
github.com/ollama/ollama/runner.Execute({0xc00012a010?, 0x0?, 0x0?})
  github.com/ollama/ollama/runner/runner.go:18 +0x10e fp=0xc000ef9d30 sp=0xc000ef9cf0 pc=0x55f691a4476e
github.com/ollama/ollama/cmd.NewCLI.func3(0xc000697200?, {0x55f69262d236?, 0x4?, 0x55f69262d23a?})
  github.com/ollama/ollama/cmd/cmd.go:2270 +0x45 fp=0xc000ef9d58 sp=0xc000ef9d30 pc=0x55f692257845
github.com/spf13/cobra.(*Command).execute(0xc000347b08, {0xc00039d770, 0x5, 0x5})
  github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc000ef9e78 sp=0xc000ef9d58 pc=0x55f6914f6cdc
github.com/spf13/cobra.(*Command).ExecuteC(0xc000236908)
  github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc000ef9f30 sp=0xc000ef9e78 pc=0x55f6914f7525
github.com/spf13/cobra.(*Command).Execute(...)
  github.com/spf13/cobra@v1.7.0/command.go:992
github.com/spf13/cobra.(*Command).ExecuteContext(...)
  github.com/spf13/cobra@v1.7.0/command.go:985
main.main()
  github.com/ollama/ollama/main.go:12 +0x4d fp=0xc000ef9f50 sp=0xc000ef9f30 pc=0x55f692259ced
runtime.main()
  runtime/proc.go:283 +0x29d fp=0xc000ef9fe0 sp=0xc000ef9f50 pc=0x55f69134971d
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc000ef9fe8 sp=0xc000ef9fe0 pc=0x55f691384e61

goroutine 2 gp=0xc000002e00 m=nil [force gc (idle), 1 minutes]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc000072fa8 sp=0xc000072f88 pc=0x55f69137ceee
runtime.goparkunlock(...)
  runtime/proc.go:441
runtime.forcegchelper()
  runtime/proc.go:348 +0xb8 fp=0xc000072fe0 sp=0xc000072fa8 pc=0x55f691349a58
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc000072fe8 sp=0xc000072fe0 pc=0x55f691384e61
created by runtime.init.7 in goroutine 1
  runtime/proc.go:336 +0x1a

goroutine 18 gp=0xc0000aa380 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc00006e780 sp=0xc00006e760 pc=0x55f69137ceee
runtime.goparkunlock(...)
  runtime/proc.go:441
runtime.bgsweep(0xc0000b8000)
  runtime/mgcsweep.go:316 +0xdf fp=0xc00006e7c8 sp=0xc00006e780 pc=0x55f6913341ff
runtime.gcenable.gowrap1()
  runtime/mgc.go:204 +0x25 fp=0xc00006e7e0 sp=0xc00006e7c8 pc=0x55f6913285e5
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc00006e7e8 sp=0xc00006e7e0 pc=0x55f691384e61
created by runtime.gcenable in goroutine 1
  runtime/mgc.go:204 +0x66

goroutine 19 gp=0xc0000aa540 m=nil [GC scavenge wait]:
runtime.gopark(0x21c5e4?, 0x1b55b6?, 0x0?, 0x0?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc00006ef78 sp=0xc00006ef58 pc=0x55f69137ceee
runtime.goparkunlock(...)
  runtime/proc.go:441
runtime.(*scavengerState).park(0x55f6936465a0)
  runtime/mgcscavenge.go:425 +0x49 fp=0xc00006efa8 sp=0xc00006ef78 pc=0x55f691331c49
runtime.bgscavenge(0xc0000b8000)
  runtime/mgcscavenge.go:658 +0x59 fp=0xc00006efc8 sp=0xc00006efa8 pc=0x55f6913321d9
runtime.gcenable.gowrap2()
  runtime/mgc.go:205 +0x25 fp=0xc00006efe0 sp=0xc00006efc8 pc=0x55f691328585
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc00006efe8 sp=0xc00006efe0 pc=0x55f691384e61
created by runtime.gcenable in goroutine 1
  runtime/mgc.go:205 +0xa5

goroutine 34 gp=0xc000104380 m=nil [finalizer wait, 1 minutes]:
runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000072688?)
  runtime/proc.go:435 +0xce fp=0xc000072630 sp=0xc000072610 pc=0x55f69137ceee
runtime.runfinq()
  runtime/mfinal.go:196 +0x107 fp=0xc0000727e0 sp=0xc000072630 pc=0x55f6913275a7
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc0000727e8 sp=0xc0000727e0 pc=0x55f691384e61
created by runtime.createfing in goroutine 1
  runtime/mfinal.go:166 +0x3d

goroutine 35 gp=0xc000104e00 m=nil [chan receive]:
runtime.gopark(0xc000181b80?, 0xc01a902018?, 0x60?, 0x47?, 0x55f6914638a8?)
  runtime/proc.go:435 +0xce fp=0xc0002a4718 sp=0xc0002a46f8 pc=0x55f69137ceee
runtime.chanrecv(0xc000100310, 0x0, 0x1)
  runtime/chan.go:664 +0x445 fp=0xc0002a4790 sp=0xc0002a4718 pc=0x55f691319085
runtime.chanrecv1(0x0?, 0x0?)
  runtime/chan.go:506 +0x12 fp=0xc0002a47b8 sp=0xc0002a4790 pc=0x55f691318c12
runtime.unique_runtime_registerUniqueMapCleanup.func2(...)
  runtime/mgc.go:1796
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
  runtime/mgc.go:1799 +0x2f fp=0xc0002a47e0 sp=0xc0002a47b8 pc=0x55f69132b78f
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc0002a47e8 sp=0xc0002a47e0 pc=0x55f691384e61
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
  runtime/mgc.go:1794 +0x85

goroutine 36 gp=0xc000105180 m=nil [GC worker (idle), 1 minutes]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc0002a4f38 sp=0xc0002a4f18 pc=0x55f69137ceee
runtime.gcBgMarkWorker(0xc000101730)
  runtime/mgc.go:1423 +0xe9 fp=0xc0002a4fc8 sp=0xc0002a4f38 pc=0x55f69132aaa9
runtime.gcBgMarkStartWorkers.gowrap1()
  runtime/mgc.go:1339 +0x25 fp=0xc0002a4fe0 sp=0xc0002a4fc8 pc=0x55f69132a985
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc0002a4fe8 sp=0xc0002a4fe0 pc=0x55f691384e61
created by runtime.gcBgMarkStartWorkers in goroutine 1
  runtime/mgc.go:1339 +0x105

goroutine 20 gp=0xc0000aa700 m=nil [GC worker (idle), 1 minutes]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc00006f738 sp=0xc00006f718 pc=0x55f69137ceee
runtime.gcBgMarkWorker(0xc000101730)
  runtime/mgc.go:1423 +0xe9 fp=0xc00006f7c8 sp=0xc00006f738 pc=0x55f69132aaa9
runtime.gcBgMarkStartWorkers.gowrap1()
  runtime/mgc.go:1339 +0x25 fp=0xc00006f7e0 sp=0xc00006f7c8 pc=0x55f69132a985
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc00006f7e8 sp=0xc00006f7e0 pc=0x55f691384e61
created by runtime.gcBgMarkStartWorkers in goroutine 1
  runtime/mgc.go:1339 +0x105

goroutine 21 gp=0xc0000aa8c0 m=nil [GC worker (idle), 1 minutes]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc00006ff38 sp=0xc00006ff18 pc=0x55f69137ceee
runtime.gcBgMarkWorker(0xc000101730)
  runtime/mgc.go:1423 +0xe9 fp=0xc00006ffc8 sp=0xc00006ff38 pc=0x55f69132aaa9
runtime.gcBgMarkStartWorkers.gowrap1()
  runtime/mgc.go:1339 +0x25 fp=0xc00006ffe0 sp=0xc00006ffc8 pc=0x55f69132a985
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc00006ffe8 sp=0xc00006ffe0 pc=0x55f691384e61
created by runtime.gcBgMarkStartWorkers in goroutine 1
  runtime/mgc.go:1339 +0x105

goroutine 22 gp=0xc0000aaa80 m=nil [GC worker (idle)]:
runtime.gopark(0x2d89bcb25f033?, 0x3?, 0xfb?, 0x9c?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc000070738 sp=0xc000070718 pc=0x55f69137ceee
runtime.gcBgMarkWorker(0xc000101730)
  runtime/mgc.go:1423 +0xe9 fp=0xc0000707c8 sp=0xc000070738 pc=0x55f69132aaa9
runtime.gcBgMarkStartWorkers.gowrap1()
  runtime/mgc.go:1339 +0x25 fp=0xc0000707e0 sp=0xc0000707c8 pc=0x55f69132a985
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc0000707e8 sp=0xc0000707e0 pc=0x55f691384e61
created by runtime.gcBgMarkStartWorkers in goroutine 1
  runtime/mgc.go:1339 +0x105

goroutine 23 gp=0xc0000aac40 m=nil [GC worker (idle)]:
runtime.gopark(0x2d89bcb210047?, 0x0?, 0x0?, 0x0?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc000070f38 sp=0xc000070f18 pc=0x55f69137ceee
runtime.gcBgMarkWorker(0xc000101730)
  runtime/mgc.go:1423 +0xe9 fp=0xc000070fc8 sp=0xc000070f38 pc=0x55f69132aaa9
runtime.gcBgMarkStartWorkers.gowrap1()
  runtime/mgc.go:1339 +0x25 fp=0xc000070fe0 sp=0xc000070fc8 pc=0x55f69132a985
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc000070fe8 sp=0xc000070fe0 pc=0x55f691384e61
created by runtime.gcBgMarkStartWorkers in goroutine 1
  runtime/mgc.go:1339 +0x105

goroutine 24 gp=0xc0000aae00 m=nil [GC worker (idle)]:
runtime.gopark(0x2d89bcb3b2f26?, 0x0?, 0x0?, 0x0?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc000071738 sp=0xc000071718 pc=0x55f69137ceee
runtime.gcBgMarkWorker(0xc000101730)
  runtime/mgc.go:1423 +0xe9 fp=0xc0000717c8 sp=0xc000071738 pc=0x55f69132aaa9
runtime.gcBgMarkStartWorkers.gowrap1()
  runtime/mgc.go:1339 +0x25 fp=0xc0000717e0 sp=0xc0000717c8 pc=0x55f69132a985
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc0000717e8 sp=0xc0000717e0 pc=0x55f691384e61
created by runtime.gcBgMarkStartWorkers in goroutine 1
  runtime/mgc.go:1339 +0x105

goroutine 25 gp=0xc0000aafc0 m=nil [GC worker (idle)]:
runtime.gopark(0x55f69371b520?, 0x1?, 0x1e?, 0x3b?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc000071f38 sp=0xc000071f18 pc=0x55f69137ceee
runtime.gcBgMarkWorker(0xc000101730)
  runtime/mgc.go:1423 +0xe9 fp=0xc000071fc8 sp=0xc000071f38 pc=0x55f69132aaa9
runtime.gcBgMarkStartWorkers.gowrap1()
  runtime/mgc.go:1339 +0x25 fp=0xc000071fe0 sp=0xc000071fc8 pc=0x55f69132a985
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc000071fe8 sp=0xc000071fe0 pc=0x55f691384e61
created by runtime.gcBgMarkStartWorkers in goroutine 1
  runtime/mgc.go:1339 +0x105

goroutine 3 gp=0xc0000036c0 m=nil [GC worker (idle)]:
runtime.gopark(0x2d89bcb25f4c1?, 0x3?, 0xa0?, 0x51?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc000073738 sp=0xc000073718 pc=0x55f69137ceee
runtime.gcBgMarkWorker(0xc000101730)
  runtime/mgc.go:1423 +0xe9 fp=0xc0000737c8 sp=0xc000073738 pc=0x55f69132aaa9
runtime.gcBgMarkStartWorkers.gowrap1()
  runtime/mgc.go:1339 +0x25 fp=0xc0000737e0 sp=0xc0000737c8 pc=0x55f69132a985
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc0000737e8 sp=0xc0000737e0 pc=0x55f691384e61
created by runtime.gcBgMarkStartWorkers in goroutine 1
  runtime/mgc.go:1339 +0x105

goroutine 4 gp=0xc000003880 m=nil [GC worker (idle)]:
runtime.gopark(0x55f69371b520?, 0x3?, 0x3e?, 0x50?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc000073f38 sp=0xc000073f18 pc=0x55f69137ceee
runtime.gcBgMarkWorker(0xc000101730)
  runtime/mgc.go:1423 +0xe9 fp=0xc000073fc8 sp=0xc000073f38 pc=0x55f69132aaa9
runtime.gcBgMarkStartWorkers.gowrap1()
  runtime/mgc.go:1339 +0x25 fp=0xc000073fe0 sp=0xc000073fc8 pc=0x55f69132a985
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc000073fe8 sp=0xc000073fe0 pc=0x55f691384e61
created by runtime.gcBgMarkStartWorkers in goroutine 1
  runtime/mgc.go:1339 +0x105

goroutine 37 gp=0xc000105340 m=nil [GC worker (idle)]:
runtime.gopark(0x55f69371b520?, 0x1?, 0x94?, 0x6d?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc0002a5738 sp=0xc0002a5718 pc=0x55f69137ceee
runtime.gcBgMarkWorker(0xc000101730)
  runtime/mgc.go:1423 +0xe9 fp=0xc0002a57c8 sp=0xc0002a5738 pc=0x55f69132aaa9
runtime.gcBgMarkStartWorkers.gowrap1()
  runtime/mgc.go:1339 +0x25 fp=0xc0002a57e0 sp=0xc0002a57c8 pc=0x55f69132a985
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc0002a57e8 sp=0xc0002a57e0 pc=0x55f691384e61
created by runtime.gcBgMarkStartWorkers in goroutine 1
  runtime/mgc.go:1339 +0x105

goroutine 26 gp=0xc0000ab180 m=nil [GC worker (idle)]:
runtime.gopark(0x2d89bcb6c4cbe?, 0x1?, 0xe0?, 0x26?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc0002a0738 sp=0xc0002a0718 pc=0x55f69137ceee
runtime.gcBgMarkWorker(0xc000101730)
  runtime/mgc.go:1423 +0xe9 fp=0xc0002a07c8 sp=0xc0002a0738 pc=0x55f69132aaa9
runtime.gcBgMarkStartWorkers.gowrap1()
  runtime/mgc.go:1339 +0x25 fp=0xc0002a07e0 sp=0xc0002a07c8 pc=0x55f69132a985
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc0002a07e8 sp=0xc0002a07e0 pc=0x55f691384e61
created by runtime.gcBgMarkStartWorkers in goroutine 1
  runtime/mgc.go:1339 +0x105

goroutine 27 gp=0xc0000ab340 m=nil [GC worker (idle)]:
runtime.gopark(0x2d89bcb228d96?, 0x3?, 0xc8?, 0x91?, 0x0?)
  runtime/proc.go:435 +0xce fp=0xc0002a0f38 sp=0xc0002a0f18 pc=0x55f69137ceee
runtime.gcBgMarkWorker(0xc000101730)
  runtime/mgc.go:1423 +0xe9 fp=0xc0002a0fc8 sp=0xc0002a0f38 pc=0x55f69132aaa9
runtime.gcBgMarkStartWorkers.gowrap1()
  runtime/mgc.go:1339 +0x25 fp=0xc0002a0fe0 sp=0xc0002a0fc8 pc=0x55f69132a985
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc0002a0fe8 sp=0xc0002a0fe0 pc=0x55f691384e61
created by runtime.gcBgMarkStartWorkers in goroutine 1
  runtime/mgc.go:1339 +0x105

goroutine 5 gp=0xc000582fc0 m=nil [chan receive]:
runtime.gopark(0x30?, 0x55f692b4dd80?, 0x1?, 0x0?, 0xc000efb798?)
  runtime/proc.go:435 +0xce fp=0xc000efb750 sp=0xc000efb730 pc=0x55f69137ceee
runtime.chanrecv(0xc000bc0230, 0x0, 0x1)
  runtime/chan.go:664 +0x445 fp=0xc000efb7c8 sp=0xc000efb750 pc=0x55f691319085
runtime.chanrecv1(0x55f692670664?, 0x29?)
  runtime/chan.go:506 +0x12 fp=0xc000efb7f0 sp=0xc000efb7c8 pc=0x55f691318c12
github.com/ollama/ollama/runner/ollamarunner.(*Server).forwardBatch(_, {0x1, {0x55f692c1e9d0, 0xc000bb9840}, {0x55f692c2be30, 0xc001ce2120}, {0xc000bca000, 0x37, 0x40}, {{0x55f692c2be30, ...}, ...}, ...})
  github.com/ollama/ollama/runner/ollamarunner/runner.go:476 +0xfa fp=0xc000efbb58 sp=0xc000efb7f0 pc=0x55f69199c09a
github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc00022f0e0, {0x55f692c12520, 0xc00039d810})
  github.com/ollama/ollama/runner/ollamarunner/runner.go:453 +0x18c fp=0xc000efbfb8 sp=0xc000efbb58 pc=0x55f69199bd4c
github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap1()
  github.com/ollama/ollama/runner/ollamarunner/runner.go:1424 +0x28 fp=0xc000efbfe0 sp=0xc000efbfb8 pc=0x55f6919a5488
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc000efbfe8 sp=0xc000efbfe0 pc=0x55f691384e61
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
  github.com/ollama/ollama/runner/ollamarunner/runner.go:1424 +0x4c9

goroutine 6 gp=0xc000583180 m=nil [select]:
runtime.gopark(0xc000049a08?, 0x2?, 0xc0?, 0x97?, 0xc00004986c?)
  runtime/proc.go:435 +0xce fp=0xc000049698 sp=0xc000049678 pc=0x55f69137ceee
runtime.selectgo(0xc000049a08, 0xc000049868, 0x237?, 0x0, 0x1?, 0x1)
  runtime/select.go:351 +0x837 fp=0xc0000497d0 sp=0xc000049698 pc=0x55f69135bc17
github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(0xc00022f0e0, {0x55f692c0fda0, 0xc000020620}, 0xc00033af00)
  github.com/ollama/ollama/runner/ollamarunner/runner.go:956 +0xc4e fp=0xc000049ac0 sp=0xc0000497d0 pc=0x55f6919a052e
github.com/ollama/ollama/runner/ollamarunner.(*Server).completion-fm({0x55f692c0fda0?, 0xc000020620?}, 0xc000049b40?)
  <autogenerated>:1 +0x36 fp=0xc000049af0 sp=0xc000049ac0 pc=0x55f6919a5976
net/http.HandlerFunc.ServeHTTP(0xc000693bc0?, {0x55f692c0fda0?, 0xc000020620?}, 0xc000049b60?)
  net/http/server.go:2294 +0x29 fp=0xc000049b18 sp=0xc000049af0 pc=0x55f69167d8c9
net/http.(*ServeMux).ServeHTTP(0x55f691321ac5?, {0x55f692c0fda0, 0xc000020620}, 0xc00033af00)
  net/http/server.go:2822 +0x1c4 fp=0xc000049b68 sp=0xc000049b18 pc=0x55f69167f7c4
net/http.serverHandler.ServeHTTP({0x55f692c0c090?}, {0x55f692c0fda0?, 0xc000020620?}, 0x1?)
  net/http/server.go:3301 +0x8e fp=0xc000049b98 sp=0xc000049b68 pc=0x55f69169d24e
net/http.(*conn).serve(0xc000036750, {0x55f692c124e8, 0xc00033f410})
  net/http/server.go:2102 +0x625 fp=0xc000049fb8 sp=0xc000049b98 pc=0x55f69167bdc5
net/http.(*Server).Serve.gowrap3()
  net/http/server.go:3454 +0x28 fp=0xc000049fe0 sp=0xc000049fb8 pc=0x55f691681688
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc000049fe8 sp=0xc000049fe0 pc=0x55f691384e61
created by net/http.(*Server).Serve in goroutine 1
  net/http/server.go:3454 +0x485

goroutine 1982 gp=0xc000532540 m=nil [chan receive]:
runtime.gopark(0x30?, 0x55f692b4dd80?, 0x1?, 0x99?, 0xc000089b20?)
  runtime/proc.go:435 +0xce fp=0xc000089ad8 sp=0xc000089ab8 pc=0x55f69137ceee
runtime.chanrecv(0xc000544d90, 0x0, 0x1)
  runtime/chan.go:664 +0x445 fp=0xc000089b50 sp=0xc000089ad8 pc=0x55f691319085
runtime.chanrecv1(0x55f692674342?, 0x2c?)
  runtime/chan.go:506 +0x12 fp=0xc000089b78 sp=0xc000089b50 pc=0x55f691318c12
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc00022f0e0, {0x1, {0x55f692c1e9d0, 0xc000bb9840}, {0x55f692c2be30, 0xc001ce2120}, {0xc000bca000, 0x37, 0x40}, {{0x55f692c2be30, ...}, ...}, ...})
  github.com/ollama/ollama/runner/ollamarunner/runner.go:645 +0x185 fp=0xc000089ef0 sp=0xc000089b78 pc=0x55f69199dba5
github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1()
  github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x58 fp=0xc000089fe0 sp=0xc000089ef0 pc=0x55f69199bf78
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc000089fe8 sp=0xc000089fe0 pc=0x55f691384e61
created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 5
  github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x2cd

goroutine 2071 gp=0xc000aacc40 m=nil [IO wait]:
runtime.gopark(0x6574696c?, 0xa0c4686120726f62?, 0x45?, 0x6e?, 0xb?)
  runtime/proc.go:435 +0xce fp=0xc0005315d8 sp=0xc0005315b8 pc=0x55f69137ceee
runtime.netpollblock(0x55f6913a0798?, 0x913164a6?, 0xf6?)
  runtime/netpoll.go:575 +0xf7 fp=0xc000531610 sp=0xc0005315d8 pc=0x55f691342097
internal/poll.runtime_pollWait(0x7fe4082344f8, 0x72)
  runtime/netpoll.go:351 +0x85 fp=0xc000531630 sp=0xc000531610 pc=0x55f69137c105
internal/poll.(*pollDesc).wait(0xc0001d8a80?, 0xc00033f511?, 0x0)
  internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000531658 sp=0xc000531630 pc=0x55f691404487
internal/poll.(*pollDesc).waitRead(...)
  internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0001d8a80, {0xc00033f511, 0x1, 0x1})
  internal/poll/fd_unix.go:165 +0x27a fp=0xc0005316f0 sp=0xc000531658 pc=0x55f69140577a
net.(*netFD).Read(0xc0001d8a80, {0xc00033f511?, 0xc00044d9d8?, 0xc000531770?})
  net/fd_posix.go:55 +0x25 fp=0xc000531738 sp=0xc0005316f0 pc=0x55f69147ada5
net.(*conn).Read(0xc00068c6e8, {0xc00033f511?, 0xc000127d00?, 0x55f6916f3f80?})
  net/net.go:194 +0x45 fp=0xc000531780 sp=0xc000531738 pc=0x55f691489165
net/http.(*connReader).backgroundRead(0xc00033f500)
  net/http/server.go:690 +0x37 fp=0xc0005317c8 sp=0xc000531780 pc=0x55f691675c97
net/http.(*connReader).startBackgroundRead.gowrap2()
  net/http/server.go:686 +0x25 fp=0xc0005317e0 sp=0xc0005317c8 pc=0x55f691675bc5
runtime.goexit({})
  runtime/asm_amd64.s:1700 +0x1 fp=0xc0005317e8 sp=0xc0005317e0 pc=0x55f691384e61
created by net/http.(*connReader).startBackgroundRead in goroutine 6
  net/http/server.go:686 +0xb6

rax    0x0
rbx    0x295
rcx    0x7fe44f61fb2c
rdx    0x6
rdi    0x291
rsi    0x295
rbp    0x7fe4037fcc00
rsp    0x7fe4037fcbc0
r8     0x0
r9     0x7
r10    0x8
r11    0x246
r12    0x6
r13    0x7fe377df6710
r14    0x16
r15    0x200000
rip    0x7fe44f61fb2c
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
time=2026-02-26T19:17:09.801Z level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:38985/completion\": EOF"
[GIN] 2026/02/26 - 19:17:09 | 500 |  2.156077265s |       127.0.0.1 | POST     "/api/chat"
time=2026-02-26T19:17:09.911Z level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 2"

<!-- gh-comment-id:3968871023 --> @Adrian-at-CrimsonAzure commented on GitHub (Feb 26, 2026): I am getting the same invalid argument error, but without the truncation line. This happens with both OpenWebUI and from the `ollama run` command for any message more than a few words long. What's strange to me is that it doesn't look like it uses more VRAM or RAM than `qwen3:30b` or [this 42b qwen3+brainstorm](https://huggingface.co/mradermacher/Qwen3-42B-A3B-2507-Thinking-TOTAL-RECALL-v2-Medium-MASTER-CODER-GGUF) model, at least, judging by the allocation given in the logs. The same prompt works fine on either of those two, but crashes on Qwen3.5. Ollama v0.17.1 docker, Ubuntu 24.04. <details> <summary>Logs</summary> ``` [GIN] 2026/02/26 - 19:15:34 | 200 | 42.429µs | 127.0.0.1 | HEAD "/" [GIN] 2026/02/26 - 19:15:35 | 200 | 715.789254ms | 127.0.0.1 | POST "/api/show" [GIN] 2026/02/26 - 19:15:36 | 200 | 697.127338ms | 127.0.0.1 | POST "/api/show" time=2026-02-26T19:15:37.031Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 35197" time=2026-02-26T19:15:37.602Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 36969" time=2026-02-26T19:15:37.891Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 34163" time=2026-02-26T19:15:38.183Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 44353" time=2026-02-26T19:15:38.462Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 35081" time=2026-02-26T19:15:38.755Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 45225" time=2026-02-26T19:15:39.046Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 33447" time=2026-02-26T19:15:39.340Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 40459" time=2026-02-26T19:15:39.625Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 45787" time=2026-02-26T19:15:39.862Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 45541" time=2026-02-26T19:15:40.130Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42583" time=2026-02-26T19:15:40.420Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 35447" time=2026-02-26T19:15:40.701Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42267" time=2026-02-26T19:15:40.986Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 37165" time=2026-02-26T19:15:41.285Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 34497" time=2026-02-26T19:15:41.565Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 34005" time=2026-02-26T19:15:41.867Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 41865" time=2026-02-26T19:15:42.161Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 33223" time=2026-02-26T19:15:42.351Z level=INFO source=runner.go:464 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v12]" extra_envs=map[] error="failed to finish discovery before timeout" time=2026-02-26T19:15:42.352Z level=WARN source=runner.go:356 msg="unable to refresh free memory, using old values" time=2026-02-26T19:15:42.352Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 36083" time=2026-02-26T19:15:42.645Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax" time=2026-02-26T19:15:43.095Z level=INFO source=server.go:247 msg="enabling flash attention" time=2026-02-26T19:15:43.095Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-2abd0d805943fa113f934d1ae4f2d5a749b5d4fe2a0a9c64b645c1df15868da7 --port 38985" time=2026-02-26T19:15:43.095Z level=INFO source=sched.go:491 msg="system memory" total="62.7 GiB" free="62.3 GiB" free_swap="0 B" time=2026-02-26T19:15:43.095Z level=INFO source=sched.go:498 msg="gpu memory" id=GPU-95d928de-a574-b766-ce5e-1bbf75c34c1e library=CUDA available="10.5 GiB" free="10.9 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-02-26T19:15:43.096Z level=INFO source=sched.go:498 msg="gpu memory" id=GPU-5fc14822-5dbe-647e-adcd-448f67369791 library=CUDA available="10.3 GiB" free="10.8 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-02-26T19:15:43.096Z level=INFO source=server.go:757 msg="loading model" "model layers"=41 requested=-1 time=2026-02-26T19:15:43.117Z level=INFO source=runner.go:1411 msg="starting ollama engine" time=2026-02-26T19:15:43.117Z level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:38985" time=2026-02-26T19:15:43.128Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:256000 KvCacheType: NumThreads:6 GPULayers:41[ID:GPU-95d928de-a574-b766-ce5e-1bbf75c34c1e Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T19:15:43.296Z level=INFO source=ggml.go:136 msg="" architecture=qwen35moe file_type=Q4_K_M name="" description="" num_tensors=1959 num_key_values=57 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, ID: GPU-95d928de-a574-b766-ce5e-1bbf75c34c1e Device 1: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1, VMM: yes, ID: GPU-5fc14822-5dbe-647e-adcd-448f67369791 load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so time=2026-02-26T19:15:43.378Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2026-02-26T19:15:48.850Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:256000 KvCacheType: NumThreads:6 GPULayers:25[ID:GPU-95d928de-a574-b766-ce5e-1bbf75c34c1e Layers:10(15..24) ID:GPU-5fc14822-5dbe-647e-adcd-448f67369791 Layers:15(25..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T19:15:53.133Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:256000 KvCacheType: NumThreads:6 GPULayers:21[ID:GPU-95d928de-a574-b766-ce5e-1bbf75c34c1e Layers:9(19..27) ID:GPU-5fc14822-5dbe-647e-adcd-448f67369791 Layers:12(28..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T19:15:57.385Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:256000 KvCacheType: NumThreads:6 GPULayers:21[ID:GPU-95d928de-a574-b766-ce5e-1bbf75c34c1e Layers:9(19..27) ID:GPU-5fc14822-5dbe-647e-adcd-448f67369791 Layers:12(28..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T19:16:03.652Z level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:256000 KvCacheType: NumThreads:6 GPULayers:21[ID:GPU-95d928de-a574-b766-ce5e-1bbf75c34c1e Layers:9(19..27) ID:GPU-5fc14822-5dbe-647e-adcd-448f67369791 Layers:12(28..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T19:16:03.653Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="4.6 GiB" time=2026-02-26T19:16:03.653Z level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="6.1 GiB" time=2026-02-26T19:16:03.653Z level=INFO source=device.go:245 msg="model weights" device=CPU size="11.6 GiB" time=2026-02-26T19:16:03.653Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.8 GiB" time=2026-02-26T19:16:03.653Z level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="1.9 GiB" time=2026-02-26T19:16:03.653Z level=INFO source=device.go:256 msg="kv cache" device=CPU size="2.7 GiB" time=2026-02-26T19:16:03.653Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="3.7 GiB" time=2026-02-26T19:16:03.653Z level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="1.4 GiB" time=2026-02-26T19:16:03.653Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="630.8 MiB" time=2026-02-26T19:16:03.653Z level=INFO source=device.go:272 msg="total memory" size="34.3 GiB" time=2026-02-26T19:16:03.653Z level=INFO source=sched.go:566 msg="loaded runners" count=1 time=2026-02-26T19:16:03.653Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" time=2026-02-26T19:16:03.653Z level=INFO source=ggml.go:482 msg="offloading 21 repeating layers to GPU" time=2026-02-26T19:16:03.653Z level=INFO source=ggml.go:486 msg="offloading output layer to CPU" time=2026-02-26T19:16:03.653Z level=INFO source=ggml.go:494 msg="offloaded 21/41 layers to GPU" time=2026-02-26T19:16:03.654Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" time=2026-02-26T19:16:16.526Z level=INFO source=server.go:1388 msg="llama runner started in 33.43 seconds" [GIN] 2026/02/26 - 19:16:16 | 200 | 40.30545129s | 127.0.0.1 | POST "/api/generate" [GIN] 2026/02/26 - 19:16:54 | 200 | 4.764626ms | 172.16.5.4 | GET "/api/tags" [GIN] 2026/02/26 - 19:16:54 | 200 | 148.766µs | 172.16.5.4 | GET "/api/ps" CUDA error: invalid argument current device: 0, in function ggml_cuda_cpy at //ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:438 cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream) //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error /usr/lib/ollama/libggml-base.so.0(+0x1bae8)[0x7fe4005c1ae8] /usr/lib/ollama/libggml-base.so.0(ggml_print_backtrace+0x1e6)[0x7fe4005c1eb6] /usr/lib/ollama/libggml-base.so.0(ggml_abort+0x11d)[0x7fe4005c203d] /usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x143272)[0x7fe37769d272] /usr/lib/ollama/cuda_v12/libggml-cuda.so(_Z13ggml_cuda_cpyR25ggml_backend_cuda_contextPK11ggml_tensorPS1_+0x1e50)[0x7fe37765b1d0] /usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x153cae)[0x7fe3776adcae] /usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x156eea)[0x7fe3776b0eea] /usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x158e55)[0x7fe3776b2e55] /usr/bin/ollama(+0x13ac156)[0x55f692353156] /usr/bin/ollama(+0x132034b)[0x55f6922c734b] /usr/bin/ollama(+0x3ddae1)[0x55f691384ae1] SIGABRT: abort PC=0x7fe44f61fb2c m=5 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 2072 gp=0xc000532000 m=5 mp=0xc0000a8008 [syscall]: runtime.cgocall(0x55f6922c7330, 0xc000088aa0) runtime/cgocall.go:167 +0x4b fp=0xc000088a78 sp=0xc000088a40 pc=0x55f691379a6b github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(0x7fe3de82c8d0, 0x7fdab4f08f20) _cgo_gotypes.go:979 +0x4a fp=0xc000088aa0 sp=0xc000088a78 pc=0x55f691864b0a github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify.func2(...) github.com/ollama/ollama/ml/backend/ggml/ggml.go:825 github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify(0xc000546880, 0xc001c1c000?, {0xc0000448c0, 0x1, 0x2?}) github.com/ollama/ollama/ml/backend/ggml/ggml.go:825 +0x1b2 fp=0xc000088b78 sp=0xc000088aa0 pc=0x55f691873492 github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc00022f0e0, {0x0, {0x55f692c1e9d0, 0xc000546880}, {0x55f692c2be30, 0xc001c1a720}, {0xc001ac2008, 0x200, 0x25f}, {{0x55f692c2be30, ...}, ...}, ...}) github.com/ollama/ollama/runner/ollamarunner/runner.go:716 +0x862 fp=0xc000088ef0 sp=0xc000088b78 pc=0x55f69199e282 github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1() github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x58 fp=0xc000088fe0 sp=0xc000088ef0 pc=0x55f69199bf78 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000088fe8 sp=0xc000088fe0 pc=0x55f691384e61 created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 5 github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x2cd goroutine 1 gp=0xc000002380 m=nil [IO wait, 1 minutes]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000ef9778 sp=0xc000ef9758 pc=0x55f69137ceee runtime.netpollblock(0xc0005177c8?, 0x913164a6?, 0xf6?) runtime/netpoll.go:575 +0xf7 fp=0xc000ef97b0 sp=0xc000ef9778 pc=0x55f691342097 internal/poll.runtime_pollWait(0x7fe408234610, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc000ef97d0 sp=0xc000ef97b0 pc=0x55f69137c105 internal/poll.(*pollDesc).wait(0xc0001d8080?, 0x900000036?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000ef97f8 sp=0xc000ef97d0 pc=0x55f691404487 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc0001d8080) internal/poll/fd_unix.go:620 +0x295 fp=0xc000ef98a0 sp=0xc000ef97f8 pc=0x55f691409855 net.(*netFD).accept(0xc0001d8080) net/fd_unix.go:172 +0x29 fp=0xc000ef9958 sp=0xc000ef98a0 pc=0x55f69147cd49 net.(*TCPListener).accept(0xc00044d900) net/tcpsock_posix.go:159 +0x1b fp=0xc000ef99a8 sp=0xc000ef9958 pc=0x55f691492c5b net.(*TCPListener).Accept(0xc00044d900) net/tcpsock.go:380 +0x30 fp=0xc000ef99d8 sp=0xc000ef99a8 pc=0x55f691491b10 net/http.(*onceCloseListener).Accept(0xc000036750?) <autogenerated>:1 +0x24 fp=0xc000ef99f0 sp=0xc000ef99d8 pc=0x55f6916a99c4 net/http.(*Server).Serve(0xc000697500, {0x55f692c0fbc0, 0xc00044d900}) net/http/server.go:3424 +0x30c fp=0xc000ef9b20 sp=0xc000ef99f0 pc=0x55f69168128c github.com/ollama/ollama/runner/ollamarunner.Execute({0xc00012a030, 0x4, 0x4}) github.com/ollama/ollama/runner/ollamarunner/runner.go:1447 +0x94e fp=0xc000ef9cf0 sp=0xc000ef9b20 pc=0x55f6919a520e github.com/ollama/ollama/runner.Execute({0xc00012a010?, 0x0?, 0x0?}) github.com/ollama/ollama/runner/runner.go:18 +0x10e fp=0xc000ef9d30 sp=0xc000ef9cf0 pc=0x55f691a4476e github.com/ollama/ollama/cmd.NewCLI.func3(0xc000697200?, {0x55f69262d236?, 0x4?, 0x55f69262d23a?}) github.com/ollama/ollama/cmd/cmd.go:2270 +0x45 fp=0xc000ef9d58 sp=0xc000ef9d30 pc=0x55f692257845 github.com/spf13/cobra.(*Command).execute(0xc000347b08, {0xc00039d770, 0x5, 0x5}) github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc000ef9e78 sp=0xc000ef9d58 pc=0x55f6914f6cdc github.com/spf13/cobra.(*Command).ExecuteC(0xc000236908) github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc000ef9f30 sp=0xc000ef9e78 pc=0x55f6914f7525 github.com/spf13/cobra.(*Command).Execute(...) github.com/spf13/cobra@v1.7.0/command.go:992 github.com/spf13/cobra.(*Command).ExecuteContext(...) github.com/spf13/cobra@v1.7.0/command.go:985 main.main() github.com/ollama/ollama/main.go:12 +0x4d fp=0xc000ef9f50 sp=0xc000ef9f30 pc=0x55f692259ced runtime.main() runtime/proc.go:283 +0x29d fp=0xc000ef9fe0 sp=0xc000ef9f50 pc=0x55f69134971d runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000ef9fe8 sp=0xc000ef9fe0 pc=0x55f691384e61 goroutine 2 gp=0xc000002e00 m=nil [force gc (idle), 1 minutes]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000072fa8 sp=0xc000072f88 pc=0x55f69137ceee runtime.goparkunlock(...) runtime/proc.go:441 runtime.forcegchelper() runtime/proc.go:348 +0xb8 fp=0xc000072fe0 sp=0xc000072fa8 pc=0x55f691349a58 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000072fe8 sp=0xc000072fe0 pc=0x55f691384e61 created by runtime.init.7 in goroutine 1 runtime/proc.go:336 +0x1a goroutine 18 gp=0xc0000aa380 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006e780 sp=0xc00006e760 pc=0x55f69137ceee runtime.goparkunlock(...) runtime/proc.go:441 runtime.bgsweep(0xc0000b8000) runtime/mgcsweep.go:316 +0xdf fp=0xc00006e7c8 sp=0xc00006e780 pc=0x55f6913341ff runtime.gcenable.gowrap1() runtime/mgc.go:204 +0x25 fp=0xc00006e7e0 sp=0xc00006e7c8 pc=0x55f6913285e5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006e7e8 sp=0xc00006e7e0 pc=0x55f691384e61 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0x66 goroutine 19 gp=0xc0000aa540 m=nil [GC scavenge wait]: runtime.gopark(0x21c5e4?, 0x1b55b6?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006ef78 sp=0xc00006ef58 pc=0x55f69137ceee runtime.goparkunlock(...) runtime/proc.go:441 runtime.(*scavengerState).park(0x55f6936465a0) runtime/mgcscavenge.go:425 +0x49 fp=0xc00006efa8 sp=0xc00006ef78 pc=0x55f691331c49 runtime.bgscavenge(0xc0000b8000) runtime/mgcscavenge.go:658 +0x59 fp=0xc00006efc8 sp=0xc00006efa8 pc=0x55f6913321d9 runtime.gcenable.gowrap2() runtime/mgc.go:205 +0x25 fp=0xc00006efe0 sp=0xc00006efc8 pc=0x55f691328585 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006efe8 sp=0xc00006efe0 pc=0x55f691384e61 created by runtime.gcenable in goroutine 1 runtime/mgc.go:205 +0xa5 goroutine 34 gp=0xc000104380 m=nil [finalizer wait, 1 minutes]: runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000072688?) runtime/proc.go:435 +0xce fp=0xc000072630 sp=0xc000072610 pc=0x55f69137ceee runtime.runfinq() runtime/mfinal.go:196 +0x107 fp=0xc0000727e0 sp=0xc000072630 pc=0x55f6913275a7 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000727e8 sp=0xc0000727e0 pc=0x55f691384e61 created by runtime.createfing in goroutine 1 runtime/mfinal.go:166 +0x3d goroutine 35 gp=0xc000104e00 m=nil [chan receive]: runtime.gopark(0xc000181b80?, 0xc01a902018?, 0x60?, 0x47?, 0x55f6914638a8?) runtime/proc.go:435 +0xce fp=0xc0002a4718 sp=0xc0002a46f8 pc=0x55f69137ceee runtime.chanrecv(0xc000100310, 0x0, 0x1) runtime/chan.go:664 +0x445 fp=0xc0002a4790 sp=0xc0002a4718 pc=0x55f691319085 runtime.chanrecv1(0x0?, 0x0?) runtime/chan.go:506 +0x12 fp=0xc0002a47b8 sp=0xc0002a4790 pc=0x55f691318c12 runtime.unique_runtime_registerUniqueMapCleanup.func2(...) runtime/mgc.go:1796 runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() runtime/mgc.go:1799 +0x2f fp=0xc0002a47e0 sp=0xc0002a47b8 pc=0x55f69132b78f runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0002a47e8 sp=0xc0002a47e0 pc=0x55f691384e61 created by unique.runtime_registerUniqueMapCleanup in goroutine 1 runtime/mgc.go:1794 +0x85 goroutine 36 gp=0xc000105180 m=nil [GC worker (idle), 1 minutes]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc0002a4f38 sp=0xc0002a4f18 pc=0x55f69137ceee runtime.gcBgMarkWorker(0xc000101730) runtime/mgc.go:1423 +0xe9 fp=0xc0002a4fc8 sp=0xc0002a4f38 pc=0x55f69132aaa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0002a4fe0 sp=0xc0002a4fc8 pc=0x55f69132a985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0002a4fe8 sp=0xc0002a4fe0 pc=0x55f691384e61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 20 gp=0xc0000aa700 m=nil [GC worker (idle), 1 minutes]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006f738 sp=0xc00006f718 pc=0x55f69137ceee runtime.gcBgMarkWorker(0xc000101730) runtime/mgc.go:1423 +0xe9 fp=0xc00006f7c8 sp=0xc00006f738 pc=0x55f69132aaa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00006f7e0 sp=0xc00006f7c8 pc=0x55f69132a985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006f7e8 sp=0xc00006f7e0 pc=0x55f691384e61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 21 gp=0xc0000aa8c0 m=nil [GC worker (idle), 1 minutes]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006ff38 sp=0xc00006ff18 pc=0x55f69137ceee runtime.gcBgMarkWorker(0xc000101730) runtime/mgc.go:1423 +0xe9 fp=0xc00006ffc8 sp=0xc00006ff38 pc=0x55f69132aaa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00006ffe0 sp=0xc00006ffc8 pc=0x55f69132a985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006ffe8 sp=0xc00006ffe0 pc=0x55f691384e61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 22 gp=0xc0000aaa80 m=nil [GC worker (idle)]: runtime.gopark(0x2d89bcb25f033?, 0x3?, 0xfb?, 0x9c?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000070738 sp=0xc000070718 pc=0x55f69137ceee runtime.gcBgMarkWorker(0xc000101730) runtime/mgc.go:1423 +0xe9 fp=0xc0000707c8 sp=0xc000070738 pc=0x55f69132aaa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0000707e0 sp=0xc0000707c8 pc=0x55f69132a985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000707e8 sp=0xc0000707e0 pc=0x55f691384e61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 23 gp=0xc0000aac40 m=nil [GC worker (idle)]: runtime.gopark(0x2d89bcb210047?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000070f38 sp=0xc000070f18 pc=0x55f69137ceee runtime.gcBgMarkWorker(0xc000101730) runtime/mgc.go:1423 +0xe9 fp=0xc000070fc8 sp=0xc000070f38 pc=0x55f69132aaa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc000070fe0 sp=0xc000070fc8 pc=0x55f69132a985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000070fe8 sp=0xc000070fe0 pc=0x55f691384e61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 24 gp=0xc0000aae00 m=nil [GC worker (idle)]: runtime.gopark(0x2d89bcb3b2f26?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000071738 sp=0xc000071718 pc=0x55f69137ceee runtime.gcBgMarkWorker(0xc000101730) runtime/mgc.go:1423 +0xe9 fp=0xc0000717c8 sp=0xc000071738 pc=0x55f69132aaa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0000717e0 sp=0xc0000717c8 pc=0x55f69132a985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000717e8 sp=0xc0000717e0 pc=0x55f691384e61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 25 gp=0xc0000aafc0 m=nil [GC worker (idle)]: runtime.gopark(0x55f69371b520?, 0x1?, 0x1e?, 0x3b?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000071f38 sp=0xc000071f18 pc=0x55f69137ceee runtime.gcBgMarkWorker(0xc000101730) runtime/mgc.go:1423 +0xe9 fp=0xc000071fc8 sp=0xc000071f38 pc=0x55f69132aaa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc000071fe0 sp=0xc000071fc8 pc=0x55f69132a985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000071fe8 sp=0xc000071fe0 pc=0x55f691384e61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 3 gp=0xc0000036c0 m=nil [GC worker (idle)]: runtime.gopark(0x2d89bcb25f4c1?, 0x3?, 0xa0?, 0x51?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000073738 sp=0xc000073718 pc=0x55f69137ceee runtime.gcBgMarkWorker(0xc000101730) runtime/mgc.go:1423 +0xe9 fp=0xc0000737c8 sp=0xc000073738 pc=0x55f69132aaa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0000737e0 sp=0xc0000737c8 pc=0x55f69132a985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000737e8 sp=0xc0000737e0 pc=0x55f691384e61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 4 gp=0xc000003880 m=nil [GC worker (idle)]: runtime.gopark(0x55f69371b520?, 0x3?, 0x3e?, 0x50?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000073f38 sp=0xc000073f18 pc=0x55f69137ceee runtime.gcBgMarkWorker(0xc000101730) runtime/mgc.go:1423 +0xe9 fp=0xc000073fc8 sp=0xc000073f38 pc=0x55f69132aaa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc000073fe0 sp=0xc000073fc8 pc=0x55f69132a985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000073fe8 sp=0xc000073fe0 pc=0x55f691384e61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 37 gp=0xc000105340 m=nil [GC worker (idle)]: runtime.gopark(0x55f69371b520?, 0x1?, 0x94?, 0x6d?, 0x0?) runtime/proc.go:435 +0xce fp=0xc0002a5738 sp=0xc0002a5718 pc=0x55f69137ceee runtime.gcBgMarkWorker(0xc000101730) runtime/mgc.go:1423 +0xe9 fp=0xc0002a57c8 sp=0xc0002a5738 pc=0x55f69132aaa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0002a57e0 sp=0xc0002a57c8 pc=0x55f69132a985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0002a57e8 sp=0xc0002a57e0 pc=0x55f691384e61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 26 gp=0xc0000ab180 m=nil [GC worker (idle)]: runtime.gopark(0x2d89bcb6c4cbe?, 0x1?, 0xe0?, 0x26?, 0x0?) runtime/proc.go:435 +0xce fp=0xc0002a0738 sp=0xc0002a0718 pc=0x55f69137ceee runtime.gcBgMarkWorker(0xc000101730) runtime/mgc.go:1423 +0xe9 fp=0xc0002a07c8 sp=0xc0002a0738 pc=0x55f69132aaa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0002a07e0 sp=0xc0002a07c8 pc=0x55f69132a985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0002a07e8 sp=0xc0002a07e0 pc=0x55f691384e61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 27 gp=0xc0000ab340 m=nil [GC worker (idle)]: runtime.gopark(0x2d89bcb228d96?, 0x3?, 0xc8?, 0x91?, 0x0?) runtime/proc.go:435 +0xce fp=0xc0002a0f38 sp=0xc0002a0f18 pc=0x55f69137ceee runtime.gcBgMarkWorker(0xc000101730) runtime/mgc.go:1423 +0xe9 fp=0xc0002a0fc8 sp=0xc0002a0f38 pc=0x55f69132aaa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0002a0fe0 sp=0xc0002a0fc8 pc=0x55f69132a985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0002a0fe8 sp=0xc0002a0fe0 pc=0x55f691384e61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 5 gp=0xc000582fc0 m=nil [chan receive]: runtime.gopark(0x30?, 0x55f692b4dd80?, 0x1?, 0x0?, 0xc000efb798?) runtime/proc.go:435 +0xce fp=0xc000efb750 sp=0xc000efb730 pc=0x55f69137ceee runtime.chanrecv(0xc000bc0230, 0x0, 0x1) runtime/chan.go:664 +0x445 fp=0xc000efb7c8 sp=0xc000efb750 pc=0x55f691319085 runtime.chanrecv1(0x55f692670664?, 0x29?) runtime/chan.go:506 +0x12 fp=0xc000efb7f0 sp=0xc000efb7c8 pc=0x55f691318c12 github.com/ollama/ollama/runner/ollamarunner.(*Server).forwardBatch(_, {0x1, {0x55f692c1e9d0, 0xc000bb9840}, {0x55f692c2be30, 0xc001ce2120}, {0xc000bca000, 0x37, 0x40}, {{0x55f692c2be30, ...}, ...}, ...}) github.com/ollama/ollama/runner/ollamarunner/runner.go:476 +0xfa fp=0xc000efbb58 sp=0xc000efb7f0 pc=0x55f69199c09a github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc00022f0e0, {0x55f692c12520, 0xc00039d810}) github.com/ollama/ollama/runner/ollamarunner/runner.go:453 +0x18c fp=0xc000efbfb8 sp=0xc000efbb58 pc=0x55f69199bd4c github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap1() github.com/ollama/ollama/runner/ollamarunner/runner.go:1424 +0x28 fp=0xc000efbfe0 sp=0xc000efbfb8 pc=0x55f6919a5488 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000efbfe8 sp=0xc000efbfe0 pc=0x55f691384e61 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/ollamarunner/runner.go:1424 +0x4c9 goroutine 6 gp=0xc000583180 m=nil [select]: runtime.gopark(0xc000049a08?, 0x2?, 0xc0?, 0x97?, 0xc00004986c?) runtime/proc.go:435 +0xce fp=0xc000049698 sp=0xc000049678 pc=0x55f69137ceee runtime.selectgo(0xc000049a08, 0xc000049868, 0x237?, 0x0, 0x1?, 0x1) runtime/select.go:351 +0x837 fp=0xc0000497d0 sp=0xc000049698 pc=0x55f69135bc17 github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(0xc00022f0e0, {0x55f692c0fda0, 0xc000020620}, 0xc00033af00) github.com/ollama/ollama/runner/ollamarunner/runner.go:956 +0xc4e fp=0xc000049ac0 sp=0xc0000497d0 pc=0x55f6919a052e github.com/ollama/ollama/runner/ollamarunner.(*Server).completion-fm({0x55f692c0fda0?, 0xc000020620?}, 0xc000049b40?) <autogenerated>:1 +0x36 fp=0xc000049af0 sp=0xc000049ac0 pc=0x55f6919a5976 net/http.HandlerFunc.ServeHTTP(0xc000693bc0?, {0x55f692c0fda0?, 0xc000020620?}, 0xc000049b60?) net/http/server.go:2294 +0x29 fp=0xc000049b18 sp=0xc000049af0 pc=0x55f69167d8c9 net/http.(*ServeMux).ServeHTTP(0x55f691321ac5?, {0x55f692c0fda0, 0xc000020620}, 0xc00033af00) net/http/server.go:2822 +0x1c4 fp=0xc000049b68 sp=0xc000049b18 pc=0x55f69167f7c4 net/http.serverHandler.ServeHTTP({0x55f692c0c090?}, {0x55f692c0fda0?, 0xc000020620?}, 0x1?) net/http/server.go:3301 +0x8e fp=0xc000049b98 sp=0xc000049b68 pc=0x55f69169d24e net/http.(*conn).serve(0xc000036750, {0x55f692c124e8, 0xc00033f410}) net/http/server.go:2102 +0x625 fp=0xc000049fb8 sp=0xc000049b98 pc=0x55f69167bdc5 net/http.(*Server).Serve.gowrap3() net/http/server.go:3454 +0x28 fp=0xc000049fe0 sp=0xc000049fb8 pc=0x55f691681688 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000049fe8 sp=0xc000049fe0 pc=0x55f691384e61 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3454 +0x485 goroutine 1982 gp=0xc000532540 m=nil [chan receive]: runtime.gopark(0x30?, 0x55f692b4dd80?, 0x1?, 0x99?, 0xc000089b20?) runtime/proc.go:435 +0xce fp=0xc000089ad8 sp=0xc000089ab8 pc=0x55f69137ceee runtime.chanrecv(0xc000544d90, 0x0, 0x1) runtime/chan.go:664 +0x445 fp=0xc000089b50 sp=0xc000089ad8 pc=0x55f691319085 runtime.chanrecv1(0x55f692674342?, 0x2c?) runtime/chan.go:506 +0x12 fp=0xc000089b78 sp=0xc000089b50 pc=0x55f691318c12 github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc00022f0e0, {0x1, {0x55f692c1e9d0, 0xc000bb9840}, {0x55f692c2be30, 0xc001ce2120}, {0xc000bca000, 0x37, 0x40}, {{0x55f692c2be30, ...}, ...}, ...}) github.com/ollama/ollama/runner/ollamarunner/runner.go:645 +0x185 fp=0xc000089ef0 sp=0xc000089b78 pc=0x55f69199dba5 github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1() github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x58 fp=0xc000089fe0 sp=0xc000089ef0 pc=0x55f69199bf78 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000089fe8 sp=0xc000089fe0 pc=0x55f691384e61 created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 5 github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x2cd goroutine 2071 gp=0xc000aacc40 m=nil [IO wait]: runtime.gopark(0x6574696c?, 0xa0c4686120726f62?, 0x45?, 0x6e?, 0xb?) runtime/proc.go:435 +0xce fp=0xc0005315d8 sp=0xc0005315b8 pc=0x55f69137ceee runtime.netpollblock(0x55f6913a0798?, 0x913164a6?, 0xf6?) runtime/netpoll.go:575 +0xf7 fp=0xc000531610 sp=0xc0005315d8 pc=0x55f691342097 internal/poll.runtime_pollWait(0x7fe4082344f8, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc000531630 sp=0xc000531610 pc=0x55f69137c105 internal/poll.(*pollDesc).wait(0xc0001d8a80?, 0xc00033f511?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000531658 sp=0xc000531630 pc=0x55f691404487 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc0001d8a80, {0xc00033f511, 0x1, 0x1}) internal/poll/fd_unix.go:165 +0x27a fp=0xc0005316f0 sp=0xc000531658 pc=0x55f69140577a net.(*netFD).Read(0xc0001d8a80, {0xc00033f511?, 0xc00044d9d8?, 0xc000531770?}) net/fd_posix.go:55 +0x25 fp=0xc000531738 sp=0xc0005316f0 pc=0x55f69147ada5 net.(*conn).Read(0xc00068c6e8, {0xc00033f511?, 0xc000127d00?, 0x55f6916f3f80?}) net/net.go:194 +0x45 fp=0xc000531780 sp=0xc000531738 pc=0x55f691489165 net/http.(*connReader).backgroundRead(0xc00033f500) net/http/server.go:690 +0x37 fp=0xc0005317c8 sp=0xc000531780 pc=0x55f691675c97 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:686 +0x25 fp=0xc0005317e0 sp=0xc0005317c8 pc=0x55f691675bc5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0005317e8 sp=0xc0005317e0 pc=0x55f691384e61 created by net/http.(*connReader).startBackgroundRead in goroutine 6 net/http/server.go:686 +0xb6 rax 0x0 rbx 0x295 rcx 0x7fe44f61fb2c rdx 0x6 rdi 0x291 rsi 0x295 rbp 0x7fe4037fcc00 rsp 0x7fe4037fcbc0 r8 0x0 r9 0x7 r10 0x8 r11 0x246 r12 0x6 r13 0x7fe377df6710 r14 0x16 r15 0x200000 rip 0x7fe44f61fb2c rflags 0x246 cs 0x33 fs 0x0 gs 0x0 time=2026-02-26T19:17:09.801Z level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:38985/completion\": EOF" [GIN] 2026/02/26 - 19:17:09 | 500 | 2.156077265s | 127.0.0.1 | POST "/api/chat" time=2026-02-26T19:17:09.911Z level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 2" ``` </details>
Author
Owner

@rick-github commented on GitHub (Feb 26, 2026):

CUDA error: invalid argument
current device: 0, in function ggml_cuda_cpy at //ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:438
cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream)

Yes, it's the same issue with a different trigger. AFAIK hasn't been root-caused yet.

<!-- gh-comment-id:3968902448 --> @rick-github commented on GitHub (Feb 26, 2026): ``` CUDA error: invalid argument current device: 0, in function ggml_cuda_cpy at //ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:438 cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream) ``` Yes, it's the same issue with a different trigger. AFAIK hasn't been root-caused yet.
Author
Owner

@ArthurusDent commented on GitHub (Feb 26, 2026):

I'm seeing the same bug when running qwen3.5:27b-q4_K_M (193ec05b1e80). I hope it's OK to add my log even though it's a different model. If it's not OK, tell me and I'll create a different issue.

Running the model from command line or in Open WebUI with the default context window of 4096, i.e. without truncation, doesn't crash it.

Gemini thought the bug happened due to flash attention and mixing two different GPU architectures, Pascal (1060 6GB) and Turing (2060 12GB) but I've benchmarked many models, including qwen3, and I never saw this bug so maybe it's a different issue.

Deployment: Docker
Ollama: 0.17.1
OS: Ubuntu Server 24.04

time=2026-02-26T14:26:04.438Z level=DEBUG source=runner.go:264 msg="refreshing free memory"
time=2026-02-26T14:26:04.438Z level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery"
time=2026-02-26T14:26:04.439Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 38035"
time=2026-02-26T14:26:04.439Z level=DEBUG source=server.go:432 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_DEBUG=1 OLLAMA_KEEP_ALIVE=60m LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12
time=2026-02-26T14:26:04.846Z level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=407.512657ms OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v12]" extra_envs=map[]
time=2026-02-26T14:26:04.846Z level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=407.794952ms
time=2026-02-26T14:26:04.847Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
time=2026-02-26T14:26:04.847Z level=DEBUG source=sched.go:222 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=6 gpu_count=2
time=2026-02-26T14:26:04.893Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32
time=2026-02-26T14:26:04.895Z level=DEBUG source=sched.go:258 msg="loading first model" model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b
time=2026-02-26T14:26:05.055Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default=""
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default=""
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false
time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2026-02-26T14:26:05.060Z level=INFO source=server.go:247 msg="enabling flash attention"
time=2026-02-26T14:26:05.060Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b --port 37403"
time=2026-02-26T14:26:05.060Z level=DEBUG source=server.go:432 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_DEBUG=1 OLLAMA_KEEP_ALIVE=60m LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12
time=2026-02-26T14:26:05.061Z level=INFO source=sched.go:491 msg="system memory" total="15.5 GiB" free="6.1 GiB" free_swap="19.6 GiB"
time=2026-02-26T14:26:05.061Z level=INFO source=sched.go:498 msg="gpu memory" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA available="11.1 GiB" free="11.6 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-02-26T14:26:05.061Z level=INFO source=sched.go:498 msg="gpu memory" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA available="5.4 GiB" free="5.9 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-02-26T14:26:05.061Z level=INFO source=server.go:757 msg="loading model" "model layers"=65 requested=-1
time=2026-02-26T14:26:05.076Z level=INFO source=runner.go:1411 msg="starting ollama engine"
time=2026-02-26T14:26:05.076Z level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:37403"
time=2026-02-26T14:26:05.084Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:4 GPULayers:65[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:65(0..64)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T14:26:05.162Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32
time=2026-02-26T14:26:05.165Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.name default=""
time=2026-02-26T14:26:05.165Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.description default=""
time=2026-02-26T14:26:05.165Z level=INFO source=ggml.go:136 msg="" architecture=qwen35 file_type=Q4_K_M name="" description="" num_tensors=1307 num_key_values=53
time=2026-02-26T14:26:05.165Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-sandybridge.so
time=2026-02-26T14:26:05.174Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v12
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5, VMM: yes, ID: GPU-8416b7d6-2f5f-d827-5992-4affda11c96a
  Device 1: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, ID: GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
time=2026-02-26T14:26:05.320Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default=""
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default=""
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false
time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2026-02-26T14:26:06.230Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=1
time=2026-02-26T14:26:06.638Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=33583 splits=388
time=2026-02-26T14:26:06.674Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=4919 splits=2
time=2026-02-26T14:26:06.675Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="15.5 GiB"
time=2026-02-26T14:26:06.675Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="710.2 MiB"
time=2026-02-26T14:26:06.675Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="3.9 GiB"
time=2026-02-26T14:26:06.675Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.1 GiB"
time=2026-02-26T14:26:06.675Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="216.0 MiB"
time=2026-02-26T14:26:06.675Z level=DEBUG source=device.go:272 msg="total memory" size="21.4 GiB"
time=2026-02-26T14:26:06.675Z level=DEBUG source=server.go:782 msg=memory success=true required.InputWeights=744714240 required.CPU.Graph=226492416 required.CUDA0.ID=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a required.CUDA0.Weights="[472507392 238746624 238746624 233760768 238746624 238746624 238746624 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 238746624 238746624 238746624 233760768 238746624 238746624 238746624 233760768 1965245056]" required.CUDA0.Cache="[81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0]" required.CUDA0.Graph=1138774016
time=2026-02-26T14:26:06.676Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA "available layer vram"="10.0 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="1.1 GiB"
time=2026-02-26T14:26:06.676Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA "available layer vram"="5.4 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B"
time=2026-02-26T14:26:06.676Z level=DEBUG source=server.go:793 msg="new layout created" layers="56[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:37(8..44) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:19(45..63)]"
time=2026-02-26T14:26:06.676Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:4 GPULayers:56[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:37(8..44) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:19(45..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T14:26:06.730Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32
time=2026-02-26T14:26:06.737Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0
time=2026-02-26T14:26:06.737Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0
time=2026-02-26T14:26:06.737Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0
time=2026-02-26T14:26:06.737Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default=""
time=2026-02-26T14:26:06.737Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default=""
time=2026-02-26T14:26:06.737Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1
time=2026-02-26T14:26:06.737Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0
time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0
time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0
time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0
time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true
time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false
time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07
time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000
time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304
time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false
time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2026-02-26T14:26:07.246Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=442
time=2026-02-26T14:26:07.439Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=33583 splits=523
time=2026-02-26T14:26:07.451Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=4919 splits=4
time=2026-02-26T14:26:07.452Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="7.7 GiB"
time=2026-02-26T14:26:07.452Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="4.1 GiB"
time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="4.5 GiB"
time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="2.3 GiB"
time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="1.1 GiB"
time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="499.6 MiB"
time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.1 GiB"
time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="736.6 MiB"
time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="1.0 GiB"
time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:272 msg="total memory" size="22.9 GiB"
time=2026-02-26T14:26:07.453Z level=DEBUG source=server.go:782 msg=memory success=true required.InputWeights=744714240 required.CPU.Weights="[472507264 238746496 238746496 233760768 238746496 238746496 238746496 233760768 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1965243328]" required.CPU.Cache="[81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=1098865536 required.CUDA0.ID=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a required.CUDA0.Weights="[0 0 0 0 0 0 0 0 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Cache="[0 0 0 0 0 0 0 0 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Graph=1129623552 required.CUDA1.ID=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 required.CUDA1.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 238746624 238746624 238746624 233760768 238746624 238746624 238746624 233760768 0]" required.CUDA1.Cache="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0]" required.CUDA1.Graph=772374528
time=2026-02-26T14:26:07.453Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA "available layer vram"="10.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="1.1 GiB"
time=2026-02-26T14:26:07.453Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA "available layer vram"="4.7 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="736.6 MiB"
time=2026-02-26T14:26:07.454Z level=DEBUG source=server.go:793 msg="new layout created" layers="54[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:37(10..46) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:17(47..63)]"
time=2026-02-26T14:26:07.454Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:4 GPULayers:54[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:37(10..46) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:17(47..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T14:26:07.555Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32
time=2026-02-26T14:26:07.563Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0
time=2026-02-26T14:26:07.563Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0
time=2026-02-26T14:26:07.563Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0
time=2026-02-26T14:26:07.563Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default=""
time=2026-02-26T14:26:07.563Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default=""
time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1
time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0
time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0
time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0
time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0
time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true
time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false
time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07
time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000
time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304
time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false
time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2026-02-26T14:26:08.013Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=442
time=2026-02-26T14:26:08.186Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=33583 splits=559
time=2026-02-26T14:26:08.198Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=4919 splits=4
time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="7.7 GiB"
time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="3.6 GiB"
time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="4.9 GiB"
time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="2.3 GiB"
time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="1015.2 MiB"
time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="655.4 MiB"
time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.1 GiB"
time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="704.5 MiB"
time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="1.0 GiB"
time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:272 msg="total memory" size="22.9 GiB"
time=2026-02-26T14:26:08.199Z level=DEBUG source=server.go:782 msg=memory success=true required.InputWeights=744714240 required.CPU.Weights="[472507264 238746496 238746496 233760768 238746496 238746496 238746496 233760768 215767936 215767936 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1965243328]" required.CPU.Cache="[81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=1098865536 required.CUDA0.ID=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Cache="[0 0 0 0 0 0 0 0 0 0 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Graph=1129623552 required.CUDA1.ID=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 required.CUDA1.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 238746624 238746624 238746624 233760768 238746624 238746624 238746624 233760768 0]" required.CUDA1.Cache="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0]" required.CUDA1.Graph=738688768
time=2026-02-26T14:26:08.199Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA "available layer vram"="10.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="1.1 GiB"
time=2026-02-26T14:26:08.199Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA "available layer vram"="4.7 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="704.5 MiB"
time=2026-02-26T14:26:08.200Z level=DEBUG source=server.go:793 msg="new layout created" layers="54[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:37(10..46) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:17(47..63)]"
time=2026-02-26T14:26:08.200Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:4 GPULayers:54[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:37(10..46) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:17(47..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T14:26:08.300Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default=""
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default=""
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false
time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2026-02-26T14:26:08.792Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=442
time=2026-02-26T14:26:09.292Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=33583 splits=559
time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="7.7 GiB"
time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="3.6 GiB"
time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="4.9 GiB"
time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="2.3 GiB"
time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="1015.2 MiB"
time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="655.4 MiB"
time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.1 GiB"
time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="608.5 MiB"
time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="1.0 GiB"
time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:272 msg="total memory" size="22.8 GiB"
time=2026-02-26T14:26:09.413Z level=DEBUG source=server.go:782 msg=memory success=false required.InputWeights=744714240 required.CPU.Weights="[472507264 238746496 238746496 233760768 238746496 238746496 238746496 233760768 215767936 215767936 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1965243328]" required.CPU.Cache="[81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=1098865536 required.CUDA0.ID=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Cache="[0 0 0 0 0 0 0 0 0 0 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Graph=1129623552 required.CUDA1.ID=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 required.CUDA1.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 238746624 238746624 238746624 233760768 238746624 238746624 238746624 233760768 0]" required.CUDA1.Cache="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0]" required.CUDA1.Graph=638025472
time=2026-02-26T14:26:09.414Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA "available layer vram"="10.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="1.1 GiB"
time=2026-02-26T14:26:09.414Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA "available layer vram"="4.8 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="608.5 MiB"
time=2026-02-26T14:26:09.414Z level=DEBUG source=server.go:793 msg="new layout created" layers="54[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:37(10..46) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:17(47..63)]"
time=2026-02-26T14:26:09.414Z level=INFO source=server.go:879 msg="model layout did not fit, applying backoff" backoff=0.10
time=2026-02-26T14:26:09.414Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA "available layer vram"="8.9 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="1.1 GiB"
time=2026-02-26T14:26:09.414Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA "available layer vram"="4.2 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="608.5 MiB"
time=2026-02-26T14:26:09.414Z level=DEBUG source=server.go:793 msg="new layout created" layers="48[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:33(16..48) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:15(49..63)]"
time=2026-02-26T14:26:09.415Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:4 GPULayers:48[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:33(16..48) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:15(49..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T14:26:09.516Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32
time=2026-02-26T14:26:09.588Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0
time=2026-02-26T14:26:09.588Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0
time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0
time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default=""
time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default=""
time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1
time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0
time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0
time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0
time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0
time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true
time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false
time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07
time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000
time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304
time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false
time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2026-02-26T14:26:10.111Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=442
time=2026-02-26T14:26:10.786Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=33583 splits=657
time=2026-02-26T14:26:10.800Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=4919 splits=4
time=2026-02-26T14:26:10.800Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="6.8 GiB"
time=2026-02-26T14:26:10.800Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="3.2 GiB"
time=2026-02-26T14:26:10.800Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="6.2 GiB"
time=2026-02-26T14:26:10.801Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="2.0 GiB"
time=2026-02-26T14:26:10.801Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="921.2 MiB"
time=2026-02-26T14:26:10.801Z level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="999.2 MiB"
time=2026-02-26T14:26:10.801Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.1 GiB"
time=2026-02-26T14:26:10.801Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="736.6 MiB"
time=2026-02-26T14:26:10.801Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="1.0 GiB"
time=2026-02-26T14:26:10.801Z level=DEBUG source=device.go:272 msg="total memory" size="22.9 GiB"
time=2026-02-26T14:26:10.801Z level=DEBUG source=server.go:782 msg=memory success=true required.InputWeights=744714240 required.CPU.Weights="[472507264 238746496 238746496 233760768 238746496 238746496 238746496 233760768 215767936 215767936 238746496 210782208 215767936 238746496 215767936 210782208 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1965243328]" required.CPU.Cache="[81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=1098865536 required.CUDA0.ID=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Cache="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Graph=1129623552 required.CUDA1.ID=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 required.CUDA1.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 238746624 215768064 210782208 238746624 215768064 215768064 233760768 238746624 238746624 238746624 233760768 238746624 238746624 238746624 233760768 0]" required.CUDA1.Cache="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0]" required.CUDA1.Graph=772374528
time=2026-02-26T14:26:10.801Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA "available layer vram"="8.9 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="1.1 GiB"
time=2026-02-26T14:26:10.801Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA "available layer vram"="4.1 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="736.6 MiB"
time=2026-02-26T14:26:10.801Z level=DEBUG source=server.go:793 msg="new layout created" layers="47[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:33(17..49) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:14(50..63)]"
time=2026-02-26T14:26:10.802Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:4 GPULayers:47[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:33(17..49) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:14(50..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T14:26:11.011Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default=""
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default=""
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false
time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0
time=2026-02-26T14:26:11.629Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=442
time=2026-02-26T14:26:12.320Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=33583 splits=675
time=2026-02-26T14:26:12.335Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=4919 splits=4
time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="6.8 GiB"
time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="3.0 GiB"
time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="6.4 GiB"
time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="2.0 GiB"
time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="843.3 MiB"
time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="1.1 GiB"
time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.1 GiB"
time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="815.6 MiB"
time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="1.0 GiB"
time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:272 msg="total memory" size="23.0 GiB"
time=2026-02-26T14:26:12.335Z level=DEBUG source=server.go:782 msg=memory success=true required.InputWeights=744714240 required.CPU.Weights="[472507264 238746496 238746496 233760768 238746496 238746496 238746496 233760768 215767936 215767936 238746496 210782208 215767936 238746496 215767936 210782208 238746496 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1965243328]" required.CPU.Cache="[81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=1098865536 required.CUDA0.ID=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Cache="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Graph=1129623552 required.CUDA1.ID=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 required.CUDA1.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 215768064 210782208 238746624 215768064 215768064 233760768 238746624 238746624 238746624 233760768 238746624 238746624 238746624 233760768 0]" required.CUDA1.Cache="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0]" required.CUDA1.Graph=855166720
time=2026-02-26T14:26:12.336Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA "available layer vram"="8.9 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="1.1 GiB"
time=2026-02-26T14:26:12.336Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA "available layer vram"="4.0 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="815.6 MiB"
time=2026-02-26T14:26:12.336Z level=DEBUG source=server.go:793 msg="new layout created" layers="47[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:33(17..49) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:14(50..63)]"
time=2026-02-26T14:26:12.336Z level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:4 GPULayers:47[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:33(17..49) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:14(50..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-26T14:26:12.336Z level=INFO source=ggml.go:482 msg="offloading 47 repeating layers to GPU"
time=2026-02-26T14:26:12.336Z level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-02-26T14:26:12.336Z level=INFO source=ggml.go:494 msg="offloaded 47/65 layers to GPU"
time=2026-02-26T14:26:12.336Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="6.8 GiB"
time=2026-02-26T14:26:12.336Z level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="3.0 GiB"
time=2026-02-26T14:26:12.336Z level=INFO source=device.go:245 msg="model weights" device=CPU size="6.4 GiB"
time=2026-02-26T14:26:12.336Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="2.0 GiB"
time=2026-02-26T14:26:12.336Z level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="843.3 MiB"
time=2026-02-26T14:26:12.337Z level=INFO source=device.go:256 msg="kv cache" device=CPU size="1.1 GiB"
time=2026-02-26T14:26:12.337Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="1.1 GiB"
time=2026-02-26T14:26:12.337Z level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="815.6 MiB"
time=2026-02-26T14:26:12.337Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="1.0 GiB"
time=2026-02-26T14:26:12.337Z level=INFO source=device.go:272 msg="total memory" size="23.0 GiB"
time=2026-02-26T14:26:12.337Z level=INFO source=sched.go:566 msg="loaded runners" count=1
time=2026-02-26T14:26:12.337Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-02-26T14:26:12.355Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-02-26T14:26:12.355Z level=DEBUG source=server.go:1394 msg="model load progress 0.00"
[...]
time=2026-02-26T14:26:42.220Z level=DEBUG source=server.go:1394 msg="model load progress 0.99"
time=2026-02-26T14:26:42.441Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0
time=2026-02-26T14:26:42.471Z level=INFO source=server.go:1388 msg="llama runner started in 37.41 seconds"
time=2026-02-26T14:26:42.471Z level=DEBUG source=sched.go:578 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3.5:27b-q4_K_M runner.inference="[{ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Library:CUDA} {ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Library:CUDA}]" runner.size="23.0 GiB" runner.vram="14.5 GiB" runner.parallel=1 runner.pid=167 runner.model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b runner.num_ctx=4096
[GIN] 2026/02/26 - 14:26:42 | 200 | 38.612650389s |      172.18.0.1 | POST     "/api/generate"
time=2026-02-26T14:26:42.472Z level=DEBUG source=sched.go:586 msg="context for request finished"
time=2026-02-26T14:26:42.472Z level=DEBUG source=sched.go:338 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3.5:27b-q4_K_M runner.inference="[{ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Library:CUDA} {ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Library:CUDA}]" runner.size="23.0 GiB" runner.vram="14.5 GiB" runner.parallel=1 runner.pid=167 runner.model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b runner.num_ctx=4096 duration=1h0m0s
time=2026-02-26T14:26:42.472Z level=DEBUG source=sched.go:356 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3.5:27b-q4_K_M runner.inference="[{ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Library:CUDA} {ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Library:CUDA}]" runner.size="23.0 GiB" runner.vram="14.5 GiB" runner.parallel=1 runner.pid=167 runner.model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b runner.num_ctx=4096 refCount=0
time=2026-02-26T14:26:42.914Z level=DEBUG source=sched.go:734 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b
time=2026-02-26T14:26:43.003Z level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=15825 format=""
time=2026-02-26T14:26:43.096Z level=WARN source=runner.go:187 msg="truncating input prompt" limit=4096 prompt=6317 keep=4 new=4096
time=2026-02-26T14:26:43.096Z level=DEBUG source=cache.go:151 msg="loading cache slot" id=0 cache=0 prompt=4096 used=0 remaining=4096
CUDA error: invalid argument
  current device: 0, in function ggml_cuda_cpy at //ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:438
  cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream)
//ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error
/usr/lib/ollama/libggml-base.so.0(+0x1bae8)[0x719e783fcae8]
/usr/lib/ollama/libggml-base.so.0(ggml_print_backtrace+0x1e6)[0x719e783fceb6]
/usr/lib/ollama/libggml-base.so.0(ggml_abort+0x11d)[0x719e783fd03d]
/usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x143272)[0x719deb69d272]
/usr/lib/ollama/cuda_v12/libggml-cuda.so(_Z13ggml_cuda_cpyR25ggml_backend_cuda_contextPK11ggml_tensorPS1_+0x1e50)[0x719deb65b1d0]
/usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x153cae)[0x719deb6adcae]
/usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x156eea)[0x719deb6b0eea]
/usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x158e55)[0x719deb6b2e55]
/usr/bin/ollama(+0x13ac156)[0x596dc083b156]
/usr/bin/ollama(+0x132034b)[0x596dc07af34b]
/usr/bin/ollama(+0x3ddae1)[0x596dbf86cae1]
SIGABRT: abort
PC=0x719ec7cabb2c m=13 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 1506 gp=0xc0005828c0 m=13 mp=0xc000278008 [syscall]:
runtime.cgocall(0x596dc07af330, 0xc00008baa0)
	runtime/cgocall.go:167 +0x4b fp=0xc00008ba78 sp=0xc00008ba40 pc=0x596dbf861a6b
github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(0x719e50104d90, 0x71971c0039a0)
	_cgo_gotypes.go:979 +0x4a fp=0xc00008baa0 sp=0xc00008ba78 pc=0x596dbfd4cb0a
github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify.func2(...)
	github.com/ollama/ollama/ml/backend/ggml/ggml.go:825
github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify(0xc001736100, 0xc000359800?, {0xc0094b7070, 0x1, 0x2?})
	github.com/ollama/ollama/ml/backend/ggml/ggml.go:825 +0x1b2 fp=0xc00008bb78 sp=0xc00008baa0 pc=0x596dbfd5b492
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc0002370e0, {0x0, {0x596dc11069d0, 0xc001736100}, {0x596dc1113e30, 0xc000660648}, {0xc0003c6c08, 0x200, 0x25f}, {{0x596dc1113e30, ...}, ...}, ...})
	github.com/ollama/ollama/runner/ollamarunner/runner.go:716 +0x862 fp=0xc00008bef0 sp=0xc00008bb78 pc=0x596dbfe86282
github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1()
	github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x58 fp=0xc00008bfe0 sp=0xc00008bef0 pc=0x596dbfe83f78
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00008bfe8 sp=0xc00008bfe0 pc=0x596dbf86ce61
created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 8
	github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x2cd

goroutine 1 gp=0xc000002380 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc000339778 sp=0xc000339758 pc=0x596dbf864eee
runtime.netpollblock(0xc00051f7c8?, 0xbf7fe4a6?, 0x6d?)
	runtime/netpoll.go:575 +0xf7 fp=0xc0003397b0 sp=0xc000339778 pc=0x596dbf82a097
internal/poll.runtime_pollWait(0x719e800c8610, 0x72)
	runtime/netpoll.go:351 +0x85 fp=0xc0003397d0 sp=0xc0003397b0 pc=0x596dbf864105
internal/poll.(*pollDesc).wait(0xc0000dc900?, 0x900000036?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0003397f8 sp=0xc0003397d0 pc=0x596dbf8ec487
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0xc0000dc900)
	internal/poll/fd_unix.go:620 +0x295 fp=0xc0003398a0 sp=0xc0003397f8 pc=0x596dbf8f1855
net.(*netFD).accept(0xc0000dc900)
	net/fd_unix.go:172 +0x29 fp=0xc000339958 sp=0xc0003398a0 pc=0x596dbf964d49
net.(*TCPListener).accept(0xc000051740)
	net/tcpsock_posix.go:159 +0x1b fp=0xc0003399a8 sp=0xc000339958 pc=0x596dbf97ac5b
net.(*TCPListener).Accept(0xc000051740)
	net/tcpsock.go:380 +0x30 fp=0xc0003399d8 sp=0xc0003399a8 pc=0x596dbf979b10
net/http.(*onceCloseListener).Accept(0xc0000f2480?)
	<autogenerated>:1 +0x24 fp=0xc0003399f0 sp=0xc0003399d8 pc=0x596dbfb919c4
net/http.(*Server).Serve(0xc0001ffb00, {0x596dc10f7bc0, 0xc000051740})
	net/http/server.go:3424 +0x30c fp=0xc000339b20 sp=0xc0003399f0 pc=0x596dbfb6928c
github.com/ollama/ollama/runner/ollamarunner.Execute({0xc000130030, 0x4, 0x4})
	github.com/ollama/ollama/runner/ollamarunner/runner.go:1447 +0x94e fp=0xc000339cf0 sp=0xc000339b20 pc=0x596dbfe8d20e
github.com/ollama/ollama/runner.Execute({0xc000130010?, 0x0?, 0x0?})
	github.com/ollama/ollama/runner/runner.go:18 +0x10e fp=0xc000339d30 sp=0xc000339cf0 pc=0x596dbff2c76e
github.com/ollama/ollama/cmd.NewCLI.func3(0xc0001ff800?, {0x596dc0b15236?, 0x4?, 0x596dc0b1523a?})
	github.com/ollama/ollama/cmd/cmd.go:2270 +0x45 fp=0xc000339d58 sp=0xc000339d30 pc=0x596dc073f845
github.com/spf13/cobra.(*Command).execute(0xc0000f7b08, {0xc0005b96d0, 0x5, 0x5})
	github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc000339e78 sp=0xc000339d58 pc=0x596dbf9decdc
github.com/spf13/cobra.(*Command).ExecuteC(0xc0005be908)
	github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc000339f30 sp=0xc000339e78 pc=0x596dbf9df525
github.com/spf13/cobra.(*Command).Execute(...)
	github.com/spf13/cobra@v1.7.0/command.go:992
github.com/spf13/cobra.(*Command).ExecuteContext(...)
	github.com/spf13/cobra@v1.7.0/command.go:985
main.main()
	github.com/ollama/ollama/main.go:12 +0x4d fp=0xc000339f50 sp=0xc000339f30 pc=0x596dc0741ced
runtime.main()
	runtime/proc.go:283 +0x29d fp=0xc000339fe0 sp=0xc000339f50 pc=0x596dbf83171d
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000339fe8 sp=0xc000339fe0 pc=0x596dbf86ce61

goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc000072fa8 sp=0xc000072f88 pc=0x596dbf864eee
runtime.goparkunlock(...)
	runtime/proc.go:441
runtime.forcegchelper()
	runtime/proc.go:348 +0xb8 fp=0xc000072fe0 sp=0xc000072fa8 pc=0x596dbf831a58
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000072fe8 sp=0xc000072fe0 pc=0x596dbf86ce61
created by runtime.init.7 in goroutine 1
	runtime/proc.go:336 +0x1a

goroutine 3 gp=0xc000003340 m=nil [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc000073780 sp=0xc000073760 pc=0x596dbf864eee
runtime.goparkunlock(...)
	runtime/proc.go:441
runtime.bgsweep(0xc00007e000)
	runtime/mgcsweep.go:316 +0xdf fp=0xc0000737c8 sp=0xc000073780 pc=0x596dbf81c1ff
runtime.gcenable.gowrap1()
	runtime/mgc.go:204 +0x25 fp=0xc0000737e0 sp=0xc0000737c8 pc=0x596dbf8105e5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000737e8 sp=0xc0000737e0 pc=0x596dbf86ce61
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:204 +0x66

goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait]:
runtime.gopark(0x4f1263?, 0x4c195c?, 0x0?, 0x0?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc000073f78 sp=0xc000073f58 pc=0x596dbf864eee
runtime.goparkunlock(...)
	runtime/proc.go:441
runtime.(*scavengerState).park(0x596dc1b2e5a0)
	runtime/mgcscavenge.go:425 +0x49 fp=0xc000073fa8 sp=0xc000073f78 pc=0x596dbf819c49
runtime.bgscavenge(0xc00007e000)
	runtime/mgcscavenge.go:658 +0x59 fp=0xc000073fc8 sp=0xc000073fa8 pc=0x596dbf81a1d9
runtime.gcenable.gowrap2()
	runtime/mgc.go:205 +0x25 fp=0xc000073fe0 sp=0xc000073fc8 pc=0x596dbf810585
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000073fe8 sp=0xc000073fe0 pc=0x596dbf86ce61
created by runtime.gcenable in goroutine 1
	runtime/mgc.go:205 +0xa5

goroutine 18 gp=0xc000102700 m=nil [finalizer wait]:
runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000072688?)
	runtime/proc.go:435 +0xce fp=0xc000072630 sp=0xc000072610 pc=0x596dbf864eee
runtime.runfinq()
	runtime/mfinal.go:196 +0x107 fp=0xc0000727e0 sp=0xc000072630 pc=0x596dbf80f5a7
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000727e8 sp=0xc0000727e0 pc=0x596dbf86ce61
created by runtime.createfing in goroutine 1
	runtime/mfinal.go:166 +0x3d

goroutine 19 gp=0xc000103180 m=nil [chan receive]:
runtime.gopark(0xc000235b80?, 0xc000010168?, 0x60?, 0xe7?, 0x596dbf94b8a8?)
	runtime/proc.go:435 +0xce fp=0xc00006e718 sp=0xc00006e6f8 pc=0x596dbf864eee
runtime.chanrecv(0xc000110310, 0x0, 0x1)
	runtime/chan.go:664 +0x445 fp=0xc00006e790 sp=0xc00006e718 pc=0x596dbf801085
runtime.chanrecv1(0x0?, 0x0?)
	runtime/chan.go:506 +0x12 fp=0xc00006e7b8 sp=0xc00006e790 pc=0x596dbf800c12
runtime.unique_runtime_registerUniqueMapCleanup.func2(...)
	runtime/mgc.go:1796
runtime.unique_runtime_registerUniqueMapCleanup.gowrap1()
	runtime/mgc.go:1799 +0x2f fp=0xc00006e7e0 sp=0xc00006e7b8 pc=0x596dbf81378f
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00006e7e8 sp=0xc00006e7e0 pc=0x596dbf86ce61
created by unique.runtime_registerUniqueMapCleanup in goroutine 1
	runtime/mgc.go:1794 +0x85

goroutine 20 gp=0xc000103500 m=nil [GC worker (idle)]:
runtime.gopark(0x36c21352c46?, 0x3?, 0xa9?, 0xc9?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc00006ef38 sp=0xc00006ef18 pc=0x596dbf864eee
runtime.gcBgMarkWorker(0xc000111730)
	runtime/mgc.go:1423 +0xe9 fp=0xc00006efc8 sp=0xc00006ef38 pc=0x596dbf812aa9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1339 +0x25 fp=0xc00006efe0 sp=0xc00006efc8 pc=0x596dbf812985
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00006efe8 sp=0xc00006efe0 pc=0x596dbf86ce61
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1339 +0x105

goroutine 34 gp=0xc000504000 m=nil [GC worker (idle)]:
runtime.gopark(0x36ad8f541be?, 0x1?, 0x49?, 0x33?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc00050a738 sp=0xc00050a718 pc=0x596dbf864eee
runtime.gcBgMarkWorker(0xc000111730)
	runtime/mgc.go:1423 +0xe9 fp=0xc00050a7c8 sp=0xc00050a738 pc=0x596dbf812aa9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1339 +0x25 fp=0xc00050a7e0 sp=0xc00050a7c8 pc=0x596dbf812985
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00050a7e8 sp=0xc00050a7e0 pc=0x596dbf86ce61
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1339 +0x105

goroutine 5 gp=0xc000003a40 m=nil [GC worker (idle)]:
runtime.gopark(0x36c21af2acb?, 0x1?, 0x52?, 0xef?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc000074738 sp=0xc000074718 pc=0x596dbf864eee
runtime.gcBgMarkWorker(0xc000111730)
	runtime/mgc.go:1423 +0xe9 fp=0xc0000747c8 sp=0xc000074738 pc=0x596dbf812aa9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1339 +0x25 fp=0xc0000747e0 sp=0xc0000747c8 pc=0x596dbf812985
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000747e8 sp=0xc0000747e0 pc=0x596dbf86ce61
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1339 +0x105

goroutine 6 gp=0xc000003c00 m=nil [GC worker (idle)]:
runtime.gopark(0x596dc1c03520?, 0x1?, 0xc2?, 0x9b?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc000074f38 sp=0xc000074f18 pc=0x596dbf864eee
runtime.gcBgMarkWorker(0xc000111730)
	runtime/mgc.go:1423 +0xe9 fp=0xc000074fc8 sp=0xc000074f38 pc=0x596dbf812aa9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1339 +0x25 fp=0xc000074fe0 sp=0xc000074fc8 pc=0x596dbf812985
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000074fe8 sp=0xc000074fe0 pc=0x596dbf86ce61
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1339 +0x105

goroutine 7 gp=0xc000003dc0 m=nil [GC worker (idle)]:
runtime.gopark(0x36c21352f04?, 0x3?, 0x14?, 0x67?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc000075738 sp=0xc000075718 pc=0x596dbf864eee
runtime.gcBgMarkWorker(0xc000111730)
	runtime/mgc.go:1423 +0xe9 fp=0xc0000757c8 sp=0xc000075738 pc=0x596dbf812aa9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1339 +0x25 fp=0xc0000757e0 sp=0xc0000757c8 pc=0x596dbf812985
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc0000757e8 sp=0xc0000757e0 pc=0x596dbf86ce61
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1339 +0x105

goroutine 21 gp=0xc0001036c0 m=nil [GC worker (idle)]:
runtime.gopark(0x36c21adf40d?, 0x1?, 0x71?, 0x67?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc00006f738 sp=0xc00006f718 pc=0x596dbf864eee
runtime.gcBgMarkWorker(0xc000111730)
	runtime/mgc.go:1423 +0xe9 fp=0xc00006f7c8 sp=0xc00006f738 pc=0x596dbf812aa9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1339 +0x25 fp=0xc00006f7e0 sp=0xc00006f7c8 pc=0x596dbf812985
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00006f7e8 sp=0xc00006f7e0 pc=0x596dbf86ce61
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1339 +0x105

goroutine 35 gp=0xc0005041c0 m=nil [GC worker (idle)]:
runtime.gopark(0x36c21b9830f?, 0x3?, 0x87?, 0x43?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc00050af38 sp=0xc00050af18 pc=0x596dbf864eee
runtime.gcBgMarkWorker(0xc000111730)
	runtime/mgc.go:1423 +0xe9 fp=0xc00050afc8 sp=0xc00050af38 pc=0x596dbf812aa9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1339 +0x25 fp=0xc00050afe0 sp=0xc00050afc8 pc=0x596dbf812985
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00050afe8 sp=0xc00050afe0 pc=0x596dbf86ce61
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1339 +0x105

goroutine 36 gp=0xc000504380 m=nil [GC worker (idle)]:
runtime.gopark(0x36c21abc622?, 0x1?, 0x68?, 0xa1?, 0x0?)
	runtime/proc.go:435 +0xce fp=0xc00050b738 sp=0xc00050b718 pc=0x596dbf864eee
runtime.gcBgMarkWorker(0xc000111730)
	runtime/mgc.go:1423 +0xe9 fp=0xc00050b7c8 sp=0xc00050b738 pc=0x596dbf812aa9
runtime.gcBgMarkStartWorkers.gowrap1()
	runtime/mgc.go:1339 +0x25 fp=0xc00050b7e0 sp=0xc00050b7c8 pc=0x596dbf812985
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc00050b7e8 sp=0xc00050b7e0 pc=0x596dbf86ce61
created by runtime.gcBgMarkStartWorkers in goroutine 1
	runtime/mgc.go:1339 +0x105

goroutine 8 gp=0xc000505180 m=nil [chan receive]:
runtime.gopark(0x30?, 0x596dc1035d80?, 0x1?, 0x2?, 0xc000335798?)
	runtime/proc.go:435 +0xce fp=0xc000335750 sp=0xc000335730 pc=0x596dbf864eee
runtime.chanrecv(0xc0001101c0, 0x0, 0x1)
	runtime/chan.go:664 +0x445 fp=0xc0003357c8 sp=0xc000335750 pc=0x596dbf801085
runtime.chanrecv1(0x596dc0b58664?, 0x29?)
	runtime/chan.go:506 +0x12 fp=0xc0003357f0 sp=0xc0003357c8 pc=0x596dbf800c12
github.com/ollama/ollama/runner/ollamarunner.(*Server).forwardBatch(_, {0x1, {0x596dc11069d0, 0xc006fe4000}, {0x596dc1113e30, 0xc002c790b0}, {0xc00227a008, 0x200, 0x25f}, {{0x596dc1113e30, ...}, ...}, ...})
	github.com/ollama/ollama/runner/ollamarunner/runner.go:476 +0xfa fp=0xc000335b58 sp=0xc0003357f0 pc=0x596dbfe8409a
github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc0002370e0, {0x596dc10fa520, 0xc0005b9770})
	github.com/ollama/ollama/runner/ollamarunner/runner.go:453 +0x18c fp=0xc000335fb8 sp=0xc000335b58 pc=0x596dbfe83d4c
github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap1()
	github.com/ollama/ollama/runner/ollamarunner/runner.go:1424 +0x28 fp=0xc000335fe0 sp=0xc000335fb8 pc=0x596dbfe8d488
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000335fe8 sp=0xc000335fe0 pc=0x596dbf86ce61
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	github.com/ollama/ollama/runner/ollamarunner/runner.go:1424 +0x4c9

goroutine 9 gp=0xc000505340 m=nil [select]:
runtime.gopark(0xc000049a08?, 0x2?, 0xc0?, 0x97?, 0xc00004986c?)
	runtime/proc.go:435 +0xce fp=0xc000049698 sp=0xc000049678 pc=0x596dbf864eee
runtime.selectgo(0xc000049a08, 0xc000049868, 0x1000?, 0x0, 0x1?, 0x1)
	runtime/select.go:351 +0x837 fp=0xc0000497d0 sp=0xc000049698 pc=0x596dbf843c17
github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(0xc0002370e0, {0x596dc10f7da0, 0xc0000e81c0}, 0xc0003a8280)
	github.com/ollama/ollama/runner/ollamarunner/runner.go:956 +0xc4e fp=0xc000049ac0 sp=0xc0000497d0 pc=0x596dbfe8852e
github.com/ollama/ollama/runner/ollamarunner.(*Server).completion-fm({0x596dc10f7da0?, 0xc0000e81c0?}, 0xc00033bb40?)
	<autogenerated>:1 +0x36 fp=0xc000049af0 sp=0xc000049ac0 pc=0x596dbfe8d976
net/http.HandlerFunc.ServeHTTP(0xc0000c8780?, {0x596dc10f7da0?, 0xc0000e81c0?}, 0xc00033bb60?)
	net/http/server.go:2294 +0x29 fp=0xc000049b18 sp=0xc000049af0 pc=0x596dbfb658c9
net/http.(*ServeMux).ServeHTTP(0x596dbf809ac5?, {0x596dc10f7da0, 0xc0000e81c0}, 0xc0003a8280)
	net/http/server.go:2822 +0x1c4 fp=0xc000049b68 sp=0xc000049b18 pc=0x596dbfb677c4
net/http.serverHandler.ServeHTTP({0x596dc10f4090?}, {0x596dc10f7da0?, 0xc0000e81c0?}, 0x1?)
	net/http/server.go:3301 +0x8e fp=0xc000049b98 sp=0xc000049b68 pc=0x596dbfb8524e
net/http.(*conn).serve(0xc0000f2480, {0x596dc10fa4e8, 0xc0000f13b0})
	net/http/server.go:2102 +0x625 fp=0xc000049fb8 sp=0xc000049b98 pc=0x596dbfb63dc5
net/http.(*Server).Serve.gowrap3()
	net/http/server.go:3454 +0x28 fp=0xc000049fe0 sp=0xc000049fb8 pc=0x596dbfb69688
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000049fe8 sp=0xc000049fe0 pc=0x596dbf86ce61
created by net/http.(*Server).Serve in goroutine 1
	net/http/server.go:3454 +0x485

goroutine 1406 gp=0xc000505c00 m=nil [IO wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?)
	runtime/proc.go:435 +0xce fp=0xc000187dd8 sp=0xc000187db8 pc=0x596dbf864eee
runtime.netpollblock(0x596dbf888798?, 0xbf7fe4a6?, 0x6d?)
	runtime/netpoll.go:575 +0xf7 fp=0xc000187e10 sp=0xc000187dd8 pc=0x596dbf82a097
internal/poll.runtime_pollWait(0x719e800c84f8, 0x72)
	runtime/netpoll.go:351 +0x85 fp=0xc000187e30 sp=0xc000187e10 pc=0x596dbf864105
internal/poll.(*pollDesc).wait(0xc0000dc980?, 0xc000574041?, 0x0)
	internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000187e58 sp=0xc000187e30 pc=0x596dbf8ec487
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0000dc980, {0xc000574041, 0x1, 0x1})
	internal/poll/fd_unix.go:165 +0x27a fp=0xc000187ef0 sp=0xc000187e58 pc=0x596dbf8ed77a
net.(*netFD).Read(0xc0000dc980, {0xc000574041?, 0xc0005b6198?, 0xc000187f70?})
	net/fd_posix.go:55 +0x25 fp=0xc000187f38 sp=0xc000187ef0 pc=0x596dbf962da5
net.(*conn).Read(0xc00052e6e0, {0xc000574041?, 0xc0005b7a80?, 0x596dbfbdbf80?})
	net/net.go:194 +0x45 fp=0xc000187f80 sp=0xc000187f38 pc=0x596dbf971165
net/http.(*connReader).backgroundRead(0xc000574030)
	net/http/server.go:690 +0x37 fp=0xc000187fc8 sp=0xc000187f80 pc=0x596dbfb5dc97
net/http.(*connReader).startBackgroundRead.gowrap2()
	net/http/server.go:686 +0x25 fp=0xc000187fe0 sp=0xc000187fc8 pc=0x596dbfb5dbc5
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc000187fe8 sp=0xc000187fe0 pc=0x596dbf86ce61
created by net/http.(*connReader).startBackgroundRead in goroutine 9
	net/http/server.go:686 +0xb6

goroutine 1494 gp=0xc0008b4380 m=nil [chan receive]:
runtime.gopark(0x30?, 0x596dc1035d80?, 0x1?, 0xf8?, 0xc001a93b20?)
	runtime/proc.go:435 +0xce fp=0xc001a93ad8 sp=0xc001a93ab8 pc=0x596dbf864eee
runtime.chanrecv(0xc00031e150, 0x0, 0x1)
	runtime/chan.go:664 +0x445 fp=0xc001a93b50 sp=0xc001a93ad8 pc=0x596dbf801085
runtime.chanrecv1(0x596dc0b5c342?, 0x2c?)
	runtime/chan.go:506 +0x12 fp=0xc001a93b78 sp=0xc001a93b50 pc=0x596dbf800c12
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc0002370e0, {0x1, {0x596dc11069d0, 0xc006fe4000}, {0x596dc1113e30, 0xc002c790b0}, {0xc00227a008, 0x200, 0x25f}, {{0x596dc1113e30, ...}, ...}, ...})
	github.com/ollama/ollama/runner/ollamarunner/runner.go:645 +0x185 fp=0xc001a93ef0 sp=0xc001a93b78 pc=0x596dbfe85ba5
github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1()
	github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x58 fp=0xc001a93fe0 sp=0xc001a93ef0 pc=0x596dbfe83f78
runtime.goexit({})
	runtime/asm_amd64.s:1700 +0x1 fp=0xc001a93fe8 sp=0xc001a93fe0 pc=0x596dbf86ce61
created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 8
	github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x2cd

rax    0x0
rbx    0xb8
rcx    0x719ec7cabb2c
rdx    0x6
rdi    0xa7
rsi    0xb8
rbp    0x719d6cd95c00
rsp    0x719d6cd95bc0
r8     0x0
r9     0x7
r10    0x8
r11    0x246
r12    0x6
r13    0x719debdf6710
r14    0x16
r15    0x300000
rip    0x719ec7cabb2c
rflags 0x246
cs     0x33
fs     0x0
gs     0x0
time=2026-02-26T14:26:44.147Z level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:37403/completion\": EOF"
[GIN] 2026/02/26 - 14:26:44 | 500 |  1.670626538s |      172.18.0.1 | POST     "/api/generate"
time=2026-02-26T14:26:44.147Z level=DEBUG source=sched.go:433 msg="context for request finished" runner.name=registry.ollama.ai/library/qwen3.5:27b-q4_K_M runner.inference="[{ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Library:CUDA} {ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Library:CUDA}]" runner.size="23.0 GiB" runner.vram="14.5 GiB" runner.parallel=1 runner.pid=167 runner.model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b runner.num_ctx=4096
time=2026-02-26T14:26:44.147Z level=DEBUG source=sched.go:338 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3.5:27b-q4_K_M runner.inference="[{ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Library:CUDA} {ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Library:CUDA}]" runner.size="23.0 GiB" runner.vram="14.5 GiB" runner.parallel=1 runner.pid=167 runner.model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b runner.num_ctx=4096 duration=1h0m0s
time=2026-02-26T14:26:44.147Z level=DEBUG source=sched.go:356 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3.5:27b-q4_K_M runner.inference="[{ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Library:CUDA} {ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Library:CUDA}]" runner.size="23.0 GiB" runner.vram="14.5 GiB" runner.parallel=1 runner.pid=167 runner.model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b runner.num_ctx=4096 refCount=0
time=2026-02-26T14:26:44.226Z level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 2"
<!-- gh-comment-id:3969038426 --> @ArthurusDent commented on GitHub (Feb 26, 2026): I'm seeing the same bug when running `qwen3.5:27b-q4_K_M` (193ec05b1e80). I hope it's OK to add my log even though it's a different model. If it's not OK, tell me and I'll create a different issue. Running the model from command line or in Open WebUI with the default context window of 4096, i.e. without truncation, doesn't crash it. Gemini thought the bug happened due to flash attention and mixing two different GPU architectures, Pascal (1060 6GB) and Turing (2060 12GB) but I've benchmarked many models, including qwen3, and I never saw this bug so maybe it's a different issue. Deployment: Docker Ollama: 0.17.1 OS: Ubuntu Server 24.04 <details> ```bash time=2026-02-26T14:26:04.438Z level=DEBUG source=runner.go:264 msg="refreshing free memory" time=2026-02-26T14:26:04.438Z level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery" time=2026-02-26T14:26:04.439Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 38035" time=2026-02-26T14:26:04.439Z level=DEBUG source=server.go:432 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_DEBUG=1 OLLAMA_KEEP_ALIVE=60m LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12 time=2026-02-26T14:26:04.846Z level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=407.512657ms OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v12]" extra_envs=map[] time=2026-02-26T14:26:04.846Z level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=407.794952ms time=2026-02-26T14:26:04.847Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax" time=2026-02-26T14:26:04.847Z level=DEBUG source=sched.go:222 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=6 gpu_count=2 time=2026-02-26T14:26:04.893Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32 time=2026-02-26T14:26:04.895Z level=DEBUG source=sched.go:258 msg="loading first model" model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b time=2026-02-26T14:26:05.055Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32 time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0 time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0 time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default="" time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default="" time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1 time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0 time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0 time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0 time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07 time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000 time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304 time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false time=2026-02-26T14:26:05.059Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2026-02-26T14:26:05.060Z level=INFO source=server.go:247 msg="enabling flash attention" time=2026-02-26T14:26:05.060Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b --port 37403" time=2026-02-26T14:26:05.060Z level=DEBUG source=server.go:432 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_DEBUG=1 OLLAMA_KEEP_ALIVE=60m LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 OLLAMA_HOST=0.0.0.0:11434 OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12 time=2026-02-26T14:26:05.061Z level=INFO source=sched.go:491 msg="system memory" total="15.5 GiB" free="6.1 GiB" free_swap="19.6 GiB" time=2026-02-26T14:26:05.061Z level=INFO source=sched.go:498 msg="gpu memory" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA available="11.1 GiB" free="11.6 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-02-26T14:26:05.061Z level=INFO source=sched.go:498 msg="gpu memory" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA available="5.4 GiB" free="5.9 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-02-26T14:26:05.061Z level=INFO source=server.go:757 msg="loading model" "model layers"=65 requested=-1 time=2026-02-26T14:26:05.076Z level=INFO source=runner.go:1411 msg="starting ollama engine" time=2026-02-26T14:26:05.076Z level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:37403" time=2026-02-26T14:26:05.084Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:4 GPULayers:65[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:65(0..64)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T14:26:05.162Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32 time=2026-02-26T14:26:05.165Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.name default="" time=2026-02-26T14:26:05.165Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.description default="" time=2026-02-26T14:26:05.165Z level=INFO source=ggml.go:136 msg="" architecture=qwen35 file_type=Q4_K_M name="" description="" num_tensors=1307 num_key_values=53 time=2026-02-26T14:26:05.165Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-sandybridge.so time=2026-02-26T14:26:05.174Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v12 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 2060, compute capability 7.5, VMM: yes, ID: GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Device 1: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes, ID: GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so time=2026-02-26T14:26:05.320Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0 time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0 time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default="" time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default="" time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1 time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0 time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0 time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0 time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07 time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000 time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304 time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false time=2026-02-26T14:26:05.330Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2026-02-26T14:26:06.230Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=1 time=2026-02-26T14:26:06.638Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=33583 splits=388 time=2026-02-26T14:26:06.674Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=4919 splits=2 time=2026-02-26T14:26:06.675Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="15.5 GiB" time=2026-02-26T14:26:06.675Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="710.2 MiB" time=2026-02-26T14:26:06.675Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="3.9 GiB" time=2026-02-26T14:26:06.675Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.1 GiB" time=2026-02-26T14:26:06.675Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="216.0 MiB" time=2026-02-26T14:26:06.675Z level=DEBUG source=device.go:272 msg="total memory" size="21.4 GiB" time=2026-02-26T14:26:06.675Z level=DEBUG source=server.go:782 msg=memory success=true required.InputWeights=744714240 required.CPU.Graph=226492416 required.CUDA0.ID=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a required.CUDA0.Weights="[472507392 238746624 238746624 233760768 238746624 238746624 238746624 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 238746624 238746624 238746624 233760768 238746624 238746624 238746624 233760768 1965245056]" required.CUDA0.Cache="[81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0]" required.CUDA0.Graph=1138774016 time=2026-02-26T14:26:06.676Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA "available layer vram"="10.0 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="1.1 GiB" time=2026-02-26T14:26:06.676Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA "available layer vram"="5.4 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B" time=2026-02-26T14:26:06.676Z level=DEBUG source=server.go:793 msg="new layout created" layers="56[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:37(8..44) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:19(45..63)]" time=2026-02-26T14:26:06.676Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:4 GPULayers:56[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:37(8..44) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:19(45..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T14:26:06.730Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32 time=2026-02-26T14:26:06.737Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0 time=2026-02-26T14:26:06.737Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0 time=2026-02-26T14:26:06.737Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 time=2026-02-26T14:26:06.737Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default="" time=2026-02-26T14:26:06.737Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default="" time=2026-02-26T14:26:06.737Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1 time=2026-02-26T14:26:06.737Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0 time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0 time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0 time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07 time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000 time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304 time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false time=2026-02-26T14:26:06.738Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2026-02-26T14:26:07.246Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=442 time=2026-02-26T14:26:07.439Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=33583 splits=523 time=2026-02-26T14:26:07.451Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=4919 splits=4 time=2026-02-26T14:26:07.452Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="7.7 GiB" time=2026-02-26T14:26:07.452Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="4.1 GiB" time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="4.5 GiB" time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="2.3 GiB" time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="1.1 GiB" time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="499.6 MiB" time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.1 GiB" time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="736.6 MiB" time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="1.0 GiB" time=2026-02-26T14:26:07.453Z level=DEBUG source=device.go:272 msg="total memory" size="22.9 GiB" time=2026-02-26T14:26:07.453Z level=DEBUG source=server.go:782 msg=memory success=true required.InputWeights=744714240 required.CPU.Weights="[472507264 238746496 238746496 233760768 238746496 238746496 238746496 233760768 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1965243328]" required.CPU.Cache="[81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=1098865536 required.CUDA0.ID=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a required.CUDA0.Weights="[0 0 0 0 0 0 0 0 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Cache="[0 0 0 0 0 0 0 0 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Graph=1129623552 required.CUDA1.ID=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 required.CUDA1.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 238746624 238746624 238746624 233760768 238746624 238746624 238746624 233760768 0]" required.CUDA1.Cache="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0]" required.CUDA1.Graph=772374528 time=2026-02-26T14:26:07.453Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA "available layer vram"="10.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="1.1 GiB" time=2026-02-26T14:26:07.453Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA "available layer vram"="4.7 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="736.6 MiB" time=2026-02-26T14:26:07.454Z level=DEBUG source=server.go:793 msg="new layout created" layers="54[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:37(10..46) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:17(47..63)]" time=2026-02-26T14:26:07.454Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:4 GPULayers:54[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:37(10..46) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:17(47..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T14:26:07.555Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32 time=2026-02-26T14:26:07.563Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0 time=2026-02-26T14:26:07.563Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0 time=2026-02-26T14:26:07.563Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 time=2026-02-26T14:26:07.563Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default="" time=2026-02-26T14:26:07.563Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default="" time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1 time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0 time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0 time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0 time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07 time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000 time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304 time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false time=2026-02-26T14:26:07.564Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2026-02-26T14:26:08.013Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=442 time=2026-02-26T14:26:08.186Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=33583 splits=559 time=2026-02-26T14:26:08.198Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=4919 splits=4 time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="7.7 GiB" time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="3.6 GiB" time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="4.9 GiB" time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="2.3 GiB" time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="1015.2 MiB" time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="655.4 MiB" time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.1 GiB" time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="704.5 MiB" time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="1.0 GiB" time=2026-02-26T14:26:08.199Z level=DEBUG source=device.go:272 msg="total memory" size="22.9 GiB" time=2026-02-26T14:26:08.199Z level=DEBUG source=server.go:782 msg=memory success=true required.InputWeights=744714240 required.CPU.Weights="[472507264 238746496 238746496 233760768 238746496 238746496 238746496 233760768 215767936 215767936 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1965243328]" required.CPU.Cache="[81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=1098865536 required.CUDA0.ID=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Cache="[0 0 0 0 0 0 0 0 0 0 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Graph=1129623552 required.CUDA1.ID=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 required.CUDA1.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 238746624 238746624 238746624 233760768 238746624 238746624 238746624 233760768 0]" required.CUDA1.Cache="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0]" required.CUDA1.Graph=738688768 time=2026-02-26T14:26:08.199Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA "available layer vram"="10.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="1.1 GiB" time=2026-02-26T14:26:08.199Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA "available layer vram"="4.7 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="704.5 MiB" time=2026-02-26T14:26:08.200Z level=DEBUG source=server.go:793 msg="new layout created" layers="54[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:37(10..46) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:17(47..63)]" time=2026-02-26T14:26:08.200Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:4 GPULayers:54[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:37(10..46) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:17(47..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T14:26:08.300Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32 time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0 time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0 time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default="" time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default="" time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1 time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0 time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0 time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0 time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07 time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000 time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304 time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false time=2026-02-26T14:26:08.310Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2026-02-26T14:26:08.792Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=442 time=2026-02-26T14:26:09.292Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=33583 splits=559 time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="7.7 GiB" time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="3.6 GiB" time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="4.9 GiB" time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="2.3 GiB" time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="1015.2 MiB" time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="655.4 MiB" time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.1 GiB" time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="608.5 MiB" time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="1.0 GiB" time=2026-02-26T14:26:09.413Z level=DEBUG source=device.go:272 msg="total memory" size="22.8 GiB" time=2026-02-26T14:26:09.413Z level=DEBUG source=server.go:782 msg=memory success=false required.InputWeights=744714240 required.CPU.Weights="[472507264 238746496 238746496 233760768 238746496 238746496 238746496 233760768 215767936 215767936 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1965243328]" required.CPU.Cache="[81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=1098865536 required.CUDA0.ID=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Cache="[0 0 0 0 0 0 0 0 0 0 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Graph=1129623552 required.CUDA1.ID=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 required.CUDA1.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 238746624 238746624 238746624 233760768 238746624 238746624 238746624 233760768 0]" required.CUDA1.Cache="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0]" required.CUDA1.Graph=638025472 time=2026-02-26T14:26:09.414Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA "available layer vram"="10.1 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="1.1 GiB" time=2026-02-26T14:26:09.414Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA "available layer vram"="4.8 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="608.5 MiB" time=2026-02-26T14:26:09.414Z level=DEBUG source=server.go:793 msg="new layout created" layers="54[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:37(10..46) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:17(47..63)]" time=2026-02-26T14:26:09.414Z level=INFO source=server.go:879 msg="model layout did not fit, applying backoff" backoff=0.10 time=2026-02-26T14:26:09.414Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA "available layer vram"="8.9 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="1.1 GiB" time=2026-02-26T14:26:09.414Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA "available layer vram"="4.2 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="608.5 MiB" time=2026-02-26T14:26:09.414Z level=DEBUG source=server.go:793 msg="new layout created" layers="48[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:33(16..48) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:15(49..63)]" time=2026-02-26T14:26:09.415Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:4 GPULayers:48[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:33(16..48) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:15(49..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T14:26:09.516Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32 time=2026-02-26T14:26:09.588Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0 time=2026-02-26T14:26:09.588Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0 time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default="" time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default="" time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1 time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0 time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0 time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0 time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07 time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000 time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304 time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false time=2026-02-26T14:26:09.589Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2026-02-26T14:26:10.111Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=442 time=2026-02-26T14:26:10.786Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=33583 splits=657 time=2026-02-26T14:26:10.800Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=4919 splits=4 time=2026-02-26T14:26:10.800Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="6.8 GiB" time=2026-02-26T14:26:10.800Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="3.2 GiB" time=2026-02-26T14:26:10.800Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="6.2 GiB" time=2026-02-26T14:26:10.801Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="2.0 GiB" time=2026-02-26T14:26:10.801Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="921.2 MiB" time=2026-02-26T14:26:10.801Z level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="999.2 MiB" time=2026-02-26T14:26:10.801Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.1 GiB" time=2026-02-26T14:26:10.801Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="736.6 MiB" time=2026-02-26T14:26:10.801Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="1.0 GiB" time=2026-02-26T14:26:10.801Z level=DEBUG source=device.go:272 msg="total memory" size="22.9 GiB" time=2026-02-26T14:26:10.801Z level=DEBUG source=server.go:782 msg=memory success=true required.InputWeights=744714240 required.CPU.Weights="[472507264 238746496 238746496 233760768 238746496 238746496 238746496 233760768 215767936 215767936 238746496 210782208 215767936 238746496 215767936 210782208 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1965243328]" required.CPU.Cache="[81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=1098865536 required.CUDA0.ID=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Cache="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Graph=1129623552 required.CUDA1.ID=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 required.CUDA1.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 238746624 215768064 210782208 238746624 215768064 215768064 233760768 238746624 238746624 238746624 233760768 238746624 238746624 238746624 233760768 0]" required.CUDA1.Cache="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0]" required.CUDA1.Graph=772374528 time=2026-02-26T14:26:10.801Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA "available layer vram"="8.9 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="1.1 GiB" time=2026-02-26T14:26:10.801Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA "available layer vram"="4.1 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="736.6 MiB" time=2026-02-26T14:26:10.801Z level=DEBUG source=server.go:793 msg="new layout created" layers="47[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:33(17..49) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:14(50..63)]" time=2026-02-26T14:26:10.802Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:4 GPULayers:47[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:33(17..49) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:14(50..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T14:26:11.011Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=general.alignment default=32 time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0 time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.head_count_kv default=0 time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.type default="" time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.type default="" time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.factor default=1 time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.rope.scaling.original_context_length default=0 time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.attention.scale default=0 time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_count default=0 time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.expert_used_count default=0 time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.norm_top_k_prob default=true time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.mrope_interleaved default=false time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.attention.layer_norm_epsilon default=9.999999974752427e-07 time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.rope.freq_base default=10000 time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.vision.num_positional_embeddings default=2304 time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.add_bos_token default=false time=2026-02-26T14:26:11.078Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=tokenizer.ggml.bos_token_id default=0 time=2026-02-26T14:26:11.629Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1258 splits=442 time=2026-02-26T14:26:12.320Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=33583 splits=675 time=2026-02-26T14:26:12.335Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=4919 splits=4 time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="6.8 GiB" time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA1 size="3.0 GiB" time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="6.4 GiB" time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="2.0 GiB" time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA1 size="843.3 MiB" time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:256 msg="kv cache" device=CPU size="1.1 GiB" time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="1.1 GiB" time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA1 size="815.6 MiB" time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="1.0 GiB" time=2026-02-26T14:26:12.335Z level=DEBUG source=device.go:272 msg="total memory" size="23.0 GiB" time=2026-02-26T14:26:12.335Z level=DEBUG source=server.go:782 msg=memory success=true required.InputWeights=744714240 required.CPU.Weights="[472507264 238746496 238746496 233760768 238746496 238746496 238746496 233760768 215767936 215767936 238746496 210782208 215767936 238746496 215767936 210782208 238746496 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1965243328]" required.CPU.Cache="[81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CPU.Graph=1098865536 required.CUDA0.ID=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a required.CUDA0.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 215768064 210782208 238746624 215768064 215768064 233760768 215768064 215768064 238746624 210782208 215768064 238746624 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Cache="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]" required.CUDA0.Graph=1129623552 required.CUDA1.ID=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 required.CUDA1.Weights="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 215768064 210782208 238746624 215768064 215768064 233760768 238746624 238746624 238746624 233760768 238746624 238746624 238746624 233760768 0]" required.CUDA1.Cache="[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 81715200 81715200 81715200 16777216 0]" required.CUDA1.Graph=855166720 time=2026-02-26T14:26:12.336Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-8416b7d6-2f5f-d827-5992-4affda11c96a library=CUDA "available layer vram"="8.9 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="1.1 GiB" time=2026-02-26T14:26:12.336Z level=DEBUG source=server.go:976 msg="available gpu" id=GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 library=CUDA "available layer vram"="4.0 GiB" backoff=0.10 minimum="457.0 MiB" overhead="0 B" graph="815.6 MiB" time=2026-02-26T14:26:12.336Z level=DEBUG source=server.go:793 msg="new layout created" layers="47[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:33(17..49) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:14(50..63)]" time=2026-02-26T14:26:12.336Z level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:4 GPULayers:47[ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Layers:33(17..49) ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Layers:14(50..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-26T14:26:12.336Z level=INFO source=ggml.go:482 msg="offloading 47 repeating layers to GPU" time=2026-02-26T14:26:12.336Z level=INFO source=ggml.go:486 msg="offloading output layer to CPU" time=2026-02-26T14:26:12.336Z level=INFO source=ggml.go:494 msg="offloaded 47/65 layers to GPU" time=2026-02-26T14:26:12.336Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="6.8 GiB" time=2026-02-26T14:26:12.336Z level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="3.0 GiB" time=2026-02-26T14:26:12.336Z level=INFO source=device.go:245 msg="model weights" device=CPU size="6.4 GiB" time=2026-02-26T14:26:12.336Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="2.0 GiB" time=2026-02-26T14:26:12.336Z level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="843.3 MiB" time=2026-02-26T14:26:12.337Z level=INFO source=device.go:256 msg="kv cache" device=CPU size="1.1 GiB" time=2026-02-26T14:26:12.337Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="1.1 GiB" time=2026-02-26T14:26:12.337Z level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="815.6 MiB" time=2026-02-26T14:26:12.337Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="1.0 GiB" time=2026-02-26T14:26:12.337Z level=INFO source=device.go:272 msg="total memory" size="23.0 GiB" time=2026-02-26T14:26:12.337Z level=INFO source=sched.go:566 msg="loaded runners" count=1 time=2026-02-26T14:26:12.337Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" time=2026-02-26T14:26:12.355Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" time=2026-02-26T14:26:12.355Z level=DEBUG source=server.go:1394 msg="model load progress 0.00" [...] time=2026-02-26T14:26:42.220Z level=DEBUG source=server.go:1394 msg="model load progress 0.99" time=2026-02-26T14:26:42.441Z level=DEBUG source=ggml.go:324 msg="key with type not found" key=qwen35.pooling_type default=0 time=2026-02-26T14:26:42.471Z level=INFO source=server.go:1388 msg="llama runner started in 37.41 seconds" time=2026-02-26T14:26:42.471Z level=DEBUG source=sched.go:578 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3.5:27b-q4_K_M runner.inference="[{ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Library:CUDA} {ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Library:CUDA}]" runner.size="23.0 GiB" runner.vram="14.5 GiB" runner.parallel=1 runner.pid=167 runner.model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b runner.num_ctx=4096 [GIN] 2026/02/26 - 14:26:42 | 200 | 38.612650389s | 172.18.0.1 | POST  "/api/generate" time=2026-02-26T14:26:42.472Z level=DEBUG source=sched.go:586 msg="context for request finished" time=2026-02-26T14:26:42.472Z level=DEBUG source=sched.go:338 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3.5:27b-q4_K_M runner.inference="[{ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Library:CUDA} {ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Library:CUDA}]" runner.size="23.0 GiB" runner.vram="14.5 GiB" runner.parallel=1 runner.pid=167 runner.model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b runner.num_ctx=4096 duration=1h0m0s time=2026-02-26T14:26:42.472Z level=DEBUG source=sched.go:356 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3.5:27b-q4_K_M runner.inference="[{ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Library:CUDA} {ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Library:CUDA}]" runner.size="23.0 GiB" runner.vram="14.5 GiB" runner.parallel=1 runner.pid=167 runner.model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b runner.num_ctx=4096 refCount=0 time=2026-02-26T14:26:42.914Z level=DEBUG source=sched.go:734 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b time=2026-02-26T14:26:43.003Z level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=15825 format="" time=2026-02-26T14:26:43.096Z level=WARN source=runner.go:187 msg="truncating input prompt" limit=4096 prompt=6317 keep=4 new=4096 time=2026-02-26T14:26:43.096Z level=DEBUG source=cache.go:151 msg="loading cache slot" id=0 cache=0 prompt=4096 used=0 remaining=4096 CUDA error: invalid argument current device: 0, in function ggml_cuda_cpy at //ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:438 cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream) //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error /usr/lib/ollama/libggml-base.so.0(+0x1bae8)[0x719e783fcae8] /usr/lib/ollama/libggml-base.so.0(ggml_print_backtrace+0x1e6)[0x719e783fceb6] /usr/lib/ollama/libggml-base.so.0(ggml_abort+0x11d)[0x719e783fd03d] /usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x143272)[0x719deb69d272] /usr/lib/ollama/cuda_v12/libggml-cuda.so(_Z13ggml_cuda_cpyR25ggml_backend_cuda_contextPK11ggml_tensorPS1_+0x1e50)[0x719deb65b1d0] /usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x153cae)[0x719deb6adcae] /usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x156eea)[0x719deb6b0eea] /usr/lib/ollama/cuda_v12/libggml-cuda.so(+0x158e55)[0x719deb6b2e55] /usr/bin/ollama(+0x13ac156)[0x596dc083b156] /usr/bin/ollama(+0x132034b)[0x596dc07af34b] /usr/bin/ollama(+0x3ddae1)[0x596dbf86cae1] SIGABRT: abort PC=0x719ec7cabb2c m=13 sigcode=18446744073709551610 signal arrived during cgo execution goroutine 1506 gp=0xc0005828c0 m=13 mp=0xc000278008 [syscall]: runtime.cgocall(0x596dc07af330, 0xc00008baa0) runtime/cgocall.go:167 +0x4b fp=0xc00008ba78 sp=0xc00008ba40 pc=0x596dbf861a6b github.com/ollama/ollama/ml/backend/ggml._Cfunc_ggml_backend_sched_graph_compute_async(0x719e50104d90, 0x71971c0039a0) _cgo_gotypes.go:979 +0x4a fp=0xc00008baa0 sp=0xc00008ba78 pc=0x596dbfd4cb0a github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify.func2(...) github.com/ollama/ollama/ml/backend/ggml/ggml.go:825 github.com/ollama/ollama/ml/backend/ggml.(*Context).ComputeWithNotify(0xc001736100, 0xc000359800?, {0xc0094b7070, 0x1, 0x2?}) github.com/ollama/ollama/ml/backend/ggml/ggml.go:825 +0x1b2 fp=0xc00008bb78 sp=0xc00008baa0 pc=0x596dbfd5b492 github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc0002370e0, {0x0, {0x596dc11069d0, 0xc001736100}, {0x596dc1113e30, 0xc000660648}, {0xc0003c6c08, 0x200, 0x25f}, {{0x596dc1113e30, ...}, ...}, ...}) github.com/ollama/ollama/runner/ollamarunner/runner.go:716 +0x862 fp=0xc00008bef0 sp=0xc00008bb78 pc=0x596dbfe86282 github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1() github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x58 fp=0xc00008bfe0 sp=0xc00008bef0 pc=0x596dbfe83f78 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00008bfe8 sp=0xc00008bfe0 pc=0x596dbf86ce61 created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 8 github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x2cd goroutine 1 gp=0xc000002380 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000339778 sp=0xc000339758 pc=0x596dbf864eee runtime.netpollblock(0xc00051f7c8?, 0xbf7fe4a6?, 0x6d?) runtime/netpoll.go:575 +0xf7 fp=0xc0003397b0 sp=0xc000339778 pc=0x596dbf82a097 internal/poll.runtime_pollWait(0x719e800c8610, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc0003397d0 sp=0xc0003397b0 pc=0x596dbf864105 internal/poll.(*pollDesc).wait(0xc0000dc900?, 0x900000036?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc0003397f8 sp=0xc0003397d0 pc=0x596dbf8ec487 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Accept(0xc0000dc900) internal/poll/fd_unix.go:620 +0x295 fp=0xc0003398a0 sp=0xc0003397f8 pc=0x596dbf8f1855 net.(*netFD).accept(0xc0000dc900) net/fd_unix.go:172 +0x29 fp=0xc000339958 sp=0xc0003398a0 pc=0x596dbf964d49 net.(*TCPListener).accept(0xc000051740) net/tcpsock_posix.go:159 +0x1b fp=0xc0003399a8 sp=0xc000339958 pc=0x596dbf97ac5b net.(*TCPListener).Accept(0xc000051740) net/tcpsock.go:380 +0x30 fp=0xc0003399d8 sp=0xc0003399a8 pc=0x596dbf979b10 net/http.(*onceCloseListener).Accept(0xc0000f2480?) <autogenerated>:1 +0x24 fp=0xc0003399f0 sp=0xc0003399d8 pc=0x596dbfb919c4 net/http.(*Server).Serve(0xc0001ffb00, {0x596dc10f7bc0, 0xc000051740}) net/http/server.go:3424 +0x30c fp=0xc000339b20 sp=0xc0003399f0 pc=0x596dbfb6928c github.com/ollama/ollama/runner/ollamarunner.Execute({0xc000130030, 0x4, 0x4}) github.com/ollama/ollama/runner/ollamarunner/runner.go:1447 +0x94e fp=0xc000339cf0 sp=0xc000339b20 pc=0x596dbfe8d20e github.com/ollama/ollama/runner.Execute({0xc000130010?, 0x0?, 0x0?}) github.com/ollama/ollama/runner/runner.go:18 +0x10e fp=0xc000339d30 sp=0xc000339cf0 pc=0x596dbff2c76e github.com/ollama/ollama/cmd.NewCLI.func3(0xc0001ff800?, {0x596dc0b15236?, 0x4?, 0x596dc0b1523a?}) github.com/ollama/ollama/cmd/cmd.go:2270 +0x45 fp=0xc000339d58 sp=0xc000339d30 pc=0x596dc073f845 github.com/spf13/cobra.(*Command).execute(0xc0000f7b08, {0xc0005b96d0, 0x5, 0x5}) github.com/spf13/cobra@v1.7.0/command.go:940 +0x85c fp=0xc000339e78 sp=0xc000339d58 pc=0x596dbf9decdc github.com/spf13/cobra.(*Command).ExecuteC(0xc0005be908) github.com/spf13/cobra@v1.7.0/command.go:1068 +0x3a5 fp=0xc000339f30 sp=0xc000339e78 pc=0x596dbf9df525 github.com/spf13/cobra.(*Command).Execute(...) github.com/spf13/cobra@v1.7.0/command.go:992 github.com/spf13/cobra.(*Command).ExecuteContext(...) github.com/spf13/cobra@v1.7.0/command.go:985 main.main() github.com/ollama/ollama/main.go:12 +0x4d fp=0xc000339f50 sp=0xc000339f30 pc=0x596dc0741ced runtime.main() runtime/proc.go:283 +0x29d fp=0xc000339fe0 sp=0xc000339f50 pc=0x596dbf83171d runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000339fe8 sp=0xc000339fe0 pc=0x596dbf86ce61 goroutine 2 gp=0xc000002e00 m=nil [force gc (idle)]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000072fa8 sp=0xc000072f88 pc=0x596dbf864eee runtime.goparkunlock(...) runtime/proc.go:441 runtime.forcegchelper() runtime/proc.go:348 +0xb8 fp=0xc000072fe0 sp=0xc000072fa8 pc=0x596dbf831a58 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000072fe8 sp=0xc000072fe0 pc=0x596dbf86ce61 created by runtime.init.7 in goroutine 1 runtime/proc.go:336 +0x1a goroutine 3 gp=0xc000003340 m=nil [GC sweep wait]: runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000073780 sp=0xc000073760 pc=0x596dbf864eee runtime.goparkunlock(...) runtime/proc.go:441 runtime.bgsweep(0xc00007e000) runtime/mgcsweep.go:316 +0xdf fp=0xc0000737c8 sp=0xc000073780 pc=0x596dbf81c1ff runtime.gcenable.gowrap1() runtime/mgc.go:204 +0x25 fp=0xc0000737e0 sp=0xc0000737c8 pc=0x596dbf8105e5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000737e8 sp=0xc0000737e0 pc=0x596dbf86ce61 created by runtime.gcenable in goroutine 1 runtime/mgc.go:204 +0x66 goroutine 4 gp=0xc000003500 m=nil [GC scavenge wait]: runtime.gopark(0x4f1263?, 0x4c195c?, 0x0?, 0x0?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000073f78 sp=0xc000073f58 pc=0x596dbf864eee runtime.goparkunlock(...) runtime/proc.go:441 runtime.(*scavengerState).park(0x596dc1b2e5a0) runtime/mgcscavenge.go:425 +0x49 fp=0xc000073fa8 sp=0xc000073f78 pc=0x596dbf819c49 runtime.bgscavenge(0xc00007e000) runtime/mgcscavenge.go:658 +0x59 fp=0xc000073fc8 sp=0xc000073fa8 pc=0x596dbf81a1d9 runtime.gcenable.gowrap2() runtime/mgc.go:205 +0x25 fp=0xc000073fe0 sp=0xc000073fc8 pc=0x596dbf810585 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000073fe8 sp=0xc000073fe0 pc=0x596dbf86ce61 created by runtime.gcenable in goroutine 1 runtime/mgc.go:205 +0xa5 goroutine 18 gp=0xc000102700 m=nil [finalizer wait]: runtime.gopark(0x1b8?, 0xc000002380?, 0x1?, 0x23?, 0xc000072688?) runtime/proc.go:435 +0xce fp=0xc000072630 sp=0xc000072610 pc=0x596dbf864eee runtime.runfinq() runtime/mfinal.go:196 +0x107 fp=0xc0000727e0 sp=0xc000072630 pc=0x596dbf80f5a7 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000727e8 sp=0xc0000727e0 pc=0x596dbf86ce61 created by runtime.createfing in goroutine 1 runtime/mfinal.go:166 +0x3d goroutine 19 gp=0xc000103180 m=nil [chan receive]: runtime.gopark(0xc000235b80?, 0xc000010168?, 0x60?, 0xe7?, 0x596dbf94b8a8?) runtime/proc.go:435 +0xce fp=0xc00006e718 sp=0xc00006e6f8 pc=0x596dbf864eee runtime.chanrecv(0xc000110310, 0x0, 0x1) runtime/chan.go:664 +0x445 fp=0xc00006e790 sp=0xc00006e718 pc=0x596dbf801085 runtime.chanrecv1(0x0?, 0x0?) runtime/chan.go:506 +0x12 fp=0xc00006e7b8 sp=0xc00006e790 pc=0x596dbf800c12 runtime.unique_runtime_registerUniqueMapCleanup.func2(...) runtime/mgc.go:1796 runtime.unique_runtime_registerUniqueMapCleanup.gowrap1() runtime/mgc.go:1799 +0x2f fp=0xc00006e7e0 sp=0xc00006e7b8 pc=0x596dbf81378f runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006e7e8 sp=0xc00006e7e0 pc=0x596dbf86ce61 created by unique.runtime_registerUniqueMapCleanup in goroutine 1 runtime/mgc.go:1794 +0x85 goroutine 20 gp=0xc000103500 m=nil [GC worker (idle)]: runtime.gopark(0x36c21352c46?, 0x3?, 0xa9?, 0xc9?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006ef38 sp=0xc00006ef18 pc=0x596dbf864eee runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc00006efc8 sp=0xc00006ef38 pc=0x596dbf812aa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00006efe0 sp=0xc00006efc8 pc=0x596dbf812985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006efe8 sp=0xc00006efe0 pc=0x596dbf86ce61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 34 gp=0xc000504000 m=nil [GC worker (idle)]: runtime.gopark(0x36ad8f541be?, 0x1?, 0x49?, 0x33?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00050a738 sp=0xc00050a718 pc=0x596dbf864eee runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc00050a7c8 sp=0xc00050a738 pc=0x596dbf812aa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00050a7e0 sp=0xc00050a7c8 pc=0x596dbf812985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00050a7e8 sp=0xc00050a7e0 pc=0x596dbf86ce61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 5 gp=0xc000003a40 m=nil [GC worker (idle)]: runtime.gopark(0x36c21af2acb?, 0x1?, 0x52?, 0xef?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000074738 sp=0xc000074718 pc=0x596dbf864eee runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc0000747c8 sp=0xc000074738 pc=0x596dbf812aa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0000747e0 sp=0xc0000747c8 pc=0x596dbf812985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000747e8 sp=0xc0000747e0 pc=0x596dbf86ce61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 6 gp=0xc000003c00 m=nil [GC worker (idle)]: runtime.gopark(0x596dc1c03520?, 0x1?, 0xc2?, 0x9b?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000074f38 sp=0xc000074f18 pc=0x596dbf864eee runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc000074fc8 sp=0xc000074f38 pc=0x596dbf812aa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc000074fe0 sp=0xc000074fc8 pc=0x596dbf812985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000074fe8 sp=0xc000074fe0 pc=0x596dbf86ce61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 7 gp=0xc000003dc0 m=nil [GC worker (idle)]: runtime.gopark(0x36c21352f04?, 0x3?, 0x14?, 0x67?, 0x0?) runtime/proc.go:435 +0xce fp=0xc000075738 sp=0xc000075718 pc=0x596dbf864eee runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc0000757c8 sp=0xc000075738 pc=0x596dbf812aa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc0000757e0 sp=0xc0000757c8 pc=0x596dbf812985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc0000757e8 sp=0xc0000757e0 pc=0x596dbf86ce61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 21 gp=0xc0001036c0 m=nil [GC worker (idle)]: runtime.gopark(0x36c21adf40d?, 0x1?, 0x71?, 0x67?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00006f738 sp=0xc00006f718 pc=0x596dbf864eee runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc00006f7c8 sp=0xc00006f738 pc=0x596dbf812aa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00006f7e0 sp=0xc00006f7c8 pc=0x596dbf812985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00006f7e8 sp=0xc00006f7e0 pc=0x596dbf86ce61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 35 gp=0xc0005041c0 m=nil [GC worker (idle)]: runtime.gopark(0x36c21b9830f?, 0x3?, 0x87?, 0x43?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00050af38 sp=0xc00050af18 pc=0x596dbf864eee runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc00050afc8 sp=0xc00050af38 pc=0x596dbf812aa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00050afe0 sp=0xc00050afc8 pc=0x596dbf812985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00050afe8 sp=0xc00050afe0 pc=0x596dbf86ce61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 36 gp=0xc000504380 m=nil [GC worker (idle)]: runtime.gopark(0x36c21abc622?, 0x1?, 0x68?, 0xa1?, 0x0?) runtime/proc.go:435 +0xce fp=0xc00050b738 sp=0xc00050b718 pc=0x596dbf864eee runtime.gcBgMarkWorker(0xc000111730) runtime/mgc.go:1423 +0xe9 fp=0xc00050b7c8 sp=0xc00050b738 pc=0x596dbf812aa9 runtime.gcBgMarkStartWorkers.gowrap1() runtime/mgc.go:1339 +0x25 fp=0xc00050b7e0 sp=0xc00050b7c8 pc=0x596dbf812985 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc00050b7e8 sp=0xc00050b7e0 pc=0x596dbf86ce61 created by runtime.gcBgMarkStartWorkers in goroutine 1 runtime/mgc.go:1339 +0x105 goroutine 8 gp=0xc000505180 m=nil [chan receive]: runtime.gopark(0x30?, 0x596dc1035d80?, 0x1?, 0x2?, 0xc000335798?) runtime/proc.go:435 +0xce fp=0xc000335750 sp=0xc000335730 pc=0x596dbf864eee runtime.chanrecv(0xc0001101c0, 0x0, 0x1) runtime/chan.go:664 +0x445 fp=0xc0003357c8 sp=0xc000335750 pc=0x596dbf801085 runtime.chanrecv1(0x596dc0b58664?, 0x29?) runtime/chan.go:506 +0x12 fp=0xc0003357f0 sp=0xc0003357c8 pc=0x596dbf800c12 github.com/ollama/ollama/runner/ollamarunner.(*Server).forwardBatch(_, {0x1, {0x596dc11069d0, 0xc006fe4000}, {0x596dc1113e30, 0xc002c790b0}, {0xc00227a008, 0x200, 0x25f}, {{0x596dc1113e30, ...}, ...}, ...}) github.com/ollama/ollama/runner/ollamarunner/runner.go:476 +0xfa fp=0xc000335b58 sp=0xc0003357f0 pc=0x596dbfe8409a github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0xc0002370e0, {0x596dc10fa520, 0xc0005b9770}) github.com/ollama/ollama/runner/ollamarunner/runner.go:453 +0x18c fp=0xc000335fb8 sp=0xc000335b58 pc=0x596dbfe83d4c github.com/ollama/ollama/runner/ollamarunner.Execute.gowrap1() github.com/ollama/ollama/runner/ollamarunner/runner.go:1424 +0x28 fp=0xc000335fe0 sp=0xc000335fb8 pc=0x596dbfe8d488 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000335fe8 sp=0xc000335fe0 pc=0x596dbf86ce61 created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1 github.com/ollama/ollama/runner/ollamarunner/runner.go:1424 +0x4c9 goroutine 9 gp=0xc000505340 m=nil [select]: runtime.gopark(0xc000049a08?, 0x2?, 0xc0?, 0x97?, 0xc00004986c?) runtime/proc.go:435 +0xce fp=0xc000049698 sp=0xc000049678 pc=0x596dbf864eee runtime.selectgo(0xc000049a08, 0xc000049868, 0x1000?, 0x0, 0x1?, 0x1) runtime/select.go:351 +0x837 fp=0xc0000497d0 sp=0xc000049698 pc=0x596dbf843c17 github.com/ollama/ollama/runner/ollamarunner.(*Server).completion(0xc0002370e0, {0x596dc10f7da0, 0xc0000e81c0}, 0xc0003a8280) github.com/ollama/ollama/runner/ollamarunner/runner.go:956 +0xc4e fp=0xc000049ac0 sp=0xc0000497d0 pc=0x596dbfe8852e github.com/ollama/ollama/runner/ollamarunner.(*Server).completion-fm({0x596dc10f7da0?, 0xc0000e81c0?}, 0xc00033bb40?) <autogenerated>:1 +0x36 fp=0xc000049af0 sp=0xc000049ac0 pc=0x596dbfe8d976 net/http.HandlerFunc.ServeHTTP(0xc0000c8780?, {0x596dc10f7da0?, 0xc0000e81c0?}, 0xc00033bb60?) net/http/server.go:2294 +0x29 fp=0xc000049b18 sp=0xc000049af0 pc=0x596dbfb658c9 net/http.(*ServeMux).ServeHTTP(0x596dbf809ac5?, {0x596dc10f7da0, 0xc0000e81c0}, 0xc0003a8280) net/http/server.go:2822 +0x1c4 fp=0xc000049b68 sp=0xc000049b18 pc=0x596dbfb677c4 net/http.serverHandler.ServeHTTP({0x596dc10f4090?}, {0x596dc10f7da0?, 0xc0000e81c0?}, 0x1?) net/http/server.go:3301 +0x8e fp=0xc000049b98 sp=0xc000049b68 pc=0x596dbfb8524e net/http.(*conn).serve(0xc0000f2480, {0x596dc10fa4e8, 0xc0000f13b0}) net/http/server.go:2102 +0x625 fp=0xc000049fb8 sp=0xc000049b98 pc=0x596dbfb63dc5 net/http.(*Server).Serve.gowrap3() net/http/server.go:3454 +0x28 fp=0xc000049fe0 sp=0xc000049fb8 pc=0x596dbfb69688 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000049fe8 sp=0xc000049fe0 pc=0x596dbf86ce61 created by net/http.(*Server).Serve in goroutine 1 net/http/server.go:3454 +0x485 goroutine 1406 gp=0xc000505c00 m=nil [IO wait]: runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0xb?) runtime/proc.go:435 +0xce fp=0xc000187dd8 sp=0xc000187db8 pc=0x596dbf864eee runtime.netpollblock(0x596dbf888798?, 0xbf7fe4a6?, 0x6d?) runtime/netpoll.go:575 +0xf7 fp=0xc000187e10 sp=0xc000187dd8 pc=0x596dbf82a097 internal/poll.runtime_pollWait(0x719e800c84f8, 0x72) runtime/netpoll.go:351 +0x85 fp=0xc000187e30 sp=0xc000187e10 pc=0x596dbf864105 internal/poll.(*pollDesc).wait(0xc0000dc980?, 0xc000574041?, 0x0) internal/poll/fd_poll_runtime.go:84 +0x27 fp=0xc000187e58 sp=0xc000187e30 pc=0x596dbf8ec487 internal/poll.(*pollDesc).waitRead(...) internal/poll/fd_poll_runtime.go:89 internal/poll.(*FD).Read(0xc0000dc980, {0xc000574041, 0x1, 0x1}) internal/poll/fd_unix.go:165 +0x27a fp=0xc000187ef0 sp=0xc000187e58 pc=0x596dbf8ed77a net.(*netFD).Read(0xc0000dc980, {0xc000574041?, 0xc0005b6198?, 0xc000187f70?}) net/fd_posix.go:55 +0x25 fp=0xc000187f38 sp=0xc000187ef0 pc=0x596dbf962da5 net.(*conn).Read(0xc00052e6e0, {0xc000574041?, 0xc0005b7a80?, 0x596dbfbdbf80?}) net/net.go:194 +0x45 fp=0xc000187f80 sp=0xc000187f38 pc=0x596dbf971165 net/http.(*connReader).backgroundRead(0xc000574030) net/http/server.go:690 +0x37 fp=0xc000187fc8 sp=0xc000187f80 pc=0x596dbfb5dc97 net/http.(*connReader).startBackgroundRead.gowrap2() net/http/server.go:686 +0x25 fp=0xc000187fe0 sp=0xc000187fc8 pc=0x596dbfb5dbc5 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc000187fe8 sp=0xc000187fe0 pc=0x596dbf86ce61 created by net/http.(*connReader).startBackgroundRead in goroutine 9 net/http/server.go:686 +0xb6 goroutine 1494 gp=0xc0008b4380 m=nil [chan receive]: runtime.gopark(0x30?, 0x596dc1035d80?, 0x1?, 0xf8?, 0xc001a93b20?) runtime/proc.go:435 +0xce fp=0xc001a93ad8 sp=0xc001a93ab8 pc=0x596dbf864eee runtime.chanrecv(0xc00031e150, 0x0, 0x1) runtime/chan.go:664 +0x445 fp=0xc001a93b50 sp=0xc001a93ad8 pc=0x596dbf801085 runtime.chanrecv1(0x596dc0b5c342?, 0x2c?) runtime/chan.go:506 +0x12 fp=0xc001a93b78 sp=0xc001a93b50 pc=0x596dbf800c12 github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0xc0002370e0, {0x1, {0x596dc11069d0, 0xc006fe4000}, {0x596dc1113e30, 0xc002c790b0}, {0xc00227a008, 0x200, 0x25f}, {{0x596dc1113e30, ...}, ...}, ...}) github.com/ollama/ollama/runner/ollamarunner/runner.go:645 +0x185 fp=0xc001a93ef0 sp=0xc001a93b78 pc=0x596dbfe85ba5 github.com/ollama/ollama/runner/ollamarunner.(*Server).run.gowrap1() github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x58 fp=0xc001a93fe0 sp=0xc001a93ef0 pc=0x596dbfe83f78 runtime.goexit({}) runtime/asm_amd64.s:1700 +0x1 fp=0xc001a93fe8 sp=0xc001a93fe0 pc=0x596dbf86ce61 created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 8 github.com/ollama/ollama/runner/ollamarunner/runner.go:459 +0x2cd rax 0x0 rbx 0xb8 rcx 0x719ec7cabb2c rdx 0x6 rdi 0xa7 rsi 0xb8 rbp 0x719d6cd95c00 rsp 0x719d6cd95bc0 r8 0x0 r9 0x7 r10 0x8 r11 0x246 r12 0x6 r13 0x719debdf6710 r14 0x16 r15 0x300000 rip 0x719ec7cabb2c rflags 0x246 cs 0x33 fs 0x0 gs 0x0 time=2026-02-26T14:26:44.147Z level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:37403/completion\": EOF" [GIN] 2026/02/26 - 14:26:44 | 500 | 1.670626538s | 172.18.0.1 | POST  "/api/generate" time=2026-02-26T14:26:44.147Z level=DEBUG source=sched.go:433 msg="context for request finished" runner.name=registry.ollama.ai/library/qwen3.5:27b-q4_K_M runner.inference="[{ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Library:CUDA} {ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Library:CUDA}]" runner.size="23.0 GiB" runner.vram="14.5 GiB" runner.parallel=1 runner.pid=167 runner.model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b runner.num_ctx=4096 time=2026-02-26T14:26:44.147Z level=DEBUG source=sched.go:338 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3.5:27b-q4_K_M runner.inference="[{ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Library:CUDA} {ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Library:CUDA}]" runner.size="23.0 GiB" runner.vram="14.5 GiB" runner.parallel=1 runner.pid=167 runner.model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b runner.num_ctx=4096 duration=1h0m0s time=2026-02-26T14:26:44.147Z level=DEBUG source=sched.go:356 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3.5:27b-q4_K_M runner.inference="[{ID:GPU-8416b7d6-2f5f-d827-5992-4affda11c96a Library:CUDA} {ID:GPU-a4c65198-ad5c-f3c2-d8a3-2829beb71c35 Library:CUDA}]" runner.size="23.0 GiB" runner.vram="14.5 GiB" runner.parallel=1 runner.pid=167 runner.model=/root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b runner.num_ctx=4096 refCount=0 time=2026-02-26T14:26:44.226Z level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 2" ``` </details>
Author
Owner

@rick-github commented on GitHub (Feb 26, 2026):

I hope it's OK to add my log even though it's a different model.

It's a different size but the architecture is the same. 0.17.1 includes support for qwen35 and qwen35moe but it looks like there are still some rough edges that need attending to.

<!-- gh-comment-id:3969110682 --> @rick-github commented on GitHub (Feb 26, 2026): > I hope it's OK to add my log even though it's a different model. It's a different size but the architecture is the same. 0.17.1 includes support for qwen35 and qwen35moe but it looks like there are still some rough edges that need attending to.
Author
Owner

@rick-github commented on GitHub (Feb 26, 2026):

Updated the title to make this the main issue for users affected by this problem.

<!-- gh-comment-id:3969117667 --> @rick-github commented on GitHub (Feb 26, 2026): Updated the title to make this the main issue for users affected by this problem.
Author
Owner

@iChristGit commented on GitHub (Feb 26, 2026):

ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, ID: GPU-98307eba-46ec-f866-ae7f-784c6292fd2e
load_backend: loaded CUDA backend from C:\Users\admin\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2026-02-27T01:06:54.744+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-02-27T01:06:55.405+02:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:16000 KvCacheType: NumThreads:8 GPULayers:38[ID:GPU-98307eba-46ec-f866-ae7f-784c6292fd2e Layers:38(2..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-27T01:06:55.722+02:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:16000 KvCacheType: NumThreads:8 GPULayers:38[ID:GPU-98307eba-46ec-f866-ae7f-784c6292fd2e Layers:38(2..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-27T01:06:56.529+02:00 level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:16000 KvCacheType: NumThreads:8 GPULayers:38[ID:GPU-98307eba-46ec-f866-ae7f-784c6292fd2e Layers:38(2..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-27T01:06:56.529+02:00 level=INFO source=ggml.go:482 msg="offloading 38 repeating layers to GPU"
time=2026-02-27T01:06:56.529+02:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-02-27T01:06:56.529+02:00 level=INFO source=ggml.go:494 msg="offloaded 38/41 layers to GPU"
time=2026-02-27T01:06:56.529+02:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="19.3 GiB"
time=2026-02-27T01:06:56.530+02:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="2.9 GiB"
time=2026-02-27T01:06:56.532+02:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.7 GiB"
time=2026-02-27T01:06:56.532+02:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="104.7 MiB"
time=2026-02-27T01:06:56.532+02:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="796.4 MiB"
time=2026-02-27T01:06:56.532+02:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="630.8 MiB"
time=2026-02-27T01:06:56.533+02:00 level=INFO source=device.go:272 msg="total memory" size="25.5 GiB"
time=2026-02-27T01:06:56.535+02:00 level=INFO source=sched.go:566 msg="loaded runners" count=1
time=2026-02-27T01:06:56.536+02:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-02-27T01:06:56.538+02:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-02-27T01:07:00.298+02:00 level=INFO source=server.go:1388 msg="llama runner started in 5.70 seconds"
CUDA error: invalid argument
  current device: 0, in function ggml_cuda_cpy at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\cpy.cu:438
  cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error
time=2026-02-27T01:07:00.626+02:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:64085/completion\": read tcp 127.0.0.1:64091->127.0.0.1:64085: wsarecv: An existing connection was forcibly closed by the remote host."
[GIN] 2026/02/27 - 01:07:00 | 500 |    6.3789689s |       127.0.0.1 | POST     "/api/chat"
time=2026-02-27T01:07:00.827+02:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"

3090Ti
Win11
ollama version is 0.17.1
downloaded from ollama (no HF)
non stop errors in both 27B and 35BMoe, in ollama i can chat once and then get the error, on open-webui cannot even get one respond, always getting:

500: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details

Llama.cpp works just fine.

<!-- gh-comment-id:3969812051 --> @iChristGit commented on GitHub (Feb 26, 2026): ``` ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes, ID: GPU-98307eba-46ec-f866-ae7f-784c6292fd2e load_backend: loaded CUDA backend from C:\Users\admin\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2026-02-27T01:06:54.744+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-02-27T01:06:55.405+02:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:16000 KvCacheType: NumThreads:8 GPULayers:38[ID:GPU-98307eba-46ec-f866-ae7f-784c6292fd2e Layers:38(2..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-27T01:06:55.722+02:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:16000 KvCacheType: NumThreads:8 GPULayers:38[ID:GPU-98307eba-46ec-f866-ae7f-784c6292fd2e Layers:38(2..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-27T01:06:56.529+02:00 level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:16000 KvCacheType: NumThreads:8 GPULayers:38[ID:GPU-98307eba-46ec-f866-ae7f-784c6292fd2e Layers:38(2..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-27T01:06:56.529+02:00 level=INFO source=ggml.go:482 msg="offloading 38 repeating layers to GPU" time=2026-02-27T01:06:56.529+02:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU" time=2026-02-27T01:06:56.529+02:00 level=INFO source=ggml.go:494 msg="offloaded 38/41 layers to GPU" time=2026-02-27T01:06:56.529+02:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="19.3 GiB" time=2026-02-27T01:06:56.530+02:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="2.9 GiB" time=2026-02-27T01:06:56.532+02:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="1.7 GiB" time=2026-02-27T01:06:56.532+02:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="104.7 MiB" time=2026-02-27T01:06:56.532+02:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="796.4 MiB" time=2026-02-27T01:06:56.532+02:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="630.8 MiB" time=2026-02-27T01:06:56.533+02:00 level=INFO source=device.go:272 msg="total memory" size="25.5 GiB" time=2026-02-27T01:06:56.535+02:00 level=INFO source=sched.go:566 msg="loaded runners" count=1 time=2026-02-27T01:06:56.536+02:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" time=2026-02-27T01:06:56.538+02:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" time=2026-02-27T01:07:00.298+02:00 level=INFO source=server.go:1388 msg="llama runner started in 5.70 seconds" CUDA error: invalid argument current device: 0, in function ggml_cuda_cpy at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\cpy.cu:438 cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream) C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error time=2026-02-27T01:07:00.626+02:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:64085/completion\": read tcp 127.0.0.1:64091->127.0.0.1:64085: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2026/02/27 - 01:07:00 | 500 | 6.3789689s | 127.0.0.1 | POST "/api/chat" time=2026-02-27T01:07:00.827+02:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1" ``` 3090Ti Win11 ollama version is 0.17.1 downloaded from ollama (no HF) non stop errors in both 27B and 35BMoe, in ollama i can chat once and then get the error, on open-webui cannot even get one respond, always getting: 500: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details Llama.cpp works just fine.
Author
Owner

@webitube commented on GitHub (Feb 27, 2026):

3090Ti Win11 ollama version is 0.17.1 downloaded from ollama (no HF) non stop errors in both 27B and 35BMoe, in ollama i can chat once and then get the error, on open-webui cannot even get one respond, always getting:

500: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details

Llama.cpp works just fine.

This is exactly what I'm seeing. Same setup but with 4070 Super.

<!-- gh-comment-id:3970229921 --> @webitube commented on GitHub (Feb 27, 2026): > 3090Ti Win11 ollama version is 0.17.1 downloaded from ollama (no HF) non stop errors in both 27B and 35BMoe, in ollama i can chat once and then get the error, on open-webui cannot even get one respond, always getting: > > 500: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details > > Llama.cpp works just fine. This is exactly what I'm seeing. Same setup but with 4070 Super.
Author
Owner

@tawagny commented on GitHub (Feb 27, 2026):

Same setup but with 4080 Super .using 0.17.2

<!-- gh-comment-id:3970356637 --> @tawagny commented on GitHub (Feb 27, 2026): Same setup but with 4080 Super .using 0.17.2
Author
Owner

@iChristGit commented on GitHub (Feb 27, 2026):

The issue present on 0.17.4 as well

<!-- gh-comment-id:3971257036 --> @iChristGit commented on GitHub (Feb 27, 2026): The issue present on 0.17.4 as well
Author
Owner

@slcsolutions commented on GitHub (Feb 27, 2026):

I find this problem on all qwen model (3.5 or code-next) start from ollama 0.17.1 to latest 0.17.4.
Problem is after second prompt. After load model first prompt receive response correctly but then model crash and the second question receive this error.
On 0.17.0 qwen3-code-next works fine. (qwen 3.5 is not supported)

<!-- gh-comment-id:3971738814 --> @slcsolutions commented on GitHub (Feb 27, 2026): I find this problem on all qwen model (3.5 or code-next) start from ollama 0.17.1 to latest 0.17.4. Problem is after second prompt. After load model first prompt receive response correctly but then model crash and the second question receive this error. On 0.17.0 qwen3-code-next works fine. (qwen 3.5 is not supported)
Author
Owner

@iChristGit commented on GitHub (Feb 27, 2026):

I find this problem on all qwen model (3.5 or code-next) start from ollama 0.17.1 to latest 0.17.4. Problem is after second prompt. After load model first prompt receive response correctly but then model crash and the second question receive this error. On 0.17.0 qwen3-code-next works fine. (qwen 3.5 is not supported)

Through open-webui even the first response get an error
on Ollama webui or cmd , yes you can chat once and then the error pops up

<!-- gh-comment-id:3971754808 --> @iChristGit commented on GitHub (Feb 27, 2026): > I find this problem on all qwen model (3.5 or code-next) start from ollama 0.17.1 to latest 0.17.4. Problem is after second prompt. After load model first prompt receive response correctly but then model crash and the second question receive this error. On 0.17.0 qwen3-code-next works fine. (qwen 3.5 is not supported) Through open-webui even the first response get an error on Ollama webui or cmd , yes you can chat once and then the error pops up
Author
Owner

@without-ordinary commented on GitHub (Feb 27, 2026):

I'm getting this error trying to load qwen3.5:122b-a10b on 8x4090's on multimodal requests. Text only seems like it can work, though it has issues like responding getting stuck in an endless loop.

<!-- gh-comment-id:3972545786 --> @without-ordinary commented on GitHub (Feb 27, 2026): I'm getting this error trying to load `qwen3.5:122b-a10b` on 8x4090's on multimodal requests. Text only seems like it can work, though it has issues like responding getting stuck in an endless loop.
Author
Owner

@Noyze-AI commented on GitHub (Feb 27, 2026):

Same problem with 5090Dv2 on 0.17.4.

Any first prompt complete success.
Any second prompt crushed with "Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details"

<!-- gh-comment-id:3973561875 --> @Noyze-AI commented on GitHub (Feb 27, 2026): Same problem with 5090Dv2 on 0.17.4. Any first prompt complete success. Any second prompt crushed with "Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details"
Author
Owner

@shadowmite commented on GitHub (Feb 27, 2026):

Same problem with 5090Dv2 on0.17.4.

Any first prompt complete success. Any second prompt crushed with "Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details"

A long enough first prompt will fail as well, I used, "This is a test to see if a really long sentence crashes ollama because in my experience I've tried multiple times to send a prompt that I have saved in my clipboard that seems to cause crashes and now I'm curious what'll happen if I just send a really long prompt like this one. I admit the LLM is likely to not know what to respond with so it should just say, "OK" if it understands whats going on here.".

<!-- gh-comment-id:3974599701 --> @shadowmite commented on GitHub (Feb 27, 2026): > Same problem with 5090Dv2 on0.17.4. > > Any first prompt complete success. Any second prompt crushed with "Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details" A long enough first prompt will fail as well, I used, "This is a test to see if a really long sentence crashes ollama because in my experience I've tried multiple times to send a prompt that I have saved in my clipboard that seems to cause crashes and now I'm curious what'll happen if I just send a really long prompt like this one. I admit the LLM is likely to not know what to respond with so it should just say, "OK" if it understands whats going on here.".
Author
Owner

@tawagny commented on GitHub (Feb 27, 2026):

issue still exist from using 0.17.2 to using 0.17.4 , tested on linux and windows 11 using RTX4080 SUPER

<!-- gh-comment-id:3975118611 --> @tawagny commented on GitHub (Feb 27, 2026): issue still exist from using 0.17.2 to using 0.17.4 , tested on linux and windows 11 using RTX4080 SUPER
Author
Owner

@Code4SAFrankie commented on GitHub (Feb 27, 2026):

Me too on RTX 4090 with 64GB system RAM

<!-- gh-comment-id:3975322932 --> @Code4SAFrankie commented on GitHub (Feb 27, 2026): Me too on RTX 4090 with 64GB system RAM
Author
Owner

@kerta1n commented on GitHub (Feb 27, 2026):

The Ollama Qwen3.5 35B model DOES work for me on 0.17.4 (RTX3090).

(Unsloth's GGUF 35B A3B version from huggingface failed when I was on 0.17.2).

<!-- gh-comment-id:3975383157 --> @kerta1n commented on GitHub (Feb 27, 2026): The Ollama Qwen3.5 35B model DOES work for me on 0.17.4 (RTX3090). (Unsloth's GGUF 35B A3B version from huggingface failed when I was on 0.17.2).
Author
Owner

@Adrian-at-CrimsonAzure commented on GitHub (Feb 27, 2026):

The Ollama Qwen3.5 35B model DOES work for me on 0.17.4 (RTX3090).

(Unsloth's GGUF 35B A3B version from huggingface failed when I was on 0.17.2).

What quant are you using? And how large of a prompt did you use? I can get a different 8 bit quant to respond to "Hello", but it breaks on anything larger than a few sentences.

<!-- gh-comment-id:3975445853 --> @Adrian-at-CrimsonAzure commented on GitHub (Feb 27, 2026): > The Ollama Qwen3.5 35B model DOES work for me on 0.17.4 (RTX3090). > > (Unsloth's GGUF 35B A3B version from huggingface failed when I was on 0.17.2). What quant are you using? And how large of a prompt did you use? I can get a different 8 bit quant to respond to "Hello", but it breaks on anything larger than a few sentences.
Author
Owner

@kerta1n commented on GitHub (Feb 27, 2026):

The Ollama Qwen3.5 35B model DOES work for me on 0.17.4 (RTX3090).
(Unsloth's GGUF 35B A3B version from huggingface failed when I was on 0.17.2).

What quant are you using? And how large of a prompt did you use? I can get a different 8 bit quant to respond to "Hello", but it breaks on anything larger than a few sentences.

Oops, forgot to mention this, I have 2x3090 lol. But quant wise, it's just the default Q4_K_M.
Hitting around 115 t/s

<!-- gh-comment-id:3975468923 --> @kerta1n commented on GitHub (Feb 27, 2026): > > The Ollama Qwen3.5 35B model DOES work for me on 0.17.4 (RTX3090). > > (Unsloth's GGUF 35B A3B version from huggingface failed when I was on 0.17.2). > > What quant are you using? And how large of a prompt did you use? I can get a different 8 bit quant to respond to "Hello", but it breaks on anything larger than a few sentences. Oops, forgot to mention this, I have 2x3090 lol. But quant wise, it's just the [default Q4_K_M](https://ollama.com/library/qwen3.5:35b). Hitting around 115 t/s
Author
Owner

@bertyhell commented on GitHub (Feb 27, 2026):

i have it with prompt:

should i use my car or go by foot to the carwash which is 500 meters from my house?

model: qwen3.5:35b

ollama version is 0.17.4

OLLAMA BATCH SIZE=128
OLLAMA_CONTEXT_LENGTH=32768
OLLAMA FLASH ATTENTION=0
OLLAMA KV CACHE TYPE=q8_o

log file:

time=2026-02-27T23:18:35.904+01:00 level=INFO source=routes.go:1663 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:65536 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\verheb4\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:true OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES:]"
time=2026-02-27T23:18:35.905+01:00 level=INFO source=routes.go:1665 msg="Ollama cloud disabled: true"
time=2026-02-27T23:18:35.911+01:00 level=INFO source=images.go:473 msg="total blobs: 29"
time=2026-02-27T23:18:35.913+01:00 level=INFO source=images.go:480 msg="total unused blobs removed: 0"
time=2026-02-27T23:18:35.914+01:00 level=INFO source=routes.go:1718 msg="Listening on [::]:11434 (version 0.17.4)"
time=2026-02-27T23:18:35.915+01:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
time=2026-02-27T23:18:35.926+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\verheb4\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 49702"
time=2026-02-27T23:18:36.588+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\verheb4\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 49708"
time=2026-02-27T23:18:36.850+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\verheb4\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 49713"
time=2026-02-27T23:18:37.078+01:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1"
time=2026-02-27T23:18:37.079+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\verheb4\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 49718"
time=2026-02-27T23:18:37.079+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\verheb4\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 49719"
time=2026-02-27T23:18:37.347+01:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa filter_id="" library=CUDA compute=12.0 name=CUDA0 description="NVIDIA GeForce RTX 5080 Laptop GPU" libdirs=ollama,cuda_v13 driver=13.0 pci_id=0000:01:00.0 type=discrete total="15.9 GiB" available="14.3 GiB"
time=2026-02-27T23:18:37.347+01:00 level=INFO source=routes.go:1768 msg="vram-based default context" total_vram="15.9 GiB" default_num_ctx=4096
[GIN] 2026/02/27 - 23:18:37 | 200 |            0s |       127.0.0.1 | GET      "/api/version"
[GIN] 2026/02/27 - 23:18:37 | 200 |       522.4µs |       127.0.0.1 | GET      "/api/version"
[GIN] 2026/02/27 - 23:18:37 | 200 |            0s |       127.0.0.1 | GET      "/api/version"
[GIN] 2026/02/27 - 23:18:37 | 200 |      5.5593ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/02/27 - 23:18:37 | 200 |    110.6703ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/02/27 - 23:18:37 | 401 |    245.1562ms |       127.0.0.1 | POST     "/api/me"
[GIN] 2026/02/27 - 23:18:37 | 401 |    245.8935ms |       127.0.0.1 | POST     "/api/me"
[GIN] 2026/02/27 - 23:18:42 | 200 |      5.1437ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2026/02/27 - 23:18:42 | 200 |     95.6449ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/02/27 - 23:18:43 | 200 |     92.6455ms |       127.0.0.1 | POST     "/api/show"
time=2026-02-27T23:18:43.105+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\verheb4\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 59096"
time=2026-02-27T23:18:43.334+01:00 level=INFO source=cpu_windows.go:148 msg=packages count=1
time=2026-02-27T23:18:43.334+01:00 level=INFO source=cpu_windows.go:164 msg="efficiency cores detected" maxEfficiencyClass=1
time=2026-02-27T23:18:43.334+01:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=24 efficiency=16 threads=24
time=2026-02-27T23:18:43.399+01:00 level=WARN source=server.go:258 msg="quantized kv cache requested but flash attention disabled" type=q8_0
time=2026-02-27T23:18:43.400+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\verheb4\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\verheb4\\.ollama\\models\\blobs\\sha256-2abd0d805943fa113f934d1ae4f2d5a749b5d4fe2a0a9c64b645c1df15868da7 --port 59101"
time=2026-02-27T23:18:43.474+01:00 level=INFO source=sched.go:491 msg="system memory" total="63.5 GiB" free="47.3 GiB" free_swap="48.8 GiB"
time=2026-02-27T23:18:43.474+01:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa library=CUDA available="13.9 GiB" free="14.3 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-02-27T23:18:43.474+01:00 level=INFO source=server.go:757 msg="loading model" "model layers"=41 requested=-1
time=2026-02-27T23:18:43.508+01:00 level=INFO source=runner.go:1411 msg="starting ollama engine"
time=2026-02-27T23:18:43.509+01:00 level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:59101"
time=2026-02-27T23:18:43.517+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:65536 KvCacheType: NumThreads:8 GPULayers:41[ID:GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-27T23:18:43.543+01:00 level=INFO source=ggml.go:136 msg="" architecture=qwen35moe file_type=Q4_K_M name="" description="" num_tensors=1959 num_key_values=57
load_backend: loaded CPU backend from C:\Users\verheb4\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5080 Laptop GPU, compute capability 12.0, VMM: yes, ID: GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa
load_backend: loaded CUDA backend from C:\Users\verheb4\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2026-02-27T23:18:43.635+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-02-27T23:18:44.505+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:65536 KvCacheType: NumThreads:8 GPULayers:2[ID:GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa Layers:2(38..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-27T23:18:44.898+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:65536 KvCacheType: NumThreads:8 GPULayers:1[ID:GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa Layers:1(39..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-27T23:18:45.341+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:65536 KvCacheType: NumThreads:8 GPULayers:1[ID:GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa Layers:1(39..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-27T23:18:46.457+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:65536 KvCacheType: NumThreads:8 GPULayers:1[ID:GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa Layers:1(39..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-27T23:18:46.457+01:00 level=INFO source=ggml.go:482 msg="offloading 1 repeating layers to GPU"
time=2026-02-27T23:18:46.457+01:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-02-27T23:18:46.457+01:00 level=INFO source=ggml.go:494 msg="offloaded 1/41 layers to GPU"
time=2026-02-27T23:18:46.457+01:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="516.8 MiB"
time=2026-02-27T23:18:46.457+01:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="21.7 GiB"
time=2026-02-27T23:18:46.457+01:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="128.0 MiB"
time=2026-02-27T23:18:46.457+01:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="2.7 GiB"
time=2026-02-27T23:18:46.457+01:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="12.7 GiB"
time=2026-02-27T23:18:46.457+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="630.8 MiB"
time=2026-02-27T23:18:46.457+01:00 level=INFO source=device.go:272 msg="total memory" size="38.4 GiB"
time=2026-02-27T23:18:46.457+01:00 level=INFO source=sched.go:566 msg="loaded runners" count=1
time=2026-02-27T23:18:46.457+01:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-02-27T23:18:46.457+01:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-02-27T23:18:49.728+01:00 level=INFO source=server.go:1388 msg="llama runner started in 6.25 seconds"
CUDA error: invalid argument
  current device: 0, in function ggml_cuda_cpy at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\cpy.cu:438
  cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error
time=2026-02-27T23:18:49.951+01:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:59101/completion\": read tcp 127.0.0.1:59106->127.0.0.1:59101: wsarecv: An existing connection was forcibly closed by the remote host."
[GIN] 2026/02/27 - 23:18:49 | 500 |    6.9285033s |       127.0.0.1 | POST     "/api/chat"
time=2026-02-27T23:18:51.195+01:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"

running with CPU only does work

OLLAMA_LLM_LIBRARY=cpu_avx2
<!-- gh-comment-id:3975471436 --> @bertyhell commented on GitHub (Feb 27, 2026): i have it with prompt: ``` should i use my car or go by foot to the carwash which is 500 meters from my house? ``` model: qwen3.5:35b ollama version is 0.17.4 ``` OLLAMA BATCH SIZE=128 OLLAMA_CONTEXT_LENGTH=32768 OLLAMA FLASH ATTENTION=0 OLLAMA KV CACHE TYPE=q8_o ``` log file: ``` time=2026-02-27T23:18:35.904+01:00 level=INFO source=routes.go:1663 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:65536 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\verheb4\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:true OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES:]" time=2026-02-27T23:18:35.905+01:00 level=INFO source=routes.go:1665 msg="Ollama cloud disabled: true" time=2026-02-27T23:18:35.911+01:00 level=INFO source=images.go:473 msg="total blobs: 29" time=2026-02-27T23:18:35.913+01:00 level=INFO source=images.go:480 msg="total unused blobs removed: 0" time=2026-02-27T23:18:35.914+01:00 level=INFO source=routes.go:1718 msg="Listening on [::]:11434 (version 0.17.4)" time=2026-02-27T23:18:35.915+01:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." time=2026-02-27T23:18:35.926+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\verheb4\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 49702" time=2026-02-27T23:18:36.588+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\verheb4\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 49708" time=2026-02-27T23:18:36.850+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\verheb4\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 49713" time=2026-02-27T23:18:37.078+01:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled. To enable, set OLLAMA_VULKAN=1" time=2026-02-27T23:18:37.079+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\verheb4\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 49718" time=2026-02-27T23:18:37.079+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\verheb4\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 49719" time=2026-02-27T23:18:37.347+01:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa filter_id="" library=CUDA compute=12.0 name=CUDA0 description="NVIDIA GeForce RTX 5080 Laptop GPU" libdirs=ollama,cuda_v13 driver=13.0 pci_id=0000:01:00.0 type=discrete total="15.9 GiB" available="14.3 GiB" time=2026-02-27T23:18:37.347+01:00 level=INFO source=routes.go:1768 msg="vram-based default context" total_vram="15.9 GiB" default_num_ctx=4096 [GIN] 2026/02/27 - 23:18:37 | 200 | 0s | 127.0.0.1 | GET "/api/version" [GIN] 2026/02/27 - 23:18:37 | 200 | 522.4µs | 127.0.0.1 | GET "/api/version" [GIN] 2026/02/27 - 23:18:37 | 200 | 0s | 127.0.0.1 | GET "/api/version" [GIN] 2026/02/27 - 23:18:37 | 200 | 5.5593ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/02/27 - 23:18:37 | 200 | 110.6703ms | 127.0.0.1 | POST "/api/show" [GIN] 2026/02/27 - 23:18:37 | 401 | 245.1562ms | 127.0.0.1 | POST "/api/me" [GIN] 2026/02/27 - 23:18:37 | 401 | 245.8935ms | 127.0.0.1 | POST "/api/me" [GIN] 2026/02/27 - 23:18:42 | 200 | 5.1437ms | 127.0.0.1 | GET "/api/tags" [GIN] 2026/02/27 - 23:18:42 | 200 | 95.6449ms | 127.0.0.1 | POST "/api/show" [GIN] 2026/02/27 - 23:18:43 | 200 | 92.6455ms | 127.0.0.1 | POST "/api/show" time=2026-02-27T23:18:43.105+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\verheb4\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 59096" time=2026-02-27T23:18:43.334+01:00 level=INFO source=cpu_windows.go:148 msg=packages count=1 time=2026-02-27T23:18:43.334+01:00 level=INFO source=cpu_windows.go:164 msg="efficiency cores detected" maxEfficiencyClass=1 time=2026-02-27T23:18:43.334+01:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=24 efficiency=16 threads=24 time=2026-02-27T23:18:43.399+01:00 level=WARN source=server.go:258 msg="quantized kv cache requested but flash attention disabled" type=q8_0 time=2026-02-27T23:18:43.400+01:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\verheb4\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\verheb4\\.ollama\\models\\blobs\\sha256-2abd0d805943fa113f934d1ae4f2d5a749b5d4fe2a0a9c64b645c1df15868da7 --port 59101" time=2026-02-27T23:18:43.474+01:00 level=INFO source=sched.go:491 msg="system memory" total="63.5 GiB" free="47.3 GiB" free_swap="48.8 GiB" time=2026-02-27T23:18:43.474+01:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa library=CUDA available="13.9 GiB" free="14.3 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-02-27T23:18:43.474+01:00 level=INFO source=server.go:757 msg="loading model" "model layers"=41 requested=-1 time=2026-02-27T23:18:43.508+01:00 level=INFO source=runner.go:1411 msg="starting ollama engine" time=2026-02-27T23:18:43.509+01:00 level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:59101" time=2026-02-27T23:18:43.517+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:65536 KvCacheType: NumThreads:8 GPULayers:41[ID:GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-27T23:18:43.543+01:00 level=INFO source=ggml.go:136 msg="" architecture=qwen35moe file_type=Q4_K_M name="" description="" num_tensors=1959 num_key_values=57 load_backend: loaded CPU backend from C:\Users\verheb4\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5080 Laptop GPU, compute capability 12.0, VMM: yes, ID: GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa load_backend: loaded CUDA backend from C:\Users\verheb4\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2026-02-27T23:18:43.635+01:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-02-27T23:18:44.505+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:65536 KvCacheType: NumThreads:8 GPULayers:2[ID:GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa Layers:2(38..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-27T23:18:44.898+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:65536 KvCacheType: NumThreads:8 GPULayers:1[ID:GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa Layers:1(39..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-27T23:18:45.341+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:65536 KvCacheType: NumThreads:8 GPULayers:1[ID:GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa Layers:1(39..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-27T23:18:46.457+01:00 level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Disabled KvSize:65536 KvCacheType: NumThreads:8 GPULayers:1[ID:GPU-dbb4ae2c-4a70-70ea-e6ca-8181610cd2aa Layers:1(39..39)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-27T23:18:46.457+01:00 level=INFO source=ggml.go:482 msg="offloading 1 repeating layers to GPU" time=2026-02-27T23:18:46.457+01:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU" time=2026-02-27T23:18:46.457+01:00 level=INFO source=ggml.go:494 msg="offloaded 1/41 layers to GPU" time=2026-02-27T23:18:46.457+01:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="516.8 MiB" time=2026-02-27T23:18:46.457+01:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="21.7 GiB" time=2026-02-27T23:18:46.457+01:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="128.0 MiB" time=2026-02-27T23:18:46.457+01:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="2.7 GiB" time=2026-02-27T23:18:46.457+01:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="12.7 GiB" time=2026-02-27T23:18:46.457+01:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="630.8 MiB" time=2026-02-27T23:18:46.457+01:00 level=INFO source=device.go:272 msg="total memory" size="38.4 GiB" time=2026-02-27T23:18:46.457+01:00 level=INFO source=sched.go:566 msg="loaded runners" count=1 time=2026-02-27T23:18:46.457+01:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" time=2026-02-27T23:18:46.457+01:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" time=2026-02-27T23:18:49.728+01:00 level=INFO source=server.go:1388 msg="llama runner started in 6.25 seconds" CUDA error: invalid argument current device: 0, in function ggml_cuda_cpy at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\cpy.cu:438 cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream) C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error time=2026-02-27T23:18:49.951+01:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:59101/completion\": read tcp 127.0.0.1:59106->127.0.0.1:59101: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2026/02/27 - 23:18:49 | 500 | 6.9285033s | 127.0.0.1 | POST "/api/chat" time=2026-02-27T23:18:51.195+01:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1" ``` running with CPU only does work ``` OLLAMA_LLM_LIBRARY=cpu_avx2 ```
Author
Owner

@chr0n1x commented on GitHub (Feb 27, 2026):

and here I don't even get past those logs

time=2026-02-27T23:07:20.810Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 39499"
time=2026-02-27T23:07:21.041Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
time=2026-02-27T23:07:21.195Z level=INFO source=server.go:247 msg="enabling flash attention"
time=2026-02-27T23:07:21.195Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b --port 45973"
time=2026-02-27T23:07:21.196Z level=INFO source=sched.go:491 msg="system memory" total="30.0 GiB" free="13.3 MiB" free_swap="0 B"
time=2026-02-27T23:07:21.196Z level=INFO source=sched.go:498 msg="gpu memory" id=GPU-5068c5ff-0f1d-ec77-edbc-85cca4831d5e library=CUDA available="23.1 GiB" free="23.6 GiB" minimum="457.0 MiB" overhead="0 B"

"failed to parse CPU allowed micro secs" 😕

<!-- gh-comment-id:3975657711 --> @chr0n1x commented on GitHub (Feb 27, 2026): and here I don't even get past those logs ``` time=2026-02-27T23:07:20.810Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 39499" time=2026-02-27T23:07:21.041Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax" time=2026-02-27T23:07:21.195Z level=INFO source=server.go:247 msg="enabling flash attention" time=2026-02-27T23:07:21.195Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b --port 45973" time=2026-02-27T23:07:21.196Z level=INFO source=sched.go:491 msg="system memory" total="30.0 GiB" free="13.3 MiB" free_swap="0 B" time=2026-02-27T23:07:21.196Z level=INFO source=sched.go:498 msg="gpu memory" id=GPU-5068c5ff-0f1d-ec77-edbc-85cca4831d5e library=CUDA available="23.1 GiB" free="23.6 GiB" minimum="457.0 MiB" overhead="0 B" ``` `"failed to parse CPU allowed micro secs" ` 😕
Author
Owner

@rick-github commented on GitHub (Feb 27, 2026):

"failed to parse CPU allowed micro secs" 😕

Not a relevant issue, it's just a warning that a string conversion was unsuccessful.

<!-- gh-comment-id:3975674034 --> @rick-github commented on GitHub (Feb 27, 2026): > "failed to parse CPU allowed micro secs" 😕 Not a relevant issue, it's just a warning that a string conversion was unsuccessful.
Author
Owner

@yossiovadia commented on GitHub (Feb 28, 2026):

Root Cause Analysis and Verified Fix

I debugged this end-to-end and identified the exact root cause. The bug is in ggml/src/ggml-cuda/cpy.cu line 438.

Root Cause

When the model doesn't fully fit in VRAM (e.g. Qwen 3.5 35B-A3B on a 24GB RTX 4090 offloads 38/41 layers to GPU, remaining on CPU), the CUDA backend performs a tensor copy using:

CUDA_CHECK(cudaMemcpyAsync(src1_ddc, src0_ddc, ggml_nbytes(src0),
           cudaMemcpyDeviceToDevice, main_stream));

The problem is cudaMemcpyDeviceToDevice — it tells CUDA both pointers are device (GPU) memory. But when layers are split across GPU/CPU, some tensors (recurrent state from the KV cache for CPU-offloaded layers) reside in host (CPU) memory. Passing a host pointer with cudaMemcpyDeviceToDevice causes CUDA error: invalid argument.

This manifests on the second prompt because the first prompt populates the KV cache, and the second prompt triggers a copy/restore of that cached state — including the CPU-resident portions.

Note: cudaMemcpyAsync in this codebase is actually macro-replaced with cudaMemcpyAsyncReserve (see common.cuh), which skips the actual memcpy during graph reservation but executes it during real computation — that's why the error occurs at runtime, not during graph setup.

The Fix

Change cudaMemcpyDeviceToDevice to cudaMemcpyDefault at cpy.cu:438:

- CUDA_CHECK(cudaMemcpyAsync(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream));
+ CUDA_CHECK(cudaMemcpyAsync(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDefault, main_stream));

cudaMemcpyDefault automatically detects whether each pointer is host or device memory and performs the correct transfer type (D2D, H2D, D2H, or H2H). This is safe for all cases — including the existing fully-on-GPU case — with negligible overhead (CUDA resolves the pointer type via its internal page table lookup).

Before the Fix (crash on second prompt)

offloading 38 repeating layers to GPU
offloading output layer to CPU

CUDA error: invalid argument
  current device: 0, in function ggml_cuda_cpy at cpy.cu:438
  cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream)

After the Fix (multiple prompts succeed)

load_backend: loaded CUDA backend from ...\cuda_v12\ggml-cuda.dll
offloading 39 repeating layers to GPU
offloading output layer to CPU

>>> Hello, how are you?
Hello! I'm doing well, thank you for asking...

>>> Talk to me like you're The Dude from The Big Lebowski
Yeah, well, hello there. Come on in, don't mind the floor...

>>> Now explain quantum physics as The Dude
So, you want the lowdown on quantum physics, man? Alright. Sit down...

No CUDA errors. Multiple consecutive multi-turn prompts succeed.

Verified On

  • GPU: NVIDIA RTX 4090 (24GB), compute 8.9, CUDA driver 13.0
  • Model: qwen3.5:35b-a3b (Q4_K_M), 39/41 layers on GPU, flash attention enabled
  • Ollama: v0.17.4
  • Before fix: Crashes on second prompt every time
  • After fix: Multiple consecutive prompts succeed without error

I'm filing an upstream PR on ggml-org/llama.cpp with this one-line fix.

<!-- gh-comment-id:3975966906 --> @yossiovadia commented on GitHub (Feb 28, 2026): ## Root Cause Analysis and Verified Fix I debugged this end-to-end and identified the exact root cause. The bug is in `ggml/src/ggml-cuda/cpy.cu` line 438. ### Root Cause When the model doesn't fully fit in VRAM (e.g. Qwen 3.5 35B-A3B on a 24GB RTX 4090 offloads 38/41 layers to GPU, remaining on CPU), the CUDA backend performs a tensor copy using: ```cpp CUDA_CHECK(cudaMemcpyAsync(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream)); ``` The problem is `cudaMemcpyDeviceToDevice` — it tells CUDA both pointers are device (GPU) memory. But when layers are split across GPU/CPU, some tensors (recurrent state from the KV cache for CPU-offloaded layers) reside in **host (CPU) memory**. Passing a host pointer with `cudaMemcpyDeviceToDevice` causes `CUDA error: invalid argument`. This manifests on the **second prompt** because the first prompt populates the KV cache, and the second prompt triggers a copy/restore of that cached state — including the CPU-resident portions. Note: `cudaMemcpyAsync` in this codebase is actually macro-replaced with `cudaMemcpyAsyncReserve` (see `common.cuh`), which skips the actual memcpy during graph reservation but executes it during real computation — that's why the error occurs at runtime, not during graph setup. ### The Fix Change `cudaMemcpyDeviceToDevice` to `cudaMemcpyDefault` at `cpy.cu:438`: ```diff - CUDA_CHECK(cudaMemcpyAsync(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream)); + CUDA_CHECK(cudaMemcpyAsync(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDefault, main_stream)); ``` `cudaMemcpyDefault` automatically detects whether each pointer is host or device memory and performs the correct transfer type (D2D, H2D, D2H, or H2H). This is safe for all cases — including the existing fully-on-GPU case — with negligible overhead (CUDA resolves the pointer type via its internal page table lookup). ### Before the Fix (crash on second prompt) ``` offloading 38 repeating layers to GPU offloading output layer to CPU CUDA error: invalid argument current device: 0, in function ggml_cuda_cpy at cpy.cu:438 cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream) ``` ### After the Fix (multiple prompts succeed) ``` load_backend: loaded CUDA backend from ...\cuda_v12\ggml-cuda.dll offloading 39 repeating layers to GPU offloading output layer to CPU >>> Hello, how are you? Hello! I'm doing well, thank you for asking... >>> Talk to me like you're The Dude from The Big Lebowski Yeah, well, hello there. Come on in, don't mind the floor... >>> Now explain quantum physics as The Dude So, you want the lowdown on quantum physics, man? Alright. Sit down... ``` No CUDA errors. Multiple consecutive multi-turn prompts succeed. ### Verified On - **GPU**: NVIDIA RTX 4090 (24GB), compute 8.9, CUDA driver 13.0 - **Model**: `qwen3.5:35b-a3b` (Q4_K_M), 39/41 layers on GPU, flash attention enabled - **Ollama**: v0.17.4 - **Before fix**: Crashes on second prompt every time - **After fix**: Multiple consecutive prompts succeed without error I'm filing an upstream PR on ggml-org/llama.cpp with this one-line fix.
Author
Owner

@without-ordinary commented on GitHub (Feb 28, 2026):

OS: Ubuntu 24.04.3 LTS
Ollama version: 0.17.4
Model: qwen3.5:122b-a10b and qwen3.5:35b
GPUs: 8x RTX 4090 24GB
Open WebUI Version: 0.8.5

Ollama service environment:

OLLAMA_SCHED_SPREAD=1
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
OLLAMA_NUM_PARALLEL=8
OLLAMA_CONTEXT_LENGTH=131000 # workaround for https://github.com/ollama/ollama/issues/13887

Prompt: "Describe the image."
Context: random image

Tried qwen3.5:35b and it works as expected.

qwen3.5:122b-a10b by itself crashes with:

current device: 0, in function ggml_cuda_cpy at //ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:438
cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream)
//ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error

Full run server log: ollama_server_log_144444.txt

VRAM us is quite low when qwen3.5:122b-a10b is loaded.

With 35b loaded first then 122b loaded along side, VRAM usage across all 8 GPUs is between 14-17GB. Oddly with both loaded in that order, 122b does not crash with that error on sampling. This behavior is similar to what I saw on https://github.com/ollama/ollama/issues/13887#issuecomment-3943173896 where having another, working, model loaded first causes the misbehaving model to run without the issue.

Edit: On further testing, the odd behavior of it working when loading 35b first seems to only work sometimes. I am unable to replicate it consistently on additional runs.

<!-- gh-comment-id:3976393823 --> @without-ordinary commented on GitHub (Feb 28, 2026): OS: Ubuntu 24.04.3 LTS Ollama version: 0.17.4 Model: `qwen3.5:122b-a10b` and `qwen3.5:35b` GPUs: 8x RTX 4090 24GB Open WebUI Version: 0.8.5 Ollama service environment: ```bash OLLAMA_SCHED_SPREAD=1 GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 OLLAMA_NUM_PARALLEL=8 OLLAMA_CONTEXT_LENGTH=131000 # workaround for https://github.com/ollama/ollama/issues/13887 ``` Prompt: "Describe the image." Context: random image Tried `qwen3.5:35b` and it works as expected. qwen3.5:122b-a10b by itself crashes with: ``` current device: 0, in function ggml_cuda_cpy at //ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:438 cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream) //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:94: CUDA error ``` Full run server log: [ollama_server_log_144444.txt](https://github.com/user-attachments/files/25618665/ollama_server_log_144444.txt) VRAM us is quite low when `qwen3.5:122b-a10b` is loaded. With 35b loaded first then 122b loaded along side, VRAM usage across all 8 GPUs is between 14-17GB. Oddly with both loaded in that order, 122b does not crash with that error on sampling. This behavior is similar to what I saw on https://github.com/ollama/ollama/issues/13887#issuecomment-3943173896 where having another, working, model loaded first causes the misbehaving model to run without the issue. Edit: On further testing, the odd behavior of it working when loading 35b first seems to only work sometimes. I am unable to replicate it consistently on additional runs.
Author
Owner

@baileikyo commented on GitHub (Feb 28, 2026):

Root Cause Analysis and Verified Fix

I debugged this end-to-end and identified the exact root cause. The bug is in ggml/src/ggml-cuda/cpy.cu line 438.

Root Cause

When the model doesn't fully fit in VRAM (e.g. Qwen 3.5 35B-A3B on a 24GB RTX 4090 offloads 38/41 layers to GPU, remaining on CPU), the CUDA backend performs a tensor copy using:

CUDA_CHECK(cudaMemcpyAsync(src1_ddc, src0_ddc, ggml_nbytes(src0),
cudaMemcpyDeviceToDevice, main_stream));
The problem is cudaMemcpyDeviceToDevice — it tells CUDA both pointers are device (GPU) memory. But when layers are split across GPU/CPU, some tensors (recurrent state from the KV cache for CPU-offloaded layers) reside in host (CPU) memory. Passing a host pointer with cudaMemcpyDeviceToDevice causes CUDA error: invalid argument.

This manifests on the second prompt because the first prompt populates the KV cache, and the second prompt triggers a copy/restore of that cached state — including the CPU-resident portions.

Note: cudaMemcpyAsync in this codebase is actually macro-replaced with cudaMemcpyAsyncReserve (see common.cuh), which skips the actual memcpy during graph reservation but executes it during real computation — that's why the error occurs at runtime, not during graph setup.

The Fix

Change cudaMemcpyDeviceToDevice to cudaMemcpyDefault at cpy.cu:438:

  • CUDA_CHECK(cudaMemcpyAsync(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream));
  • CUDA_CHECK(cudaMemcpyAsync(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDefault, main_stream));
    cudaMemcpyDefault automatically detects whether each pointer is host or device memory and performs the correct transfer type (D2D, H2D, D2H, or H2H). This is safe for all cases — including the existing fully-on-GPU case — with negligible overhead (CUDA resolves the pointer type via its internal page table lookup).

Before the Fix (crash on second prompt)

offloading 38 repeating layers to GPU
offloading output layer to CPU

CUDA error: invalid argument
  current device: 0, in function ggml_cuda_cpy at cpy.cu:438
  cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream)

After the Fix (multiple prompts succeed)

load_backend: loaded CUDA backend from ...\cuda_v12\ggml-cuda.dll
offloading 39 repeating layers to GPU
offloading output layer to CPU

>>> Hello, how are you?
Hello! I'm doing well, thank you for asking...

>>> Talk to me like you're The Dude from The Big Lebowski
Yeah, well, hello there. Come on in, don't mind the floor...

>>> Now explain quantum physics as The Dude
So, you want the lowdown on quantum physics, man? Alright. Sit down...

No CUDA errors. Multiple consecutive multi-turn prompts succeed.

Verified On

  • GPU: NVIDIA RTX 4090 (24GB), compute 8.9, CUDA driver 13.0
  • Model: qwen3.5:35b-a3b (Q4_K_M), 39/41 layers on GPU, flash attention enabled
  • Ollama: v0.17.4
  • Before fix: Crashes on second prompt every time
  • After fix: Multiple consecutive prompts succeed without error

I'm filing an upstream PR on ggml-org/llama.cpp with this one-line fix.

能告诉我具体要替换哪些文件么?

<!-- gh-comment-id:3976493278 --> @baileikyo commented on GitHub (Feb 28, 2026): > ## Root Cause Analysis and Verified Fix > I debugged this end-to-end and identified the exact root cause. The bug is in `ggml/src/ggml-cuda/cpy.cu` line 438. > > ### Root Cause > When the model doesn't fully fit in VRAM (e.g. Qwen 3.5 35B-A3B on a 24GB RTX 4090 offloads 38/41 layers to GPU, remaining on CPU), the CUDA backend performs a tensor copy using: > > CUDA_CHECK(cudaMemcpyAsync(src1_ddc, src0_ddc, ggml_nbytes(src0), > cudaMemcpyDeviceToDevice, main_stream)); > The problem is `cudaMemcpyDeviceToDevice` — it tells CUDA both pointers are device (GPU) memory. But when layers are split across GPU/CPU, some tensors (recurrent state from the KV cache for CPU-offloaded layers) reside in **host (CPU) memory**. Passing a host pointer with `cudaMemcpyDeviceToDevice` causes `CUDA error: invalid argument`. > > This manifests on the **second prompt** because the first prompt populates the KV cache, and the second prompt triggers a copy/restore of that cached state — including the CPU-resident portions. > > Note: `cudaMemcpyAsync` in this codebase is actually macro-replaced with `cudaMemcpyAsyncReserve` (see `common.cuh`), which skips the actual memcpy during graph reservation but executes it during real computation — that's why the error occurs at runtime, not during graph setup. > > ### The Fix > Change `cudaMemcpyDeviceToDevice` to `cudaMemcpyDefault` at `cpy.cu:438`: > > - CUDA_CHECK(cudaMemcpyAsync(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream)); > + CUDA_CHECK(cudaMemcpyAsync(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDefault, main_stream)); > `cudaMemcpyDefault` automatically detects whether each pointer is host or device memory and performs the correct transfer type (D2D, H2D, D2H, or H2H). This is safe for all cases — including the existing fully-on-GPU case — with negligible overhead (CUDA resolves the pointer type via its internal page table lookup). > > ### Before the Fix (crash on second prompt) > ``` > offloading 38 repeating layers to GPU > offloading output layer to CPU > > CUDA error: invalid argument > current device: 0, in function ggml_cuda_cpy at cpy.cu:438 > cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream) > ``` > > ### After the Fix (multiple prompts succeed) > ``` > load_backend: loaded CUDA backend from ...\cuda_v12\ggml-cuda.dll > offloading 39 repeating layers to GPU > offloading output layer to CPU > > >>> Hello, how are you? > Hello! I'm doing well, thank you for asking... > > >>> Talk to me like you're The Dude from The Big Lebowski > Yeah, well, hello there. Come on in, don't mind the floor... > > >>> Now explain quantum physics as The Dude > So, you want the lowdown on quantum physics, man? Alright. Sit down... > ``` > > No CUDA errors. Multiple consecutive multi-turn prompts succeed. > > ### Verified On > * **GPU**: NVIDIA RTX 4090 (24GB), compute 8.9, CUDA driver 13.0 > * **Model**: `qwen3.5:35b-a3b` (Q4_K_M), 39/41 layers on GPU, flash attention enabled > * **Ollama**: v0.17.4 > * **Before fix**: Crashes on second prompt every time > * **After fix**: Multiple consecutive prompts succeed without error > > I'm filing an upstream PR on ggml-org/llama.cpp with this one-line fix. 能告诉我具体要替换哪些文件么?
Author
Owner

@yossiovadia commented on GitHub (Mar 1, 2026):

Opened a fix for this: #14536

The root cause is in deltanet.go — the SetInplace call in the DeltaNet chunked attention loop creates GGML_OP_SET with a view of a buffer-less intermediate tensor. With partial offload, the ggml scheduler can't determine the correct backend for that view and leaks GPU assignments from neighboring layers into CPU-layer ops, causing the cudaMemcpyDeviceToDevice crash.

The fix replaces SetInplace with Concat — one file, pure Go, no C/CUDA changes.

<!-- gh-comment-id:3981180165 --> @yossiovadia commented on GitHub (Mar 1, 2026): Opened a fix for this: [#14536](https://github.com/ollama/ollama/pull/14536) The root cause is in `deltanet.go` — the `SetInplace` call in the DeltaNet chunked attention loop creates `GGML_OP_SET` with a view of a buffer-less intermediate tensor. With partial offload, the ggml scheduler can't determine the correct backend for that view and leaks GPU assignments from neighboring layers into CPU-layer ops, causing the `cudaMemcpyDeviceToDevice` crash. The fix replaces `SetInplace` with `Concat` — one file, pure Go, no C/CUDA changes.
Author
Owner

@jmorganca commented on GitHub (Mar 2, 2026):

This was fixed by https://github.com/ollama/ollama/pull/14541 and will be released in 0.17.5 soon. Sorry for the issue and thanks for all the reports and help solving it.

<!-- gh-comment-id:3981814130 --> @jmorganca commented on GitHub (Mar 2, 2026): This was fixed by https://github.com/ollama/ollama/pull/14541 and will be released in 0.17.5 soon. Sorry for the issue and thanks for all the reports and help solving it.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9381