[GH-ISSUE #14519] qwen3.5:122b Error: 500 with ollama 0.17.4 #35181

Closed
opened 2026-04-22 19:30:21 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @brantzh on GitHub (Mar 1, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14519

What is the issue?

qwen3.5:122b
ollama 0.17.4
Windows 11 IoT ENT LTSC 24H2
AMD Ryzen™ Chipset Driver Release Notes 8.01.20.513
AMD RYZEN AI MAX+ 395 w/ Radeon 8060S with 64G/64G setting

The first conversation can be replied to successfully, but the second conversation gives an error. This happens consistently across multiple attempts.

ollama run qwen3.5:122b

help
Thinking...
Okay, the user just said "help". That's pretty vague. Let me think. They might need assistance with something, but
I don't know what exactly. Since there's no context, maybe they're looking for general help or have a specific
issue. I should ask them to clarify what they need help with. Let me check if there's any previous conversation
history... Oh wait, this seems like the start of a conversation. No prior messages. So I need to prompt them to
provide more details. Maybe list some common areas where people usually need help, like technical issues, advice,
information, etc. But keep it friendly and open-ended. Let me make sure I don't assume too much. Just a simple
prompt asking them to specify what they need.
...done thinking.

Of course! I'd be happy to help. Could you please tell me more about what you need assistance with? For example:

  • Tech issues (software, devices, etc.)
  • Learning (explanations, tutorials, study tips)
  • Creative projects (writing, brainstorming, design)
  • Problem-solving (decisions, advice, step-by-step guidance)
  • Something else?

Just let me know! 😊

help
Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details

Are there any other debugging methods?

Relevant log output

[GIN] 2026/03/01 - 09:50:29 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2026/03/01 - 09:50:29 | 200 |    135.5324ms |       127.0.0.1 | POST     "/api/show"
[GIN] 2026/03/01 - 09:50:29 | 200 |    130.3221ms |       127.0.0.1 | POST     "/api/show"
time=2026-03-01T09:50:29.736+08:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\ xx\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58533"
time=2026-03-01T09:50:30.092+08:00 level=INFO source=cpu_windows.go:148 msg=packages count=1
time=2026-03-01T09:50:30.092+08:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=16 efficiency=0 threads=32
time=2026-03-01T09:50:30.166+08:00 level=INFO source=server.go:247 msg="enabling flash attention"
time=2026-03-01T09:50:30.168+08:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\xx\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\Models\\blobs\\sha256-93c83617a40560a61cda911ee327efdb5b5fbd39caa8b777a4ec565c0af1af3d --port 58538"
time=2026-03-01T09:50:30.172+08:00 level=INFO source=sched.go:491 msg="system memory" total="63.6 GiB" free="58.1 GiB" free_swap="245.2 GiB"
time=2026-03-01T09:50:30.172+08:00 level=INFO source=sched.go:498 msg="gpu memory" id=0 library=ROCm available="63.0 GiB" free="63.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-01T09:50:30.172+08:00 level=INFO source=server.go:757 msg="loading model" "model layers"=49 requested=-1
time=2026-03-01T09:50:30.199+08:00 level=INFO source=runner.go:1411 msg="starting ollama engine"
time=2026-03-01T09:50:30.204+08:00 level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:58538"
time=2026-03-01T09:50:30.205+08:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:16384 KvCacheType: NumThreads:16 GPULayers:49[ID:0 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-01T09:50:30.236+08:00 level=INFO source=ggml.go:136 msg="" architecture=qwen35moe file_type=Q4_K_M name="" description="" num_tensors=2105 num_key_values=57
load_backend: loaded CPU backend from C:\Users\xx\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon(TM) 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, ID: 0
load_backend: loaded ROCm backend from C:\Users\xx\AppData\Local\Programs\Ollama\lib\ollama\rocm\ggml-hip.dll
time=2026-03-01T09:50:30.275+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.NO_PEER_COPY=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-03-01T09:50:30.815+08:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:16384 KvCacheType: NumThreads:16 GPULayers:38[ID:0 Layers:38(10..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-01T09:50:31.113+08:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:16384 KvCacheType: NumThreads:16 GPULayers:38[ID:0 Layers:38(10..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-01T09:50:34.463+08:00 level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:16384 KvCacheType: NumThreads:16 GPULayers:38[ID:0 Layers:38(10..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-01T09:50:34.463+08:00 level=INFO source=device.go:240 msg="model weights" device=ROCm0 size="57.5 GiB"
time=2026-03-01T09:50:34.463+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="18.3 GiB"
time=2026-03-01T09:50:34.463+08:00 level=INFO source=device.go:251 msg="kv cache" device=ROCm0 size="3.1 GiB"
time=2026-03-01T09:50:34.463+08:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="892.1 MiB"
time=2026-03-01T09:50:34.463+08:00 level=INFO source=device.go:262 msg="compute graph" device=ROCm0 size="1.5 GiB"
time=2026-03-01T09:50:34.463+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="1.4 GiB"
time=2026-03-01T09:50:34.463+08:00 level=INFO source=device.go:272 msg="total memory" size="82.6 GiB"
time=2026-03-01T09:50:34.463+08:00 level=INFO source=sched.go:566 msg="loaded runners" count=1
time=2026-03-01T09:50:34.463+08:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-03-01T09:50:34.463+08:00 level=INFO source=ggml.go:482 msg="offloading 38 repeating layers to GPU"
time=2026-03-01T09:50:34.463+08:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-03-01T09:50:34.463+08:00 level=INFO source=ggml.go:494 msg="offloaded 38/49 layers to GPU"
time=2026-03-01T09:50:34.464+08:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-03-01T09:51:13.589+08:00 level=INFO source=server.go:1388 msg="llama runner started in 43.42 seconds"
[GIN] 2026/03/01 - 09:51:13 | 200 |   43.9794571s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/03/01 - 09:52:13 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2026/03/01 - 09:52:13 | 200 |       522.2µs |       127.0.0.1 | GET      "/api/ps"
[GIN] 2026/03/01 - 09:52:20 | 200 |   23.5591904s |       127.0.0.1 | POST     "/api/chat"
ggml-backend.cpp:1554: GGML_ASSERT(id >= 0 && id < n_expert) failed
time=2026-03-01T09:52:58.262+08:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:58538/completion\": read tcp 127.0.0.1:58542->127.0.0.1:58538: wsarecv: An existing connection was forcibly closed by the remote host."
[GIN] 2026/03/01 - 09:52:58 | 500 |     1.696873s |       127.0.0.1 | POST     "/api/chat"
time=2026-03-01T09:52:59.680+08:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"

OS

Windows

GPU

AMD

CPU

AMD

Ollama version

0.17.4

Originally created by @brantzh on GitHub (Mar 1, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14519 ### What is the issue? qwen3.5:122b ollama 0.17.4 Windows 11 IoT ENT LTSC 24H2 AMD Ryzen™ Chipset Driver Release Notes 8.01.20.513 AMD RYZEN AI MAX+ 395 w/ Radeon 8060S with 64G/64G setting The first conversation can be replied to successfully, but the second conversation gives an error. This happens consistently across multiple attempts. ollama run qwen3.5:122b >>> help Thinking... Okay, the user just said "help". That's pretty vague. Let me think. They might need assistance with something, but I don't know what exactly. Since there's no context, maybe they're looking for general help or have a specific issue. I should ask them to clarify what they need help with. Let me check if there's any previous conversation history... Oh wait, this seems like the start of a conversation. No prior messages. So I need to prompt them to provide more details. Maybe list some common areas where people usually need help, like technical issues, advice, information, etc. But keep it friendly and open-ended. Let me make sure I don't assume too much. Just a simple prompt asking them to specify what they need. ...done thinking. Of course! I'd be happy to help. Could you please tell me more about what you need assistance with? For example: - **Tech issues** (software, devices, etc.) - **Learning** (explanations, tutorials, study tips) - **Creative projects** (writing, brainstorming, design) - **Problem-solving** (decisions, advice, step-by-step guidance) - **Something else**? Just let me know! 😊 >>> help Error: 500 Internal Server Error: model runner has unexpectedly stopped, this may be due to resource limitations or an internal error, check ollama server logs for details Are there any other debugging methods? ### Relevant log output ```shell [GIN] 2026/03/01 - 09:50:29 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2026/03/01 - 09:50:29 | 200 | 135.5324ms | 127.0.0.1 | POST "/api/show" [GIN] 2026/03/01 - 09:50:29 | 200 | 130.3221ms | 127.0.0.1 | POST "/api/show" time=2026-03-01T09:50:29.736+08:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\ xx\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 58533" time=2026-03-01T09:50:30.092+08:00 level=INFO source=cpu_windows.go:148 msg=packages count=1 time=2026-03-01T09:50:30.092+08:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=16 efficiency=0 threads=32 time=2026-03-01T09:50:30.166+08:00 level=INFO source=server.go:247 msg="enabling flash attention" time=2026-03-01T09:50:30.168+08:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\xx\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\Models\\blobs\\sha256-93c83617a40560a61cda911ee327efdb5b5fbd39caa8b777a4ec565c0af1af3d --port 58538" time=2026-03-01T09:50:30.172+08:00 level=INFO source=sched.go:491 msg="system memory" total="63.6 GiB" free="58.1 GiB" free_swap="245.2 GiB" time=2026-03-01T09:50:30.172+08:00 level=INFO source=sched.go:498 msg="gpu memory" id=0 library=ROCm available="63.0 GiB" free="63.4 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-03-01T09:50:30.172+08:00 level=INFO source=server.go:757 msg="loading model" "model layers"=49 requested=-1 time=2026-03-01T09:50:30.199+08:00 level=INFO source=runner.go:1411 msg="starting ollama engine" time=2026-03-01T09:50:30.204+08:00 level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:58538" time=2026-03-01T09:50:30.205+08:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:16384 KvCacheType: NumThreads:16 GPULayers:49[ID:0 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-01T09:50:30.236+08:00 level=INFO source=ggml.go:136 msg="" architecture=qwen35moe file_type=Q4_K_M name="" description="" num_tensors=2105 num_key_values=57 load_backend: loaded CPU backend from C:\Users\xx\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-icelake.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon(TM) 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, ID: 0 load_backend: loaded ROCm backend from C:\Users\xx\AppData\Local\Programs\Ollama\lib\ollama\rocm\ggml-hip.dll time=2026-03-01T09:50:30.275+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.NO_PEER_COPY=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-03-01T09:50:30.815+08:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:16384 KvCacheType: NumThreads:16 GPULayers:38[ID:0 Layers:38(10..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-01T09:50:31.113+08:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:16384 KvCacheType: NumThreads:16 GPULayers:38[ID:0 Layers:38(10..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-01T09:50:34.463+08:00 level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:16384 KvCacheType: NumThreads:16 GPULayers:38[ID:0 Layers:38(10..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-01T09:50:34.463+08:00 level=INFO source=device.go:240 msg="model weights" device=ROCm0 size="57.5 GiB" time=2026-03-01T09:50:34.463+08:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="18.3 GiB" time=2026-03-01T09:50:34.463+08:00 level=INFO source=device.go:251 msg="kv cache" device=ROCm0 size="3.1 GiB" time=2026-03-01T09:50:34.463+08:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="892.1 MiB" time=2026-03-01T09:50:34.463+08:00 level=INFO source=device.go:262 msg="compute graph" device=ROCm0 size="1.5 GiB" time=2026-03-01T09:50:34.463+08:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="1.4 GiB" time=2026-03-01T09:50:34.463+08:00 level=INFO source=device.go:272 msg="total memory" size="82.6 GiB" time=2026-03-01T09:50:34.463+08:00 level=INFO source=sched.go:566 msg="loaded runners" count=1 time=2026-03-01T09:50:34.463+08:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" time=2026-03-01T09:50:34.463+08:00 level=INFO source=ggml.go:482 msg="offloading 38 repeating layers to GPU" time=2026-03-01T09:50:34.463+08:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU" time=2026-03-01T09:50:34.463+08:00 level=INFO source=ggml.go:494 msg="offloaded 38/49 layers to GPU" time=2026-03-01T09:50:34.464+08:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" time=2026-03-01T09:51:13.589+08:00 level=INFO source=server.go:1388 msg="llama runner started in 43.42 seconds" [GIN] 2026/03/01 - 09:51:13 | 200 | 43.9794571s | 127.0.0.1 | POST "/api/generate" [GIN] 2026/03/01 - 09:52:13 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2026/03/01 - 09:52:13 | 200 | 522.2µs | 127.0.0.1 | GET "/api/ps" [GIN] 2026/03/01 - 09:52:20 | 200 | 23.5591904s | 127.0.0.1 | POST "/api/chat" ggml-backend.cpp:1554: GGML_ASSERT(id >= 0 && id < n_expert) failed time=2026-03-01T09:52:58.262+08:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:58538/completion\": read tcp 127.0.0.1:58542->127.0.0.1:58538: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2026/03/01 - 09:52:58 | 500 | 1.696873s | 127.0.0.1 | POST "/api/chat" time=2026-03-01T09:52:59.680+08:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1" ``` ### OS Windows ### GPU AMD ### CPU AMD ### Ollama version 0.17.4
GiteaMirror added the bug label 2026-04-22 19:30:21 -05:00
Author
Owner

@dish72 commented on GitHub (Mar 1, 2026):

Exactly the same issue with Ollama 0.17.4, however in my case its when using the smaller local qwen3.5:27b model and an nVidia card instead:

qwen3.5:27b
ollama 0.17.4
Windows 10 22H2 Version 10.0.19045.6937]
nVidia RTX 5070Ti

As indicated by the OP, the initial first question is properly answered but the second question gives an error.

I haven't noticed the issue with other models.

Below is the log output:

time=2026-03-01T16:10:50.209-05:00 level=INFO source=cpu_windows.go:148 msg=packages count=1
time=2026-03-01T16:10:50.209-05:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=8 efficiency=0 threads=16
time=2026-03-01T16:10:50.310-05:00 level=INFO source=server.go:247 msg="enabling flash attention"
time=2026-03-01T16:10:50.311-05:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model E:\\LLMs\\blobs\\sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b --port 53077"
time=2026-03-01T16:10:50.314-05:00 level=INFO source=sched.go:491 msg="system memory" total="31.9 GiB" free="25.0 GiB" free_swap="32.4 GiB"
time=2026-03-01T16:10:50.314-05:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-443c2319-5159-5b04-7f7c-57a0aab9e4de library=CUDA available="14.6 GiB" free="15.0 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-03-01T16:10:50.314-05:00 level=INFO source=server.go:757 msg="loading model" "model layers"=65 requested=-1
time=2026-03-01T16:10:50.340-05:00 level=INFO source=runner.go:1411 msg="starting ollama engine"
time=2026-03-01T16:10:50.341-05:00 level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:53077"
time=2026-03-01T16:10:50.346-05:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:65[ID:GPU-443c2319-5159-5b04-7f7c-57a0aab9e4de Layers:65(0..64)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-01T16:10:50.385-05:00 level=INFO source=ggml.go:136 msg="" architecture=qwen35 file_type=Q4_K_M name="" description="" num_tensors=1307 num_key_values=53
load_backend: loaded CPU backend from C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-443c2319-5159-5b04-7f7c-57a0aab9e4de
load_backend: loaded CUDA backend from C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2026-03-01T16:10:50.473-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2026-03-01T16:10:51.176-05:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:51[ID:GPU-443c2319-5159-5b04-7f7c-57a0aab9e4de Layers:51(13..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-01T16:10:51.524-05:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:51[ID:GPU-443c2319-5159-5b04-7f7c-57a0aab9e4de Layers:51(13..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-01T16:10:52.394-05:00 level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:51[ID:GPU-443c2319-5159-5b04-7f7c-57a0aab9e4de Layers:51(13..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-03-01T16:10:52.395-05:00 level=INFO source=ggml.go:482 msg="offloading 51 repeating layers to GPU"
time=2026-03-01T16:10:52.395-05:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-03-01T16:10:52.395-05:00 level=INFO source=ggml.go:494 msg="offloaded 51/65 layers to GPU"
time=2026-03-01T16:10:52.395-05:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="10.7 GiB"
time=2026-03-01T16:10:52.395-05:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="5.5 GiB"
time=2026-03-01T16:10:52.395-05:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="3.1 GiB"
time=2026-03-01T16:10:52.395-05:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="827.3 MiB"
time=2026-03-01T16:10:52.395-05:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="799.3 MiB"
time=2026-03-01T16:10:52.395-05:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="1.0 GiB"
time=2026-03-01T16:10:52.395-05:00 level=INFO source=device.go:272 msg="total memory" size="21.9 GiB"
time=2026-03-01T16:10:52.395-05:00 level=INFO source=sched.go:566 msg="loaded runners" count=1
time=2026-03-01T16:10:52.395-05:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-03-01T16:10:52.396-05:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-03-01T16:10:55.401-05:00 level=INFO source=server.go:1388 msg="llama runner started in 5.09 seconds"
[GIN] 2026/03/01 - 16:10:55 | 200 |    5.5356548s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2026/03/01 - 16:17:43 | 200 |         5m35s |       127.0.0.1 | POST     "/api/chat"
CUDA error: invalid argument
  current device: 0, in function ggml_cuda_cpy at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\cpy.cu:438
  cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error
time=2026-03-01T16:19:59.332-05:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:53077/completion\": read tcp 127.0.0.1:53081->127.0.0.1:53077: wsarecv: An existing connection was forcibly closed by the remote host."
[GIN] 2026/03/01 - 16:19:59 | 500 |         1m28s |       127.0.0.1 | POST     "/api/chat"
time=2026-03-01T16:19:59.590-05:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"`
<!-- gh-comment-id:3981081960 --> @dish72 commented on GitHub (Mar 1, 2026): Exactly the same issue with Ollama 0.17.4, however in my case its when using the smaller local qwen3.5:27b model and an nVidia card instead: qwen3.5:27b ollama 0.17.4 Windows 10 22H2 Version 10.0.19045.6937] nVidia RTX 5070Ti As indicated by the OP, the initial first question is properly answered but the second question gives an error. I haven't noticed the issue with other models. Below is the log output: ```time=2026-03-01T16:10:50.038-05:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --port 53073" time=2026-03-01T16:10:50.209-05:00 level=INFO source=cpu_windows.go:148 msg=packages count=1 time=2026-03-01T16:10:50.209-05:00 level=INFO source=cpu_windows.go:195 msg="" package=0 cores=8 efficiency=0 threads=16 time=2026-03-01T16:10:50.310-05:00 level=INFO source=server.go:247 msg="enabling flash attention" time=2026-03-01T16:10:50.311-05:00 level=INFO source=server.go:431 msg="starting runner" cmd="C:\\Users\\User\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model E:\\LLMs\\blobs\\sha256-7935de6e08f9444536d0edcacf19d2166b34bef8ddb4ac7ce9263ff5cad0693b --port 53077" time=2026-03-01T16:10:50.314-05:00 level=INFO source=sched.go:491 msg="system memory" total="31.9 GiB" free="25.0 GiB" free_swap="32.4 GiB" time=2026-03-01T16:10:50.314-05:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-443c2319-5159-5b04-7f7c-57a0aab9e4de library=CUDA available="14.6 GiB" free="15.0 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-03-01T16:10:50.314-05:00 level=INFO source=server.go:757 msg="loading model" "model layers"=65 requested=-1 time=2026-03-01T16:10:50.340-05:00 level=INFO source=runner.go:1411 msg="starting ollama engine" time=2026-03-01T16:10:50.341-05:00 level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:53077" time=2026-03-01T16:10:50.346-05:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:65[ID:GPU-443c2319-5159-5b04-7f7c-57a0aab9e4de Layers:65(0..64)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-01T16:10:50.385-05:00 level=INFO source=ggml.go:136 msg="" architecture=qwen35 file_type=Q4_K_M name="" description="" num_tensors=1307 num_key_values=53 load_backend: loaded CPU backend from C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, ID: GPU-443c2319-5159-5b04-7f7c-57a0aab9e4de load_backend: loaded CUDA backend from C:\Users\User\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2026-03-01T16:10:50.473-05:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2026-03-01T16:10:51.176-05:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:51[ID:GPU-443c2319-5159-5b04-7f7c-57a0aab9e4de Layers:51(13..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-01T16:10:51.524-05:00 level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:51[ID:GPU-443c2319-5159-5b04-7f7c-57a0aab9e4de Layers:51(13..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-01T16:10:52.394-05:00 level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:4096 KvCacheType: NumThreads:8 GPULayers:51[ID:GPU-443c2319-5159-5b04-7f7c-57a0aab9e4de Layers:51(13..63)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-03-01T16:10:52.395-05:00 level=INFO source=ggml.go:482 msg="offloading 51 repeating layers to GPU" time=2026-03-01T16:10:52.395-05:00 level=INFO source=ggml.go:486 msg="offloading output layer to CPU" time=2026-03-01T16:10:52.395-05:00 level=INFO source=ggml.go:494 msg="offloaded 51/65 layers to GPU" time=2026-03-01T16:10:52.395-05:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="10.7 GiB" time=2026-03-01T16:10:52.395-05:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="5.5 GiB" time=2026-03-01T16:10:52.395-05:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="3.1 GiB" time=2026-03-01T16:10:52.395-05:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="827.3 MiB" time=2026-03-01T16:10:52.395-05:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="799.3 MiB" time=2026-03-01T16:10:52.395-05:00 level=INFO source=device.go:267 msg="compute graph" device=CPU size="1.0 GiB" time=2026-03-01T16:10:52.395-05:00 level=INFO source=device.go:272 msg="total memory" size="21.9 GiB" time=2026-03-01T16:10:52.395-05:00 level=INFO source=sched.go:566 msg="loaded runners" count=1 time=2026-03-01T16:10:52.395-05:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" time=2026-03-01T16:10:52.396-05:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" time=2026-03-01T16:10:55.401-05:00 level=INFO source=server.go:1388 msg="llama runner started in 5.09 seconds" [GIN] 2026/03/01 - 16:10:55 | 200 | 5.5356548s | 127.0.0.1 | POST "/api/generate" [GIN] 2026/03/01 - 16:17:43 | 200 | 5m35s | 127.0.0.1 | POST "/api/chat" CUDA error: invalid argument current device: 0, in function ggml_cuda_cpy at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\cpy.cu:438 cudaMemcpyAsyncReserve(src1_ddc, src0_ddc, ggml_nbytes(src0), cudaMemcpyDeviceToDevice, main_stream) C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:94: CUDA error time=2026-03-01T16:19:59.332-05:00 level=ERROR source=server.go:1610 msg="post predict" error="Post \"http://127.0.0.1:53077/completion\": read tcp 127.0.0.1:53081->127.0.0.1:53077: wsarecv: An existing connection was forcibly closed by the remote host." [GIN] 2026/03/01 - 16:19:59 | 500 | 1m28s | 127.0.0.1 | POST "/api/chat" time=2026-03-01T16:19:59.590-05:00 level=ERROR source=server.go:304 msg="llama runner terminated" error="exit status 1"`
Author
Owner

@tririver commented on GitHub (Mar 2, 2026):

Same for me. I noted that if the model can be loaded 100% to GPU memory, there's no problem. If the model is partially in GPU memory partially in system memory (for a RTX 4090, that means large context for qwen3.5-27b-q4km, or any context size for qwen3.5-35b-a3b), I encouter exactly the same problem.

<!-- gh-comment-id:3981644132 --> @tririver commented on GitHub (Mar 2, 2026): Same for me. I noted that if the model can be loaded 100% to GPU memory, there's no problem. If the model is partially in GPU memory partially in system memory (for a RTX 4090, that means large context for qwen3.5-27b-q4km, or any context size for qwen3.5-35b-a3b), I encouter exactly the same problem.
Author
Owner

@marburps commented on GitHub (Mar 2, 2026):

Exactly the same error with Qwen3.5, latest ollama and RTX5080. 64GB RAM, 16 GB VRAM.

<!-- gh-comment-id:3983394472 --> @marburps commented on GitHub (Mar 2, 2026): Exactly the same error with Qwen3.5, latest ollama and RTX5080. 64GB RAM, 16 GB VRAM.
Author
Owner

@mizaar666 commented on GitHub (Mar 2, 2026):

It is solved in 0.17.5

<!-- gh-comment-id:3983590392 --> @mizaar666 commented on GitHub (Mar 2, 2026): It is solved in 0.17.5
Author
Owner

@prurigro commented on GitHub (Mar 4, 2026):

I'm seeing this in 0.17.5 using the vulkan renderer. It's fine with ROCm.

EDIT: It works in vulkan after updating to 0.17.6!

<!-- gh-comment-id:4000136494 --> @prurigro commented on GitHub (Mar 4, 2026): I'm seeing this in 0.17.5 using the vulkan renderer. It's fine with ROCm. EDIT: It works in vulkan after updating to 0.17.6!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#35181