[GH-ISSUE #14491] After updating from version 0.15.4 to 0.17.4, the previous model failed to load #35157

Closed
opened 2026-04-22 19:27:30 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @LiAI-tech on GitHub (Feb 27, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14491

What is the issue?

root@c64755252568:/# ollama run glm-4.7-flash:q4_K_M
Error: 500 Internal Server Error: model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details


ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-afbaaa7b-a1e8-fa9b-9518-5c88a787c217
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-f5f932cd-d178-8e5a-fb80-9f5429163882
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
time=2026-02-27T06:14:40.715Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-02-27T06:14:42.170Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:Enabled KvSize:811008 KvCacheType: NumThreads:8 GPULayers:19[ID:GPU-f5f932cd-d178-8e5a-fb80-9f5429163882 Layers:9(28..36) ID:GPU-afbaaa7b-a1e8-fa9b-9518-5c88a787c217 Layers:10(37..46)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-27T06:14:43.425Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:Enabled KvSize:811008 KvCacheType: NumThreads:8 GPULayers:18[ID:GPU-f5f932cd-d178-8e5a-fb80-9f5429163882 Layers:10(29..38) ID:GPU-afbaaa7b-a1e8-fa9b-9518-5c88a787c217 Layers:8(39..46)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-27T06:14:44.630Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:Enabled KvSize:811008 KvCacheType: NumThreads:8 GPULayers:18[ID:GPU-f5f932cd-d178-8e5a-fb80-9f5429163882 Layers:10(29..38) ID:GPU-afbaaa7b-a1e8-fa9b-9518-5c88a787c217 Layers:8(39..46)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-27T06:14:58.014Z level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:Enabled KvSize:811008 KvCacheType: NumThreads:8 GPULayers:18[ID:GPU-f5f932cd-d178-8e5a-fb80-9f5429163882 Layers:10(29..38) ID:GPU-afbaaa7b-a1e8-fa9b-9518-5c88a787c217 Layers:8(39..46)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-27T06:14:58.014Z level=INFO source=ggml.go:482 msg="offloading 18 repeating layers to GPU"
time=2026-02-27T06:14:58.014Z level=INFO source=ggml.go:486 msg="offloading output layer to CPU"
time=2026-02-27T06:14:58.014Z level=INFO source=ggml.go:494 msg="offloaded 18/48 layers to GPU"
time=2026-02-27T06:14:58.014Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="3.1 GiB"
time=2026-02-27T06:14:58.014Z level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="3.9 GiB"
time=2026-02-27T06:14:58.014Z level=INFO source=device.go:245 msg="model weights" device=CPU size="10.7 GiB"
time=2026-02-27T06:14:58.014Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="13.1 GiB"
time=2026-02-27T06:14:58.014Z level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="16.4 GiB"
time=2026-02-27T06:14:58.014Z level=INFO source=device.go:256 msg="kv cache" device=CPU size="47.7 GiB"
time=2026-02-27T06:14:58.014Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="4.2 GiB"
time=2026-02-27T06:14:58.014Z level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="992.9 MiB"
time=2026-02-27T06:14:58.014Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="4.0 MiB"
time=2026-02-27T06:14:58.014Z level=INFO source=device.go:272 msg="total memory" size="100.1 GiB"
time=2026-02-27T06:14:58.014Z level=INFO source=sched.go:566 msg="loaded runners" count=1
time=2026-02-27T06:14:58.014Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-02-27T06:14:58.014Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-02-27T06:18:35.184Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding"
time=2026-02-27T06:18:35.446Z level=WARN source=server.go:1357 msg="client connection closed before server finished loading, aborting load"
time=2026-02-27T06:18:35.446Z level=ERROR source=sched.go:572 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled"
[GIN] 2026/02/27 - 06:18:35 | 499 | 10m15s | 192.168.1.68 | POST "/api/chat"
time=2026-02-27T06:18:39.072Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 40065"
time=2026-02-27T06:18:39.601Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 37943"
time=2026-02-27T06:18:39.852Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 41141"
time=2026-02-27T06:18:40.102Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 41769"
time=2026-02-27T06:18:40.351Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42819"

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @LiAI-tech on GitHub (Feb 27, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14491 ### What is the issue? root@c64755252568:/# ollama run glm-4.7-flash:q4_K_M Error: 500 Internal Server Error: model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details ------------------------------------------------------------------------------------------ ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-afbaaa7b-a1e8-fa9b-9518-5c88a787c217 Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-f5f932cd-d178-8e5a-fb80-9f5429163882 load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so time=2026-02-27T06:14:40.715Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2026-02-27T06:14:42.170Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:Enabled KvSize:811008 KvCacheType: NumThreads:8 GPULayers:19[ID:GPU-f5f932cd-d178-8e5a-fb80-9f5429163882 Layers:9(28..36) ID:GPU-afbaaa7b-a1e8-fa9b-9518-5c88a787c217 Layers:10(37..46)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-27T06:14:43.425Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:Enabled KvSize:811008 KvCacheType: NumThreads:8 GPULayers:18[ID:GPU-f5f932cd-d178-8e5a-fb80-9f5429163882 Layers:10(29..38) ID:GPU-afbaaa7b-a1e8-fa9b-9518-5c88a787c217 Layers:8(39..46)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-27T06:14:44.630Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:Enabled KvSize:811008 KvCacheType: NumThreads:8 GPULayers:18[ID:GPU-f5f932cd-d178-8e5a-fb80-9f5429163882 Layers:10(29..38) ID:GPU-afbaaa7b-a1e8-fa9b-9518-5c88a787c217 Layers:8(39..46)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-27T06:14:58.014Z level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:4 BatchSize:512 FlashAttention:Enabled KvSize:811008 KvCacheType: NumThreads:8 GPULayers:18[ID:GPU-f5f932cd-d178-8e5a-fb80-9f5429163882 Layers:10(29..38) ID:GPU-afbaaa7b-a1e8-fa9b-9518-5c88a787c217 Layers:8(39..46)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-27T06:14:58.014Z level=INFO source=ggml.go:482 msg="offloading 18 repeating layers to GPU" time=2026-02-27T06:14:58.014Z level=INFO source=ggml.go:486 msg="offloading output layer to CPU" time=2026-02-27T06:14:58.014Z level=INFO source=ggml.go:494 msg="offloaded 18/48 layers to GPU" time=2026-02-27T06:14:58.014Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="3.1 GiB" time=2026-02-27T06:14:58.014Z level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="3.9 GiB" time=2026-02-27T06:14:58.014Z level=INFO source=device.go:245 msg="model weights" device=CPU size="10.7 GiB" time=2026-02-27T06:14:58.014Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="13.1 GiB" time=2026-02-27T06:14:58.014Z level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="16.4 GiB" time=2026-02-27T06:14:58.014Z level=INFO source=device.go:256 msg="kv cache" device=CPU size="47.7 GiB" time=2026-02-27T06:14:58.014Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="4.2 GiB" time=2026-02-27T06:14:58.014Z level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="992.9 MiB" time=2026-02-27T06:14:58.014Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="4.0 MiB" time=2026-02-27T06:14:58.014Z level=INFO source=device.go:272 msg="total memory" size="100.1 GiB" time=2026-02-27T06:14:58.014Z level=INFO source=sched.go:566 msg="loaded runners" count=1 time=2026-02-27T06:14:58.014Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" time=2026-02-27T06:14:58.014Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" time=2026-02-27T06:18:35.184Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding" time=2026-02-27T06:18:35.446Z level=WARN source=server.go:1357 msg="client connection closed before server finished loading, aborting load" time=2026-02-27T06:18:35.446Z level=ERROR source=sched.go:572 msg="error loading llama server" error="timed out waiting for llama runner to start: context canceled" [GIN] 2026/02/27 - 06:18:35 | 499 | 10m15s | 192.168.1.68 | POST "/api/chat" time=2026-02-27T06:18:39.072Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 40065" time=2026-02-27T06:18:39.601Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 37943" time=2026-02-27T06:18:39.852Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 41141" time=2026-02-27T06:18:40.102Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 41769" time=2026-02-27T06:18:40.351Z level=INFO source=server.go:431 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 42819" ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-22 19:27:30 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 27, 2026):

The log does not show a 500 Internal Server Error entry. The error shown is from the client disconnecting because the model load took too long. This may be due to the large KV cache that has been configured: Parallel:4 KvSize:811008.

<!-- gh-comment-id:3972669713 --> @rick-github commented on GitHub (Feb 27, 2026): The log does not show a `500 Internal Server Error` entry. The error shown is from the client disconnecting because the model load took too long. This may be due to the large KV cache that has been configured: `Parallel:4 KvSize:811008`.
Author
Owner

@LiAI-tech commented on GitHub (Feb 28, 2026):

The log does not show a 500 Internal Server Error entry. The error shown is from the client disconnecting because the model load took too long. This may be due to the large KV cache that has been configured: Parallel:4 KvSize:811008.

Yes, the model that originally occupied 20G of VRAM now requires 107G

Image
<!-- gh-comment-id:3975977266 --> @LiAI-tech commented on GitHub (Feb 28, 2026): > The log does not show a `500 Internal Server Error` entry. The error shown is from the client disconnecting because the model load took too long. This may be due to the large KV cache that has been configured: `Parallel:4 KvSize:811008`. Yes, the model that originally occupied 20G of VRAM now requires 107G <img width="1008" height="95" alt="Image" src="https://github.com/user-attachments/assets/6b372c09-8dbf-4b88-8dc2-6523a2c12e08" />
Author
Owner

@rick-github commented on GitHub (Feb 28, 2026):

#14116

<!-- gh-comment-id:3975983718 --> @rick-github commented on GitHub (Feb 28, 2026): #14116
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#35157