[GH-ISSUE #12846] v0.12.7-Qwen3-VL-30b Memory management or misunderstanding of internal workings? #8511

Closed
opened 2026-04-12 21:12:16 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @Burnarz on GitHub (Oct 30, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12846

What is the issue?

Hi,

Playing with Qwen3-VL:30b with RTX 3090 and Ollama 0.12.7.

With Default Settings for the model, i got this Vram usage.

Image

This ollama ps:
ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL qwen3-vl:30b eda0be100877 25 GB 23%/77% CPU/GPU 8192 Forever

And this log:
oct. 30 00:11:38 jarvis-server ollama[556420]: [GIN] 2025/10/30 - 00:11:38 | 200 | 1.865926ms | 127.0.0.1 | HEAD "/" oct. 30 00:11:38 jarvis-server ollama[556420]: [GIN] 2025/10/30 - 00:11:38 | 200 | 84.801208ms | 127.0.0.1 | POST "/api/show" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.288Z level=INFO source=server.go:385 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 44625" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.732Z level=INFO source=server.go:215 msg="enabling flash attention" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.732Z level=INFO source=server.go:385 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-b1da6f96a2e40e5db05b6066d799c69411225b336bfa20ef1b002c223ed4b190 --port 42059" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.732Z level=INFO source=server.go:638 msg="loading model" "model layers"=49 requested=-1 oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.732Z level=INFO source=server.go:643 msg="system memory" total="15.5 GiB" free="14.1 GiB" free_swap="15.0 GiB" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.732Z level=INFO source=server.go:650 msg="gpu memory" id=GPU-d0364f00-33d1-a9a6-d173-85839f5f872c library=CUDA available="23.1 GiB" free="23.6 GiB" minimum="457.0 MiB" overhead="0 B" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.741Z level=INFO source=runner.go:1337 msg="starting ollama engine" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.742Z level=INFO source=runner.go:1372 msg="Server listening on 127.0.0.1:42059" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.745Z level=INFO source=runner.go:1210 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:6 GPULayers:49[ID:GPU-d0364f00-33d1-a9a6-d173-85839f5f872c Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.777Z level=INFO source=ggml.go:135 msg="" architecture=qwen3vlmoe file_type=Q4_K_M name="" description="" num_tensors=1038 num_key_values=43 oct. 30 00:11:38 jarvis-server ollama[556420]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so oct. 30 00:11:38 jarvis-server ollama[556420]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no oct. 30 00:11:38 jarvis-server ollama[556420]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no oct. 30 00:11:38 jarvis-server ollama[556420]: ggml_cuda_init: found 1 CUDA devices: oct. 30 00:11:38 jarvis-server ollama[556420]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-d0364f00-33d1-a9a6-d173-85839f5f872c oct. 30 00:11:38 jarvis-server ollama[556420]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.955Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) oct. 30 00:11:39 jarvis-server ollama[556420]: time=2025-10-30T00:11:39.698Z level=INFO source=runner.go:1210 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:6 GPULayers:48[ID:GPU-d0364f00-33d1-a9a6-d173-85839f5f872c Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.021Z level=INFO source=runner.go:1210 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:6 GPULayers:48[ID:GPU-d0364f00-33d1-a9a6-d173-85839f5f872c Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=runner.go:1210 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:6 GPULayers:48[ID:GPU-d0364f00-33d1-a9a6-d173-85839f5f872c Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=ggml.go:481 msg="offloading 48 repeating layers to GPU" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=ggml.go:485 msg="offloading output layer to CPU" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=ggml.go:493 msg="offloaded 48/49 layers to GPU" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:212 msg="model weights" device=CUDA0 size="16.9 GiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:217 msg="model weights" device=CPU size="1.4 GiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:223 msg="kv cache" device=CUDA0 size="768.0 MiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:234 msg="compute graph" device=CUDA0 size="663.2 MiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:239 msg="compute graph" device=CPU size="4.2 GiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:244 msg="total memory" size="23.8 GiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=sched.go:493 msg="loaded runners" count=1 oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=server.go:1236 msg="waiting for llama runner to start responding" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.438Z level=INFO source=server.go:1270 msg="waiting for server to become available" status="llm server loading model" oct. 30 00:11:45 jarvis-server ollama[556420]: time=2025-10-30T00:11:45.955Z level=INFO source=server.go:1274 msg="llama runner started in 7.22 seconds" oct. 30 00:11:45 jarvis-server ollama[556420]: [GIN] 2025/10/30 - 00:11:45 | 200 | 7.77802376s | 127.0.0.1 | POST "/api/generate" oct. 30 00:12:12 jarvis-server ollama[556420]: [GIN] 2025/10/30 - 00:12:12 | 200 | 18.269250086s | 127.0.0.1 | POST "/api/chat"

I tried forcing num_gpu to 49 but got this result:
Error: 500 Internal Server Error: memory layout cannot be allocated with num_gpu = 49

With this log:
oct. 30 00:20:12 jarvis-server ollama[556420]: [GIN] 2025/10/30 - 00:20:12 | 200 | 19.86µs | 127.0.0.1 | HEAD "/" oct. 30 00:20:12 jarvis-server ollama[556420]: [GIN] 2025/10/30 - 00:20:12 | 200 | 59.434428ms | 127.0.0.1 | POST "/api/show" oct. 30 00:20:12 jarvis-server ollama[556420]: ggml_backend_cuda_device_get_memory device GPU-d0364f00-33d1-a9a6-d173-85839f5f872c utilizing NVML memory reporting free: 5361434624 total: 25769803776 oct. 30 00:20:15 jarvis-server ollama[556420]: time=2025-10-30T00:20:15.879Z level=INFO source=server.go:385 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 38615" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.053Z level=INFO source=server.go:385 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 40563" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.309Z level=INFO source=server.go:215 msg="enabling flash attention" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.309Z level=INFO source=server.go:385 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-b1da6f96a2e40e5db05b6066d799c69411225b336bfa20ef1b002c223ed4b190 --port 41305" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.309Z level=INFO source=server.go:638 msg="loading model" "model layers"=49 requested=49 oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.309Z level=INFO source=server.go:643 msg="system memory" total="15.5 GiB" free="14.1 GiB" free_swap="15.0 GiB" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.309Z level=INFO source=server.go:650 msg="gpu memory" id=GPU-d0364f00-33d1-a9a6-d173-85839f5f872c library=CUDA available="23.1 GiB" free="23.6 GiB" minimum="457.0 MiB" overhead="0 B" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.319Z level=INFO source=runner.go:1337 msg="starting ollama engine" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.319Z level=INFO source=runner.go:1372 msg="Server listening on 127.0.0.1:41305" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.320Z level=INFO source=runner.go:1210 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:6 GPULayers:49[ID:GPU-d0364f00-33d1-a9a6-d173-85839f5f872c Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.352Z level=INFO source=ggml.go:135 msg="" architecture=qwen3vlmoe file_type=Q4_K_M name="" description="" num_tensors=1038 num_key_values=43 oct. 30 00:20:16 jarvis-server ollama[556420]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so oct. 30 00:20:16 jarvis-server ollama[556420]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no oct. 30 00:20:16 jarvis-server ollama[556420]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no oct. 30 00:20:16 jarvis-server ollama[556420]: ggml_cuda_init: found 1 CUDA devices: oct. 30 00:20:16 jarvis-server ollama[556420]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-d0364f00-33d1-a9a6-d173-85839f5f872c oct. 30 00:20:16 jarvis-server ollama[556420]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.485Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.051Z level=INFO source=runner.go:1210 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:6 GPULayers:49[ID:GPU-d0364f00-33d1-a9a6-d173-85839f5f872c Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=runner.go:1210 msg=load request="{Operation:close LoraPath:[] Parallel:0 BatchSize:0 FlashAttention:false KvSize:0 KvCacheType: NumThreads:0 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=device.go:212 msg="model weights" device=CUDA0 size="18.1 GiB" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=device.go:217 msg="model weights" device=CPU size="166.9 MiB" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=device.go:223 msg="kv cache" device=CUDA0 size="768.0 MiB" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=device.go:234 msg="compute graph" device=CUDA0 size="4.5 GiB" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=device.go:239 msg="compute graph" device=CPU size="31.7 MiB" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=device.go:244 msg="total memory" size="23.5 GiB" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=sched.go:446 msg="Load failed" model=/usr/share/ollama/.ollama/models/blobs/sha256-b1da6f96a2e40e5db05b6066d799c69411225b336bfa20ef1b002c223ed4b190 error="memory layout cannot be allocated with num_gpu = 49" oct. 30 00:20:17 jarvis-server ollama[556420]: [GIN] 2025/10/30 - 00:20:17 | 500 | 5.241777807s | 127.0.0.1 | POST "/api/generate"

Is this normal ?
Shouldn't 19+2Go fit in 24Go ?

Relevant log output


OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.12.7

Originally created by @Burnarz on GitHub (Oct 30, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12846 ### What is the issue? Hi, Playing with Qwen3-VL:30b with RTX 3090 and Ollama 0.12.7. With Default Settings for the model, i got this Vram usage. <img width="3250" height="1471" alt="Image" src="https://github.com/user-attachments/assets/317d95e8-43a9-4de2-b717-58c6b0738e59" /> This ollama ps: `ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL qwen3-vl:30b eda0be100877 25 GB 23%/77% CPU/GPU 8192 Forever ` And this log: `oct. 30 00:11:38 jarvis-server ollama[556420]: [GIN] 2025/10/30 - 00:11:38 | 200 | 1.865926ms | 127.0.0.1 | HEAD "/" oct. 30 00:11:38 jarvis-server ollama[556420]: [GIN] 2025/10/30 - 00:11:38 | 200 | 84.801208ms | 127.0.0.1 | POST "/api/show" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.288Z level=INFO source=server.go:385 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 44625" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.732Z level=INFO source=server.go:215 msg="enabling flash attention" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.732Z level=INFO source=server.go:385 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-b1da6f96a2e40e5db05b6066d799c69411225b336bfa20ef1b002c223ed4b190 --port 42059" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.732Z level=INFO source=server.go:638 msg="loading model" "model layers"=49 requested=-1 oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.732Z level=INFO source=server.go:643 msg="system memory" total="15.5 GiB" free="14.1 GiB" free_swap="15.0 GiB" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.732Z level=INFO source=server.go:650 msg="gpu memory" id=GPU-d0364f00-33d1-a9a6-d173-85839f5f872c library=CUDA available="23.1 GiB" free="23.6 GiB" minimum="457.0 MiB" overhead="0 B" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.741Z level=INFO source=runner.go:1337 msg="starting ollama engine" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.742Z level=INFO source=runner.go:1372 msg="Server listening on 127.0.0.1:42059" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.745Z level=INFO source=runner.go:1210 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:6 GPULayers:49[ID:GPU-d0364f00-33d1-a9a6-d173-85839f5f872c Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.777Z level=INFO source=ggml.go:135 msg="" architecture=qwen3vlmoe file_type=Q4_K_M name="" description="" num_tensors=1038 num_key_values=43 oct. 30 00:11:38 jarvis-server ollama[556420]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so oct. 30 00:11:38 jarvis-server ollama[556420]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no oct. 30 00:11:38 jarvis-server ollama[556420]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no oct. 30 00:11:38 jarvis-server ollama[556420]: ggml_cuda_init: found 1 CUDA devices: oct. 30 00:11:38 jarvis-server ollama[556420]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-d0364f00-33d1-a9a6-d173-85839f5f872c oct. 30 00:11:38 jarvis-server ollama[556420]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.955Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) oct. 30 00:11:39 jarvis-server ollama[556420]: time=2025-10-30T00:11:39.698Z level=INFO source=runner.go:1210 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:6 GPULayers:48[ID:GPU-d0364f00-33d1-a9a6-d173-85839f5f872c Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.021Z level=INFO source=runner.go:1210 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:6 GPULayers:48[ID:GPU-d0364f00-33d1-a9a6-d173-85839f5f872c Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=runner.go:1210 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:6 GPULayers:48[ID:GPU-d0364f00-33d1-a9a6-d173-85839f5f872c Layers:48(0..47)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=ggml.go:481 msg="offloading 48 repeating layers to GPU" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=ggml.go:485 msg="offloading output layer to CPU" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=ggml.go:493 msg="offloaded 48/49 layers to GPU" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:212 msg="model weights" device=CUDA0 size="16.9 GiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:217 msg="model weights" device=CPU size="1.4 GiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:223 msg="kv cache" device=CUDA0 size="768.0 MiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:234 msg="compute graph" device=CUDA0 size="663.2 MiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:239 msg="compute graph" device=CPU size="4.2 GiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:244 msg="total memory" size="23.8 GiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=sched.go:493 msg="loaded runners" count=1 oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=server.go:1236 msg="waiting for llama runner to start responding" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.438Z level=INFO source=server.go:1270 msg="waiting for server to become available" status="llm server loading model" oct. 30 00:11:45 jarvis-server ollama[556420]: time=2025-10-30T00:11:45.955Z level=INFO source=server.go:1274 msg="llama runner started in 7.22 seconds" oct. 30 00:11:45 jarvis-server ollama[556420]: [GIN] 2025/10/30 - 00:11:45 | 200 | 7.77802376s | 127.0.0.1 | POST "/api/generate" oct. 30 00:12:12 jarvis-server ollama[556420]: [GIN] 2025/10/30 - 00:12:12 | 200 | 18.269250086s | 127.0.0.1 | POST "/api/chat" ` I tried forcing num_gpu to 49 but got this result: Error: 500 Internal Server Error: memory layout cannot be allocated with num_gpu = 49 With this log: `oct. 30 00:20:12 jarvis-server ollama[556420]: [GIN] 2025/10/30 - 00:20:12 | 200 | 19.86µs | 127.0.0.1 | HEAD "/" oct. 30 00:20:12 jarvis-server ollama[556420]: [GIN] 2025/10/30 - 00:20:12 | 200 | 59.434428ms | 127.0.0.1 | POST "/api/show" oct. 30 00:20:12 jarvis-server ollama[556420]: ggml_backend_cuda_device_get_memory device GPU-d0364f00-33d1-a9a6-d173-85839f5f872c utilizing NVML memory reporting free: 5361434624 total: 25769803776 oct. 30 00:20:15 jarvis-server ollama[556420]: time=2025-10-30T00:20:15.879Z level=INFO source=server.go:385 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 38615" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.053Z level=INFO source=server.go:385 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 40563" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.309Z level=INFO source=server.go:215 msg="enabling flash attention" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.309Z level=INFO source=server.go:385 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --model /usr/share/ollama/.ollama/models/blobs/sha256-b1da6f96a2e40e5db05b6066d799c69411225b336bfa20ef1b002c223ed4b190 --port 41305" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.309Z level=INFO source=server.go:638 msg="loading model" "model layers"=49 requested=49 oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.309Z level=INFO source=server.go:643 msg="system memory" total="15.5 GiB" free="14.1 GiB" free_swap="15.0 GiB" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.309Z level=INFO source=server.go:650 msg="gpu memory" id=GPU-d0364f00-33d1-a9a6-d173-85839f5f872c library=CUDA available="23.1 GiB" free="23.6 GiB" minimum="457.0 MiB" overhead="0 B" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.319Z level=INFO source=runner.go:1337 msg="starting ollama engine" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.319Z level=INFO source=runner.go:1372 msg="Server listening on 127.0.0.1:41305" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.320Z level=INFO source=runner.go:1210 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:6 GPULayers:49[ID:GPU-d0364f00-33d1-a9a6-d173-85839f5f872c Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.352Z level=INFO source=ggml.go:135 msg="" architecture=qwen3vlmoe file_type=Q4_K_M name="" description="" num_tensors=1038 num_key_values=43 oct. 30 00:20:16 jarvis-server ollama[556420]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so oct. 30 00:20:16 jarvis-server ollama[556420]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no oct. 30 00:20:16 jarvis-server ollama[556420]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no oct. 30 00:20:16 jarvis-server ollama[556420]: ggml_cuda_init: found 1 CUDA devices: oct. 30 00:20:16 jarvis-server ollama[556420]: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-d0364f00-33d1-a9a6-d173-85839f5f872c oct. 30 00:20:16 jarvis-server ollama[556420]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so oct. 30 00:20:16 jarvis-server ollama[556420]: time=2025-10-30T00:20:16.485Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.051Z level=INFO source=runner.go:1210 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:6 GPULayers:49[ID:GPU-d0364f00-33d1-a9a6-d173-85839f5f872c Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=runner.go:1210 msg=load request="{Operation:close LoraPath:[] Parallel:0 BatchSize:0 FlashAttention:false KvSize:0 KvCacheType: NumThreads:0 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=device.go:212 msg="model weights" device=CUDA0 size="18.1 GiB" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=device.go:217 msg="model weights" device=CPU size="166.9 MiB" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=device.go:223 msg="kv cache" device=CUDA0 size="768.0 MiB" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=device.go:234 msg="compute graph" device=CUDA0 size="4.5 GiB" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=device.go:239 msg="compute graph" device=CPU size="31.7 MiB" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=device.go:244 msg="total memory" size="23.5 GiB" oct. 30 00:20:17 jarvis-server ollama[556420]: time=2025-10-30T00:20:17.547Z level=INFO source=sched.go:446 msg="Load failed" model=/usr/share/ollama/.ollama/models/blobs/sha256-b1da6f96a2e40e5db05b6066d799c69411225b336bfa20ef1b002c223ed4b190 error="memory layout cannot be allocated with num_gpu = 49" oct. 30 00:20:17 jarvis-server ollama[556420]: [GIN] 2025/10/30 - 00:20:17 | 500 | 5.241777807s | 127.0.0.1 | POST "/api/generate"` Is this normal ? Shouldn't 19+2Go fit in 24Go ? ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.12.7
GiteaMirror added the bug label 2026-04-12 21:12:16 -05:00
Author
Owner

@jessegross commented on GitHub (Oct 30, 2025):

Your log is very difficult to read but you have 23.1G available for allocation:

oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.732Z level=INFO source=server.go:650 msg="gpu memory" id=GPU-d0364f00-33d1-a9a6-d173-85839f5f872c library=CUDA available="23.1 GiB" free="23.6 GiB" minimum="457.0 MiB" overhead="0 B"

And we need to allocate a total of 23.8G:

oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:212 msg="model weights" device=CUDA0 size="16.9 GiB"
oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:217 msg="model weights" device=CPU size="1.4 GiB"
oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:223 msg="kv cache" device=CUDA0 size="768.0 MiB"
oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:234 msg="compute graph" device=CUDA0 size="663.2 MiB"
oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:239 msg="compute graph" device=CPU size="4.2 GiB"
oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:244 msg="total memory" size="23.8 GiB"

As a result, the vision projector gets bumped onto the CPU. This is large and moved as a single block, which is why the GPU percentage is lower than 100%.

<!-- gh-comment-id:3465650947 --> @jessegross commented on GitHub (Oct 30, 2025): Your log is very difficult to read but you have 23.1G available for allocation: ``` oct. 30 00:11:38 jarvis-server ollama[556420]: time=2025-10-30T00:11:38.732Z level=INFO source=server.go:650 msg="gpu memory" id=GPU-d0364f00-33d1-a9a6-d173-85839f5f872c library=CUDA available="23.1 GiB" free="23.6 GiB" minimum="457.0 MiB" overhead="0 B" ``` And we need to allocate a total of 23.8G: ``` oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:212 msg="model weights" device=CUDA0 size="16.9 GiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:217 msg="model weights" device=CPU size="1.4 GiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:223 msg="kv cache" device=CUDA0 size="768.0 MiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:234 msg="compute graph" device=CUDA0 size="663.2 MiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:239 msg="compute graph" device=CPU size="4.2 GiB" oct. 30 00:11:40 jarvis-server ollama[556420]: time=2025-10-30T00:11:40.427Z level=INFO source=device.go:244 msg="total memory" size="23.8 GiB" ``` As a result, the vision projector gets bumped onto the CPU. This is large and moved as a single block, which is why the GPU percentage is lower than 100%.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8511