[GH-ISSUE #12580] New Error: "memory layout cannot be allocated" when switching large multi-GPU models #70407

Closed
opened 2026-05-04 21:26:28 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @chrisoutwright on GitHub (Oct 11, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12580

What is the issue?

When switching between two large models in Ollama using a multi-GPU setup (4090 + 3090), I get the following error:

500: memory layout cannot be allocated with num_gpu = 91

The issue only occurs when changing from one loaded model to another — restarting ollama serve resolves it temporarily.
It seems like GPU memory or CUDA context cleanup between model loads isn’t happening correctly.


What happened? What did you expect to happen?

What happened:

  • After successfully running one model (Qwen3-Coder-53B-A3B-Instruct-TOTAL-RECALL-v2-MASTER-CODER-L-i1-GGUF:Q4_K_M),
    I tried switching to another (Qwen-3-30B-FerrisMind:q6_k).
  • The Ollama API responded with HTTP 500, and the logs showed:
    memory layout cannot be allocated with num_gpu = 91
  • VRAM usage remained high (~20 GB per GPU), and the process didn’t fully clean up GPU allocations.

Expected behavior:
Ollama should unload the first model, release all GPU resources, and successfully load the second model without needing a server restart.


OS

Windows 11 Pro
64 GB RAM
Swap file: 72 GB


GPU

  • NVIDIA GeForce RTX 4090 (24 GB VRAM, compute 8.9)
  • NVIDIA GeForce RTX 3090 (24 GB VRAM, compute 8.6)
  • Flash Attention: enabled

CPU

AMD Ryzen 7 (8 cores, 16 threads)


Ollama version

0.12.4-rc6


Additional Notes

  • Happens when switching between large ollama create models or from hf using different quantization or layer counts (Q4_K_MQ6_K).
  • Restarting the Ollama service clears the issue.
  • Both models load fine individually.
  • Looks like incomplete GPU cleanup or CUDA context fragmentation between model loads.
  • The issue appears independent of context size or cache quantization, and is likely related to how the scheduler handles model transitions in multi-GPU mode.?

Example Models

Model A:
hf.co/mradermacher/Qwen3-Coder-53B-A3B-Instruct-TOTAL-RECALL-v2-MASTER-CODER-L-i1-GGUF:Q4_K_M
{
"repeat_penalty": 1.05,
"stop": ["<|im_start|>", "<|im_end|>", "<|endoftext|>"],
"temperature": 0.7,
"top_k": 20,
"top_p": 0.8,
"num_gpu": 91,
"num_ctx": 90000
}

Model B:
Qwen-3-30B-FerrisMind:q6_k
{
"min_p": 0,
"num_ctx": 126000,
"num_gpu": 85,
"repeat_penalty": 1.05,
"stop": ["<|im_start|>", "<|im_end|>"],
"temperature": 0.7,
"top_k": 20,
"top_p": 0.8
}


Steps to Reproduce

  1. Start Ollama:
    ollama serve
  2. Load Model A → works fine
  3. Load Model B → error:
    500: memory layout cannot be allocated with num_gpu = 91
  4. Restart ollama serve → both models can load again until switching.

Runner Script Used


If (-Not ([Security.Principal.WindowsPrincipal] [Security.Principal.WindowsIdentity]::GetCurrent()).IsInRole([Security.Principal.WindowsBuiltInRole] "Administrator")) {
    Start-Process powershell -ArgumentList $myinvocation.mycommand.definition -Verb RunAs
    Exit
}

$envNode2 = @{
    HostAddress = "0.0.0.0:11434"
    CUDA = "0,1"
    OllamaPath = "D:\Ollama\models"
    MaxLoadedModels = "0"
    NumParallel = "1"
    SchedSpred = "1"
    FlashAttention = "1"
    KeepAlive = "20m"
    NewEstimate = "0"
    KVCacheType = "q8_0"
}

# Configure environment and start Ollama instances
foreach ($env in @($envNode2)) {
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_HOST' -Value $env.HostAddress
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'CUDA_VISIBLE_DEVICES' -Value $env.CUDA
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_MODELS' -Value $env.OllamaPath
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_MAX_LOADED_MODELS' -Value $env.MaxLoadedModels
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_NUM_PARALLEL' -Value $env.NumParallel
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_SCHED_SPREAD' -Value $env.SchedSpred
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_FLASH_ATTENTION' -Value $env.FlashAttention
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_KEEP_ALIVE' -Value $env.KeepAlive
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_KV_CACHE_TYPE' -Value $env.KVCacheType
    Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_NEW_ESTIMATES' -Value $env.NewEstimate

    Start-Process powershell -ArgumentList "-Command `"`$env:OLLAMA_HOST='$($env.HostAddress)'; echo `$env:OLLAMA_HOST; `$env:CUDA_VISIBLE_DEVICES='$($env.CUDA)'; echo `$env:CUDA_VISIBLE_DEVICES; `$env:OLLAMA_MODELS='$($env.OllamaPath)'; echo `$env:OLLAMA_MODELS; `$env:OLLAMA_SCHED_SPREAD='$($env.SchedSpred)'; echo `$env:OLLAMA_SCHED_SPREAD; `$env:OLLAMA_FLASH_ATTENTION='$($env.FlashAttention)'; echo `$env:OLLAMA_FLASH_ATTENTION; `$env:OLLAMA_KEEP_ALIVE='$($env.KeepAlive)'; echo `$env:OLLAMA_KEEP_ALIVE; `$env:OLLAMA_KV_CACHE_TYPE='$($env.KVCacheType)'; echo `$env:OLLAMA_KV_CACHE_TYPE; `$env:OLLAMA_NEW_ESTIMATES='$($env.NewEstimate)'; echo `$env:OLLAMA_NEW_ESTIMATES; ollama serve; Read-Host 'Press any key to close the instance.'`"" -WindowStyle Normal -Verb RunAs
}

Relevant log output

0.0.0.0:11434
0,1
D:\Ollama\models
0
1
20m
q8_0
0
time=2025-10-11T20:54:37.399+02:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:250000 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:20m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\Ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-10-11T20:54:37.414+02:00 level=INFO source=images.go:522 msg="total blobs: 257"
time=2025-10-11T20:54:37.420+02:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0"
time=2025-10-11T20:54:37.425+02:00 level=INFO source=routes.go:1528 msg="Listening on [::]:11434 (version 0.12.4-rc6)"
time=2025-10-11T20:54:37.426+02:00 level=INFO source=runner.go:80 msg="discovering available GPUs..."
time=2025-10-11T20:54:38.159+02:00 level=INFO source=types.go:111 msg="inference compute" id=GPU-971b407f-ae20-75ed-99c8-42c696057b0e library=CUDA compute=8.9 name=CUDA0 description="NVIDIA GeForce RTX 4090" libdirs=ollama,cuda_v13 driver=13.0 pci_id=02:00.0 type=discrete total="24.0 GiB" available="23.6 GiB"
time=2025-10-11T20:54:38.160+02:00 level=INFO source=types.go:111 msg="inference compute" id=GPU-3752f260-9f8c-48e9-780e-12430a037c53 library=CUDA compute=8.6 name=CUDA1 description="NVIDIA GeForce RTX 3090" libdirs=ollama,cuda_v13 driver=13.0 pci_id=01:00.0 type=discrete total="24.0 GiB" available="23.1 GiB"
time=2025-10-11T20:54:56.644+02:00 level=INFO source=cpu_windows.go:139 msg=packages count=1
time=2025-10-11T20:54:56.644+02:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=8 efficiency=0 threads=8
time=2025-10-11T20:54:56.644+02:00 level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-10-11T20:54:56.645+02:00 level=INFO source=server.go:395 msg="starting runner" cmd="C:\\Users\\Chris\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\Ollama\\models\\blobs\\sha256-77804d4b578b9bd192d9e83b88d512c737fb03e7eafb45d51998f2ce1ccc3393 --port 55095"
time=2025-10-11T20:54:56.647+02:00 level=INFO source=server.go:670 msg="loading model" "model layers"=85 requested=91
time=2025-10-11T20:54:56.647+02:00 level=INFO source=cpu_windows.go:139 msg=packages count=1
time=2025-10-11T20:54:56.647+02:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=8 efficiency=0 threads=8
time=2025-10-11T20:54:56.647+02:00 level=INFO source=server.go:676 msg="system memory" total="63.9 GiB" free="56.0 GiB" free_swap="72.5 GiB"
time=2025-10-11T20:54:56.648+02:00 level=INFO source=server.go:684 msg="gpu memory" id=GPU-971b407f-ae20-75ed-99c8-42c696057b0e library=CUDA available="23.1 GiB" free="23.6 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-10-11T20:54:56.648+02:00 level=INFO source=server.go:684 msg="gpu memory" id=GPU-3752f260-9f8c-48e9-780e-12430a037c53 library=CUDA available="22.6 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-10-11T20:54:56.676+02:00 level=INFO source=runner.go:1299 msg="starting ollama engine"
time=2025-10-11T20:54:56.676+02:00 level=INFO source=runner.go:1335 msg="Server listening on 127.0.0.1:55095"
time=2025-10-11T20:54:56.681+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:90000 KvCacheType:q8_0 NumThreads:8 GPULayers:85[ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:85(0..84)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T20:54:56.705+02:00 level=INFO source=ggml.go:133 msg="" architecture=qwen3moe file_type=Q4_K_M name="Qwen3 Coder 53B A3B Instruct TOTAL RECALL v2 MASTER CODER L" description="" num_tensors=1011 num_key_values=51
load_backend: loaded CPU backend from C:\Users\Chris\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-971b407f-ae20-75ed-99c8-42c696057b0e
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-3752f260-9f8c-48e9-780e-12430a037c53
load_backend: loaded CUDA backend from C:\Users\Chris\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2025-10-11T20:54:56.804+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-10-11T20:54:57.168+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:90000 KvCacheType:q8_0 NumThreads:8 GPULayers:85[ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:42(0..41) ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:43(42..84)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T20:54:57.291+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:90000 KvCacheType:q8_0 NumThreads:8 GPULayers:85[ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:43(0..42) ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:42(43..84)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T20:54:57.334+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:90000 KvCacheType:q8_0 NumThreads:8 GPULayers:85[ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:43(0..42) ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:42(43..84)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T20:54:58.018+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:90000 KvCacheType:q8_0 NumThreads:8 GPULayers:85[ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:43(0..42) ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:42(43..84)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T20:54:58.018+02:00 level=INFO source=ggml.go:477 msg="offloading 84 repeating layers to GPU"
time=2025-10-11T20:54:58.018+02:00 level=INFO source=ggml.go:483 msg="offloading output layer to GPU"
time=2025-10-11T20:54:58.018+02:00 level=INFO source=ggml.go:488 msg="offloaded 85/85 layers to GPU"
time=2025-10-11T20:54:58.018+02:00 level=INFO source=device.go:206 msg="model weights" device=CUDA0 size="15.1 GiB"
time=2025-10-11T20:54:58.020+02:00 level=INFO source=device.go:206 msg="model weights" device=CUDA1 size="14.7 GiB"
time=2025-10-11T20:54:58.020+02:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="166.9 MiB"
time=2025-10-11T20:54:58.020+02:00 level=INFO source=device.go:217 msg="kv cache" device=CUDA0 size="3.9 GiB"
time=2025-10-11T20:54:58.021+02:00 level=INFO source=device.go:217 msg="kv cache" device=CUDA1 size="3.7 GiB"
time=2025-10-11T20:54:58.021+02:00 level=INFO source=device.go:228 msg="compute graph" device=CUDA0 size="431.0 MiB"
time=2025-10-11T20:54:58.022+02:00 level=INFO source=device.go:228 msg="compute graph" device=CUDA1 size="423.5 MiB"
time=2025-10-11T20:54:58.022+02:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="4.0 MiB"
time=2025-10-11T20:54:58.022+02:00 level=INFO source=device.go:238 msg="total memory" size="38.4 GiB"
time=2025-10-11T20:54:58.022+02:00 level=INFO source=sched.go:480 msg="loaded runners" count=1
time=2025-10-11T20:54:58.023+02:00 level=INFO source=server.go:1266 msg="waiting for llama runner to start responding"
time=2025-10-11T20:54:58.023+02:00 level=INFO source=server.go:1300 msg="waiting for server to become available" status="llm server loading model"
time=2025-10-11T20:55:06.804+02:00 level=INFO source=server.go:1304 msg="llama runner started in 10.16 seconds"
[GIN] 2025/10/11 - 20:55:21 | 200 |   25.5822703s |    192.168.1.88 | POST     "/api/chat"
[GIN] 2025/10/11 - 20:55:39 | 200 |      10.846ms |    192.168.1.88 | GET      "/api/tags"
[GIN] 2025/10/11 - 20:55:39 | 200 |            0s |    192.168.1.88 | GET      "/api/ps"
[GIN] 2025/10/11 - 20:55:40 | 200 |            0s |    192.168.1.88 | GET      "/api/version"
ggml_backend_cuda_device_get_memory utilizing NVML memory reporting free: 3852505088 total: 25757220864
ggml_backend_cuda_device_get_memory utilizing NVML memory reporting free: 4382294016 total: 25769803776
time=2025-10-11T20:55:47.118+02:00 level=INFO source=sched.go:543 msg="updated VRAM based on existing loaded models" gpu=GPU-971b407f-ae20-75ed-99c8-42c696057b0e library=CUDA total="24.0 GiB" available="3.6 GiB"
time=2025-10-11T20:55:47.118+02:00 level=INFO source=sched.go:543 msg="updated VRAM based on existing loaded models" gpu=GPU-3752f260-9f8c-48e9-780e-12430a037c53 library=CUDA total="24.0 GiB" available="4.1 GiB"
time=2025-10-11T20:55:47.165+02:00 level=INFO source=cpu_windows.go:139 msg=packages count=1
time=2025-10-11T20:55:47.165+02:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=8 efficiency=0 threads=8
time=2025-10-11T20:55:47.167+02:00 level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-10-11T20:55:47.169+02:00 level=INFO source=server.go:395 msg="starting runner" cmd="C:\\Users\\Chris\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\Ollama\\models\\blobs\\sha256-aa885ffefae01fe441fe8a50ab2e5f3cd72ad12a140f718c794f153d841010a7 --port 55117"
time=2025-10-11T20:55:47.172+02:00 level=INFO source=server.go:670 msg="loading model" "model layers"=49 requested=91
time=2025-10-11T20:55:47.172+02:00 level=INFO source=cpu_windows.go:139 msg=packages count=1
time=2025-10-11T20:55:47.172+02:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=8 efficiency=0 threads=8
time=2025-10-11T20:55:47.173+02:00 level=INFO source=server.go:676 msg="system memory" total="63.9 GiB" free="55.0 GiB" free_swap="32.0 GiB"
time=2025-10-11T20:55:47.173+02:00 level=INFO source=server.go:684 msg="gpu memory" id=GPU-971b407f-ae20-75ed-99c8-42c696057b0e library=CUDA available="3.1 GiB" free="3.6 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-10-11T20:55:47.173+02:00 level=INFO source=server.go:684 msg="gpu memory" id=GPU-3752f260-9f8c-48e9-780e-12430a037c53 library=CUDA available="3.6 GiB" free="4.1 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-10-11T20:55:47.199+02:00 level=INFO source=runner.go:1299 msg="starting ollama engine"
time=2025-10-11T20:55:47.199+02:00 level=INFO source=runner.go:1335 msg="Server listening on 127.0.0.1:55117"
time=2025-10-11T20:55:47.207+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:126000 KvCacheType:q8_0 NumThreads:8 GPULayers:49[ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T20:55:47.231+02:00 level=INFO source=ggml.go:133 msg="" architecture=qwen3moe file_type=Q6_K name=Af2F80A8321937409Fa75Cf98Ab17Bfd94F55D5A description="" num_tensors=579 num_key_values=30
load_backend: loaded CPU backend from C:\Users\Chris\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-971b407f-ae20-75ed-99c8-42c696057b0e
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-3752f260-9f8c-48e9-780e-12430a037c53
load_backend: loaded CUDA backend from C:\Users\Chris\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll
time=2025-10-11T20:55:47.297+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-10-11T20:55:47.615+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:126000 KvCacheType:q8_0 NumThreads:8 GPULayers:49[ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:24(0..23) ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:25(24..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T20:55:47.754+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:126000 KvCacheType:q8_0 NumThreads:8 GPULayers:49[ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:26(0..25) ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:23(26..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T20:55:47.789+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:126000 KvCacheType:q8_0 NumThreads:8 GPULayers:49[ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:26(0..25) ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:23(26..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T20:55:53.257+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:126000 KvCacheType:q8_0 NumThreads:8 GPULayers:49[ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:25(0..24) ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:24(25..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T20:55:54.287+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:close LoraPath:[] Parallel:0 BatchSize:0 FlashAttention:false KvSize:0 KvCacheType: NumThreads:0 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-11T20:55:54.287+02:00 level=INFO source=device.go:206 msg="model weights" device=CUDA0 size="11.2 GiB"
time=2025-10-11T20:55:54.288+02:00 level=INFO source=device.go:206 msg="model weights" device=CUDA1 size="11.9 GiB"
time=2025-10-11T20:55:54.288+02:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="243.4 MiB"
time=2025-10-11T20:55:54.289+02:00 level=INFO source=device.go:217 msg="kv cache" device=CUDA0 size="2.9 GiB"
time=2025-10-11T20:55:54.289+02:00 level=INFO source=device.go:217 msg="kv cache" device=CUDA1 size="3.2 GiB"
time=2025-10-11T20:55:54.289+02:00 level=INFO source=device.go:228 msg="compute graph" device=CUDA0 size="323.8 MiB"
time=2025-10-11T20:55:54.289+02:00 level=INFO source=device.go:228 msg="compute graph" device=CUDA1 size="521.3 MiB"
time=2025-10-11T20:55:54.290+02:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="4.0 MiB"
time=2025-10-11T20:55:54.290+02:00 level=INFO source=device.go:238 msg="total memory" size="30.3 GiB"
time=2025-10-11T20:55:54.290+02:00 level=INFO source=sched.go:448 msg="Load failed" model=D:\Ollama\models\blobs\sha256-aa885ffefae01fe441fe8a50ab2e5f3cd72ad12a140f718c794f153d841010a7 error="memory layout cannot be allocated with num_gpu = 91"
[GIN] 2025/10/11 - 20:55:54 | 500 |    7.4332265s |    192.168.1.88 | POST     "/api/chat"
[GIN] 2025/10/11 - 20:57:10 | 200 |            0s |       127.0.0.1 | GET      "/api/version"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.12.4-rc6

Originally created by @chrisoutwright on GitHub (Oct 11, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12580 ### What is the issue? When switching between two large models in Ollama using a multi-GPU setup (4090 + 3090), I get the following error: `500: memory layout cannot be allocated with num_gpu = 91` The issue only occurs when changing from one loaded model to another — restarting `ollama serve` resolves it temporarily. It seems like GPU memory or CUDA context cleanup between model loads isn’t happening correctly. --- ### What happened? What did you expect to happen? **What happened:** - After successfully running one model (`Qwen3-Coder-53B-A3B-Instruct-TOTAL-RECALL-v2-MASTER-CODER-L-i1-GGUF:Q4_K_M`), I tried switching to another (`Qwen-3-30B-FerrisMind:q6_k`). - The Ollama API responded with HTTP 500, and the logs showed: memory layout cannot be allocated with num_gpu = 91 - VRAM usage remained high (~20 GB per GPU), and the process didn’t fully clean up GPU allocations. **Expected behavior:** Ollama should unload the first model, release all GPU resources, and successfully load the second model without needing a server restart. --- ### OS Windows 11 Pro 64 GB RAM Swap file: 72 GB --- ### GPU - NVIDIA GeForce RTX 4090 (24 GB VRAM, compute 8.9) - NVIDIA GeForce RTX 3090 (24 GB VRAM, compute 8.6) - Flash Attention: enabled --- ### CPU AMD Ryzen 7 (8 cores, 16 threads) --- ### Ollama version 0.12.4-rc6 --- ### Additional Notes - Happens when switching between large `ollama create` models or from hf using different quantization or layer counts (`Q4_K_M` ↔ `Q6_K`). - Restarting the Ollama service clears the issue. - Both models load fine individually. - Looks like incomplete GPU cleanup or CUDA context fragmentation between model loads. - The issue appears independent of context size or cache quantization, and is likely related to how the scheduler handles model transitions in multi-GPU mode.? --- ### Example Models **Model A:** hf.co/mradermacher/Qwen3-Coder-53B-A3B-Instruct-TOTAL-RECALL-v2-MASTER-CODER-L-i1-GGUF:Q4_K_M { "repeat_penalty": 1.05, "stop": ["<|im_start|>", "<|im_end|>", "<|endoftext|>"], "temperature": 0.7, "top_k": 20, "top_p": 0.8, "num_gpu": 91, "num_ctx": 90000 } **Model B:** Qwen-3-30B-FerrisMind:q6_k { "min_p": 0, "num_ctx": 126000, "num_gpu": 85, "repeat_penalty": 1.05, "stop": ["<|im_start|>", "<|im_end|>"], "temperature": 0.7, "top_k": 20, "top_p": 0.8 } --- ### Steps to Reproduce 1. Start Ollama: ollama serve 2. Load Model A → works fine 3. Load Model B → error: 500: memory layout cannot be allocated with num_gpu = 91 4. Restart `ollama serve` → both models can load again until switching. --- ### Runner Script Used ``` If (-Not ([Security.Principal.WindowsPrincipal] [Security.Principal.WindowsIdentity]::GetCurrent()).IsInRole([Security.Principal.WindowsBuiltInRole] "Administrator")) { Start-Process powershell -ArgumentList $myinvocation.mycommand.definition -Verb RunAs Exit } $envNode2 = @{ HostAddress = "0.0.0.0:11434" CUDA = "0,1" OllamaPath = "D:\Ollama\models" MaxLoadedModels = "0" NumParallel = "1" SchedSpred = "1" FlashAttention = "1" KeepAlive = "20m" NewEstimate = "0" KVCacheType = "q8_0" } # Configure environment and start Ollama instances foreach ($env in @($envNode2)) { Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_HOST' -Value $env.HostAddress Set-ItemProperty -Path 'HKCU:\Environment' -Name 'CUDA_VISIBLE_DEVICES' -Value $env.CUDA Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_MODELS' -Value $env.OllamaPath Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_MAX_LOADED_MODELS' -Value $env.MaxLoadedModels Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_NUM_PARALLEL' -Value $env.NumParallel Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_SCHED_SPREAD' -Value $env.SchedSpred Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_FLASH_ATTENTION' -Value $env.FlashAttention Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_KEEP_ALIVE' -Value $env.KeepAlive Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_KV_CACHE_TYPE' -Value $env.KVCacheType Set-ItemProperty -Path 'HKCU:\Environment' -Name 'OLLAMA_NEW_ESTIMATES' -Value $env.NewEstimate Start-Process powershell -ArgumentList "-Command `"`$env:OLLAMA_HOST='$($env.HostAddress)'; echo `$env:OLLAMA_HOST; `$env:CUDA_VISIBLE_DEVICES='$($env.CUDA)'; echo `$env:CUDA_VISIBLE_DEVICES; `$env:OLLAMA_MODELS='$($env.OllamaPath)'; echo `$env:OLLAMA_MODELS; `$env:OLLAMA_SCHED_SPREAD='$($env.SchedSpred)'; echo `$env:OLLAMA_SCHED_SPREAD; `$env:OLLAMA_FLASH_ATTENTION='$($env.FlashAttention)'; echo `$env:OLLAMA_FLASH_ATTENTION; `$env:OLLAMA_KEEP_ALIVE='$($env.KeepAlive)'; echo `$env:OLLAMA_KEEP_ALIVE; `$env:OLLAMA_KV_CACHE_TYPE='$($env.KVCacheType)'; echo `$env:OLLAMA_KV_CACHE_TYPE; `$env:OLLAMA_NEW_ESTIMATES='$($env.NewEstimate)'; echo `$env:OLLAMA_NEW_ESTIMATES; ollama serve; Read-Host 'Press any key to close the instance.'`"" -WindowStyle Normal -Verb RunAs } ``` ### Relevant log output ```shell 0.0.0.0:11434 0,1 D:\Ollama\models 0 1 20m q8_0 0 time=2025-10-11T20:54:37.399+02:00 level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:250000 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:20m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:D:\\Ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-10-11T20:54:37.414+02:00 level=INFO source=images.go:522 msg="total blobs: 257" time=2025-10-11T20:54:37.420+02:00 level=INFO source=images.go:529 msg="total unused blobs removed: 0" time=2025-10-11T20:54:37.425+02:00 level=INFO source=routes.go:1528 msg="Listening on [::]:11434 (version 0.12.4-rc6)" time=2025-10-11T20:54:37.426+02:00 level=INFO source=runner.go:80 msg="discovering available GPUs..." time=2025-10-11T20:54:38.159+02:00 level=INFO source=types.go:111 msg="inference compute" id=GPU-971b407f-ae20-75ed-99c8-42c696057b0e library=CUDA compute=8.9 name=CUDA0 description="NVIDIA GeForce RTX 4090" libdirs=ollama,cuda_v13 driver=13.0 pci_id=02:00.0 type=discrete total="24.0 GiB" available="23.6 GiB" time=2025-10-11T20:54:38.160+02:00 level=INFO source=types.go:111 msg="inference compute" id=GPU-3752f260-9f8c-48e9-780e-12430a037c53 library=CUDA compute=8.6 name=CUDA1 description="NVIDIA GeForce RTX 3090" libdirs=ollama,cuda_v13 driver=13.0 pci_id=01:00.0 type=discrete total="24.0 GiB" available="23.1 GiB" time=2025-10-11T20:54:56.644+02:00 level=INFO source=cpu_windows.go:139 msg=packages count=1 time=2025-10-11T20:54:56.644+02:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=8 efficiency=0 threads=8 time=2025-10-11T20:54:56.644+02:00 level=INFO source=server.go:216 msg="enabling flash attention" time=2025-10-11T20:54:56.645+02:00 level=INFO source=server.go:395 msg="starting runner" cmd="C:\\Users\\Chris\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\Ollama\\models\\blobs\\sha256-77804d4b578b9bd192d9e83b88d512c737fb03e7eafb45d51998f2ce1ccc3393 --port 55095" time=2025-10-11T20:54:56.647+02:00 level=INFO source=server.go:670 msg="loading model" "model layers"=85 requested=91 time=2025-10-11T20:54:56.647+02:00 level=INFO source=cpu_windows.go:139 msg=packages count=1 time=2025-10-11T20:54:56.647+02:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=8 efficiency=0 threads=8 time=2025-10-11T20:54:56.647+02:00 level=INFO source=server.go:676 msg="system memory" total="63.9 GiB" free="56.0 GiB" free_swap="72.5 GiB" time=2025-10-11T20:54:56.648+02:00 level=INFO source=server.go:684 msg="gpu memory" id=GPU-971b407f-ae20-75ed-99c8-42c696057b0e library=CUDA available="23.1 GiB" free="23.6 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-10-11T20:54:56.648+02:00 level=INFO source=server.go:684 msg="gpu memory" id=GPU-3752f260-9f8c-48e9-780e-12430a037c53 library=CUDA available="22.6 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-10-11T20:54:56.676+02:00 level=INFO source=runner.go:1299 msg="starting ollama engine" time=2025-10-11T20:54:56.676+02:00 level=INFO source=runner.go:1335 msg="Server listening on 127.0.0.1:55095" time=2025-10-11T20:54:56.681+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:90000 KvCacheType:q8_0 NumThreads:8 GPULayers:85[ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:85(0..84)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T20:54:56.705+02:00 level=INFO source=ggml.go:133 msg="" architecture=qwen3moe file_type=Q4_K_M name="Qwen3 Coder 53B A3B Instruct TOTAL RECALL v2 MASTER CODER L" description="" num_tensors=1011 num_key_values=51 load_backend: loaded CPU backend from C:\Users\Chris\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-971b407f-ae20-75ed-99c8-42c696057b0e Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-3752f260-9f8c-48e9-780e-12430a037c53 load_backend: loaded CUDA backend from C:\Users\Chris\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2025-10-11T20:54:56.804+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-10-11T20:54:57.168+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:90000 KvCacheType:q8_0 NumThreads:8 GPULayers:85[ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:42(0..41) ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:43(42..84)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T20:54:57.291+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:90000 KvCacheType:q8_0 NumThreads:8 GPULayers:85[ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:43(0..42) ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:42(43..84)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T20:54:57.334+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:90000 KvCacheType:q8_0 NumThreads:8 GPULayers:85[ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:43(0..42) ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:42(43..84)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T20:54:58.018+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:90000 KvCacheType:q8_0 NumThreads:8 GPULayers:85[ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:43(0..42) ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:42(43..84)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T20:54:58.018+02:00 level=INFO source=ggml.go:477 msg="offloading 84 repeating layers to GPU" time=2025-10-11T20:54:58.018+02:00 level=INFO source=ggml.go:483 msg="offloading output layer to GPU" time=2025-10-11T20:54:58.018+02:00 level=INFO source=ggml.go:488 msg="offloaded 85/85 layers to GPU" time=2025-10-11T20:54:58.018+02:00 level=INFO source=device.go:206 msg="model weights" device=CUDA0 size="15.1 GiB" time=2025-10-11T20:54:58.020+02:00 level=INFO source=device.go:206 msg="model weights" device=CUDA1 size="14.7 GiB" time=2025-10-11T20:54:58.020+02:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="166.9 MiB" time=2025-10-11T20:54:58.020+02:00 level=INFO source=device.go:217 msg="kv cache" device=CUDA0 size="3.9 GiB" time=2025-10-11T20:54:58.021+02:00 level=INFO source=device.go:217 msg="kv cache" device=CUDA1 size="3.7 GiB" time=2025-10-11T20:54:58.021+02:00 level=INFO source=device.go:228 msg="compute graph" device=CUDA0 size="431.0 MiB" time=2025-10-11T20:54:58.022+02:00 level=INFO source=device.go:228 msg="compute graph" device=CUDA1 size="423.5 MiB" time=2025-10-11T20:54:58.022+02:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="4.0 MiB" time=2025-10-11T20:54:58.022+02:00 level=INFO source=device.go:238 msg="total memory" size="38.4 GiB" time=2025-10-11T20:54:58.022+02:00 level=INFO source=sched.go:480 msg="loaded runners" count=1 time=2025-10-11T20:54:58.023+02:00 level=INFO source=server.go:1266 msg="waiting for llama runner to start responding" time=2025-10-11T20:54:58.023+02:00 level=INFO source=server.go:1300 msg="waiting for server to become available" status="llm server loading model" time=2025-10-11T20:55:06.804+02:00 level=INFO source=server.go:1304 msg="llama runner started in 10.16 seconds" [GIN] 2025/10/11 - 20:55:21 | 200 | 25.5822703s | 192.168.1.88 | POST "/api/chat" [GIN] 2025/10/11 - 20:55:39 | 200 | 10.846ms | 192.168.1.88 | GET "/api/tags" [GIN] 2025/10/11 - 20:55:39 | 200 | 0s | 192.168.1.88 | GET "/api/ps" [GIN] 2025/10/11 - 20:55:40 | 200 | 0s | 192.168.1.88 | GET "/api/version" ggml_backend_cuda_device_get_memory utilizing NVML memory reporting free: 3852505088 total: 25757220864 ggml_backend_cuda_device_get_memory utilizing NVML memory reporting free: 4382294016 total: 25769803776 time=2025-10-11T20:55:47.118+02:00 level=INFO source=sched.go:543 msg="updated VRAM based on existing loaded models" gpu=GPU-971b407f-ae20-75ed-99c8-42c696057b0e library=CUDA total="24.0 GiB" available="3.6 GiB" time=2025-10-11T20:55:47.118+02:00 level=INFO source=sched.go:543 msg="updated VRAM based on existing loaded models" gpu=GPU-3752f260-9f8c-48e9-780e-12430a037c53 library=CUDA total="24.0 GiB" available="4.1 GiB" time=2025-10-11T20:55:47.165+02:00 level=INFO source=cpu_windows.go:139 msg=packages count=1 time=2025-10-11T20:55:47.165+02:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=8 efficiency=0 threads=8 time=2025-10-11T20:55:47.167+02:00 level=INFO source=server.go:216 msg="enabling flash attention" time=2025-10-11T20:55:47.169+02:00 level=INFO source=server.go:395 msg="starting runner" cmd="C:\\Users\\Chris\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model D:\\Ollama\\models\\blobs\\sha256-aa885ffefae01fe441fe8a50ab2e5f3cd72ad12a140f718c794f153d841010a7 --port 55117" time=2025-10-11T20:55:47.172+02:00 level=INFO source=server.go:670 msg="loading model" "model layers"=49 requested=91 time=2025-10-11T20:55:47.172+02:00 level=INFO source=cpu_windows.go:139 msg=packages count=1 time=2025-10-11T20:55:47.172+02:00 level=INFO source=cpu_windows.go:186 msg="" package=0 cores=8 efficiency=0 threads=8 time=2025-10-11T20:55:47.173+02:00 level=INFO source=server.go:676 msg="system memory" total="63.9 GiB" free="55.0 GiB" free_swap="32.0 GiB" time=2025-10-11T20:55:47.173+02:00 level=INFO source=server.go:684 msg="gpu memory" id=GPU-971b407f-ae20-75ed-99c8-42c696057b0e library=CUDA available="3.1 GiB" free="3.6 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-10-11T20:55:47.173+02:00 level=INFO source=server.go:684 msg="gpu memory" id=GPU-3752f260-9f8c-48e9-780e-12430a037c53 library=CUDA available="3.6 GiB" free="4.1 GiB" minimum="457.0 MiB" overhead="0 B" time=2025-10-11T20:55:47.199+02:00 level=INFO source=runner.go:1299 msg="starting ollama engine" time=2025-10-11T20:55:47.199+02:00 level=INFO source=runner.go:1335 msg="Server listening on 127.0.0.1:55117" time=2025-10-11T20:55:47.207+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:126000 KvCacheType:q8_0 NumThreads:8 GPULayers:49[ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:49(0..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T20:55:47.231+02:00 level=INFO source=ggml.go:133 msg="" architecture=qwen3moe file_type=Q6_K name=Af2F80A8321937409Fa75Cf98Ab17Bfd94F55D5A description="" num_tensors=579 num_key_values=30 load_backend: loaded CPU backend from C:\Users\Chris\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-haswell.dll ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-971b407f-ae20-75ed-99c8-42c696057b0e Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, ID: GPU-3752f260-9f8c-48e9-780e-12430a037c53 load_backend: loaded CUDA backend from C:\Users\Chris\AppData\Local\Programs\Ollama\lib\ollama\cuda_v13\ggml-cuda.dll time=2025-10-11T20:55:47.297+02:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-10-11T20:55:47.615+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:126000 KvCacheType:q8_0 NumThreads:8 GPULayers:49[ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:24(0..23) ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:25(24..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T20:55:47.754+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:126000 KvCacheType:q8_0 NumThreads:8 GPULayers:49[ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:26(0..25) ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:23(26..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T20:55:47.789+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:126000 KvCacheType:q8_0 NumThreads:8 GPULayers:49[ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:26(0..25) ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:23(26..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T20:55:53.257+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:126000 KvCacheType:q8_0 NumThreads:8 GPULayers:49[ID:GPU-3752f260-9f8c-48e9-780e-12430a037c53 Layers:25(0..24) ID:GPU-971b407f-ae20-75ed-99c8-42c696057b0e Layers:24(25..48)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T20:55:54.287+02:00 level=INFO source=runner.go:1172 msg=load request="{Operation:close LoraPath:[] Parallel:0 BatchSize:0 FlashAttention:false KvSize:0 KvCacheType: NumThreads:0 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-10-11T20:55:54.287+02:00 level=INFO source=device.go:206 msg="model weights" device=CUDA0 size="11.2 GiB" time=2025-10-11T20:55:54.288+02:00 level=INFO source=device.go:206 msg="model weights" device=CUDA1 size="11.9 GiB" time=2025-10-11T20:55:54.288+02:00 level=INFO source=device.go:211 msg="model weights" device=CPU size="243.4 MiB" time=2025-10-11T20:55:54.289+02:00 level=INFO source=device.go:217 msg="kv cache" device=CUDA0 size="2.9 GiB" time=2025-10-11T20:55:54.289+02:00 level=INFO source=device.go:217 msg="kv cache" device=CUDA1 size="3.2 GiB" time=2025-10-11T20:55:54.289+02:00 level=INFO source=device.go:228 msg="compute graph" device=CUDA0 size="323.8 MiB" time=2025-10-11T20:55:54.289+02:00 level=INFO source=device.go:228 msg="compute graph" device=CUDA1 size="521.3 MiB" time=2025-10-11T20:55:54.290+02:00 level=INFO source=device.go:233 msg="compute graph" device=CPU size="4.0 MiB" time=2025-10-11T20:55:54.290+02:00 level=INFO source=device.go:238 msg="total memory" size="30.3 GiB" time=2025-10-11T20:55:54.290+02:00 level=INFO source=sched.go:448 msg="Load failed" model=D:\Ollama\models\blobs\sha256-aa885ffefae01fe441fe8a50ab2e5f3cd72ad12a140f718c794f153d841010a7 error="memory layout cannot be allocated with num_gpu = 91" [GIN] 2025/10/11 - 20:55:54 | 500 | 7.4332265s | 192.168.1.88 | POST "/api/chat" [GIN] 2025/10/11 - 20:57:10 | 200 | 0s | 127.0.0.1 | GET "/api/version" ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.12.4-rc6
GiteaMirror added the bug label 2026-05-04 21:26:28 -05:00
Author
Owner

@chrisoutwright commented on GitHub (Oct 11, 2025):

Note: the same happens if I reverse the order of models used as in description.
I also tried: OLLAMA_MAX_LOADED_MODELS = 1 .. but that did not help the issue. Using same num_gpu, did also not help.

<!-- gh-comment-id:3393598699 --> @chrisoutwright commented on GitHub (Oct 11, 2025): Note: the same happens if I reverse the order of models used as in description. I also tried: OLLAMA_MAX_LOADED_MODELS = 1 .. but that did not help the issue. Using same num_gpu, did also not help.
Author
Owner

@chrisoutwright commented on GitHub (Oct 11, 2025):

I tried now: OLLAMA_NEW_ESTIMATES = 1 , and the error seems gone.
Is OLLAMA_NEW_ESTIMATES = 0 not like the previous behaviour? In prior releases I had no such issues.

<!-- gh-comment-id:3393603929 --> @chrisoutwright commented on GitHub (Oct 11, 2025): I tried now: OLLAMA_NEW_ESTIMATES = 1 , and the error seems gone. Is OLLAMA_NEW_ESTIMATES = 0 not like the previous behaviour? In prior releases I had no such issues.
Author
Owner

@rick-github commented on GitHub (Oct 11, 2025):

0.12.4 series didn't end in a successful release. Does it still occur with 0.12.5?

<!-- gh-comment-id:3393665478 --> @rick-github commented on GitHub (Oct 11, 2025): 0.12.4 series didn't end in a successful release. Does it still occur with 0.12.5?
Author
Owner

@jessegross commented on GitHub (Oct 14, 2025):

This is happening because you are overriding the memory allocation logic by forcing num_gpu to 91. When you do that, Ollama will try to load the model with that number of layers no matter what. This includes not trying to evict the already loaded model. Since you don't have enough memory to fully offload both models at the same time, it returns an error. If you leave the settings at the default, it should swap between models like it normally does.

It does look like this is a change in behavior from the old memory estimates and the old behavior makes more sense.

By the way, in 0.12.x OLLAMA_NEW_ESTIMATES is no longer a configuration variable as it is always on for models that run the Ollama engine.

<!-- gh-comment-id:3403978394 --> @jessegross commented on GitHub (Oct 14, 2025): This is happening because you are overriding the memory allocation logic by forcing num_gpu to 91. When you do that, Ollama will try to load the model with that number of layers no matter what. This includes not trying to evict the already loaded model. Since you don't have enough memory to fully offload both models at the same time, it returns an error. If you leave the settings at the default, it should swap between models like it normally does. It does look like this is a change in behavior from the old memory estimates and the old behavior makes more sense. By the way, in 0.12.x OLLAMA_NEW_ESTIMATES is no longer a configuration variable as it is always on for models that run the Ollama engine.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70407