[GH-ISSUE #14164] model load so slow after upgrade to 0.15.6 #71295

Closed
opened 2026-05-05 01:09:01 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @Arvintian on GitHub (Feb 9, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14164

What is the issue?

0.12.11 ~ 10S

time=2026-02-09T03:18:18.789Z level=INFO source=sched.go:443 msg="system memory" total="15.5 GiB" free="14.8 GiB" free_swap="0 B"
time=2026-02-09T03:18:18.789Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-3be405b4-935c-cbee-ad38-7068f15721db library=CUDA available="15.1 GiB" free="15.6 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-02-09T03:18:18.789Z level=INFO source=server.go:702 msg="loading model" "model layers"=41 requested=-1
time=2026-02-09T03:18:18.811Z level=INFO source=runner.go:1398 msg="starting ollama engine"
time=2026-02-09T03:18:18.814Z level=INFO source=runner.go:1433 msg="Server listening on 127.0.0.1:41075"
time=2026-02-09T03:18:18.823Z level=INFO source=runner.go:1271 msg=load request="{Operation:fit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType:q8_0 NumThreads:14 GPULayers:41[ID:GPU-3be405b4-935c-cbee-ad38-7068f15721db Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-09T03:18:18.859Z level=INFO source=ggml.go:136 msg="" architecture=qwen3 file_type=Q4_K_M name=Qwen3-14B-128K description="" num_tensors=443 num_key_values=37
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, ID: GPU-3be405b4-935c-cbee-ad38-7068f15721db
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
time=2026-02-09T03:18:19.037Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-02-09T03:18:19.721Z level=INFO source=runner.go:1271 msg=load request="{Operation:alloc LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType:q8_0 NumThreads:14 GPULayers:41[ID:GPU-3be405b4-935c-cbee-ad38-7068f15721db Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-09T03:18:20.255Z level=INFO source=runner.go:1271 msg=load request="{Operation:commit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType:q8_0 NumThreads:14 GPULayers:41[ID:GPU-3be405b4-935c-cbee-ad38-7068f15721db Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-09T03:18:20.255Z level=INFO source=ggml.go:482 msg="offloading 40 repeating layers to GPU"
time=2026-02-09T03:18:20.255Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2026-02-09T03:18:20.255Z level=INFO source=ggml.go:494 msg="offloaded 41/41 layers to GPU"
time=2026-02-09T03:18:20.255Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="8.0 GiB"
time=2026-02-09T03:18:20.255Z level=INFO source=device.go:245 msg="model weights" device=CPU size="417.3 MiB"
time=2026-02-09T03:18:20.255Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="2.7 GiB"
time=2026-02-09T03:18:20.255Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="352.0 MiB"
time=2026-02-09T03:18:20.255Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="10.0 MiB"
time=2026-02-09T03:18:20.255Z level=INFO source=device.go:272 msg="total memory" size="11.4 GiB"
time=2026-02-09T03:18:20.255Z level=INFO source=sched.go:517 msg="loaded runners" count=1
time=2026-02-09T03:18:20.255Z level=INFO source=server.go:1294 msg="waiting for llama runner to start responding"
time=2026-02-09T03:18:20.256Z level=INFO source=server.go:1328 msg="waiting for server to become available" status="llm server loading model"
time=2026-02-09T03:18:25.283Z level=INFO source=server.go:1332 msg="llama runner started in 6.49 seconds"
[GIN] 2026/02/09 - 03:18:28 | 200 | 10.503781665s |       127.0.0.1 | POST     "/api/chat"

0.15.6 ~ 90S

time=2026-02-09T03:20:06.684Z level=INFO source=sched.go:463 msg="system memory" total="15.5 GiB" free="15.3 GiB" free_swap="0 B"
time=2026-02-09T03:20:06.684Z level=INFO source=sched.go:470 msg="gpu memory" id=GPU-3be405b4-935c-cbee-ad38-7068f15721db library=CUDA available="15.1 GiB" free="15.6 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-02-09T03:20:06.684Z level=INFO source=server.go:757 msg="loading model" "model layers"=41 requested=-1
time=2026-02-09T03:20:06.705Z level=INFO source=runner.go:1411 msg="starting ollama engine"
time=2026-02-09T03:20:06.709Z level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:39565"
time=2026-02-09T03:20:06.718Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType:q8_0 NumThreads:14 GPULayers:41[ID:GPU-3be405b4-935c-cbee-ad38-7068f15721db Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-09T03:20:06.757Z level=INFO source=ggml.go:136 msg="" architecture=qwen3 file_type=Q4_K_M name=Qwen3-14B-128K description="" num_tensors=443 num_key_values=37
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, ID: GPU-3be405b4-935c-cbee-ad38-7068f15721db
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2026-02-09T03:20:06.946Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-02-09T03:21:25.234Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType:q8_0 NumThreads:14 GPULayers:41[ID:GPU-3be405b4-935c-cbee-ad38-7068f15721db Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-09T03:21:25.868Z level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType:q8_0 NumThreads:14 GPULayers:41[ID:GPU-3be405b4-935c-cbee-ad38-7068f15721db Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-02-09T03:21:25.869Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="8.0 GiB"
time=2026-02-09T03:21:25.869Z level=INFO source=device.go:245 msg="model weights" device=CPU size="417.3 MiB"
time=2026-02-09T03:21:25.869Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="2.7 GiB"
time=2026-02-09T03:21:25.869Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="352.0 MiB"
time=2026-02-09T03:21:25.869Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="10.0 MiB"
time=2026-02-09T03:21:25.869Z level=INFO source=device.go:272 msg="total memory" size="11.4 GiB"
time=2026-02-09T03:21:25.869Z level=INFO source=sched.go:537 msg="loaded runners" count=1
time=2026-02-09T03:21:25.869Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
time=2026-02-09T03:21:25.869Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
time=2026-02-09T03:21:25.868Z level=INFO source=ggml.go:482 msg="offloading 40 repeating layers to GPU"
time=2026-02-09T03:21:25.869Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2026-02-09T03:21:25.869Z level=INFO source=ggml.go:494 msg="offloaded 41/41 layers to GPU"
time=2026-02-09T03:21:30.896Z level=INFO source=server.go:1388 msg="llama runner started in 84.21 seconds"
[GIN] 2026/02/09 - 03:21:33 | 200 |         1m27s |       127.0.0.1 | POST     "/api/chat"
Mon Feb  9 11:27:05 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.07             Driver Version: 581.80         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+

NVIDIA GeForce RTX 5060 Ti on windows10 wsl2 with docker

docker run -d --restart unless-stopped --gpus=all -v /opt/ollama:/root/.ollama -e OLLAMA_KEEP_ALIVE=60m -e OLLAMA_NUM_PARALLEL=2 -e OLLAMA_KV_CACHE_TYPE=q8_0 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_CONTEXT_LENGTH=16384 -p 11434:11434 --name ollama ollama/ollama:0.15.6 or 0.12.11

Relevant log output


OS

Windows, WSL2

GPU

Nvidia

CPU

Intel

Ollama version

0.15.6

Originally created by @Arvintian on GitHub (Feb 9, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14164 ### What is the issue? 0.12.11 ~ 10S ``` time=2026-02-09T03:18:18.789Z level=INFO source=sched.go:443 msg="system memory" total="15.5 GiB" free="14.8 GiB" free_swap="0 B" time=2026-02-09T03:18:18.789Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-3be405b4-935c-cbee-ad38-7068f15721db library=CUDA available="15.1 GiB" free="15.6 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-02-09T03:18:18.789Z level=INFO source=server.go:702 msg="loading model" "model layers"=41 requested=-1 time=2026-02-09T03:18:18.811Z level=INFO source=runner.go:1398 msg="starting ollama engine" time=2026-02-09T03:18:18.814Z level=INFO source=runner.go:1433 msg="Server listening on 127.0.0.1:41075" time=2026-02-09T03:18:18.823Z level=INFO source=runner.go:1271 msg=load request="{Operation:fit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType:q8_0 NumThreads:14 GPULayers:41[ID:GPU-3be405b4-935c-cbee-ad38-7068f15721db Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-09T03:18:18.859Z level=INFO source=ggml.go:136 msg="" architecture=qwen3 file_type=Q4_K_M name=Qwen3-14B-128K description="" num_tensors=443 num_key_values=37 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, ID: GPU-3be405b4-935c-cbee-ad38-7068f15721db load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so time=2026-02-09T03:18:19.037Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2026-02-09T03:18:19.721Z level=INFO source=runner.go:1271 msg=load request="{Operation:alloc LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType:q8_0 NumThreads:14 GPULayers:41[ID:GPU-3be405b4-935c-cbee-ad38-7068f15721db Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-09T03:18:20.255Z level=INFO source=runner.go:1271 msg=load request="{Operation:commit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:true KvSize:32768 KvCacheType:q8_0 NumThreads:14 GPULayers:41[ID:GPU-3be405b4-935c-cbee-ad38-7068f15721db Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-09T03:18:20.255Z level=INFO source=ggml.go:482 msg="offloading 40 repeating layers to GPU" time=2026-02-09T03:18:20.255Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU" time=2026-02-09T03:18:20.255Z level=INFO source=ggml.go:494 msg="offloaded 41/41 layers to GPU" time=2026-02-09T03:18:20.255Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="8.0 GiB" time=2026-02-09T03:18:20.255Z level=INFO source=device.go:245 msg="model weights" device=CPU size="417.3 MiB" time=2026-02-09T03:18:20.255Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="2.7 GiB" time=2026-02-09T03:18:20.255Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="352.0 MiB" time=2026-02-09T03:18:20.255Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="10.0 MiB" time=2026-02-09T03:18:20.255Z level=INFO source=device.go:272 msg="total memory" size="11.4 GiB" time=2026-02-09T03:18:20.255Z level=INFO source=sched.go:517 msg="loaded runners" count=1 time=2026-02-09T03:18:20.255Z level=INFO source=server.go:1294 msg="waiting for llama runner to start responding" time=2026-02-09T03:18:20.256Z level=INFO source=server.go:1328 msg="waiting for server to become available" status="llm server loading model" time=2026-02-09T03:18:25.283Z level=INFO source=server.go:1332 msg="llama runner started in 6.49 seconds" [GIN] 2026/02/09 - 03:18:28 | 200 | 10.503781665s | 127.0.0.1 | POST "/api/chat" ``` 0.15.6 ~ 90S ``` time=2026-02-09T03:20:06.684Z level=INFO source=sched.go:463 msg="system memory" total="15.5 GiB" free="15.3 GiB" free_swap="0 B" time=2026-02-09T03:20:06.684Z level=INFO source=sched.go:470 msg="gpu memory" id=GPU-3be405b4-935c-cbee-ad38-7068f15721db library=CUDA available="15.1 GiB" free="15.6 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-02-09T03:20:06.684Z level=INFO source=server.go:757 msg="loading model" "model layers"=41 requested=-1 time=2026-02-09T03:20:06.705Z level=INFO source=runner.go:1411 msg="starting ollama engine" time=2026-02-09T03:20:06.709Z level=INFO source=runner.go:1446 msg="Server listening on 127.0.0.1:39565" time=2026-02-09T03:20:06.718Z level=INFO source=runner.go:1284 msg=load request="{Operation:fit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType:q8_0 NumThreads:14 GPULayers:41[ID:GPU-3be405b4-935c-cbee-ad38-7068f15721db Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-09T03:20:06.757Z level=INFO source=ggml.go:136 msg="" architecture=qwen3 file_type=Q4_K_M name=Qwen3-14B-128K description="" num_tensors=443 num_key_values=37 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, ID: GPU-3be405b4-935c-cbee-ad38-7068f15721db load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so time=2026-02-09T03:20:06.946Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2026-02-09T03:21:25.234Z level=INFO source=runner.go:1284 msg=load request="{Operation:alloc LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType:q8_0 NumThreads:14 GPULayers:41[ID:GPU-3be405b4-935c-cbee-ad38-7068f15721db Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-09T03:21:25.868Z level=INFO source=runner.go:1284 msg=load request="{Operation:commit LoraPath:[] Parallel:2 BatchSize:512 FlashAttention:Enabled KvSize:32768 KvCacheType:q8_0 NumThreads:14 GPULayers:41[ID:GPU-3be405b4-935c-cbee-ad38-7068f15721db Layers:41(0..40)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-02-09T03:21:25.869Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="8.0 GiB" time=2026-02-09T03:21:25.869Z level=INFO source=device.go:245 msg="model weights" device=CPU size="417.3 MiB" time=2026-02-09T03:21:25.869Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="2.7 GiB" time=2026-02-09T03:21:25.869Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="352.0 MiB" time=2026-02-09T03:21:25.869Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="10.0 MiB" time=2026-02-09T03:21:25.869Z level=INFO source=device.go:272 msg="total memory" size="11.4 GiB" time=2026-02-09T03:21:25.869Z level=INFO source=sched.go:537 msg="loaded runners" count=1 time=2026-02-09T03:21:25.869Z level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" time=2026-02-09T03:21:25.869Z level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" time=2026-02-09T03:21:25.868Z level=INFO source=ggml.go:482 msg="offloading 40 repeating layers to GPU" time=2026-02-09T03:21:25.869Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU" time=2026-02-09T03:21:25.869Z level=INFO source=ggml.go:494 msg="offloaded 41/41 layers to GPU" time=2026-02-09T03:21:30.896Z level=INFO source=server.go:1388 msg="llama runner started in 84.21 seconds" [GIN] 2026/02/09 - 03:21:33 | 200 | 1m27s | 127.0.0.1 | POST "/api/chat" ``` ``` Mon Feb 9 11:27:05 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.105.07 Driver Version: 581.80 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ ``` NVIDIA GeForce RTX 5060 Ti on windows10 wsl2 with docker ``` docker run -d --restart unless-stopped --gpus=all -v /opt/ollama:/root/.ollama -e OLLAMA_KEEP_ALIVE=60m -e OLLAMA_NUM_PARALLEL=2 -e OLLAMA_KV_CACHE_TYPE=q8_0 -e OLLAMA_FLASH_ATTENTION=1 -e OLLAMA_CONTEXT_LENGTH=16384 -p 11434:11434 --name ollama ollama/ollama:0.15.6 or 0.12.11 ``` ### Relevant log output ```shell ``` ### OS Windows, WSL2 ### GPU Nvidia ### CPU Intel ### Ollama version 0.15.6
GiteaMirror added the bug label 2026-05-05 01:09:01 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 9, 2026):

Is the model load slow if using native Windows ollama?

<!-- gh-comment-id:3870147410 --> @rick-github commented on GitHub (Feb 9, 2026): Is the model load slow if using native Windows ollama?
Author
Owner

@nairamk commented on GitHub (Feb 9, 2026):

There is a trick to speed up model loading.

You can enable the CUDA cache
and it will work for any of Your model.
This is not a response to a possible bug, but a general answer on how to enable CUDA cache in docker.

The first launch will take longer because the model is being “unpacked,”
but next launches will only take a few seconds.

Use in docker environment variables:

CUDA_CACHE_PATH=/root/.nv/ComputeCache
# cuda cache 2G for fast re-load models
CUDA_CACHE_MAXSIZE=2147483648

and add volume map
-v /host_folder/ollama/cuda_cache:/root/.nv/ComputeCache

so, every run You will have this same cache with already expanded objects ...

example ollama.yml

version: '3.8'

services:
  ollama:
    image: ollama/ollama:0.15.6
    deploy:
      resources:
        limits:
          memory: 26G
    restart: always
    environment:
      - LANG=pl_PL.UTF-8
      - TZ=Europe/Warsaw
      - GENERIC_TIMEZONE=Europe/Warsaw
      - OLLAMA_MODELS=/models
      - CUDA_CACHE_PATH=/root/.nv/ComputeCache
      # cuda cache 2G for fast re-load models
      - CUDA_CACHE_MAXSIZE=2147483648
    volumes:
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
      - /opt/ai/models:/models
      - /opt/ai/ollama/ollama:/root/.ollama
      - /opt/ai/ollama/cuda_cache:/root/.nv/ComputeCache
    ports:
      - "11434:11434"
<!-- gh-comment-id:3871134757 --> @nairamk commented on GitHub (Feb 9, 2026): There is a trick to speed up model loading. You can enable the CUDA cache and it will work for any of Your model. This is not a response to a possible bug, but a general answer on how to enable CUDA cache in docker. The first launch will take longer because the model is being “unpacked,” but next launches will only take a few seconds. Use in docker environment variables: ``` CUDA_CACHE_PATH=/root/.nv/ComputeCache # cuda cache 2G for fast re-load models CUDA_CACHE_MAXSIZE=2147483648 ``` and add volume map -v /host_folder/ollama/cuda_cache:/root/.nv/ComputeCache so, every run You will have this same cache with already expanded objects ... example ollama.yml ``` version: '3.8' services: ollama: image: ollama/ollama:0.15.6 deploy: resources: limits: memory: 26G restart: always environment: - LANG=pl_PL.UTF-8 - TZ=Europe/Warsaw - GENERIC_TIMEZONE=Europe/Warsaw - OLLAMA_MODELS=/models - CUDA_CACHE_PATH=/root/.nv/ComputeCache # cuda cache 2G for fast re-load models - CUDA_CACHE_MAXSIZE=2147483648 volumes: - /etc/timezone:/etc/timezone:ro - /etc/localtime:/etc/localtime:ro - /opt/ai/models:/models - /opt/ai/ollama/ollama:/root/.ollama - /opt/ai/ollama/cuda_cache:/root/.nv/ComputeCache ports: - "11434:11434" ```
Author
Owner

@rick-github commented on GitHub (Feb 9, 2026):

CUDA_CACHE_PATH is a cache for JIT PTX code, not models.

<!-- gh-comment-id:3871251864 --> @rick-github commented on GitHub (Feb 9, 2026): `CUDA_CACHE_PATH` is a cache for JIT PTX code, not models.
Author
Owner

@eclgo commented on GitHub (Feb 10, 2026):

I also upgraded to v0.15.6 under Win11 with 16G Nvidia GPU, but the loading of GLM-4.7-flash is super slow. In this first session, it not yet responded to my question after 20 min. I notice less than 16G VRAM was occupied and did not consumed shared VRAM. I attach the server logs for your reference.
BTW, Qwen3-8B also gave the 1st response in 5min+(maybe few mins more as I did not accurate count) while Qwen2.5-7B in around 70S.

server_GLM-4.7-Flash_1st session.log

<!-- gh-comment-id:3875555923 --> @eclgo commented on GitHub (Feb 10, 2026): I also upgraded to v0.15.6 under Win11 with 16G Nvidia GPU, but the loading of GLM-4.7-flash is super slow. In this first session, it not yet responded to my question after 20 min. I notice less than 16G VRAM was occupied and did not consumed shared VRAM. I attach the server logs for your reference. BTW, Qwen3-8B also gave the 1st response in 5min+(maybe few mins more as I did not accurate count) while Qwen2.5-7B in around 70S. [server_GLM-4.7-Flash_1st session.log](https://github.com/user-attachments/files/25201344/server_GLM-4.7-Flash_1st.session.log)
Author
Owner

@rick-github commented on GitHub (Feb 10, 2026):

What is drive H:? SSD, HDD, NVME, USB, network?

<!-- gh-comment-id:3877451882 --> @rick-github commented on GitHub (Feb 10, 2026): What is drive H:? SSD, HDD, NVME, USB, network?
Author
Owner

@eclgo commented on GitHub (Feb 10, 2026):

@rick-github H: drive is HDD. Thanks.

<!-- gh-comment-id:3877869885 --> @eclgo commented on GitHub (Feb 10, 2026): @rick-github H: drive is HDD. Thanks.
Author
Owner

@rick-github commented on GitHub (Feb 10, 2026):

Ollama launches multiple IO readers when loading a model. It could be that due to the latency of a HDD, there is significant IO contention causing the model load to be slow. Try setting GOMAXPROCS=1 in the server environment and see if loading speed improves.

<!-- gh-comment-id:3879104784 --> @rick-github commented on GitHub (Feb 10, 2026): Ollama launches multiple IO readers when loading a model. It could be that due to the latency of a HDD, there is significant IO contention causing the model load to be slow. Try setting `GOMAXPROCS=1` in the server environment and see if loading speed improves.
Author
Owner

@eclgo commented on GitHub (Feb 11, 2026):

@rick-github After setting GOMAXPROCS=1, the model loading speed improved a lot for some models. For your reference, GLM-4.7-flash Q4KM and Qwen3-8B can start thinking process after 3min 20S and 1min 25S respectively. However, Qwen2.5-7B still responded in around 70S. My HDD is 5700RPM. Many thanks for your help!

<!-- gh-comment-id:3881727998 --> @eclgo commented on GitHub (Feb 11, 2026): @rick-github After setting GOMAXPROCS=1, the model loading speed improved a lot for some models. For your reference, GLM-4.7-flash Q4KM and Qwen3-8B can start thinking process after 3min 20S and 1min 25S respectively. However, Qwen2.5-7B still responded in around 70S. My HDD is 5700RPM. Many thanks for your help!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71295