[GH-ISSUE #13627] Long startup time after docker container starts #8966

Open
opened 2026-04-12 21:47:52 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @thomas-meier85 on GitHub (Jan 5, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13627

What is the issue?

Hey guys,

i have a strange issue running Ollama as a docker container in 2 different scenarios.
1st: Docker container is started and the first attempt ends up in huge Ollama startup time around 30s.
2nd: However, when I unload the model, docker is still running, Ollama runner starts in around 3s.

This systems runs on a NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition with 96GB VRAM.
Other System with A40 show similar Ollama start up times around 3s.

Below are the 2 log outputs - long and shot Ollama runner start up times.
Model I used was got-oss:20b which easily fits into the VRAM

Relevant log output

// Working as expected: "llama runner started in 2.60 seconds"
time=2026-01-05T20:48:43.178Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 45083"
time=2026-01-05T20:48:43.326Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
time=2026-01-05T20:48:43.538Z level=INFO source=server.go:245 msg="enabling flash attention"
time=2026-01-05T20:48:43.539Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb --port 34529"
time=2026-01-05T20:48:43.539Z level=INFO source=sched.go:443 msg="system memory" total="503.4 GiB" free="503.2 GiB" free_swap="4.0 GiB"
time=2026-01-05T20:48:43.539Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA available="93.4 GiB" free="93.9 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-01-05T20:48:43.539Z level=INFO source=server.go:746 msg="loading model" "model layers"=25 requested=-1
time=2026-01-05T20:48:43.557Z level=INFO source=runner.go:1405 msg="starting ollama engine"
time=2026-01-05T20:48:43.558Z level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:34529"
time=2026-01-05T20:48:43.563Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-01-05T20:48:43.642Z level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, ID: GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2026-01-05T20:48:43.737Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gc
time=2026-01-05T20:48:44.241Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-01-05T20:48:44.381Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-01-05T20:48:44.381Z level=INFO source=ggml.go:482 msg="offloading 24 repeating layers to GPU"
time=2026-01-05T20:48:44.381Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2026-01-05T20:48:44.381Z level=INFO source=ggml.go:494 msg="offloaded 25/25 layers to GPU"
time=2026-01-05T20:48:44.381Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB"
time=2026-01-05T20:48:44.381Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.1 GiB"
time=2026-01-05T20:48:44.381Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB"
time=2026-01-05T20:48:44.381Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB"
time=2026-01-05T20:48:44.381Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB"
time=2026-01-05T20:48:44.381Z level=INFO source=device.go:272 msg="total memory" size="13.3 GiB"
time=2026-01-05T20:48:44.381Z level=INFO source=sched.go:517 msg="loaded runners" count=1
time=2026-01-05T20:48:44.381Z level=INFO source=server.go:1338 msg="waiting for llama runner to start responding"
time=2026-01-05T20:48:44.382Z level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model"
time=2026-01-05T20:48:46.139Z level=INFO source=server.go:1376 msg="llama runner started in 2.60 seconds"
[GIN] 2026/01/05 - 20:48:49 | 200 |  6.558038359s |      172.18.0.1 | POST     "/api/chat"
[GIN] 2026/01/05 - 20:48:51 | 200 |  2.115978354s |      172.18.0.1 | POST     "/api/chat"
[GIN] 2026/01/05 - 20:48:52 | 200 |  1.342621359s |      172.18.0.1 | POST     "/api/chat"
[GIN] 2026/01/05 - 20:48:53 | 200 |  852.120881ms |      172.18.0.1 | POST     "/api/chat"
 


// Not expected: "llama runner started in 29.23 seconds"
time=2026-01-05T20:51:04.121Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 44473"
time=2026-01-05T20:51:04.273Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
time=2026-01-05T20:51:04.474Z level=INFO source=server.go:245 msg="enabling flash attention"
time=2026-01-05T20:51:04.474Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb --port 43059"
time=2026-01-05T20:51:04.475Z level=INFO source=sched.go:443 msg="system memory" total="503.4 GiB" free="503.2 GiB" free_swap="4.0 GiB"
time=2026-01-05T20:51:04.475Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA available="93.4 GiB" free="93.9 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-01-05T20:51:04.475Z level=INFO source=server.go:746 msg="loading model" "model layers"=25 requested=-1
time=2026-01-05T20:51:04.494Z level=INFO source=runner.go:1405 msg="starting ollama engine"
time=2026-01-05T20:51:04.494Z level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:43059"
time=2026-01-05T20:51:04.498Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-01-05T20:51:04.580Z level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, ID: GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2026-01-05T20:51:04.677Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-01-05T20:51:31.555Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-01-05T20:51:31.695Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-01-05T20:51:31.695Z level=INFO source=ggml.go:482 msg="offloading 24 repeating layers to GPU"
time=2026-01-05T20:51:31.695Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2026-01-05T20:51:31.695Z level=INFO source=ggml.go:494 msg="offloaded 25/25 layers to GPU"
time=2026-01-05T20:51:31.695Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB"
time=2026-01-05T20:51:31.695Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.1 GiB"
time=2026-01-05T20:51:31.695Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB"
time=2026-01-05T20:51:31.695Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB"
time=2026-01-05T20:51:31.695Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB"
time=2026-01-05T20:51:31.695Z level=INFO source=device.go:272 msg="total memory" size="13.3 GiB"
time=2026-01-05T20:51:31.696Z level=INFO source=sched.go:517 msg="loaded runners" count=1
time=2026-01-05T20:51:31.696Z level=INFO source=server.go:1338 msg="waiting for llama runner to start responding"
time=2026-01-05T20:51:31.696Z level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model"
time=2026-01-05T20:51:33.704Z level=INFO source=server.go:1376 msg="llama runner started in 29.23 seconds"
[GIN] 2026/01/05 - 20:51:50 | 200 | 46.182873101s |      172.19.0.1 | POST     "/api/chat"
[GIN] 2026/01/05 - 20:51:52 | 200 |  2.079641882s |      172.19.0.1 | POST     "/api/chat"
[GIN] 2026/01/05 - 20:51:53 | 200 |  1.200260658s |      172.19.0.1 | POST     "/api/chat"
[GIN] 2026/01/05 - 20:51:54 | 200 |  1.483932971s |      172.19.0.1 | POST     "/api/chat"

OS

Ubuntu 24.04 LTS

GPU

NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition with 96GB VRAM

CPU

XEON Gold 5412U

0.13.5

Originally created by @thomas-meier85 on GitHub (Jan 5, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13627 ### What is the issue? Hey guys, i have a strange issue running Ollama as a docker container in 2 different scenarios. 1st: Docker container is started and the first attempt ends up in huge Ollama startup time around 30s. 2nd: However, when I unload the model, docker is still running, Ollama runner starts in around 3s. This systems runs on a NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition with 96GB VRAM. Other System with A40 show similar Ollama start up times around 3s. Below are the 2 log outputs - long and shot Ollama runner start up times. Model I used was got-oss:20b which easily fits into the VRAM ### Relevant log output ```shell // Working as expected: "llama runner started in 2.60 seconds" time=2026-01-05T20:48:43.178Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 45083" time=2026-01-05T20:48:43.326Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax" time=2026-01-05T20:48:43.538Z level=INFO source=server.go:245 msg="enabling flash attention" time=2026-01-05T20:48:43.539Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb --port 34529" time=2026-01-05T20:48:43.539Z level=INFO source=sched.go:443 msg="system memory" total="503.4 GiB" free="503.2 GiB" free_swap="4.0 GiB" time=2026-01-05T20:48:43.539Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA available="93.4 GiB" free="93.9 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-01-05T20:48:43.539Z level=INFO source=server.go:746 msg="loading model" "model layers"=25 requested=-1 time=2026-01-05T20:48:43.557Z level=INFO source=runner.go:1405 msg="starting ollama engine" time=2026-01-05T20:48:43.558Z level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:34529" time=2026-01-05T20:48:43.563Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-01-05T20:48:43.642Z level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, ID: GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so time=2026-01-05T20:48:43.737Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gc time=2026-01-05T20:48:44.241Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-01-05T20:48:44.381Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-01-05T20:48:44.381Z level=INFO source=ggml.go:482 msg="offloading 24 repeating layers to GPU" time=2026-01-05T20:48:44.381Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU" time=2026-01-05T20:48:44.381Z level=INFO source=ggml.go:494 msg="offloaded 25/25 layers to GPU" time=2026-01-05T20:48:44.381Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB" time=2026-01-05T20:48:44.381Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.1 GiB" time=2026-01-05T20:48:44.381Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB" time=2026-01-05T20:48:44.381Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB" time=2026-01-05T20:48:44.381Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB" time=2026-01-05T20:48:44.381Z level=INFO source=device.go:272 msg="total memory" size="13.3 GiB" time=2026-01-05T20:48:44.381Z level=INFO source=sched.go:517 msg="loaded runners" count=1 time=2026-01-05T20:48:44.381Z level=INFO source=server.go:1338 msg="waiting for llama runner to start responding" time=2026-01-05T20:48:44.382Z level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model" time=2026-01-05T20:48:46.139Z level=INFO source=server.go:1376 msg="llama runner started in 2.60 seconds" [GIN] 2026/01/05 - 20:48:49 | 200 | 6.558038359s | 172.18.0.1 | POST "/api/chat" [GIN] 2026/01/05 - 20:48:51 | 200 | 2.115978354s | 172.18.0.1 | POST "/api/chat" [GIN] 2026/01/05 - 20:48:52 | 200 | 1.342621359s | 172.18.0.1 | POST "/api/chat" [GIN] 2026/01/05 - 20:48:53 | 200 | 852.120881ms | 172.18.0.1 | POST "/api/chat"   // Not expected: "llama runner started in 29.23 seconds" time=2026-01-05T20:51:04.121Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 44473" time=2026-01-05T20:51:04.273Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax" time=2026-01-05T20:51:04.474Z level=INFO source=server.go:245 msg="enabling flash attention" time=2026-01-05T20:51:04.474Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb --port 43059" time=2026-01-05T20:51:04.475Z level=INFO source=sched.go:443 msg="system memory" total="503.4 GiB" free="503.2 GiB" free_swap="4.0 GiB" time=2026-01-05T20:51:04.475Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA available="93.4 GiB" free="93.9 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-01-05T20:51:04.475Z level=INFO source=server.go:746 msg="loading model" "model layers"=25 requested=-1 time=2026-01-05T20:51:04.494Z level=INFO source=runner.go:1405 msg="starting ollama engine" time=2026-01-05T20:51:04.494Z level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:43059" time=2026-01-05T20:51:04.498Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-01-05T20:51:04.580Z level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, ID: GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so time=2026-01-05T20:51:04.677Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2026-01-05T20:51:31.555Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-01-05T20:51:31.695Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-01-05T20:51:31.695Z level=INFO source=ggml.go:482 msg="offloading 24 repeating layers to GPU" time=2026-01-05T20:51:31.695Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU" time=2026-01-05T20:51:31.695Z level=INFO source=ggml.go:494 msg="offloaded 25/25 layers to GPU" time=2026-01-05T20:51:31.695Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB" time=2026-01-05T20:51:31.695Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.1 GiB" time=2026-01-05T20:51:31.695Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB" time=2026-01-05T20:51:31.695Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB" time=2026-01-05T20:51:31.695Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB" time=2026-01-05T20:51:31.695Z level=INFO source=device.go:272 msg="total memory" size="13.3 GiB" time=2026-01-05T20:51:31.696Z level=INFO source=sched.go:517 msg="loaded runners" count=1 time=2026-01-05T20:51:31.696Z level=INFO source=server.go:1338 msg="waiting for llama runner to start responding" time=2026-01-05T20:51:31.696Z level=INFO source=server.go:1372 msg="waiting for server to become available" status="llm server loading model" time=2026-01-05T20:51:33.704Z level=INFO source=server.go:1376 msg="llama runner started in 29.23 seconds" [GIN] 2026/01/05 - 20:51:50 | 200 | 46.182873101s | 172.19.0.1 | POST "/api/chat" [GIN] 2026/01/05 - 20:51:52 | 200 | 2.079641882s | 172.19.0.1 | POST "/api/chat" [GIN] 2026/01/05 - 20:51:53 | 200 | 1.200260658s | 172.19.0.1 | POST "/api/chat" [GIN] 2026/01/05 - 20:51:54 | 200 | 1.483932971s | 172.19.0.1 | POST "/api/chat" ``` ### OS Ubuntu 24.04 LTS ### GPU NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition with 96GB VRAM ### CPU XEON Gold 5412U 0.13.5
GiteaMirror added the bug label 2026-04-12 21:47:52 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 6, 2026):

Where are the models stored?

<!-- gh-comment-id:3712776035 --> @rick-github commented on GitHub (Jan 6, 2026): Where are the models stored?
Author
Owner

@thomas-meier85 commented on GitHub (Jan 6, 2026):

Where are the models stored?

local MVNE - RAID1

<!-- gh-comment-id:3713048324 --> @thomas-meier85 commented on GitHub (Jan 6, 2026): > Where are the models stored? local MVNE - RAID1
Author
Owner

@thomas-meier85 commented on GitHub (Jan 10, 2026):

There seems to be a huge difference here:

No delay:

time=2026-01-05T20:48:43.737Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gc
time=2026-01-05T20:48:44.241Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-01-05T20:48:44.241Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"

Huge delay: 27s
time=2026-01-05T20:51:04.677Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-01-05T20:51:31.555Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"

In cases with huge delays, CPU + RAM ist heavily used for around 30s. After that, GPU is utilized and everything works as expected.
From this time on, docker works without any delay until the container is restarted.

The only difference I see is:
compiler=cgo(gc vs compiler=cgo(gcc)

Best
Thomas

<!-- gh-comment-id:3731986220 --> @thomas-meier85 commented on GitHub (Jan 10, 2026): There seems to be a huge difference here: No delay: time=2026-01-05T20:48:43.737Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 **compiler=cgo(gc** time=2026-01-05T20:48:44.241Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-01-05T20:48:44.241Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Huge delay: 27s time=2026-01-05T20:51:04.677Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 **compiler=cgo(gcc)** time=2026-01-05T20:51:31.555Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" In cases with huge delays, CPU + RAM ist heavily used for around 30s. After that, GPU is utilized and everything works as expected. From this time on, docker works without any delay until the container is restarted. The only difference I see is: **compiler=cgo(gc** vs **compiler=cgo(gcc)** Best Thomas
Author
Owner

@thomas-meier85 commented on GitHub (Jan 18, 2026):

Any suggestion why this happens?

<!-- gh-comment-id:3765147753 --> @thomas-meier85 commented on GitHub (Jan 18, 2026): Any suggestion why this happens?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8966