[GH-ISSUE #13904] Ollama Docker with strange delay #9097

Closed
opened 2026-04-12 21:56:54 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @thomas-meier85 on GitHub (Jan 25, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/13904

What is the issue?

I'm having strange delays when using Ollama over docker (compose).

Basic information:
Ubuntu 24.04 LTS, Current Docker Version
Nvidia RTX 6000 pro Max-Q - 96GB DDR7 VRAM
512GB of System Memory

Nvidia-SMI output:
Sun Jan 25 15:48:18 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05 Driver Version: 590.48.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX PRO 6000 Blac... On | 00000000:17:00.0 Off | Off |
| 30% 39C P8 5W / 300W | 14343MiB / 97887MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

Interesting as well:
When I stop the container by "docker compose stop Ollama" and start it again, I don't have the 30s delay.
But when I start the "container by docker compose up -d" the first request of every model takes 30s.

Please let me know in case you need more information or logs.

Best
Thomas

Relevant log output

**Scenario 1: Docker container started, prompt "Chat" - Model: gpt-oss:20b**
**I takes around 30s to start the Ollama runner**
time=2026-01-25T14:42:32.062Z level=DEBUG source=runner.go:264 msg="refreshing free memory"
time=2026-01-25T14:42:32.062Z level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery"
time=2026-01-25T14:42:32.063Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 43921"
time=2026-01-25T14:42:32.063Z level=DEBUG source=server.go:430 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_LLM_LIBRARY=cuda_v13 OLLAMA_KEEP_ALIVE=-1 OLLAMA_HOST=0.0.0.0 OLLAMA_DEBUG=true OLLAMA_FLASH_ATTENTION=1 LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13:/usr/local/nvidia/l
ib:/usr/local/nvidia/lib64 OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13
time=2026-01-25T14:42:32.208Z level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=145.477131ms OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v13]" extra_envs=map[]
time=2026-01-25T14:42:32.208Z level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=145.670137ms
time=2026-01-25T14:42:32.213Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
time=2026-01-25T14:42:32.213Z level=DEBUG source=sched.go:195 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
time=2026-01-25T14:42:32.250Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32
time=2026-01-25T14:42:32.251Z level=DEBUG source=sched.go:220 msg="loading first model" model=/root/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb
time=2026-01-25T14:42:32.412Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32
time=2026-01-25T14:42:32.413Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0
time=2026-01-25T14:42:32.413Z level=INFO source=server.go:245 msg="enabling flash attention"
time=2026-01-25T14:42:32.413Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb --port 46675"
time=2026-01-25T14:42:32.413Z level=DEBUG source=server.go:430 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_LLM_LIBRARY=cuda_v13 OLLAMA_KEEP_ALIVE=-1 OLLAMA_HOST=0.0.0.0 OLLAMA_DEBUG=true OLLAMA_FLASH_ATTENTION=1 LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13:/usr/local/nvidia/l
ib:/usr/local/nvidia/lib64 OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13
time=2026-01-25T14:42:32.414Z level=INFO source=sched.go:452 msg="system memory" total="503.4 GiB" free="503.2 GiB" free_swap="4.0 GiB"
time=2026-01-25T14:42:32.414Z level=INFO source=sched.go:459 msg="gpu memory" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA available="93.4 GiB" free="93.9 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-01-25T14:42:32.414Z level=INFO source=server.go:755 msg="loading model" "model layers"=25 requested=-1
time=2026-01-25T14:42:32.434Z level=INFO source=runner.go:1405 msg="starting ollama engine"
time=2026-01-25T14:42:32.434Z level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:46675"
time=2026-01-25T14:42:32.437Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-01-25T14:42:32.518Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32
time=2026-01-25T14:42:32.519Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.name default=""
time=2026-01-25T14:42:32.519Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.description default=""
time=2026-01-25T14:42:32.519Z level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32
time=2026-01-25T14:42:32.519Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2026-01-25T14:42:32.527Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v13
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, ID: GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2026-01-25T14:42:32.614Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=
1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-01-25T14:42:32.617Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0
time=2026-01-25T14:42:51.557Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2
time=2026-01-25T14:43:00.007Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2
time=2026-01-25T14:43:00.008Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB"
time=2026-01-25T14:43:00.008Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.1 GiB"
time=2026-01-25T14:43:00.008Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB"
time=2026-01-25T14:43:00.008Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB"
time=2026-01-25T14:43:00.008Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB"
time=2026-01-25T14:43:00.008Z level=DEBUG source=device.go:272 msg="total memory" size="13.3 GiB"
time=2026-01-25T14:43:00.008Z level=DEBUG source=server.go:780 msg=memory success=true required.InputWeights=1158266880 required.CPU.Graph=5898240 required.CUDA0.ID=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 required.CUDA0.Weights="[477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 4
77628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 1158278400]" required.CUDA0.Cache="[9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437
184 16777216 9437184 16777216 9437184 16777216 0]" required.CUDA0.Graph=168433792
time=2026-01-25T14:43:00.008Z level=DEBUG source=server.go:974 msg="available gpu" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA "available layer vram"="93.3 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="160.6 MiB"
time=2026-01-25T14:43:00.008Z level=DEBUG source=server.go:791 msg="new layout created" layers="25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)]"
time=2026-01-25T14:43:00.009Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}
"
time=2026-01-25T14:43:00.078Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32
time=2026-01-25T14:43:00.082Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0
time=2026-01-25T14:43:00.147Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2
time=2026-01-25T14:43:00.151Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2
time=2026-01-25T14:43:00.151Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB"
time=2026-01-25T14:43:00.151Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.1 GiB"
time=2026-01-25T14:43:00.151Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB"
time=2026-01-25T14:43:00.151Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB"
time=2026-01-25T14:43:00.151Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB"
time=2026-01-25T14:43:00.151Z level=DEBUG source=device.go:272 msg="total memory" size="13.3 GiB"
time=2026-01-25T14:43:00.151Z level=DEBUG source=server.go:780 msg=memory success=true required.InputWeights=1158266880 required.CPU.Graph=5898240 required.CUDA0.ID=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 required.CUDA0.Weights="[477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 4
77628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 1158278400]" required.CUDA0.Cache="[9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437
184 16777216 9437184 16777216 9437184 16777216 0]" required.CUDA0.Graph=168433792
time=2026-01-25T14:43:00.151Z level=DEBUG source=server.go:974 msg="available gpu" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA "available layer vram"="93.3 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="160.6 MiB"
time=2026-01-25T14:43:00.152Z level=DEBUG source=server.go:791 msg="new layout created" layers="25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)]"
time=2026-01-25T14:43:00.152Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false
}"
time=2026-01-25T14:43:00.152Z level=INFO source=ggml.go:482 msg="offloading 24 repeating layers to GPU"
time=2026-01-25T14:43:00.153Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2026-01-25T14:43:00.153Z level=INFO source=ggml.go:494 msg="offloaded 25/25 layers to GPU"
time=2026-01-25T14:43:00.153Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB"
time=2026-01-25T14:43:00.153Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.1 GiB"
time=2026-01-25T14:43:00.153Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB"
time=2026-01-25T14:43:00.153Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB"
time=2026-01-25T14:43:00.153Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB"
time=2026-01-25T14:43:00.153Z level=INFO source=device.go:272 msg="total memory" size="13.3 GiB"
time=2026-01-25T14:43:00.153Z level=INFO source=sched.go:526 msg="loaded runners" count=1
time=2026-01-25T14:43:00.153Z level=INFO source=server.go:1347 msg="waiting for llama runner to start responding"
time=2026-01-25T14:43:00.154Z level=INFO source=server.go:1381 msg="waiting for server to become available" status="llm server loading model"
time=2026-01-25T14:43:00.154Z level=DEBUG source=server.go:1391 msg="model load progress 0.00"
time=2026-01-25T14:43:00.405Z level=DEBUG source=server.go:1391 msg="model load progress 0.05"
time=2026-01-25T14:43:00.656Z level=DEBUG source=server.go:1391 msg="model load progress 0.09"
time=2026-01-25T14:43:00.907Z level=DEBUG source=server.go:1391 msg="model load progress 0.14"
time=2026-01-25T14:43:01.158Z level=DEBUG source=server.go:1391 msg="model load progress 0.18"
time=2026-01-25T14:43:01.410Z level=DEBUG source=server.go:1391 msg="model load progress 0.22"
time=2026-01-25T14:43:01.661Z level=DEBUG source=server.go:1391 msg="model load progress 0.27"
time=2026-01-25T14:43:01.912Z level=DEBUG source=server.go:1391 msg="model load progress 0.30"
time=2026-01-25T14:43:02.164Z level=DEBUG source=server.go:1391 msg="model load progress 0.35"
time=2026-01-25T14:43:02.415Z level=DEBUG source=server.go:1391 msg="model load progress 0.48"
time=2026-01-25T14:43:02.666Z level=DEBUG source=server.go:1391 msg="model load progress 0.64"
time=2026-01-25T14:43:02.916Z level=DEBUG source=server.go:1391 msg="model load progress 0.82"
time=2026-01-25T14:43:03.168Z level=DEBUG source=server.go:1391 msg="model load progress 0.95"
time=2026-01-25T14:43:03.348Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0
**time=2026-01-25T14:43:03.420Z level=INFO source=server.go:1385 msg="llama runner started in 31.01 seconds"**
time=2026-01-25T14:43:03.420Z level=DEBUG source=sched.go:538 msg="finished setting up" runner.name=registry.ollama.ai/library/gpt-oss:latest runner.inference="[{ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Library:CUDA}]" runner.size="13.3 GiB" runner.vram="13.3 GiB" runner.parallel=1 runner.pid=103 runner.model=/root/.ollama/mode
ls/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb runner.num_ctx=8192
time=2026-01-25T14:43:03.421Z level=DEBUG source=server.go:1533 msg="completion request" images=0 prompt=306 format=""
time=2026-01-25T14:43:03.498Z level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=0 prompt=68 used=0 remaining=68

**Scenario 1: Docker container already started - Model manually unloaded, prompt "Chat" - Model: gpt-oss:20b**
**I takes around 3s to start the Ollama runner**
time=2026-01-25T14:40:26.488Z level=DEBUG source=runner.go:264 msg="refreshing free memory"
time=2026-01-25T14:40:26.488Z level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery"
time=2026-01-25T14:40:26.489Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 38037"
time=2026-01-25T14:40:26.489Z level=DEBUG source=server.go:430 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_DEBUG=true OLLAMA_FLASH_ATTENTION=1 OLLAMA_LLM_LIBRARY=cuda_v13 OLLAMA_KEEP_ALIVE=-1 OLLAMA_HOST=0.0.0.0 LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13:/usr/local/nvidia/l
ib:/usr/local/nvidia/lib64 OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13
time=2026-01-25T14:40:26.627Z level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=139.130954ms OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v13]" extra_envs=map[]
time=2026-01-25T14:40:26.628Z level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=139.384896ms
time=2026-01-25T14:40:26.632Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
time=2026-01-25T14:40:26.671Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32
time=2026-01-25T14:40:26.671Z level=DEBUG source=sched.go:220 msg="loading first model" model=/root/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb
time=2026-01-25T14:40:26.831Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32
time=2026-01-25T14:40:26.832Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0
time=2026-01-25T14:40:26.832Z level=INFO source=server.go:245 msg="enabling flash attention"
time=2026-01-25T14:40:26.832Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb --port 37943"
time=2026-01-25T14:40:26.832Z level=DEBUG source=server.go:430 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_DEBUG=true OLLAMA_FLASH_ATTENTION=1 OLLAMA_LLM_LIBRARY=cuda_v13 OLLAMA_KEEP_ALIVE=-1 OLLAMA_HOST=0.0.0.0 LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13:/usr/local/nvidia/l
ib:/usr/local/nvidia/lib64 OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13
time=2026-01-25T14:40:26.833Z level=INFO source=sched.go:452 msg="system memory" total="503.4 GiB" free="503.2 GiB" free_swap="4.0 GiB"
time=2026-01-25T14:40:26.833Z level=INFO source=sched.go:459 msg="gpu memory" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA available="93.4 GiB" free="93.9 GiB" minimum="457.0 MiB" overhead="0 B"
time=2026-01-25T14:40:26.833Z level=INFO source=server.go:755 msg="loading model" "model layers"=25 requested=-1
time=2026-01-25T14:40:26.853Z level=INFO source=runner.go:1405 msg="starting ollama engine"
time=2026-01-25T14:40:26.853Z level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:37943"
time=2026-01-25T14:40:26.857Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2026-01-25T14:40:26.935Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32
time=2026-01-25T14:40:26.936Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.name default=""
time=2026-01-25T14:40:26.936Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.description default=""
time=2026-01-25T14:40:26.936Z level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32
time=2026-01-25T14:40:26.936Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
time=2026-01-25T14:40:26.944Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v13
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, ID: GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2026-01-25T14:40:27.033Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS=
1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2026-01-25T14:40:27.035Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0
time=2026-01-25T14:40:27.588Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2
time=2026-01-25T14:40:27.615Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2
time=2026-01-25T14:40:27.616Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB"
time=2026-01-25T14:40:27.616Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.1 GiB"
time=2026-01-25T14:40:27.616Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB"
time=2026-01-25T14:40:27.616Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB"
time=2026-01-25T14:40:27.616Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB"
time=2026-01-25T14:40:27.616Z level=DEBUG source=device.go:272 msg="total memory" size="13.3 GiB"
time=2026-01-25T14:40:27.616Z level=DEBUG source=server.go:780 msg=memory success=true required.InputWeights=1158266880 required.CPU.Graph=5898240 required.CUDA0.ID=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 required.CUDA0.Weights="[477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 4
77628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 1158278400]" required.CUDA0.Cache="[9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437
184 16777216 9437184 16777216 9437184 16777216 0]" required.CUDA0.Graph=168433792
time=2026-01-25T14:40:27.616Z level=DEBUG source=server.go:974 msg="available gpu" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA "available layer vram"="93.3 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="160.6 MiB"
time=2026-01-25T14:40:27.617Z level=DEBUG source=server.go:791 msg="new layout created" layers="25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)]"
time=2026-01-25T14:40:27.617Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}
"
time=2026-01-25T14:40:27.688Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32
time=2026-01-25T14:40:27.692Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0
time=2026-01-25T14:40:27.756Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2
time=2026-01-25T14:40:27.760Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2
time=2026-01-25T14:40:27.761Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB"
time=2026-01-25T14:40:27.761Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.1 GiB"
time=2026-01-25T14:40:27.761Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB"
time=2026-01-25T14:40:27.761Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB"
time=2026-01-25T14:40:27.761Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB"
time=2026-01-25T14:40:27.761Z level=DEBUG source=device.go:272 msg="total memory" size="13.3 GiB"
time=2026-01-25T14:40:27.761Z level=DEBUG source=server.go:780 msg=memory success=true required.InputWeights=1158266880 required.CPU.Graph=5898240 required.CUDA0.ID=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 required.CUDA0.Weights="[477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 4
77628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 1158278400]" required.CUDA0.Cache="[9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437
184 16777216 9437184 16777216 9437184 16777216 0]" required.CUDA0.Graph=168433792
time=2026-01-25T14:40:27.761Z level=DEBUG source=server.go:974 msg="available gpu" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA "available layer vram"="93.3 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="160.6 MiB"
time=2026-01-25T14:40:27.761Z level=DEBUG source=server.go:791 msg="new layout created" layers="25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)]"
time=2026-01-25T14:40:27.762Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false
}"
time=2026-01-25T14:40:27.762Z level=INFO source=ggml.go:482 msg="offloading 24 repeating layers to GPU"
time=2026-01-25T14:40:27.762Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU"
time=2026-01-25T14:40:27.762Z level=INFO source=ggml.go:494 msg="offloaded 25/25 layers to GPU"
time=2026-01-25T14:40:27.763Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB"
time=2026-01-25T14:40:27.763Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.1 GiB"
time=2026-01-25T14:40:27.763Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB"
time=2026-01-25T14:40:27.763Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB"
time=2026-01-25T14:40:27.763Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB"
time=2026-01-25T14:40:27.763Z level=INFO source=device.go:272 msg="total memory" size="13.3 GiB"
time=2026-01-25T14:40:27.763Z level=INFO source=sched.go:526 msg="loaded runners" count=1
time=2026-01-25T14:40:27.763Z level=INFO source=server.go:1347 msg="waiting for llama runner to start responding"
time=2026-01-25T14:40:27.763Z level=INFO source=server.go:1381 msg="waiting for server to become available" status="llm server loading model"
time=2026-01-25T14:40:27.763Z level=DEBUG source=server.go:1391 msg="model load progress 0.00"
time=2026-01-25T14:40:28.014Z level=DEBUG source=server.go:1391 msg="model load progress 0.12"
time=2026-01-25T14:40:28.265Z level=DEBUG source=server.go:1391 msg="model load progress 0.27"
time=2026-01-25T14:40:28.515Z level=DEBUG source=server.go:1391 msg="model load progress 0.41"
time=2026-01-25T14:40:28.766Z level=DEBUG source=server.go:1391 msg="model load progress 0.55"
time=2026-01-25T14:40:29.017Z level=DEBUG source=server.go:1391 msg="model load progress 0.72"
time=2026-01-25T14:40:29.268Z level=DEBUG source=server.go:1391 msg="model load progress 0.90"
time=2026-01-25T14:40:29.519Z level=DEBUG source=server.go:1391 msg="model load progress 0.98"
time=2026-01-25T14:40:29.602Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0
**time=2026-01-25T14:40:29.771Z level=INFO source=server.go:1385 msg="llama runner started in 2.94 seconds"**
time=2026-01-25T14:40:29.772Z level=DEBUG source=sched.go:538 msg="finished setting up" runner.name=registry.ollama.ai/library/gpt-oss:latest runner.inference="[{ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Library:CUDA}]" runner.size="13.3 GiB" runner.vram="13.3 GiB" runner.parallel=1 runner.pid=217 runner.model=/root/.ollama/mode
ls/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb runner.num_ctx=8192
time=2026-01-25T14:40:29.772Z level=DEBUG source=server.go:1533 msg="completion request" images=0 prompt=306 format=""
time=2026-01-25T14:40:29.858Z level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=0 prompt=68 used=0 remaining=68

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

latest

Originally created by @thomas-meier85 on GitHub (Jan 25, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/13904 ### What is the issue? I'm having strange delays when using Ollama over docker (compose). Basic information: Ubuntu 24.04 LTS, Current Docker Version Nvidia RTX 6000 pro Max-Q - 96GB DDR7 VRAM 512GB of System Memory Nvidia-SMI output: Sun Jan 25 15:48:18 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.95.05 Driver Version: 590.48.01 CUDA Version: 13.1 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX PRO 6000 Blac... On | 00000000:17:00.0 Off | Off | | 30% 39C P8 5W / 300W | 14343MiB / 97887MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ Interesting as well: When I stop the container by "docker compose stop Ollama" and start it again, I don't have the 30s delay. But when I start the "container by docker compose up -d" the first request of every model takes 30s. Please let me know in case you need more information or logs. Best Thomas ### Relevant log output ```shell **Scenario 1: Docker container started, prompt "Chat" - Model: gpt-oss:20b** **I takes around 30s to start the Ollama runner** time=2026-01-25T14:42:32.062Z level=DEBUG source=runner.go:264 msg="refreshing free memory" time=2026-01-25T14:42:32.062Z level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery" time=2026-01-25T14:42:32.063Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 43921" time=2026-01-25T14:42:32.063Z level=DEBUG source=server.go:430 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_LLM_LIBRARY=cuda_v13 OLLAMA_KEEP_ALIVE=-1 OLLAMA_HOST=0.0.0.0 OLLAMA_DEBUG=true OLLAMA_FLASH_ATTENTION=1 LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13:/usr/local/nvidia/l ib:/usr/local/nvidia/lib64 OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13 time=2026-01-25T14:42:32.208Z level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=145.477131ms OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v13]" extra_envs=map[] time=2026-01-25T14:42:32.208Z level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=145.670137ms time=2026-01-25T14:42:32.213Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax" time=2026-01-25T14:42:32.213Z level=DEBUG source=sched.go:195 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 time=2026-01-25T14:42:32.250Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32 time=2026-01-25T14:42:32.251Z level=DEBUG source=sched.go:220 msg="loading first model" model=/root/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb time=2026-01-25T14:42:32.412Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32 time=2026-01-25T14:42:32.413Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0 time=2026-01-25T14:42:32.413Z level=INFO source=server.go:245 msg="enabling flash attention" time=2026-01-25T14:42:32.413Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb --port 46675" time=2026-01-25T14:42:32.413Z level=DEBUG source=server.go:430 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_LLM_LIBRARY=cuda_v13 OLLAMA_KEEP_ALIVE=-1 OLLAMA_HOST=0.0.0.0 OLLAMA_DEBUG=true OLLAMA_FLASH_ATTENTION=1 LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13:/usr/local/nvidia/l ib:/usr/local/nvidia/lib64 OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13 time=2026-01-25T14:42:32.414Z level=INFO source=sched.go:452 msg="system memory" total="503.4 GiB" free="503.2 GiB" free_swap="4.0 GiB" time=2026-01-25T14:42:32.414Z level=INFO source=sched.go:459 msg="gpu memory" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA available="93.4 GiB" free="93.9 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-01-25T14:42:32.414Z level=INFO source=server.go:755 msg="loading model" "model layers"=25 requested=-1 time=2026-01-25T14:42:32.434Z level=INFO source=runner.go:1405 msg="starting ollama engine" time=2026-01-25T14:42:32.434Z level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:46675" time=2026-01-25T14:42:32.437Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-01-25T14:42:32.518Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32 time=2026-01-25T14:42:32.519Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.name default="" time=2026-01-25T14:42:32.519Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.description default="" time=2026-01-25T14:42:32.519Z level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32 time=2026-01-25T14:42:32.519Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2026-01-25T14:42:32.527Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v13 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, ID: GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so time=2026-01-25T14:42:32.614Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS= 1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2026-01-25T14:42:32.617Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0 time=2026-01-25T14:42:51.557Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2 time=2026-01-25T14:43:00.007Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2 time=2026-01-25T14:43:00.008Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB" time=2026-01-25T14:43:00.008Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.1 GiB" time=2026-01-25T14:43:00.008Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB" time=2026-01-25T14:43:00.008Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB" time=2026-01-25T14:43:00.008Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB" time=2026-01-25T14:43:00.008Z level=DEBUG source=device.go:272 msg="total memory" size="13.3 GiB" time=2026-01-25T14:43:00.008Z level=DEBUG source=server.go:780 msg=memory success=true required.InputWeights=1158266880 required.CPU.Graph=5898240 required.CUDA0.ID=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 required.CUDA0.Weights="[477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 4 77628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 1158278400]" required.CUDA0.Cache="[9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437 184 16777216 9437184 16777216 9437184 16777216 0]" required.CUDA0.Graph=168433792 time=2026-01-25T14:43:00.008Z level=DEBUG source=server.go:974 msg="available gpu" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA "available layer vram"="93.3 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="160.6 MiB" time=2026-01-25T14:43:00.008Z level=DEBUG source=server.go:791 msg="new layout created" layers="25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)]" time=2026-01-25T14:43:00.009Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false} " time=2026-01-25T14:43:00.078Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32 time=2026-01-25T14:43:00.082Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0 time=2026-01-25T14:43:00.147Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2 time=2026-01-25T14:43:00.151Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2 time=2026-01-25T14:43:00.151Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB" time=2026-01-25T14:43:00.151Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.1 GiB" time=2026-01-25T14:43:00.151Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB" time=2026-01-25T14:43:00.151Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB" time=2026-01-25T14:43:00.151Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB" time=2026-01-25T14:43:00.151Z level=DEBUG source=device.go:272 msg="total memory" size="13.3 GiB" time=2026-01-25T14:43:00.151Z level=DEBUG source=server.go:780 msg=memory success=true required.InputWeights=1158266880 required.CPU.Graph=5898240 required.CUDA0.ID=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 required.CUDA0.Weights="[477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 4 77628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 1158278400]" required.CUDA0.Cache="[9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437 184 16777216 9437184 16777216 9437184 16777216 0]" required.CUDA0.Graph=168433792 time=2026-01-25T14:43:00.151Z level=DEBUG source=server.go:974 msg="available gpu" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA "available layer vram"="93.3 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="160.6 MiB" time=2026-01-25T14:43:00.152Z level=DEBUG source=server.go:791 msg="new layout created" layers="25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)]" time=2026-01-25T14:43:00.152Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false }" time=2026-01-25T14:43:00.152Z level=INFO source=ggml.go:482 msg="offloading 24 repeating layers to GPU" time=2026-01-25T14:43:00.153Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU" time=2026-01-25T14:43:00.153Z level=INFO source=ggml.go:494 msg="offloaded 25/25 layers to GPU" time=2026-01-25T14:43:00.153Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB" time=2026-01-25T14:43:00.153Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.1 GiB" time=2026-01-25T14:43:00.153Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB" time=2026-01-25T14:43:00.153Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB" time=2026-01-25T14:43:00.153Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB" time=2026-01-25T14:43:00.153Z level=INFO source=device.go:272 msg="total memory" size="13.3 GiB" time=2026-01-25T14:43:00.153Z level=INFO source=sched.go:526 msg="loaded runners" count=1 time=2026-01-25T14:43:00.153Z level=INFO source=server.go:1347 msg="waiting for llama runner to start responding" time=2026-01-25T14:43:00.154Z level=INFO source=server.go:1381 msg="waiting for server to become available" status="llm server loading model" time=2026-01-25T14:43:00.154Z level=DEBUG source=server.go:1391 msg="model load progress 0.00" time=2026-01-25T14:43:00.405Z level=DEBUG source=server.go:1391 msg="model load progress 0.05" time=2026-01-25T14:43:00.656Z level=DEBUG source=server.go:1391 msg="model load progress 0.09" time=2026-01-25T14:43:00.907Z level=DEBUG source=server.go:1391 msg="model load progress 0.14" time=2026-01-25T14:43:01.158Z level=DEBUG source=server.go:1391 msg="model load progress 0.18" time=2026-01-25T14:43:01.410Z level=DEBUG source=server.go:1391 msg="model load progress 0.22" time=2026-01-25T14:43:01.661Z level=DEBUG source=server.go:1391 msg="model load progress 0.27" time=2026-01-25T14:43:01.912Z level=DEBUG source=server.go:1391 msg="model load progress 0.30" time=2026-01-25T14:43:02.164Z level=DEBUG source=server.go:1391 msg="model load progress 0.35" time=2026-01-25T14:43:02.415Z level=DEBUG source=server.go:1391 msg="model load progress 0.48" time=2026-01-25T14:43:02.666Z level=DEBUG source=server.go:1391 msg="model load progress 0.64" time=2026-01-25T14:43:02.916Z level=DEBUG source=server.go:1391 msg="model load progress 0.82" time=2026-01-25T14:43:03.168Z level=DEBUG source=server.go:1391 msg="model load progress 0.95" time=2026-01-25T14:43:03.348Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0 **time=2026-01-25T14:43:03.420Z level=INFO source=server.go:1385 msg="llama runner started in 31.01 seconds"** time=2026-01-25T14:43:03.420Z level=DEBUG source=sched.go:538 msg="finished setting up" runner.name=registry.ollama.ai/library/gpt-oss:latest runner.inference="[{ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Library:CUDA}]" runner.size="13.3 GiB" runner.vram="13.3 GiB" runner.parallel=1 runner.pid=103 runner.model=/root/.ollama/mode ls/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb runner.num_ctx=8192 time=2026-01-25T14:43:03.421Z level=DEBUG source=server.go:1533 msg="completion request" images=0 prompt=306 format="" time=2026-01-25T14:43:03.498Z level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=0 prompt=68 used=0 remaining=68 **Scenario 1: Docker container already started - Model manually unloaded, prompt "Chat" - Model: gpt-oss:20b** **I takes around 3s to start the Ollama runner** time=2026-01-25T14:40:26.488Z level=DEBUG source=runner.go:264 msg="refreshing free memory" time=2026-01-25T14:40:26.488Z level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery" time=2026-01-25T14:40:26.489Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 38037" time=2026-01-25T14:40:26.489Z level=DEBUG source=server.go:430 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_DEBUG=true OLLAMA_FLASH_ATTENTION=1 OLLAMA_LLM_LIBRARY=cuda_v13 OLLAMA_KEEP_ALIVE=-1 OLLAMA_HOST=0.0.0.0 LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13:/usr/local/nvidia/l ib:/usr/local/nvidia/lib64 OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13 time=2026-01-25T14:40:26.627Z level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=139.130954ms OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v13]" extra_envs=map[] time=2026-01-25T14:40:26.628Z level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=139.384896ms time=2026-01-25T14:40:26.632Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax" time=2026-01-25T14:40:26.671Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32 time=2026-01-25T14:40:26.671Z level=DEBUG source=sched.go:220 msg="loading first model" model=/root/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb time=2026-01-25T14:40:26.831Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32 time=2026-01-25T14:40:26.832Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0 time=2026-01-25T14:40:26.832Z level=INFO source=server.go:245 msg="enabling flash attention" time=2026-01-25T14:40:26.832Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb --port 37943" time=2026-01-25T14:40:26.832Z level=DEBUG source=server.go:430 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_DEBUG=true OLLAMA_FLASH_ATTENTION=1 OLLAMA_LLM_LIBRARY=cuda_v13 OLLAMA_KEEP_ALIVE=-1 OLLAMA_HOST=0.0.0.0 LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13:/usr/local/nvidia/l ib:/usr/local/nvidia/lib64 OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13 time=2026-01-25T14:40:26.833Z level=INFO source=sched.go:452 msg="system memory" total="503.4 GiB" free="503.2 GiB" free_swap="4.0 GiB" time=2026-01-25T14:40:26.833Z level=INFO source=sched.go:459 msg="gpu memory" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA available="93.4 GiB" free="93.9 GiB" minimum="457.0 MiB" overhead="0 B" time=2026-01-25T14:40:26.833Z level=INFO source=server.go:755 msg="loading model" "model layers"=25 requested=-1 time=2026-01-25T14:40:26.853Z level=INFO source=runner.go:1405 msg="starting ollama engine" time=2026-01-25T14:40:26.853Z level=INFO source=runner.go:1440 msg="Server listening on 127.0.0.1:37943" time=2026-01-25T14:40:26.857Z level=INFO source=runner.go:1278 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2026-01-25T14:40:26.935Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32 time=2026-01-25T14:40:26.936Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.name default="" time=2026-01-25T14:40:26.936Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.description default="" time=2026-01-25T14:40:26.936Z level=INFO source=ggml.go:136 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=459 num_key_values=32 time=2026-01-25T14:40:26.936Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so time=2026-01-25T14:40:26.944Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v13 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, ID: GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so time=2026-01-25T14:40:27.033Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1030,1100,1200,1210 CUDA.0.USE_GRAPHS= 1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2026-01-25T14:40:27.035Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0 time=2026-01-25T14:40:27.588Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2 time=2026-01-25T14:40:27.615Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2 time=2026-01-25T14:40:27.616Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB" time=2026-01-25T14:40:27.616Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.1 GiB" time=2026-01-25T14:40:27.616Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB" time=2026-01-25T14:40:27.616Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB" time=2026-01-25T14:40:27.616Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB" time=2026-01-25T14:40:27.616Z level=DEBUG source=device.go:272 msg="total memory" size="13.3 GiB" time=2026-01-25T14:40:27.616Z level=DEBUG source=server.go:780 msg=memory success=true required.InputWeights=1158266880 required.CPU.Graph=5898240 required.CUDA0.ID=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 required.CUDA0.Weights="[477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 4 77628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 1158278400]" required.CUDA0.Cache="[9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437 184 16777216 9437184 16777216 9437184 16777216 0]" required.CUDA0.Graph=168433792 time=2026-01-25T14:40:27.616Z level=DEBUG source=server.go:974 msg="available gpu" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA "available layer vram"="93.3 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="160.6 MiB" time=2026-01-25T14:40:27.617Z level=DEBUG source=server.go:791 msg="new layout created" layers="25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)]" time=2026-01-25T14:40:27.617Z level=INFO source=runner.go:1278 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false} " time=2026-01-25T14:40:27.688Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=general.alignment default=32 time=2026-01-25T14:40:27.692Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0 time=2026-01-25T14:40:27.756Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2 time=2026-01-25T14:40:27.760Z level=DEBUG source=ggml.go:852 msg="compute graph" nodes=1399 splits=2 time=2026-01-25T14:40:27.761Z level=DEBUG source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB" time=2026-01-25T14:40:27.761Z level=DEBUG source=device.go:245 msg="model weights" device=CPU size="1.1 GiB" time=2026-01-25T14:40:27.761Z level=DEBUG source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB" time=2026-01-25T14:40:27.761Z level=DEBUG source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB" time=2026-01-25T14:40:27.761Z level=DEBUG source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB" time=2026-01-25T14:40:27.761Z level=DEBUG source=device.go:272 msg="total memory" size="13.3 GiB" time=2026-01-25T14:40:27.761Z level=DEBUG source=server.go:780 msg=memory success=true required.InputWeights=1158266880 required.CPU.Graph=5898240 required.CUDA0.ID=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 required.CUDA0.Weights="[477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 4 77628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 477628928 1158278400]" required.CUDA0.Cache="[9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437184 16777216 9437 184 16777216 9437184 16777216 9437184 16777216 0]" required.CUDA0.Graph=168433792 time=2026-01-25T14:40:27.761Z level=DEBUG source=server.go:974 msg="available gpu" id=GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 library=CUDA "available layer vram"="93.3 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="160.6 MiB" time=2026-01-25T14:40:27.761Z level=DEBUG source=server.go:791 msg="new layout created" layers="25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)]" time=2026-01-25T14:40:27.762Z level=INFO source=runner.go:1278 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:8192 KvCacheType: NumThreads:24 GPULayers:25[ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false }" time=2026-01-25T14:40:27.762Z level=INFO source=ggml.go:482 msg="offloading 24 repeating layers to GPU" time=2026-01-25T14:40:27.762Z level=INFO source=ggml.go:489 msg="offloading output layer to GPU" time=2026-01-25T14:40:27.762Z level=INFO source=ggml.go:494 msg="offloaded 25/25 layers to GPU" time=2026-01-25T14:40:27.763Z level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="11.8 GiB" time=2026-01-25T14:40:27.763Z level=INFO source=device.go:245 msg="model weights" device=CPU size="1.1 GiB" time=2026-01-25T14:40:27.763Z level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="300.0 MiB" time=2026-01-25T14:40:27.763Z level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="160.6 MiB" time=2026-01-25T14:40:27.763Z level=INFO source=device.go:267 msg="compute graph" device=CPU size="5.6 MiB" time=2026-01-25T14:40:27.763Z level=INFO source=device.go:272 msg="total memory" size="13.3 GiB" time=2026-01-25T14:40:27.763Z level=INFO source=sched.go:526 msg="loaded runners" count=1 time=2026-01-25T14:40:27.763Z level=INFO source=server.go:1347 msg="waiting for llama runner to start responding" time=2026-01-25T14:40:27.763Z level=INFO source=server.go:1381 msg="waiting for server to become available" status="llm server loading model" time=2026-01-25T14:40:27.763Z level=DEBUG source=server.go:1391 msg="model load progress 0.00" time=2026-01-25T14:40:28.014Z level=DEBUG source=server.go:1391 msg="model load progress 0.12" time=2026-01-25T14:40:28.265Z level=DEBUG source=server.go:1391 msg="model load progress 0.27" time=2026-01-25T14:40:28.515Z level=DEBUG source=server.go:1391 msg="model load progress 0.41" time=2026-01-25T14:40:28.766Z level=DEBUG source=server.go:1391 msg="model load progress 0.55" time=2026-01-25T14:40:29.017Z level=DEBUG source=server.go:1391 msg="model load progress 0.72" time=2026-01-25T14:40:29.268Z level=DEBUG source=server.go:1391 msg="model load progress 0.90" time=2026-01-25T14:40:29.519Z level=DEBUG source=server.go:1391 msg="model load progress 0.98" time=2026-01-25T14:40:29.602Z level=DEBUG source=ggml.go:298 msg="key with type not found" key=gptoss.pooling_type default=0 **time=2026-01-25T14:40:29.771Z level=INFO source=server.go:1385 msg="llama runner started in 2.94 seconds"** time=2026-01-25T14:40:29.772Z level=DEBUG source=sched.go:538 msg="finished setting up" runner.name=registry.ollama.ai/library/gpt-oss:latest runner.inference="[{ID:GPU-7f2d3c91-8c50-7c91-e17f-ac3279c58604 Library:CUDA}]" runner.size="13.3 GiB" runner.vram="13.3 GiB" runner.parallel=1 runner.pid=217 runner.model=/root/.ollama/mode ls/blobs/sha256-e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb runner.num_ctx=8192 time=2026-01-25T14:40:29.772Z level=DEBUG source=server.go:1533 msg="completion request" images=0 prompt=306 format="" time=2026-01-25T14:40:29.858Z level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=0 prompt=68 used=0 remaining=68 ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version latest
GiteaMirror added the bug label 2026-04-12 21:56:54 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9097