[GH-ISSUE #12048] High I/O contention during model initialization causes ~3 times slower model load time on HDD #33761

Open
opened 2026-04-22 16:45:13 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @Ghost99 on GitHub (Aug 23, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12048

What is the issue?

When Ollama loads a large model file (gpt-oss:20b – ~13 GB) on a HDD-backed Docker volume, the time taken to initialize the model is roughly three times longer than the raw read‑time of the same file. The slowdown is due to Ollama spawning multiple concurrent read processes that saturate the HDD, causing I/O contention.


Environment

OS: Linux
Docker: 28.0.4
Container image: ollama/ollama:0.11.6
Ollama version: 0.11.6
Model: gpt-oss:20b
Storage: HDD (non‑SSD)


Steps to reproduce

  1. Start Ollamadocker run -d --name ollama -v /var/lib/ollama:/ollama ollama/ollama:0.11.6
  2. Prepare the model – Pull gpt-oss:20b into the container (this will download the blob into ollama/models/blobs/).
  3. Trigger load – Issue any request to the model (e.g. curl -X POST localhost:11434/api/chat -d '{"model":"gpt-oss:20b","messages":[{"role":"user","content":"hello"}]}').
  4. Record log – In the container logs you’ll see something like:
    source=server.go:1272 msg="llama runner started in 363.97 seconds"
    
  5. Stop Ollama & clear OS read cachedocker stop ollama && sync; echo 3 > /proc/sys/vm/drop_caches
  6. Measure raw read time – To measure read time of the main gpt-oss:20b blob sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 (~13GB) run:
    time cat /var/lib/ollama/models/blobs/sha256-b112e* > /dev/null
    
    In my case it's ~127 s.
  7. Compare – The init time (~364 s) is roughly 3× slower.

Expected Behavior

Ollama should load the model with a single, sequential read that is only limited by the physical speed of the disk, i.e. init time ≈ raw read time (≈127 s for the 13 GB blob).


Actual Behavior

During initialization Ollama spawns multiple concurrent read workers (visible in iotop), causing significant I/O contention on the HDD. The combined throughput drops to ~1/3 of the single‑threaded read speed, resulting in a ≈364 s load time.


Additional Details

  • iotop shows >8 active read processes during initialization.
  • No errors or warnings appear in the logs; the behavior is purely performance‑related.

Relevant log output

time=2025-08-23T11:29:09.961Z level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:12288 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-08-23T11:29:09.964Z level=INFO source=images.go:477 msg="total blobs: 36"
time=2025-08-23T11:29:09.964Z level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2025-08-23T11:29:09.964Z level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.6)"
time=2025-08-23T11:29:09.965Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-08-23T11:29:10.243Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxxx library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="14.9 GiB"
time=2025-08-23T11:29:10.243Z level=INFO source=routes.go:1412 msg="entering low vram mode" "total vram"="15.6 GiB" threshold="20.0 GiB"
time=2025-08-23T11:29:55.227Z level=INFO source=server.go:383 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 44423"
time=2025-08-23T11:29:55.242Z level=INFO source=runner.go:1006 msg="starting ollama engine"
time=2025-08-23T11:29:55.245Z level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:44423"
time=2025-08-23T11:29:55.370Z level=INFO source=server.go:488 msg="system memory" total="62.7 GiB" free="56.3 GiB" free_swap="14.6 GiB"
time=2025-08-23T11:29:55.370Z level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 library=cuda parallel=1 required="14.5 GiB" gpus=1
time=2025-08-23T11:29:55.370Z level=INFO source=server.go:531 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=25 layers.split=[25] memory.available="[14.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.5 GiB" memory.required.partial="14.5 GiB" memory.required.kv="396.0 MiB" memory.required.allocations="[14.5 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.5 GiB" memory.graph.partial="1.5 GiB"
time=2025-08-23T11:29:55.371Z level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:12288 KvCacheType: NumThreads:8 GPULayers:25[ID:GPU-xxxx Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-08-23T11:29:55.492Z level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes, ID: GPU-xxxx
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
time=2025-08-23T11:29:55.590Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-23T11:29:55.687Z level=INFO source=ggml.go:486 msg="offloading 24 repeating layers to GPU"
time=2025-08-23T11:29:55.687Z level=INFO source=ggml.go:492 msg="offloading output layer to GPU"
time=2025-08-23T11:29:55.687Z level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU"
time=2025-08-23T11:29:55.688Z level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="11.8 GiB"
time=2025-08-23T11:29:55.688Z level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB"
time=2025-08-23T11:29:55.688Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="396.0 MiB"
time=2025-08-23T11:29:55.688Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="1.6 GiB"
time=2025-08-23T11:29:55.688Z level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB"
time=2025-08-23T11:29:55.688Z level=INFO source=backend.go:342 msg="total memory" size="14.8 GiB"
time=2025-08-23T11:29:55.688Z level=INFO source=sched.go:473 msg="loaded runners" count=1
time=2025-08-23T11:29:55.688Z level=INFO source=server.go:1234 msg="waiting for llama runner to start responding"
time=2025-08-23T11:29:55.688Z level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-23T11:35:59.194Z level=INFO source=server.go:1272 msg="llama runner started in 363.97 seconds"
[GIN] 2025/08/23 - 11:36:42 | 200 |         6m47s |      172.25.0.2 | POST     "/api/chat"
[GIN] 2025/08/23 - 11:36:53 | 200 | 10.905647257s |      172.25.0.2 | POST     "/api/chat"
[GIN] 2025/08/23 - 11:37:02 | 200 |  9.150921422s |      172.25.0.2 | POST     "/api/chat"

OS

Docker

GPU

Nvidia

CPU

AMD

Ollama version

0.11.6

Originally created by @Ghost99 on GitHub (Aug 23, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12048 ### What is the issue? When Ollama loads a large model file (gpt-oss:20b – ~13 GB) on a HDD-backed Docker volume, the time taken to initialize the model is roughly three times longer than the raw read‑time of the same file. The slowdown is due to Ollama spawning multiple concurrent read processes that saturate the HDD, causing I/O contention. --- ### Environment OS: Linux Docker: 28.0.4 Container image: ollama/ollama:0.11.6 Ollama version: 0.11.6 Model: gpt-oss:20b Storage: HDD (non‑SSD) --- ### Steps to reproduce 1. **Start Ollama** – `docker run -d --name ollama -v /var/lib/ollama:/ollama ollama/ollama:0.11.6` 2. **Prepare the model** – Pull `gpt-oss:20b` into the container (this will download the blob into `ollama/models/blobs/`). 3. **Trigger load** – Issue any request to the model (e.g. `curl -X POST localhost:11434/api/chat -d '{"model":"gpt-oss:20b","messages":[{"role":"user","content":"hello"}]}'`). 4. **Record log** – In the container logs you’ll see something like: ``` source=server.go:1272 msg="llama runner started in 363.97 seconds" ``` 5. **Stop Ollama & clear OS read cache** – `docker stop ollama && sync; echo 3 > /proc/sys/vm/drop_caches` 6. **Measure raw read time** – To measure read time of the main gpt-oss:20b blob `sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583` (~13GB) run: ``` time cat /var/lib/ollama/models/blobs/sha256-b112e* > /dev/null ``` In my case it's ~127 s. 7. **Compare** – The init time (~364 s) is roughly 3× slower. --- ### Expected Behavior Ollama should load the model with a single, sequential read that is only limited by the physical speed of the disk, i.e. init time ≈ raw read time (≈127 s for the 13 GB blob). --- ### Actual Behavior During initialization Ollama spawns multiple concurrent read workers (visible in `iotop`), causing significant I/O contention on the HDD. The combined throughput drops to ~1/3 of the single‑threaded read speed, resulting in a ≈364 s load time. --- ### Additional Details * `iotop` shows >8 active read processes during initialization. * No errors or warnings appear in the logs; the behavior is purely performance‑related. ### Relevant log output ```shell time=2025-08-23T11:29:09.961Z level=INFO source=routes.go:1318 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:12288 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NEW_ESTIMATES:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-08-23T11:29:09.964Z level=INFO source=images.go:477 msg="total blobs: 36" time=2025-08-23T11:29:09.964Z level=INFO source=images.go:484 msg="total unused blobs removed: 0" time=2025-08-23T11:29:09.964Z level=INFO source=routes.go:1371 msg="Listening on [::]:11434 (version 0.11.6)" time=2025-08-23T11:29:09.965Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-08-23T11:29:10.243Z level=INFO source=types.go:130 msg="inference compute" id=GPU-xxxx library=cuda variant=v12 compute=8.9 driver=12.8 name="NVIDIA GeForce RTX 4060 Ti" total="15.6 GiB" available="14.9 GiB" time=2025-08-23T11:29:10.243Z level=INFO source=routes.go:1412 msg="entering low vram mode" "total vram"="15.6 GiB" threshold="20.0 GiB" time=2025-08-23T11:29:55.227Z level=INFO source=server.go:383 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 --port 44423" time=2025-08-23T11:29:55.242Z level=INFO source=runner.go:1006 msg="starting ollama engine" time=2025-08-23T11:29:55.245Z level=INFO source=runner.go:1043 msg="Server listening on 127.0.0.1:44423" time=2025-08-23T11:29:55.370Z level=INFO source=server.go:488 msg="system memory" total="62.7 GiB" free="56.3 GiB" free_swap="14.6 GiB" time=2025-08-23T11:29:55.370Z level=INFO source=memory.go:36 msg="new model will fit in available VRAM across minimum required GPUs, loading" model=/root/.ollama/models/blobs/sha256-b112e727c6f18875636c56a779790a590d705aec9e1c0eb5a97d51fc2a778583 library=cuda parallel=1 required="14.5 GiB" gpus=1 time=2025-08-23T11:29:55.370Z level=INFO source=server.go:531 msg=offload library=cuda layers.requested=-1 layers.model=25 layers.offload=25 layers.split=[25] memory.available="[14.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="14.5 GiB" memory.required.partial="14.5 GiB" memory.required.kv="396.0 MiB" memory.required.allocations="[14.5 GiB]" memory.weights.total="11.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="1.1 GiB" memory.graph.full="1.5 GiB" memory.graph.partial="1.5 GiB" time=2025-08-23T11:29:55.371Z level=INFO source=runner.go:925 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:12288 KvCacheType: NumThreads:8 GPULayers:25[ID:GPU-xxxx Layers:25(0..24)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-08-23T11:29:55.492Z level=INFO source=ggml.go:130 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=315 num_key_values=30 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes, ID: GPU-xxxx load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so time=2025-08-23T11:29:55.590Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-08-23T11:29:55.687Z level=INFO source=ggml.go:486 msg="offloading 24 repeating layers to GPU" time=2025-08-23T11:29:55.687Z level=INFO source=ggml.go:492 msg="offloading output layer to GPU" time=2025-08-23T11:29:55.687Z level=INFO source=ggml.go:497 msg="offloaded 25/25 layers to GPU" time=2025-08-23T11:29:55.688Z level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="11.8 GiB" time=2025-08-23T11:29:55.688Z level=INFO source=backend.go:315 msg="model weights" device=CPU size="1.1 GiB" time=2025-08-23T11:29:55.688Z level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="396.0 MiB" time=2025-08-23T11:29:55.688Z level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="1.6 GiB" time=2025-08-23T11:29:55.688Z level=INFO source=backend.go:337 msg="compute graph" device=CPU size="5.6 MiB" time=2025-08-23T11:29:55.688Z level=INFO source=backend.go:342 msg="total memory" size="14.8 GiB" time=2025-08-23T11:29:55.688Z level=INFO source=sched.go:473 msg="loaded runners" count=1 time=2025-08-23T11:29:55.688Z level=INFO source=server.go:1234 msg="waiting for llama runner to start responding" time=2025-08-23T11:29:55.688Z level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model" time=2025-08-23T11:35:59.194Z level=INFO source=server.go:1272 msg="llama runner started in 363.97 seconds" [GIN] 2025/08/23 - 11:36:42 | 200 | 6m47s | 172.25.0.2 | POST "/api/chat" [GIN] 2025/08/23 - 11:36:53 | 200 | 10.905647257s | 172.25.0.2 | POST "/api/chat" [GIN] 2025/08/23 - 11:37:02 | 200 | 9.150921422s | 172.25.0.2 | POST "/api/chat" ``` ### OS Docker ### GPU Nvidia ### CPU AMD ### Ollama version 0.11.6
GiteaMirror added the bug label 2026-04-22 16:45:13 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 28, 2025):

Your problem seems to be multiple co-routines doing reads on a slow device.

You can work around this by reducing the number of co-routines that ollama launches by setting GOMAXPROCS in the environment of the server.

<!-- gh-comment-id:3344008422 --> @rick-github commented on GitHub (Sep 28, 2025): Your problem seems to be multiple co-routines doing reads on a slow device. You can work around this by reducing the number of co-routines that ollama launches by setting `GOMAXPROCS` in the environment of the server.
Author
Owner

@Ghost99 commented on GitHub (Sep 29, 2025):

I upgraded Ollama to the latest 0.12.3 release, flushed the filesystem cache, started the server, and issued a text request to the LLM. The results were:

  1. With GOMAXPROCS=1 set in the environment: using iotop I observed that Ollama used only 1–3 processes for reading the model file.
  2. Without GOMAXPROCS set: Ollama spawned more than 8 processes to read the file.

So the suggested workaround does reduce the contention, though it doesn’t eliminate the slowdown entirely. Thanks for the suggestion, but it’s probably not a complete solution.

For my own use‑case I’ve adopted a different workaround: I preload all the main model files into the OS read cache on the system startup so that subsequent reads hit the cache rather than the HDD. Not a perfect solution, but still.

<!-- gh-comment-id:3346350572 --> @Ghost99 commented on GitHub (Sep 29, 2025): I upgraded Ollama to the latest 0.12.3 release, flushed the filesystem cache, started the server, and issued a text request to the LLM. The results were: 1. With `GOMAXPROCS=1` set in the environment: using `iotop` I observed that Ollama used only 1–3 processes for reading the model file. 2. Without `GOMAXPROCS` set: Ollama spawned more than 8 processes to read the file. So the suggested workaround does reduce the contention, though it doesn’t eliminate the slowdown entirely. Thanks for the suggestion, but it’s probably not a complete solution. For my own use‑case I’ve adopted a different workaround: I preload all the main model files into the OS read cache on the system startup so that subsequent reads hit the cache rather than the HDD. Not a perfect solution, but still.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33761