[GH-ISSUE #11986] Models are always split across multiple GPUs #70018

Closed
opened 2026-05-04 20:04:30 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @alromb01 on GitHub (Aug 20, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11986

What is the issue?

Since version 0.11.5, the models are always placed on multiple GPUs, even if one GPU has more than enough RAM to serve the request. It's happening both with OLLAMA_NEW_ESTIMATES=0 and OLLAMA_NEW_ESTIMATES=1.

Further environment variables:
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0

Setting OLLAMA_SCHED_SPREAD=false is not a viable solution as it would prevent being able to use multiple GPUs when it's indeed required.

Because of that, the generated tokens per second for gpt-oss:20b went from 100 (when the model is running exclusively on H200) to now around 35 .
Similarly, the tokens/s for Llama3.3-70b-q8 went from 40 to around 27.

Is this intended w.r.t the "new memory management" introduced in this release?

Relevant log output

print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 0
load_tensors: layer   1 assigned to device CUDA0, is_swa = 0
load_tensors: layer   2 assigned to device CUDA0, is_swa = 0
load_tensors: layer   3 assigned to device CUDA0, is_swa = 0
load_tensors: layer   4 assigned to device CUDA0, is_swa = 0
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
load_tensors: layer  21 assigned to device CUDA0, is_swa = 0
load_tensors: layer  22 assigned to device CUDA0, is_swa = 0
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 0
load_tensors: layer  25 assigned to device CUDA0, is_swa = 0
load_tensors: layer  26 assigned to device CUDA0, is_swa = 0
load_tensors: layer  27 assigned to device CUDA0, is_swa = 0
load_tensors: layer  28 assigned to device CUDA2, is_swa = 0
load_tensors: layer  29 assigned to device CUDA2, is_swa = 0
load_tensors: layer  30 assigned to device CUDA2, is_swa = 0
load_tensors: layer  31 assigned to device CUDA2, is_swa = 0
load_tensors: layer  32 assigned to device CUDA2, is_swa = 0
load_tensors: layer  33 assigned to device CUDA2, is_swa = 0
load_tensors: layer  34 assigned to device CUDA2, is_swa = 0
load_tensors: layer  35 assigned to device CUDA2, is_swa = 0
load_tensors: layer  36 assigned to device CUDA2, is_swa = 0
load_tensors: layer  37 assigned to device CUDA2, is_swa = 0
load_tensors: layer  38 assigned to device CUDA2, is_swa = 0
load_tensors: layer  39 assigned to device CUDA2, is_swa = 0
load_tensors: layer  40 assigned to device CUDA2, is_swa = 0
load_tensors: layer  41 assigned to device CUDA2, is_swa = 0
load_tensors: layer  42 assigned to device CUDA2, is_swa = 0
load_tensors: layer  43 assigned to device CUDA2, is_swa = 0
load_tensors: layer  44 assigned to device CUDA2, is_swa = 0
load_tensors: layer  45 assigned to device CUDA2, is_swa = 0
load_tensors: layer  46 assigned to device CUDA2, is_swa = 0
load_tensors: layer  47 assigned to device CUDA2, is_swa = 0
load_tensors: layer  48 assigned to device CUDA2, is_swa = 0
load_tensors: layer  49 assigned to device CUDA2, is_swa = 0
load_tensors: layer  50 assigned to device CUDA2, is_swa = 0
load_tensors: layer  51 assigned to device CUDA2, is_swa = 0
load_tensors: layer  52 assigned to device CUDA2, is_swa = 0
load_tensors: layer  53 assigned to device CUDA2, is_swa = 0
load_tensors: layer  54 assigned to device CUDA2, is_swa = 0
load_tensors: layer  55 assigned to device CUDA2, is_swa = 0
load_tensors: layer  56 assigned to device CUDA2, is_swa = 0
load_tensors: layer  57 assigned to device CUDA2, is_swa = 0
load_tensors: layer  58 assigned to device CUDA2, is_swa = 0
load_tensors: layer  59 assigned to device CUDA2, is_swa = 0
load_tensors: layer  60 assigned to device CUDA2, is_swa = 0
load_tensors: layer  61 assigned to device CUDA2, is_swa = 0
load_tensors: layer  62 assigned to device CUDA2, is_swa = 0
load_tensors: layer  63 assigned to device CUDA2, is_swa = 0
load_tensors: layer  64 assigned to device CUDA2, is_swa = 0
load_tensors: layer  65 assigned to device CUDA2, is_swa = 0
load_tensors: layer  66 assigned to device CUDA2, is_swa = 0
load_tensors: layer  67 assigned to device CUDA2, is_swa = 0
load_tensors: layer  68 assigned to device CUDA2, is_swa = 0
load_tensors: layer  69 assigned to device CUDA2, is_swa = 0
load_tensors: layer  70 assigned to device CUDA2, is_swa = 0
load_tensors: layer  71 assigned to device CUDA2, is_swa = 0
load_tensors: layer  72 assigned to device CUDA2, is_swa = 0
load_tensors: layer  73 assigned to device CUDA2, is_swa = 0
load_tensors: layer  74 assigned to device CUDA2, is_swa = 0
load_tensors: layer  75 assigned to device CUDA2, is_swa = 0
load_tensors: layer  76 assigned to device CUDA2, is_swa = 0
load_tensors: layer  77 assigned to device CUDA2, is_swa = 0
load_tensors: layer  78 assigned to device CUDA2, is_swa = 0
load_tensors: layer  79 assigned to device CUDA2, is_swa = 0
load_tensors: layer  80 assigned to device CUDA2, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
time=2025-08-20T11:41:20.014Z level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server not responding"
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors:        CUDA0 model buffer size = 24277.75 MiB
load_tensors:        CUDA2 model buffer size = 46151.91 MiB
load_tensors:   CPU_Mapped model buffer size =  1064.62 MiB
time=2025-08-20T11:41:24.328Z level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-20T11:41:24.328Z level=DEBUG source=server.go:1278 msg="model load progress 0.00"
time=2025-08-20T11:41:24.579Z level=DEBUG source=server.go:1278 msg="model load progress 0.07"
time=2025-08-20T11:41:24.830Z level=DEBUG source=server.go:1278 msg="model load progress 0.14"
time=2025-08-20T11:41:25.082Z level=DEBUG source=server.go:1278 msg="model load progress 0.21"
time=2025-08-20T11:41:25.333Z level=DEBUG source=server.go:1278 msg="model load progress 0.28"
time=2025-08-20T11:41:25.584Z level=DEBUG source=server.go:1278 msg="model load progress 0.34"
time=2025-08-20T11:41:25.835Z level=DEBUG source=server.go:1278 msg="model load progress 0.44"
time=2025-08-20T11:41:26.087Z level=DEBUG source=server.go:1278 msg="model load progress 0.52"
time=2025-08-20T11:41:26.338Z level=DEBUG source=server.go:1278 msg="model load progress 0.61"
time=2025-08-20T11:41:26.589Z level=DEBUG source=server.go:1278 msg="model load progress 0.69"
time=2025-08-20T11:41:26.841Z level=DEBUG source=server.go:1278 msg="model load progress 0.78"
time=2025-08-20T11:41:27.092Z level=DEBUG source=server.go:1278 msg="model load progress 0.86"
time=2025-08-20T11:41:27.343Z level=DEBUG source=server.go:1278 msg="model load progress 0.95"
time=2025-08-20T11:41:27.594Z level=DEBUG source=server.go:1278 msg="model load progress 0.99"
llama_context: constructing llama_context
llama_context: n_seq_max     = 3
llama_context: n_ctx         = 98304
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 1536
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     1.56 MiB
create_memory: n_ctx = 98304 (padded)
llama_kv_cache_unified: layer   0: dev = CUDA0
llama_kv_cache_unified: layer   1: dev = CUDA0
llama_kv_cache_unified: layer   2: dev = CUDA0
llama_kv_cache_unified: layer   3: dev = CUDA0
llama_kv_cache_unified: layer   4: dev = CUDA0
llama_kv_cache_unified: layer   5: dev = CUDA0
llama_kv_cache_unified: layer   6: dev = CUDA0
llama_kv_cache_unified: layer   7: dev = CUDA0
llama_kv_cache_unified: layer   8: dev = CUDA0
llama_kv_cache_unified: layer   9: dev = CUDA0
llama_kv_cache_unified: layer  10: dev = CUDA0
llama_kv_cache_unified: layer  11: dev = CUDA0
llama_kv_cache_unified: layer  12: dev = CUDA0
llama_kv_cache_unified: layer  13: dev = CUDA0
llama_kv_cache_unified: layer  14: dev = CUDA0
llama_kv_cache_unified: layer  15: dev = CUDA0
llama_kv_cache_unified: layer  16: dev = CUDA0
llama_kv_cache_unified: layer  17: dev = CUDA0
llama_kv_cache_unified: layer  18: dev = CUDA0
llama_kv_cache_unified: layer  19: dev = CUDA0
llama_kv_cache_unified: layer  20: dev = CUDA0
llama_kv_cache_unified: layer  21: dev = CUDA0
llama_kv_cache_unified: layer  22: dev = CUDA0
llama_kv_cache_unified: layer  23: dev = CUDA0
llama_kv_cache_unified: layer  24: dev = CUDA0
llama_kv_cache_unified: layer  25: dev = CUDA0
llama_kv_cache_unified: layer  26: dev = CUDA0
llama_kv_cache_unified: layer  27: dev = CUDA0
llama_kv_cache_unified: layer  28: dev = CUDA2
llama_kv_cache_unified: layer  29: dev = CUDA2
llama_kv_cache_unified: layer  30: dev = CUDA2
llama_kv_cache_unified: layer  31: dev = CUDA2
llama_kv_cache_unified: layer  32: dev = CUDA2
llama_kv_cache_unified: layer  33: dev = CUDA2
llama_kv_cache_unified: layer  34: dev = CUDA2
llama_kv_cache_unified: layer  35: dev = CUDA2
llama_kv_cache_unified: layer  36: dev = CUDA2
llama_kv_cache_unified: layer  37: dev = CUDA2
llama_kv_cache_unified: layer  38: dev = CUDA2
llama_kv_cache_unified: layer  39: dev = CUDA2
llama_kv_cache_unified: layer  40: dev = CUDA2
llama_kv_cache_unified: layer  41: dev = CUDA2
llama_kv_cache_unified: layer  42: dev = CUDA2
llama_kv_cache_unified: layer  43: dev = CUDA2
llama_kv_cache_unified: layer  44: dev = CUDA2
llama_kv_cache_unified: layer  45: dev = CUDA2
llama_kv_cache_unified: layer  46: dev = CUDA2
llama_kv_cache_unified: layer  47: dev = CUDA2
llama_kv_cache_unified: layer  48: dev = CUDA2
llama_kv_cache_unified: layer  49: dev = CUDA2
llama_kv_cache_unified: layer  50: dev = CUDA2
llama_kv_cache_unified: layer  51: dev = CUDA2
llama_kv_cache_unified: layer  52: dev = CUDA2
llama_kv_cache_unified: layer  53: dev = CUDA2
llama_kv_cache_unified: layer  54: dev = CUDA2
llama_kv_cache_unified: layer  55: dev = CUDA2
llama_kv_cache_unified: layer  56: dev = CUDA2
llama_kv_cache_unified: layer  57: dev = CUDA2
llama_kv_cache_unified: layer  58: dev = CUDA2
llama_kv_cache_unified: layer  59: dev = CUDA2
llama_kv_cache_unified: layer  60: dev = CUDA2
llama_kv_cache_unified: layer  61: dev = CUDA2
llama_kv_cache_unified: layer  62: dev = CUDA2
llama_kv_cache_unified: layer  63: dev = CUDA2
llama_kv_cache_unified: layer  64: dev = CUDA2
llama_kv_cache_unified: layer  65: dev = CUDA2
llama_kv_cache_unified: layer  66: dev = CUDA2
llama_kv_cache_unified: layer  67: dev = CUDA2
llama_kv_cache_unified: layer  68: dev = CUDA2
llama_kv_cache_unified: layer  69: dev = CUDA2
llama_kv_cache_unified: layer  70: dev = CUDA2
llama_kv_cache_unified: layer  71: dev = CUDA2
llama_kv_cache_unified: layer  72: dev = CUDA2
llama_kv_cache_unified: layer  73: dev = CUDA2
llama_kv_cache_unified: layer  74: dev = CUDA2
llama_kv_cache_unified: layer  75: dev = CUDA2
llama_kv_cache_unified: layer  76: dev = CUDA2
llama_kv_cache_unified: layer  77: dev = CUDA2
llama_kv_cache_unified: layer  78: dev = CUDA2
llama_kv_cache_unified: layer  79: dev = CUDA2
llama_kv_cache_unified:      CUDA0 KV buffer size =  5712.00 MiB
llama_kv_cache_unified:      CUDA2 KV buffer size = 10608.00 MiB
llama_kv_cache_unified: size = 16320.00 MiB ( 32768 cells,  80 layers,  3/3 seqs), K (q8_0): 8160.00 MiB, V (q8_0): 8160.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 4
llama_context: max_nodes = 5800
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: worst-case: n_tokens = 512, n_seqs = 3, n_outputs = 0
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  3, n_outputs =  512
graph_reserve: making n_tokens a multiple of n_seqs - n_tokens = 513, n_seqs = 3, n_outputs = 512
time=2025-08-20T11:41:29.353Z level=DEBUG source=server.go:1278 msg="model load progress 1.00"
graph_reserve: reserving a graph for ubatch with n_tokens =    3, n_seqs =  3, n_outputs =    3
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  3, n_outputs =  512
graph_reserve: making n_tokens a multiple of n_seqs - n_tokens = 513, n_seqs = 3, n_outputs = 512
llama_context:      CUDA0 compute buffer size =   604.59 MiB
llama_context:      CUDA2 compute buffer size =   474.67 MiB
llama_context:  CUDA_Host compute buffer size =   304.09 MiB
llama_context: graph nodes  = 2487
llama_context: graph splits = 3

OS

Linux

GPU

Nvidia (1x H200, 2x L40S)

CPU

No response

Ollama version

0.11.5

Originally created by @alromb01 on GitHub (Aug 20, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11986 ### What is the issue? Since version 0.11.5, the models are always placed on multiple GPUs, even if one GPU has more than enough RAM to serve the request. It's happening both with `OLLAMA_NEW_ESTIMATES=0` and `OLLAMA_NEW_ESTIMATES=1`. Further environment variables: OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 Setting `OLLAMA_SCHED_SPREAD=false` is not a viable solution as it would prevent being able to use multiple GPUs when it's indeed required. Because of that, the generated tokens per second for gpt-oss:20b went from 100 (when the model is running exclusively on H200) to now around 35 . Similarly, the tokens/s for Llama3.3-70b-q8 went from 40 to around 27. Is this intended w.r.t the "new memory management" introduced in this release? ### Relevant log output ```shell print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device CUDA0, is_swa = 0 load_tensors: layer 1 assigned to device CUDA0, is_swa = 0 load_tensors: layer 2 assigned to device CUDA0, is_swa = 0 load_tensors: layer 3 assigned to device CUDA0, is_swa = 0 load_tensors: layer 4 assigned to device CUDA0, is_swa = 0 load_tensors: layer 5 assigned to device CUDA0, is_swa = 0 load_tensors: layer 6 assigned to device CUDA0, is_swa = 0 load_tensors: layer 7 assigned to device CUDA0, is_swa = 0 load_tensors: layer 8 assigned to device CUDA0, is_swa = 0 load_tensors: layer 9 assigned to device CUDA0, is_swa = 0 load_tensors: layer 10 assigned to device CUDA0, is_swa = 0 load_tensors: layer 11 assigned to device CUDA0, is_swa = 0 load_tensors: layer 12 assigned to device CUDA0, is_swa = 0 load_tensors: layer 13 assigned to device CUDA0, is_swa = 0 load_tensors: layer 14 assigned to device CUDA0, is_swa = 0 load_tensors: layer 15 assigned to device CUDA0, is_swa = 0 load_tensors: layer 16 assigned to device CUDA0, is_swa = 0 load_tensors: layer 17 assigned to device CUDA0, is_swa = 0 load_tensors: layer 18 assigned to device CUDA0, is_swa = 0 load_tensors: layer 19 assigned to device CUDA0, is_swa = 0 load_tensors: layer 20 assigned to device CUDA0, is_swa = 0 load_tensors: layer 21 assigned to device CUDA0, is_swa = 0 load_tensors: layer 22 assigned to device CUDA0, is_swa = 0 load_tensors: layer 23 assigned to device CUDA0, is_swa = 0 load_tensors: layer 24 assigned to device CUDA0, is_swa = 0 load_tensors: layer 25 assigned to device CUDA0, is_swa = 0 load_tensors: layer 26 assigned to device CUDA0, is_swa = 0 load_tensors: layer 27 assigned to device CUDA0, is_swa = 0 load_tensors: layer 28 assigned to device CUDA2, is_swa = 0 load_tensors: layer 29 assigned to device CUDA2, is_swa = 0 load_tensors: layer 30 assigned to device CUDA2, is_swa = 0 load_tensors: layer 31 assigned to device CUDA2, is_swa = 0 load_tensors: layer 32 assigned to device CUDA2, is_swa = 0 load_tensors: layer 33 assigned to device CUDA2, is_swa = 0 load_tensors: layer 34 assigned to device CUDA2, is_swa = 0 load_tensors: layer 35 assigned to device CUDA2, is_swa = 0 load_tensors: layer 36 assigned to device CUDA2, is_swa = 0 load_tensors: layer 37 assigned to device CUDA2, is_swa = 0 load_tensors: layer 38 assigned to device CUDA2, is_swa = 0 load_tensors: layer 39 assigned to device CUDA2, is_swa = 0 load_tensors: layer 40 assigned to device CUDA2, is_swa = 0 load_tensors: layer 41 assigned to device CUDA2, is_swa = 0 load_tensors: layer 42 assigned to device CUDA2, is_swa = 0 load_tensors: layer 43 assigned to device CUDA2, is_swa = 0 load_tensors: layer 44 assigned to device CUDA2, is_swa = 0 load_tensors: layer 45 assigned to device CUDA2, is_swa = 0 load_tensors: layer 46 assigned to device CUDA2, is_swa = 0 load_tensors: layer 47 assigned to device CUDA2, is_swa = 0 load_tensors: layer 48 assigned to device CUDA2, is_swa = 0 load_tensors: layer 49 assigned to device CUDA2, is_swa = 0 load_tensors: layer 50 assigned to device CUDA2, is_swa = 0 load_tensors: layer 51 assigned to device CUDA2, is_swa = 0 load_tensors: layer 52 assigned to device CUDA2, is_swa = 0 load_tensors: layer 53 assigned to device CUDA2, is_swa = 0 load_tensors: layer 54 assigned to device CUDA2, is_swa = 0 load_tensors: layer 55 assigned to device CUDA2, is_swa = 0 load_tensors: layer 56 assigned to device CUDA2, is_swa = 0 load_tensors: layer 57 assigned to device CUDA2, is_swa = 0 load_tensors: layer 58 assigned to device CUDA2, is_swa = 0 load_tensors: layer 59 assigned to device CUDA2, is_swa = 0 load_tensors: layer 60 assigned to device CUDA2, is_swa = 0 load_tensors: layer 61 assigned to device CUDA2, is_swa = 0 load_tensors: layer 62 assigned to device CUDA2, is_swa = 0 load_tensors: layer 63 assigned to device CUDA2, is_swa = 0 load_tensors: layer 64 assigned to device CUDA2, is_swa = 0 load_tensors: layer 65 assigned to device CUDA2, is_swa = 0 load_tensors: layer 66 assigned to device CUDA2, is_swa = 0 load_tensors: layer 67 assigned to device CUDA2, is_swa = 0 load_tensors: layer 68 assigned to device CUDA2, is_swa = 0 load_tensors: layer 69 assigned to device CUDA2, is_swa = 0 load_tensors: layer 70 assigned to device CUDA2, is_swa = 0 load_tensors: layer 71 assigned to device CUDA2, is_swa = 0 load_tensors: layer 72 assigned to device CUDA2, is_swa = 0 load_tensors: layer 73 assigned to device CUDA2, is_swa = 0 load_tensors: layer 74 assigned to device CUDA2, is_swa = 0 load_tensors: layer 75 assigned to device CUDA2, is_swa = 0 load_tensors: layer 76 assigned to device CUDA2, is_swa = 0 load_tensors: layer 77 assigned to device CUDA2, is_swa = 0 load_tensors: layer 78 assigned to device CUDA2, is_swa = 0 load_tensors: layer 79 assigned to device CUDA2, is_swa = 0 load_tensors: layer 80 assigned to device CUDA2, is_swa = 0 load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead time=2025-08-20T11:41:20.014Z level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server not responding" load_tensors: offloading 80 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 81/81 layers to GPU load_tensors: CUDA0 model buffer size = 24277.75 MiB load_tensors: CUDA2 model buffer size = 46151.91 MiB load_tensors: CPU_Mapped model buffer size = 1064.62 MiB time=2025-08-20T11:41:24.328Z level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model" time=2025-08-20T11:41:24.328Z level=DEBUG source=server.go:1278 msg="model load progress 0.00" time=2025-08-20T11:41:24.579Z level=DEBUG source=server.go:1278 msg="model load progress 0.07" time=2025-08-20T11:41:24.830Z level=DEBUG source=server.go:1278 msg="model load progress 0.14" time=2025-08-20T11:41:25.082Z level=DEBUG source=server.go:1278 msg="model load progress 0.21" time=2025-08-20T11:41:25.333Z level=DEBUG source=server.go:1278 msg="model load progress 0.28" time=2025-08-20T11:41:25.584Z level=DEBUG source=server.go:1278 msg="model load progress 0.34" time=2025-08-20T11:41:25.835Z level=DEBUG source=server.go:1278 msg="model load progress 0.44" time=2025-08-20T11:41:26.087Z level=DEBUG source=server.go:1278 msg="model load progress 0.52" time=2025-08-20T11:41:26.338Z level=DEBUG source=server.go:1278 msg="model load progress 0.61" time=2025-08-20T11:41:26.589Z level=DEBUG source=server.go:1278 msg="model load progress 0.69" time=2025-08-20T11:41:26.841Z level=DEBUG source=server.go:1278 msg="model load progress 0.78" time=2025-08-20T11:41:27.092Z level=DEBUG source=server.go:1278 msg="model load progress 0.86" time=2025-08-20T11:41:27.343Z level=DEBUG source=server.go:1278 msg="model load progress 0.95" time=2025-08-20T11:41:27.594Z level=DEBUG source=server.go:1278 msg="model load progress 0.99" llama_context: constructing llama_context llama_context: n_seq_max = 3 llama_context: n_ctx = 98304 llama_context: n_ctx_per_seq = 32768 llama_context: n_batch = 1536 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 1 llama_context: kv_unified = false llama_context: freq_base = 500000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized set_abort_callback: call llama_context: CUDA_Host output buffer size = 1.56 MiB create_memory: n_ctx = 98304 (padded) llama_kv_cache_unified: layer 0: dev = CUDA0 llama_kv_cache_unified: layer 1: dev = CUDA0 llama_kv_cache_unified: layer 2: dev = CUDA0 llama_kv_cache_unified: layer 3: dev = CUDA0 llama_kv_cache_unified: layer 4: dev = CUDA0 llama_kv_cache_unified: layer 5: dev = CUDA0 llama_kv_cache_unified: layer 6: dev = CUDA0 llama_kv_cache_unified: layer 7: dev = CUDA0 llama_kv_cache_unified: layer 8: dev = CUDA0 llama_kv_cache_unified: layer 9: dev = CUDA0 llama_kv_cache_unified: layer 10: dev = CUDA0 llama_kv_cache_unified: layer 11: dev = CUDA0 llama_kv_cache_unified: layer 12: dev = CUDA0 llama_kv_cache_unified: layer 13: dev = CUDA0 llama_kv_cache_unified: layer 14: dev = CUDA0 llama_kv_cache_unified: layer 15: dev = CUDA0 llama_kv_cache_unified: layer 16: dev = CUDA0 llama_kv_cache_unified: layer 17: dev = CUDA0 llama_kv_cache_unified: layer 18: dev = CUDA0 llama_kv_cache_unified: layer 19: dev = CUDA0 llama_kv_cache_unified: layer 20: dev = CUDA0 llama_kv_cache_unified: layer 21: dev = CUDA0 llama_kv_cache_unified: layer 22: dev = CUDA0 llama_kv_cache_unified: layer 23: dev = CUDA0 llama_kv_cache_unified: layer 24: dev = CUDA0 llama_kv_cache_unified: layer 25: dev = CUDA0 llama_kv_cache_unified: layer 26: dev = CUDA0 llama_kv_cache_unified: layer 27: dev = CUDA0 llama_kv_cache_unified: layer 28: dev = CUDA2 llama_kv_cache_unified: layer 29: dev = CUDA2 llama_kv_cache_unified: layer 30: dev = CUDA2 llama_kv_cache_unified: layer 31: dev = CUDA2 llama_kv_cache_unified: layer 32: dev = CUDA2 llama_kv_cache_unified: layer 33: dev = CUDA2 llama_kv_cache_unified: layer 34: dev = CUDA2 llama_kv_cache_unified: layer 35: dev = CUDA2 llama_kv_cache_unified: layer 36: dev = CUDA2 llama_kv_cache_unified: layer 37: dev = CUDA2 llama_kv_cache_unified: layer 38: dev = CUDA2 llama_kv_cache_unified: layer 39: dev = CUDA2 llama_kv_cache_unified: layer 40: dev = CUDA2 llama_kv_cache_unified: layer 41: dev = CUDA2 llama_kv_cache_unified: layer 42: dev = CUDA2 llama_kv_cache_unified: layer 43: dev = CUDA2 llama_kv_cache_unified: layer 44: dev = CUDA2 llama_kv_cache_unified: layer 45: dev = CUDA2 llama_kv_cache_unified: layer 46: dev = CUDA2 llama_kv_cache_unified: layer 47: dev = CUDA2 llama_kv_cache_unified: layer 48: dev = CUDA2 llama_kv_cache_unified: layer 49: dev = CUDA2 llama_kv_cache_unified: layer 50: dev = CUDA2 llama_kv_cache_unified: layer 51: dev = CUDA2 llama_kv_cache_unified: layer 52: dev = CUDA2 llama_kv_cache_unified: layer 53: dev = CUDA2 llama_kv_cache_unified: layer 54: dev = CUDA2 llama_kv_cache_unified: layer 55: dev = CUDA2 llama_kv_cache_unified: layer 56: dev = CUDA2 llama_kv_cache_unified: layer 57: dev = CUDA2 llama_kv_cache_unified: layer 58: dev = CUDA2 llama_kv_cache_unified: layer 59: dev = CUDA2 llama_kv_cache_unified: layer 60: dev = CUDA2 llama_kv_cache_unified: layer 61: dev = CUDA2 llama_kv_cache_unified: layer 62: dev = CUDA2 llama_kv_cache_unified: layer 63: dev = CUDA2 llama_kv_cache_unified: layer 64: dev = CUDA2 llama_kv_cache_unified: layer 65: dev = CUDA2 llama_kv_cache_unified: layer 66: dev = CUDA2 llama_kv_cache_unified: layer 67: dev = CUDA2 llama_kv_cache_unified: layer 68: dev = CUDA2 llama_kv_cache_unified: layer 69: dev = CUDA2 llama_kv_cache_unified: layer 70: dev = CUDA2 llama_kv_cache_unified: layer 71: dev = CUDA2 llama_kv_cache_unified: layer 72: dev = CUDA2 llama_kv_cache_unified: layer 73: dev = CUDA2 llama_kv_cache_unified: layer 74: dev = CUDA2 llama_kv_cache_unified: layer 75: dev = CUDA2 llama_kv_cache_unified: layer 76: dev = CUDA2 llama_kv_cache_unified: layer 77: dev = CUDA2 llama_kv_cache_unified: layer 78: dev = CUDA2 llama_kv_cache_unified: layer 79: dev = CUDA2 llama_kv_cache_unified: CUDA0 KV buffer size = 5712.00 MiB llama_kv_cache_unified: CUDA2 KV buffer size = 10608.00 MiB llama_kv_cache_unified: size = 16320.00 MiB ( 32768 cells, 80 layers, 3/3 seqs), K (q8_0): 8160.00 MiB, V (q8_0): 8160.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 4 llama_context: max_nodes = 5800 llama_context: pipeline parallelism enabled (n_copies=4) llama_context: worst-case: n_tokens = 512, n_seqs = 3, n_outputs = 0 graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 3, n_outputs = 512 graph_reserve: making n_tokens a multiple of n_seqs - n_tokens = 513, n_seqs = 3, n_outputs = 512 time=2025-08-20T11:41:29.353Z level=DEBUG source=server.go:1278 msg="model load progress 1.00" graph_reserve: reserving a graph for ubatch with n_tokens = 3, n_seqs = 3, n_outputs = 3 graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 3, n_outputs = 512 graph_reserve: making n_tokens a multiple of n_seqs - n_tokens = 513, n_seqs = 3, n_outputs = 512 llama_context: CUDA0 compute buffer size = 604.59 MiB llama_context: CUDA2 compute buffer size = 474.67 MiB llama_context: CUDA_Host compute buffer size = 304.09 MiB llama_context: graph nodes = 2487 llama_context: graph splits = 3 ``` ### OS Linux ### GPU Nvidia (1x H200, 2x L40S) ### CPU _No response_ ### Ollama version 0.11.5
GiteaMirror added the bug label 2026-05-04 20:04:30 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 20, 2025):

Full log.

<!-- gh-comment-id:3206174045 --> @rick-github commented on GitHub (Aug 20, 2025): Full log.
Author
Owner

@alromb01 commented on GitHub (Aug 20, 2025):

Full log.

Due to logging on TRACE level, here as a file: log.txt (with OLLAMA_NEW_ESTIMATES=0)

<!-- gh-comment-id:3206251938 --> @alromb01 commented on GitHub (Aug 20, 2025): > Full log. Due to logging on TRACE level, here as a file: [log.txt](https://github.com/user-attachments/files/21897677/log.txt) (with `OLLAMA_NEW_ESTIMATES=0`)
Author
Owner

@rick-github commented on GitHub (Aug 20, 2025):

This log shows gpt-oss:20b being loaded entirely onto a single NVIDIA H200 NVL.

<!-- gh-comment-id:3206642316 --> @rick-github commented on GitHub (Aug 20, 2025): This log shows gpt-oss:20b being loaded entirely onto a single NVIDIA H200 NVL.
Author
Owner

@alromb01 commented on GitHub (Aug 20, 2025):

This log shows gpt-oss:20b being loaded entirely onto a single NVIDIA H200 NVL.

Image

this is the output of nvtop. The request spawns two small fragments on the other GPUs. Also, the GPU usage is really low. Both did not happen with version 0.11.4.

This behavior does not change with OLLAMA_NEW_ESTIMATES=0 and OLLAMA_NEW_ESTIMATES=1

Stats with OLLAMA_NEW_ESTIMATES=0:
Model load time: 4.992 seconds
Time to first token (TTFT) (model_load + prompt_eval): 5.301 seconds
Generation speed: 31.83 tokens/sec
Prompt tokens: 90 | Generation tokens: 1107
Overall duration: 40.17 seconds

Stats with OLLAMA_NEW_ESTIMATES=1:
Model load time: 4.421 seconds
Time to first token (TTFT) (model_load + prompt_eval): 4.749 seconds
Generation speed: 34.22 tokens/sec
Prompt tokens: 90 | Generation tokens: 1518
Overall duration: 49.19 seconds

So while the option of OLLAMA_NEW_ESTIMATES seems to improve performance, it's still far off the performance of version 0.11.4 where I achieved ~ 100 tokens/s.

<!-- gh-comment-id:3206668459 --> @alromb01 commented on GitHub (Aug 20, 2025): > This log shows gpt-oss:20b being loaded entirely onto a single NVIDIA H200 NVL. <img width="946" height="496" alt="Image" src="https://github.com/user-attachments/assets/bb4b154b-69e6-4e2d-9641-77b9e067e553" /> this is the output of `nvtop`. The request spawns two small fragments on the other GPUs. Also, the GPU usage is really low. Both did not happen with version 0.11.4. This behavior does not change with `OLLAMA_NEW_ESTIMATES=0` and `OLLAMA_NEW_ESTIMATES=1` Stats with `OLLAMA_NEW_ESTIMATES=0`: Model load time: 4.992 seconds Time to first token (TTFT) (model_load + prompt_eval): 5.301 seconds Generation speed: 31.83 tokens/sec Prompt tokens: 90 | Generation tokens: 1107 Overall duration: 40.17 seconds Stats with `OLLAMA_NEW_ESTIMATES=1`: Model load time: 4.421 seconds Time to first token (TTFT) (model_load + prompt_eval): 4.749 seconds Generation speed: 34.22 tokens/sec Prompt tokens: 90 | Generation tokens: 1518 Overall duration: 49.19 seconds So while the option of OLLAMA_NEW_ESTIMATES seems to improve performance, it's still far off the performance of version 0.11.4 where I achieved ~ 100 tokens/s.
Author
Owner

@rick-github commented on GitHub (Aug 20, 2025):

https://github.com/ollama/ollama/issues/11923#issuecomment-3192171487

Do you have logs for 0.11.4?

<!-- gh-comment-id:3206832644 --> @rick-github commented on GitHub (Aug 20, 2025): https://github.com/ollama/ollama/issues/11923#issuecomment-3192171487 Do you have logs for 0.11.4?
Author
Owner

@dan-and commented on GitHub (Aug 20, 2025):

As explained by @jessegross in #11923 : Before the subprocess was only allowed to see the necessary GPUs through the use of CUDA_VISIBLE_DEVICES, so they were masked out at the CUDA level. Now, it sees all of GPUs but only schedules on the appropriate ones.

So yes, you have seen that same effect and stumbled over the 3 rows, which is a different behavior than before, but if you look closely to the memory allocation, only the GPU0 is in use, which is the intended behavior when OLLAMA_SPREAD_SCHED=0 .

Also, if you keep running, you will see that there will be no load on the other GPUs, so they will go into power savings.

So the main issue you still have: Why is the performance degraded between 0.11.4 and 0.11.5-rc2 ? Please provide log files of both versions with the same settings.

<!-- gh-comment-id:3207375187 --> @dan-and commented on GitHub (Aug 20, 2025): As explained by @jessegross in #11923 : Before the subprocess was only allowed to see the necessary GPUs through the use of CUDA_VISIBLE_DEVICES, so they were masked out at the CUDA level. Now, it sees all of GPUs but only schedules on the appropriate ones. So yes, you have seen that same effect and stumbled over the 3 rows, which is a different behavior than before, but if you look closely to the memory allocation, only the GPU0 is in use, which is the intended behavior when OLLAMA_SPREAD_SCHED=0 . Also, if you keep running, you will see that there will be no load on the other GPUs, so they will go into power savings. So the main issue you still have: Why is the performance degraded between 0.11.4 and 0.11.5-rc2 ? Please provide log files of both versions with the same settings.
Author
Owner

@jessegross commented on GitHub (Aug 20, 2025):

@alromb01 Note that in the first, partial log the model is getting split across GPUs, likely due to the old estimates over-estimating memory usage with the combination of a longer context and flash attention. OLLAMA_NEW_ESTIMATES=1 should help with this. However, the default behavior should not have changed between versions.

As others have said, the second, full log shows the model being fully loaded on the H200 alone. I wouldn't expect that OLLAMA_NEW_ESTIMATES will help in this case, so something else might be going on.

<!-- gh-comment-id:3207647371 --> @jessegross commented on GitHub (Aug 20, 2025): @alromb01 Note that in the first, partial log the model is getting split across GPUs, likely due to the old estimates over-estimating memory usage with the combination of a longer context and flash attention. OLLAMA_NEW_ESTIMATES=1 should help with this. However, the default behavior should not have changed between versions. As others have said, the second, full log shows the model being fully loaded on the H200 alone. I wouldn't expect that OLLAMA_NEW_ESTIMATES will help in this case, so something else might be going on.
Author
Owner

@alromb01 commented on GitHub (Aug 20, 2025):

Here is the full log now with OLLAMA_NEW_ESTIMATES=1 still on version 0.11.5: log_newEstimates1_0.11.5.txt.

Again, with reduced performance of:
Model load time: 4.358 seconds
Time to first token (TTFT) (model_load + prompt_eval): 4.651 seconds
Generation speed: 35.85 tokens/sec
Prompt tokens: 90 | Generation tokens: 836
Overall duration: 28.07 seconds

Will update with a log for 0.11.4.

<!-- gh-comment-id:3207726426 --> @alromb01 commented on GitHub (Aug 20, 2025): Here is the full log now with `OLLAMA_NEW_ESTIMATES=1` still on version 0.11.5: [log_newEstimates1_0.11.5.txt](https://github.com/user-attachments/files/21903883/log_newEstimates1_0.11.5.txt). Again, with reduced performance of: Model load time: 4.358 seconds Time to first token (TTFT) (model_load + prompt_eval): 4.651 seconds Generation speed: 35.85 tokens/sec Prompt tokens: 90 | Generation tokens: 836 Overall duration: 28.07 seconds Will update with a log for 0.11.4.
Author
Owner

@alromb01 commented on GitHub (Aug 20, 2025):

And now the same setup/prompt with version 0.11.4:

Model load time: 4.936 seconds
Time to first token (TTFT) (model_load + prompt_eval): 5.354 seconds
Generation speed: 97.34 tokens/sec
Prompt tokens: 90 | Generation tokens: 2882
Overall duration: 35.07 seconds

See nvtop with much higher GPU usage:
Image

Here the full log:

log_0.11.4.txt

<!-- gh-comment-id:3207756176 --> @alromb01 commented on GitHub (Aug 20, 2025): And now the same setup/prompt with version 0.11.4: Model load time: 4.936 seconds Time to first token (TTFT) (model_load + prompt_eval): 5.354 seconds **Generation speed: 97.34 tokens/sec** Prompt tokens: 90 | Generation tokens: 2882 Overall duration: 35.07 seconds See nvtop with much higher GPU usage: <img width="950" height="492" alt="Image" src="https://github.com/user-attachments/assets/ee3d8257-c375-4565-a983-b022f1877115" /> Here the full log: [log_0.11.4.txt](https://github.com/user-attachments/files/21903972/log_0.11.4.txt)
Author
Owner

@jessegross commented on GitHub (Aug 20, 2025):

The gpt-oss issues might be related to flash attention. In 0.11.4, flash attention didn't work and was disabled (for gpt-oss). In 0.11.5 this was fixed and it was enabled. However, the performance with flash attention is much better with the upcoming 0.11.6 - you can try the rc0 now if you like. You can also try turning off OLLAMA_FLASH_ATTENTION to see if that equalizes performance between versions.

There is no equivalent change to llama3.3 in terms of flash attention being enabled/disabled between versions, so I'm not sure why that would change. However, that one does show splitting between GPUs so there might be something different that triggered that.

<!-- gh-comment-id:3208162258 --> @jessegross commented on GitHub (Aug 20, 2025): The gpt-oss issues might be related to flash attention. In 0.11.4, flash attention didn't work and was disabled (for gpt-oss). In 0.11.5 this was fixed and it was enabled. However, the performance with flash attention is much better with the upcoming 0.11.6 - you can try the rc0 now if you like. You can also try turning off OLLAMA_FLASH_ATTENTION to see if that equalizes performance between versions. There is no equivalent change to llama3.3 in terms of flash attention being enabled/disabled between versions, so I'm not sure why that would change. However, that one does show splitting between GPUs so there might be something different that triggered that.
Author
Owner

@alromb01 commented on GitHub (Aug 21, 2025):

However, the performance with flash attention is much better with the upcoming 0.11.6 - you can try the rc0 now if you like.

Thanks for the heads up, just tested with 0.11.6 and got the following results for gpt-oss:20b 🎉
(also with OLLAMA_FLASH_ATTENTION=1, OLLAMA_KV_CACHE_TYPE=q8_0)

Model load time: 4.337 seconds
Time to first token (TTFT) (model_load + prompt_eval): 4.629 seconds
Generation speed: 146.75 tokens/sec
Prompt tokens: 90 | Generation tokens: 650
Overall duration: 9.15 seconds

gpt-oss:120b is now also running with > 100 tokens/sec

And for llama3.3:70b-instruct-q4_0:
Model load time: 8.616 seconds
Time to first token (TTFT) (model_load + prompt_eval): 8.721 seconds
Generation speed: 41.22 tokens/sec
Prompt tokens: 30 | Generation tokens: 44
Overall duration: 9.79 seconds

<!-- gh-comment-id:3209220806 --> @alromb01 commented on GitHub (Aug 21, 2025): > However, the performance with flash attention is much better with the upcoming 0.11.6 - you can try the rc0 now if you like. Thanks for the heads up, just tested with 0.11.6 and got the following results for `gpt-oss:20b` 🎉 (also with `OLLAMA_FLASH_ATTENTION=1`, `OLLAMA_KV_CACHE_TYPE=q8_0`) Model load time: 4.337 seconds Time to first token (TTFT) (model_load + prompt_eval): 4.629 seconds **Generation speed: 146.75 tokens/sec** Prompt tokens: 90 | Generation tokens: 650 Overall duration: 9.15 seconds `gpt-oss:120b` is now also running with > 100 tokens/sec And for `llama3.3:70b-instruct-q4_0`: Model load time: 8.616 seconds Time to first token (TTFT) (model_load + prompt_eval): 8.721 seconds Generation speed: 41.22 tokens/sec Prompt tokens: 30 | Generation tokens: 44 Overall duration: 9.79 seconds
Author
Owner

@jessegross commented on GitHub (Aug 21, 2025):

@alromb01 That's great to hear. Sounds like everything is working at this point and we can close the issue?

<!-- gh-comment-id:3211396091 --> @jessegross commented on GitHub (Aug 21, 2025): @alromb01 That's great to hear. Sounds like everything is working at this point and we can close the issue?
Author
Owner

@alromb01 commented on GitHub (Aug 21, 2025):

@jessegross yes, thanks

<!-- gh-comment-id:3211452346 --> @alromb01 commented on GitHub (Aug 21, 2025): @jessegross yes, thanks
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70018