[GH-ISSUE #11986] Models are always split across multiple GPUs #70018

New Issue

GiteaMirror · 2026-05-04T20:04:30-05:00

GiteaMirror commented

2026-05-04 20:04:30 -05:00

Originally created by @alromb01 on GitHub (Aug 20, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11986

What is the issue?

Since version 0.11.5, the models are always placed on multiple GPUs, even if one GPU has more than enough RAM to serve the request. It's happening both with OLLAMA_NEW_ESTIMATES=0 and OLLAMA_NEW_ESTIMATES=1.

Further environment variables:
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0

Setting OLLAMA_SCHED_SPREAD=false is not a viable solution as it would prevent being able to use multiple GPUs when it's indeed required.

Because of that, the generated tokens per second for gpt-oss:20b went from 100 (when the model is running exclusively on H200) to now around 35 .
Similarly, the tokens/s for Llama3.3-70b-q8 went from 40 to around 27.

Is this intended w.r.t the "new memory management" introduced in this release?

Relevant log output

print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 0
load_tensors: layer   1 assigned to device CUDA0, is_swa = 0
load_tensors: layer   2 assigned to device CUDA0, is_swa = 0
load_tensors: layer   3 assigned to device CUDA0, is_swa = 0
load_tensors: layer   4 assigned to device CUDA0, is_swa = 0
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
load_tensors: layer  21 assigned to device CUDA0, is_swa = 0
load_tensors: layer  22 assigned to device CUDA0, is_swa = 0
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 0
load_tensors: layer  25 assigned to device CUDA0, is_swa = 0
load_tensors: layer  26 assigned to device CUDA0, is_swa = 0
load_tensors: layer  27 assigned to device CUDA0, is_swa = 0
load_tensors: layer  28 assigned to device CUDA2, is_swa = 0
load_tensors: layer  29 assigned to device CUDA2, is_swa = 0
load_tensors: layer  30 assigned to device CUDA2, is_swa = 0
load_tensors: layer  31 assigned to device CUDA2, is_swa = 0
load_tensors: layer  32 assigned to device CUDA2, is_swa = 0
load_tensors: layer  33 assigned to device CUDA2, is_swa = 0
load_tensors: layer  34 assigned to device CUDA2, is_swa = 0
load_tensors: layer  35 assigned to device CUDA2, is_swa = 0
load_tensors: layer  36 assigned to device CUDA2, is_swa = 0
load_tensors: layer  37 assigned to device CUDA2, is_swa = 0
load_tensors: layer  38 assigned to device CUDA2, is_swa = 0
load_tensors: layer  39 assigned to device CUDA2, is_swa = 0
load_tensors: layer  40 assigned to device CUDA2, is_swa = 0
load_tensors: layer  41 assigned to device CUDA2, is_swa = 0
load_tensors: layer  42 assigned to device CUDA2, is_swa = 0
load_tensors: layer  43 assigned to device CUDA2, is_swa = 0
load_tensors: layer  44 assigned to device CUDA2, is_swa = 0
load_tensors: layer  45 assigned to device CUDA2, is_swa = 0
load_tensors: layer  46 assigned to device CUDA2, is_swa = 0
load_tensors: layer  47 assigned to device CUDA2, is_swa = 0
load_tensors: layer  48 assigned to device CUDA2, is_swa = 0
load_tensors: layer  49 assigned to device CUDA2, is_swa = 0
load_tensors: layer  50 assigned to device CUDA2, is_swa = 0
load_tensors: layer  51 assigned to device CUDA2, is_swa = 0
load_tensors: layer  52 assigned to device CUDA2, is_swa = 0
load_tensors: layer  53 assigned to device CUDA2, is_swa = 0
load_tensors: layer  54 assigned to device CUDA2, is_swa = 0
load_tensors: layer  55 assigned to device CUDA2, is_swa = 0
load_tensors: layer  56 assigned to device CUDA2, is_swa = 0
load_tensors: layer  57 assigned to device CUDA2, is_swa = 0
load_tensors: layer  58 assigned to device CUDA2, is_swa = 0
load_tensors: layer  59 assigned to device CUDA2, is_swa = 0
load_tensors: layer  60 assigned to device CUDA2, is_swa = 0
load_tensors: layer  61 assigned to device CUDA2, is_swa = 0
load_tensors: layer  62 assigned to device CUDA2, is_swa = 0
load_tensors: layer  63 assigned to device CUDA2, is_swa = 0
load_tensors: layer  64 assigned to device CUDA2, is_swa = 0
load_tensors: layer  65 assigned to device CUDA2, is_swa = 0
load_tensors: layer  66 assigned to device CUDA2, is_swa = 0
load_tensors: layer  67 assigned to device CUDA2, is_swa = 0
load_tensors: layer  68 assigned to device CUDA2, is_swa = 0
load_tensors: layer  69 assigned to device CUDA2, is_swa = 0
load_tensors: layer  70 assigned to device CUDA2, is_swa = 0
load_tensors: layer  71 assigned to device CUDA2, is_swa = 0
load_tensors: layer  72 assigned to device CUDA2, is_swa = 0
load_tensors: layer  73 assigned to device CUDA2, is_swa = 0
load_tensors: layer  74 assigned to device CUDA2, is_swa = 0
load_tensors: layer  75 assigned to device CUDA2, is_swa = 0
load_tensors: layer  76 assigned to device CUDA2, is_swa = 0
load_tensors: layer  77 assigned to device CUDA2, is_swa = 0
load_tensors: layer  78 assigned to device CUDA2, is_swa = 0
load_tensors: layer  79 assigned to device CUDA2, is_swa = 0
load_tensors: layer  80 assigned to device CUDA2, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
time=2025-08-20T11:41:20.014Z level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server not responding"
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors:        CUDA0 model buffer size = 24277.75 MiB
load_tensors:        CUDA2 model buffer size = 46151.91 MiB
load_tensors:   CPU_Mapped model buffer size =  1064.62 MiB
time=2025-08-20T11:41:24.328Z level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model"
time=2025-08-20T11:41:24.328Z level=DEBUG source=server.go:1278 msg="model load progress 0.00"
time=2025-08-20T11:41:24.579Z level=DEBUG source=server.go:1278 msg="model load progress 0.07"
time=2025-08-20T11:41:24.830Z level=DEBUG source=server.go:1278 msg="model load progress 0.14"
time=2025-08-20T11:41:25.082Z level=DEBUG source=server.go:1278 msg="model load progress 0.21"
time=2025-08-20T11:41:25.333Z level=DEBUG source=server.go:1278 msg="model load progress 0.28"
time=2025-08-20T11:41:25.584Z level=DEBUG source=server.go:1278 msg="model load progress 0.34"
time=2025-08-20T11:41:25.835Z level=DEBUG source=server.go:1278 msg="model load progress 0.44"
time=2025-08-20T11:41:26.087Z level=DEBUG source=server.go:1278 msg="model load progress 0.52"
time=2025-08-20T11:41:26.338Z level=DEBUG source=server.go:1278 msg="model load progress 0.61"
time=2025-08-20T11:41:26.589Z level=DEBUG source=server.go:1278 msg="model load progress 0.69"
time=2025-08-20T11:41:26.841Z level=DEBUG source=server.go:1278 msg="model load progress 0.78"
time=2025-08-20T11:41:27.092Z level=DEBUG source=server.go:1278 msg="model load progress 0.86"
time=2025-08-20T11:41:27.343Z level=DEBUG source=server.go:1278 msg="model load progress 0.95"
time=2025-08-20T11:41:27.594Z level=DEBUG source=server.go:1278 msg="model load progress 0.99"
llama_context: constructing llama_context
llama_context: n_seq_max     = 3
llama_context: n_ctx         = 98304
llama_context: n_ctx_per_seq = 32768
llama_context: n_batch       = 1536
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     1.56 MiB
create_memory: n_ctx = 98304 (padded)
llama_kv_cache_unified: layer   0: dev = CUDA0
llama_kv_cache_unified: layer   1: dev = CUDA0
llama_kv_cache_unified: layer   2: dev = CUDA0
llama_kv_cache_unified: layer   3: dev = CUDA0
llama_kv_cache_unified: layer   4: dev = CUDA0
llama_kv_cache_unified: layer   5: dev = CUDA0
llama_kv_cache_unified: layer   6: dev = CUDA0
llama_kv_cache_unified: layer   7: dev = CUDA0
llama_kv_cache_unified: layer   8: dev = CUDA0
llama_kv_cache_unified: layer   9: dev = CUDA0
llama_kv_cache_unified: layer  10: dev = CUDA0
llama_kv_cache_unified: layer  11: dev = CUDA0
llama_kv_cache_unified: layer  12: dev = CUDA0
llama_kv_cache_unified: layer  13: dev = CUDA0
llama_kv_cache_unified: layer  14: dev = CUDA0
llama_kv_cache_unified: layer  15: dev = CUDA0
llama_kv_cache_unified: layer  16: dev = CUDA0
llama_kv_cache_unified: layer  17: dev = CUDA0
llama_kv_cache_unified: layer  18: dev = CUDA0
llama_kv_cache_unified: layer  19: dev = CUDA0
llama_kv_cache_unified: layer  20: dev = CUDA0
llama_kv_cache_unified: layer  21: dev = CUDA0
llama_kv_cache_unified: layer  22: dev = CUDA0
llama_kv_cache_unified: layer  23: dev = CUDA0
llama_kv_cache_unified: layer  24: dev = CUDA0
llama_kv_cache_unified: layer  25: dev = CUDA0
llama_kv_cache_unified: layer  26: dev = CUDA0
llama_kv_cache_unified: layer  27: dev = CUDA0
llama_kv_cache_unified: layer  28: dev = CUDA2
llama_kv_cache_unified: layer  29: dev = CUDA2
llama_kv_cache_unified: layer  30: dev = CUDA2
llama_kv_cache_unified: layer  31: dev = CUDA2
llama_kv_cache_unified: layer  32: dev = CUDA2
llama_kv_cache_unified: layer  33: dev = CUDA2
llama_kv_cache_unified: layer  34: dev = CUDA2
llama_kv_cache_unified: layer  35: dev = CUDA2
llama_kv_cache_unified: layer  36: dev = CUDA2
llama_kv_cache_unified: layer  37: dev = CUDA2
llama_kv_cache_unified: layer  38: dev = CUDA2
llama_kv_cache_unified: layer  39: dev = CUDA2
llama_kv_cache_unified: layer  40: dev = CUDA2
llama_kv_cache_unified: layer  41: dev = CUDA2
llama_kv_cache_unified: layer  42: dev = CUDA2
llama_kv_cache_unified: layer  43: dev = CUDA2
llama_kv_cache_unified: layer  44: dev = CUDA2
llama_kv_cache_unified: layer  45: dev = CUDA2
llama_kv_cache_unified: layer  46: dev = CUDA2
llama_kv_cache_unified: layer  47: dev = CUDA2
llama_kv_cache_unified: layer  48: dev = CUDA2
llama_kv_cache_unified: layer  49: dev = CUDA2
llama_kv_cache_unified: layer  50: dev = CUDA2
llama_kv_cache_unified: layer  51: dev = CUDA2
llama_kv_cache_unified: layer  52: dev = CUDA2
llama_kv_cache_unified: layer  53: dev = CUDA2
llama_kv_cache_unified: layer  54: dev = CUDA2
llama_kv_cache_unified: layer  55: dev = CUDA2
llama_kv_cache_unified: layer  56: dev = CUDA2
llama_kv_cache_unified: layer  57: dev = CUDA2
llama_kv_cache_unified: layer  58: dev = CUDA2
llama_kv_cache_unified: layer  59: dev = CUDA2
llama_kv_cache_unified: layer  60: dev = CUDA2
llama_kv_cache_unified: layer  61: dev = CUDA2
llama_kv_cache_unified: layer  62: dev = CUDA2
llama_kv_cache_unified: layer  63: dev = CUDA2
llama_kv_cache_unified: layer  64: dev = CUDA2
llama_kv_cache_unified: layer  65: dev = CUDA2
llama_kv_cache_unified: layer  66: dev = CUDA2
llama_kv_cache_unified: layer  67: dev = CUDA2
llama_kv_cache_unified: layer  68: dev = CUDA2
llama_kv_cache_unified: layer  69: dev = CUDA2
llama_kv_cache_unified: layer  70: dev = CUDA2
llama_kv_cache_unified: layer  71: dev = CUDA2
llama_kv_cache_unified: layer  72: dev = CUDA2
llama_kv_cache_unified: layer  73: dev = CUDA2
llama_kv_cache_unified: layer  74: dev = CUDA2
llama_kv_cache_unified: layer  75: dev = CUDA2
llama_kv_cache_unified: layer  76: dev = CUDA2
llama_kv_cache_unified: layer  77: dev = CUDA2
llama_kv_cache_unified: layer  78: dev = CUDA2
llama_kv_cache_unified: layer  79: dev = CUDA2
llama_kv_cache_unified:      CUDA0 KV buffer size =  5712.00 MiB
llama_kv_cache_unified:      CUDA2 KV buffer size = 10608.00 MiB
llama_kv_cache_unified: size = 16320.00 MiB ( 32768 cells,  80 layers,  3/3 seqs), K (q8_0): 8160.00 MiB, V (q8_0): 8160.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 4
llama_context: max_nodes = 5800
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: worst-case: n_tokens = 512, n_seqs = 3, n_outputs = 0
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  3, n_outputs =  512
graph_reserve: making n_tokens a multiple of n_seqs - n_tokens = 513, n_seqs = 3, n_outputs = 512
time=2025-08-20T11:41:29.353Z level=DEBUG source=server.go:1278 msg="model load progress 1.00"
graph_reserve: reserving a graph for ubatch with n_tokens =    3, n_seqs =  3, n_outputs =    3
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  3, n_outputs =  512
graph_reserve: making n_tokens a multiple of n_seqs - n_tokens = 513, n_seqs = 3, n_outputs = 512
llama_context:      CUDA0 compute buffer size =   604.59 MiB
llama_context:      CUDA2 compute buffer size =   474.67 MiB
llama_context:  CUDA_Host compute buffer size =   304.09 MiB
llama_context: graph nodes  = 2487
llama_context: graph splits = 3

OS

Linux

GPU

Nvidia (1x H200, 2x L40S)

CPU

No response

Ollama version

0.11.5

Originally created by @alromb01 on GitHub (Aug 20, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11986 ### What is the issue? Since version 0.11.5, the models are always placed on multiple GPUs, even if one GPU has more than enough RAM to serve the request. It's happening both with `OLLAMA_NEW_ESTIMATES=0` and `OLLAMA_NEW_ESTIMATES=1`. Further environment variables: OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 Setting `OLLAMA_SCHED_SPREAD=false` is not a viable solution as it would prevent being able to use multiple GPUs when it's indeed required. Because of that, the generated tokens per second for gpt-oss:20b went from 100 (when the model is running exclusively on H200) to now around 35 . Similarly, the tokens/s for Llama3.3-70b-q8 went from 40 to around 27. Is this intended w.r.t the "new memory management" introduced in this release? ### Relevant log output ```shell print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device CUDA0, is_swa = 0 load_tensors: layer 1 assigned to device CUDA0, is_swa = 0 load_tensors: layer 2 assigned to device CUDA0, is_swa = 0 load_tensors: layer 3 assigned to device CUDA0, is_swa = 0 load_tensors: layer 4 assigned to device CUDA0, is_swa = 0 load_tensors: layer 5 assigned to device CUDA0, is_swa = 0 load_tensors: layer 6 assigned to device CUDA0, is_swa = 0 load_tensors: layer 7 assigned to device CUDA0, is_swa = 0 load_tensors: layer 8 assigned to device CUDA0, is_swa = 0 load_tensors: layer 9 assigned to device CUDA0, is_swa = 0 load_tensors: layer 10 assigned to device CUDA0, is_swa = 0 load_tensors: layer 11 assigned to device CUDA0, is_swa = 0 load_tensors: layer 12 assigned to device CUDA0, is_swa = 0 load_tensors: layer 13 assigned to device CUDA0, is_swa = 0 load_tensors: layer 14 assigned to device CUDA0, is_swa = 0 load_tensors: layer 15 assigned to device CUDA0, is_swa = 0 load_tensors: layer 16 assigned to device CUDA0, is_swa = 0 load_tensors: layer 17 assigned to device CUDA0, is_swa = 0 load_tensors: layer 18 assigned to device CUDA0, is_swa = 0 load_tensors: layer 19 assigned to device CUDA0, is_swa = 0 load_tensors: layer 20 assigned to device CUDA0, is_swa = 0 load_tensors: layer 21 assigned to device CUDA0, is_swa = 0 load_tensors: layer 22 assigned to device CUDA0, is_swa = 0 load_tensors: layer 23 assigned to device CUDA0, is_swa = 0 load_tensors: layer 24 assigned to device CUDA0, is_swa = 0 load_tensors: layer 25 assigned to device CUDA0, is_swa = 0 load_tensors: layer 26 assigned to device CUDA0, is_swa = 0 load_tensors: layer 27 assigned to device CUDA0, is_swa = 0 load_tensors: layer 28 assigned to device CUDA2, is_swa = 0 load_tensors: layer 29 assigned to device CUDA2, is_swa = 0 load_tensors: layer 30 assigned to device CUDA2, is_swa = 0 load_tensors: layer 31 assigned to device CUDA2, is_swa = 0 load_tensors: layer 32 assigned to device CUDA2, is_swa = 0 load_tensors: layer 33 assigned to device CUDA2, is_swa = 0 load_tensors: layer 34 assigned to device CUDA2, is_swa = 0 load_tensors: layer 35 assigned to device CUDA2, is_swa = 0 load_tensors: layer 36 assigned to device CUDA2, is_swa = 0 load_tensors: layer 37 assigned to device CUDA2, is_swa = 0 load_tensors: layer 38 assigned to device CUDA2, is_swa = 0 load_tensors: layer 39 assigned to device CUDA2, is_swa = 0 load_tensors: layer 40 assigned to device CUDA2, is_swa = 0 load_tensors: layer 41 assigned to device CUDA2, is_swa = 0 load_tensors: layer 42 assigned to device CUDA2, is_swa = 0 load_tensors: layer 43 assigned to device CUDA2, is_swa = 0 load_tensors: layer 44 assigned to device CUDA2, is_swa = 0 load_tensors: layer 45 assigned to device CUDA2, is_swa = 0 load_tensors: layer 46 assigned to device CUDA2, is_swa = 0 load_tensors: layer 47 assigned to device CUDA2, is_swa = 0 load_tensors: layer 48 assigned to device CUDA2, is_swa = 0 load_tensors: layer 49 assigned to device CUDA2, is_swa = 0 load_tensors: layer 50 assigned to device CUDA2, is_swa = 0 load_tensors: layer 51 assigned to device CUDA2, is_swa = 0 load_tensors: layer 52 assigned to device CUDA2, is_swa = 0 load_tensors: layer 53 assigned to device CUDA2, is_swa = 0 load_tensors: layer 54 assigned to device CUDA2, is_swa = 0 load_tensors: layer 55 assigned to device CUDA2, is_swa = 0 load_tensors: layer 56 assigned to device CUDA2, is_swa = 0 load_tensors: layer 57 assigned to device CUDA2, is_swa = 0 load_tensors: layer 58 assigned to device CUDA2, is_swa = 0 load_tensors: layer 59 assigned to device CUDA2, is_swa = 0 load_tensors: layer 60 assigned to device CUDA2, is_swa = 0 load_tensors: layer 61 assigned to device CUDA2, is_swa = 0 load_tensors: layer 62 assigned to device CUDA2, is_swa = 0 load_tensors: layer 63 assigned to device CUDA2, is_swa = 0 load_tensors: layer 64 assigned to device CUDA2, is_swa = 0 load_tensors: layer 65 assigned to device CUDA2, is_swa = 0 load_tensors: layer 66 assigned to device CUDA2, is_swa = 0 load_tensors: layer 67 assigned to device CUDA2, is_swa = 0 load_tensors: layer 68 assigned to device CUDA2, is_swa = 0 load_tensors: layer 69 assigned to device CUDA2, is_swa = 0 load_tensors: layer 70 assigned to device CUDA2, is_swa = 0 load_tensors: layer 71 assigned to device CUDA2, is_swa = 0 load_tensors: layer 72 assigned to device CUDA2, is_swa = 0 load_tensors: layer 73 assigned to device CUDA2, is_swa = 0 load_tensors: layer 74 assigned to device CUDA2, is_swa = 0 load_tensors: layer 75 assigned to device CUDA2, is_swa = 0 load_tensors: layer 76 assigned to device CUDA2, is_swa = 0 load_tensors: layer 77 assigned to device CUDA2, is_swa = 0 load_tensors: layer 78 assigned to device CUDA2, is_swa = 0 load_tensors: layer 79 assigned to device CUDA2, is_swa = 0 load_tensors: layer 80 assigned to device CUDA2, is_swa = 0 load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead time=2025-08-20T11:41:20.014Z level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server not responding" load_tensors: offloading 80 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 81/81 layers to GPU load_tensors: CUDA0 model buffer size = 24277.75 MiB load_tensors: CUDA2 model buffer size = 46151.91 MiB load_tensors: CPU_Mapped model buffer size = 1064.62 MiB time=2025-08-20T11:41:24.328Z level=INFO source=server.go:1268 msg="waiting for server to become available" status="llm server loading model" time=2025-08-20T11:41:24.328Z level=DEBUG source=server.go:1278 msg="model load progress 0.00" time=2025-08-20T11:41:24.579Z level=DEBUG source=server.go:1278 msg="model load progress 0.07" time=2025-08-20T11:41:24.830Z level=DEBUG source=server.go:1278 msg="model load progress 0.14" time=2025-08-20T11:41:25.082Z level=DEBUG source=server.go:1278 msg="model load progress 0.21" time=2025-08-20T11:41:25.333Z level=DEBUG source=server.go:1278 msg="model load progress 0.28" time=2025-08-20T11:41:25.584Z level=DEBUG source=server.go:1278 msg="model load progress 0.34" time=2025-08-20T11:41:25.835Z level=DEBUG source=server.go:1278 msg="model load progress 0.44" time=2025-08-20T11:41:26.087Z level=DEBUG source=server.go:1278 msg="model load progress 0.52" time=2025-08-20T11:41:26.338Z level=DEBUG source=server.go:1278 msg="model load progress 0.61" time=2025-08-20T11:41:26.589Z level=DEBUG source=server.go:1278 msg="model load progress 0.69" time=2025-08-20T11:41:26.841Z level=DEBUG source=server.go:1278 msg="model load progress 0.78" time=2025-08-20T11:41:27.092Z level=DEBUG source=server.go:1278 msg="model load progress 0.86" time=2025-08-20T11:41:27.343Z level=DEBUG source=server.go:1278 msg="model load progress 0.95" time=2025-08-20T11:41:27.594Z level=DEBUG source=server.go:1278 msg="model load progress 0.99" llama_context: constructing llama_context llama_context: n_seq_max = 3 llama_context: n_ctx = 98304 llama_context: n_ctx_per_seq = 32768 llama_context: n_batch = 1536 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 1 llama_context: kv_unified = false llama_context: freq_base = 500000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (32768) < n_ctx_train (131072) -- the full capacity of the model will not be utilized set_abort_callback: call llama_context: CUDA_Host output buffer size = 1.56 MiB create_memory: n_ctx = 98304 (padded) llama_kv_cache_unified: layer 0: dev = CUDA0 llama_kv_cache_unified: layer 1: dev = CUDA0 llama_kv_cache_unified: layer 2: dev = CUDA0 llama_kv_cache_unified: layer 3: dev = CUDA0 llama_kv_cache_unified: layer 4: dev = CUDA0 llama_kv_cache_unified: layer 5: dev = CUDA0 llama_kv_cache_unified: layer 6: dev = CUDA0 llama_kv_cache_unified: layer 7: dev = CUDA0 llama_kv_cache_unified: layer 8: dev = CUDA0 llama_kv_cache_unified: layer 9: dev = CUDA0 llama_kv_cache_unified: layer 10: dev = CUDA0 llama_kv_cache_unified: layer 11: dev = CUDA0 llama_kv_cache_unified: layer 12: dev = CUDA0 llama_kv_cache_unified: layer 13: dev = CUDA0 llama_kv_cache_unified: layer 14: dev = CUDA0 llama_kv_cache_unified: layer 15: dev = CUDA0 llama_kv_cache_unified: layer 16: dev = CUDA0 llama_kv_cache_unified: layer 17: dev = CUDA0 llama_kv_cache_unified: layer 18: dev = CUDA0 llama_kv_cache_unified: layer 19: dev = CUDA0 llama_kv_cache_unified: layer 20: dev = CUDA0 llama_kv_cache_unified: layer 21: dev = CUDA0 llama_kv_cache_unified: layer 22: dev = CUDA0 llama_kv_cache_unified: layer 23: dev = CUDA0 llama_kv_cache_unified: layer 24: dev = CUDA0 llama_kv_cache_unified: layer 25: dev = CUDA0 llama_kv_cache_unified: layer 26: dev = CUDA0 llama_kv_cache_unified: layer 27: dev = CUDA0 llama_kv_cache_unified: layer 28: dev = CUDA2 llama_kv_cache_unified: layer 29: dev = CUDA2 llama_kv_cache_unified: layer 30: dev = CUDA2 llama_kv_cache_unified: layer 31: dev = CUDA2 llama_kv_cache_unified: layer 32: dev = CUDA2 llama_kv_cache_unified: layer 33: dev = CUDA2 llama_kv_cache_unified: layer 34: dev = CUDA2 llama_kv_cache_unified: layer 35: dev = CUDA2 llama_kv_cache_unified: layer 36: dev = CUDA2 llama_kv_cache_unified: layer 37: dev = CUDA2 llama_kv_cache_unified: layer 38: dev = CUDA2 llama_kv_cache_unified: layer 39: dev = CUDA2 llama_kv_cache_unified: layer 40: dev = CUDA2 llama_kv_cache_unified: layer 41: dev = CUDA2 llama_kv_cache_unified: layer 42: dev = CUDA2 llama_kv_cache_unified: layer 43: dev = CUDA2 llama_kv_cache_unified: layer 44: dev = CUDA2 llama_kv_cache_unified: layer 45: dev = CUDA2 llama_kv_cache_unified: layer 46: dev = CUDA2 llama_kv_cache_unified: layer 47: dev = CUDA2 llama_kv_cache_unified: layer 48: dev = CUDA2 llama_kv_cache_unified: layer 49: dev = CUDA2 llama_kv_cache_unified: layer 50: dev = CUDA2 llama_kv_cache_unified: layer 51: dev = CUDA2 llama_kv_cache_unified: layer 52: dev = CUDA2 llama_kv_cache_unified: layer 53: dev = CUDA2 llama_kv_cache_unified: layer 54: dev = CUDA2 llama_kv_cache_unified: layer 55: dev = CUDA2 llama_kv_cache_unified: layer 56: dev = CUDA2 llama_kv_cache_unified: layer 57: dev = CUDA2 llama_kv_cache_unified: layer 58: dev = CUDA2 llama_kv_cache_unified: layer 59: dev = CUDA2 llama_kv_cache_unified: layer 60: dev = CUDA2 llama_kv_cache_unified: layer 61: dev = CUDA2 llama_kv_cache_unified: layer 62: dev = CUDA2 llama_kv_cache_unified: layer 63: dev = CUDA2 llama_kv_cache_unified: layer 64: dev = CUDA2 llama_kv_cache_unified: layer 65: dev = CUDA2 llama_kv_cache_unified: layer 66: dev = CUDA2 llama_kv_cache_unified: layer 67: dev = CUDA2 llama_kv_cache_unified: layer 68: dev = CUDA2 llama_kv_cache_unified: layer 69: dev = CUDA2 llama_kv_cache_unified: layer 70: dev = CUDA2 llama_kv_cache_unified: layer 71: dev = CUDA2 llama_kv_cache_unified: layer 72: dev = CUDA2 llama_kv_cache_unified: layer 73: dev = CUDA2 llama_kv_cache_unified: layer 74: dev = CUDA2 llama_kv_cache_unified: layer 75: dev = CUDA2 llama_kv_cache_unified: layer 76: dev = CUDA2 llama_kv_cache_unified: layer 77: dev = CUDA2 llama_kv_cache_unified: layer 78: dev = CUDA2 llama_kv_cache_unified: layer 79: dev = CUDA2 llama_kv_cache_unified: CUDA0 KV buffer size = 5712.00 MiB llama_kv_cache_unified: CUDA2 KV buffer size = 10608.00 MiB llama_kv_cache_unified: size = 16320.00 MiB ( 32768 cells, 80 layers, 3/3 seqs), K (q8_0): 8160.00 MiB, V (q8_0): 8160.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 4 llama_context: max_nodes = 5800 llama_context: pipeline parallelism enabled (n_copies=4) llama_context: worst-case: n_tokens = 512, n_seqs = 3, n_outputs = 0 graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 3, n_outputs = 512 graph_reserve: making n_tokens a multiple of n_seqs - n_tokens = 513, n_seqs = 3, n_outputs = 512 time=2025-08-20T11:41:29.353Z level=DEBUG source=server.go:1278 msg="model load progress 1.00" graph_reserve: reserving a graph for ubatch with n_tokens = 3, n_seqs = 3, n_outputs = 3 graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 3, n_outputs = 512 graph_reserve: making n_tokens a multiple of n_seqs - n_tokens = 513, n_seqs = 3, n_outputs = 512 llama_context: CUDA0 compute buffer size = 604.59 MiB llama_context: CUDA2 compute buffer size = 474.67 MiB llama_context: CUDA_Host compute buffer size = 304.09 MiB llama_context: graph nodes = 2487 llama_context: graph splits = 3 ``` ### OS Linux ### GPU Nvidia (1x H200, 2x L40S) ### CPU _No response_ ### Ollama version 0.11.5

GiteaMirror added the bug label 2026-05-04 20:04:30 -05:00

GiteaMirror closed this issue

2026-05-04 20:04:31 -05:00

GiteaMirror commented

2026-05-04 20:04:33 -05:00

@rick-github commented on GitHub (Aug 20, 2025):

Full log.

@rick-github commented on GitHub (Aug 20, 2025): Full log.

GiteaMirror commented

2026-05-04 20:04:34 -05:00

@alromb01 commented on GitHub (Aug 20, 2025):

Full log.

Due to logging on TRACE level, here as a file: log.txt (with OLLAMA_NEW_ESTIMATES=0)

@alromb01 commented on GitHub (Aug 20, 2025): > Full log. Due to logging on TRACE level, here as a file: [log.txt](https://github.com/user-attachments/files/21897677/log.txt) (with `OLLAMA_NEW_ESTIMATES=0`)

GiteaMirror commented

2026-05-04 20:04:34 -05:00

@rick-github commented on GitHub (Aug 20, 2025):

This log shows gpt-oss:20b being loaded entirely onto a single NVIDIA H200 NVL.

@rick-github commented on GitHub (Aug 20, 2025): This log shows gpt-oss:20b being loaded entirely onto a single NVIDIA H200 NVL.

GiteaMirror commented

2026-05-04 20:04:36 -05:00

@alromb01 commented on GitHub (Aug 20, 2025):

This log shows gpt-oss:20b being loaded entirely onto a single NVIDIA H200 NVL.

this is the output of nvtop. The request spawns two small fragments on the other GPUs. Also, the GPU usage is really low. Both did not happen with version 0.11.4.

This behavior does not change with OLLAMA_NEW_ESTIMATES=0 and OLLAMA_NEW_ESTIMATES=1

Stats with OLLAMA_NEW_ESTIMATES=0:
Model load time: 4.992 seconds
Time to first token (TTFT) (model_load + prompt_eval): 5.301 seconds
Generation speed: 31.83 tokens/sec
Prompt tokens: 90 | Generation tokens: 1107
Overall duration: 40.17 seconds

Stats with OLLAMA_NEW_ESTIMATES=1:
Model load time: 4.421 seconds
Time to first token (TTFT) (model_load + prompt_eval): 4.749 seconds
Generation speed: 34.22 tokens/sec
Prompt tokens: 90 | Generation tokens: 1518
Overall duration: 49.19 seconds

So while the option of OLLAMA_NEW_ESTIMATES seems to improve performance, it's still far off the performance of version 0.11.4 where I achieved ~ 100 tokens/s.

@alromb01 commented on GitHub (Aug 20, 2025): > This log shows gpt-oss:20b being loaded entirely onto a single NVIDIA H200 NVL. <img width="946" height="496" alt="Image" src="https://github.com/user-attachments/assets/bb4b154b-69e6-4e2d-9641-77b9e067e553" /> this is the output of `nvtop`. The request spawns two small fragments on the other GPUs. Also, the GPU usage is really low. Both did not happen with version 0.11.4. This behavior does not change with `OLLAMA_NEW_ESTIMATES=0` and `OLLAMA_NEW_ESTIMATES=1` Stats with `OLLAMA_NEW_ESTIMATES=0`: Model load time: 4.992 seconds Time to first token (TTFT) (model_load + prompt_eval): 5.301 seconds Generation speed: 31.83 tokens/sec Prompt tokens: 90 | Generation tokens: 1107 Overall duration: 40.17 seconds Stats with `OLLAMA_NEW_ESTIMATES=1`: Model load time: 4.421 seconds Time to first token (TTFT) (model_load + prompt_eval): 4.749 seconds Generation speed: 34.22 tokens/sec Prompt tokens: 90 | Generation tokens: 1518 Overall duration: 49.19 seconds So while the option of OLLAMA_NEW_ESTIMATES seems to improve performance, it's still far off the performance of version 0.11.4 where I achieved ~ 100 tokens/s.

GiteaMirror commented

2026-05-04 20:04:36 -05:00

@rick-github commented on GitHub (Aug 20, 2025):

https://github.com/ollama/ollama/issues/11923#issuecomment-3192171487

Do you have logs for 0.11.4?

@rick-github commented on GitHub (Aug 20, 2025): https://github.com/ollama/ollama/issues/11923#issuecomment-3192171487 Do you have logs for 0.11.4?

GiteaMirror commented

2026-05-04 20:04:37 -05:00

@dan-and commented on GitHub (Aug 20, 2025):

As explained by @jessegross in #11923 : Before the subprocess was only allowed to see the necessary GPUs through the use of CUDA_VISIBLE_DEVICES, so they were masked out at the CUDA level. Now, it sees all of GPUs but only schedules on the appropriate ones.

So yes, you have seen that same effect and stumbled over the 3 rows, which is a different behavior than before, but if you look closely to the memory allocation, only the GPU0 is in use, which is the intended behavior when OLLAMA_SPREAD_SCHED=0 .

Also, if you keep running, you will see that there will be no load on the other GPUs, so they will go into power savings.

So the main issue you still have: Why is the performance degraded between 0.11.4 and 0.11.5-rc2 ? Please provide log files of both versions with the same settings.

@dan-and commented on GitHub (Aug 20, 2025): As explained by @jessegross in #11923 : Before the subprocess was only allowed to see the necessary GPUs through the use of CUDA_VISIBLE_DEVICES, so they were masked out at the CUDA level. Now, it sees all of GPUs but only schedules on the appropriate ones. So yes, you have seen that same effect and stumbled over the 3 rows, which is a different behavior than before, but if you look closely to the memory allocation, only the GPU0 is in use, which is the intended behavior when OLLAMA_SPREAD_SCHED=0 . Also, if you keep running, you will see that there will be no load on the other GPUs, so they will go into power savings. So the main issue you still have: Why is the performance degraded between 0.11.4 and 0.11.5-rc2 ? Please provide log files of both versions with the same settings.

GiteaMirror commented

2026-05-04 20:04:37 -05:00

@jessegross commented on GitHub (Aug 20, 2025):

@alromb01 Note that in the first, partial log the model is getting split across GPUs, likely due to the old estimates over-estimating memory usage with the combination of a longer context and flash attention. OLLAMA_NEW_ESTIMATES=1 should help with this. However, the default behavior should not have changed between versions.

As others have said, the second, full log shows the model being fully loaded on the H200 alone. I wouldn't expect that OLLAMA_NEW_ESTIMATES will help in this case, so something else might be going on.

@jessegross commented on GitHub (Aug 20, 2025): @alromb01 Note that in the first, partial log the model is getting split across GPUs, likely due to the old estimates over-estimating memory usage with the combination of a longer context and flash attention. OLLAMA_NEW_ESTIMATES=1 should help with this. However, the default behavior should not have changed between versions. As others have said, the second, full log shows the model being fully loaded on the H200 alone. I wouldn't expect that OLLAMA_NEW_ESTIMATES will help in this case, so something else might be going on.

GiteaMirror commented

2026-05-04 20:04:38 -05:00

@alromb01 commented on GitHub (Aug 20, 2025):

Here is the full log now with OLLAMA_NEW_ESTIMATES=1 still on version 0.11.5: log_newEstimates1_0.11.5.txt.

Again, with reduced performance of:
Model load time: 4.358 seconds
Time to first token (TTFT) (model_load + prompt_eval): 4.651 seconds
Generation speed: 35.85 tokens/sec
Prompt tokens: 90 | Generation tokens: 836
Overall duration: 28.07 seconds

Will update with a log for 0.11.4.

@alromb01 commented on GitHub (Aug 20, 2025): Here is the full log now with `OLLAMA_NEW_ESTIMATES=1` still on version 0.11.5: [log_newEstimates1_0.11.5.txt](https://github.com/user-attachments/files/21903883/log_newEstimates1_0.11.5.txt). Again, with reduced performance of: Model load time: 4.358 seconds Time to first token (TTFT) (model_load + prompt_eval): 4.651 seconds Generation speed: 35.85 tokens/sec Prompt tokens: 90 | Generation tokens: 836 Overall duration: 28.07 seconds Will update with a log for 0.11.4.

GiteaMirror commented

2026-05-04 20:04:38 -05:00

@alromb01 commented on GitHub (Aug 20, 2025):

And now the same setup/prompt with version 0.11.4:

Model load time: 4.936 seconds
Time to first token (TTFT) (model_load + prompt_eval): 5.354 seconds
Generation speed: 97.34 tokens/sec
Prompt tokens: 90 | Generation tokens: 2882
Overall duration: 35.07 seconds

See nvtop with much higher GPU usage:

Here the full log:

log_0.11.4.txt

@alromb01 commented on GitHub (Aug 20, 2025): And now the same setup/prompt with version 0.11.4: Model load time: 4.936 seconds Time to first token (TTFT) (model_load + prompt_eval): 5.354 seconds **Generation speed: 97.34 tokens/sec** Prompt tokens: 90 | Generation tokens: 2882 Overall duration: 35.07 seconds See nvtop with much higher GPU usage: <img width="950" height="492" alt="Image" src="https://github.com/user-attachments/assets/ee3d8257-c375-4565-a983-b022f1877115" /> Here the full log: [log_0.11.4.txt](https://github.com/user-attachments/files/21903972/log_0.11.4.txt)

GiteaMirror commented

2026-05-04 20:04:39 -05:00

@jessegross commented on GitHub (Aug 20, 2025):

The gpt-oss issues might be related to flash attention. In 0.11.4, flash attention didn't work and was disabled (for gpt-oss). In 0.11.5 this was fixed and it was enabled. However, the performance with flash attention is much better with the upcoming 0.11.6 - you can try the rc0 now if you like. You can also try turning off OLLAMA_FLASH_ATTENTION to see if that equalizes performance between versions.

There is no equivalent change to llama3.3 in terms of flash attention being enabled/disabled between versions, so I'm not sure why that would change. However, that one does show splitting between GPUs so there might be something different that triggered that.

@jessegross commented on GitHub (Aug 20, 2025): The gpt-oss issues might be related to flash attention. In 0.11.4, flash attention didn't work and was disabled (for gpt-oss). In 0.11.5 this was fixed and it was enabled. However, the performance with flash attention is much better with the upcoming 0.11.6 - you can try the rc0 now if you like. You can also try turning off OLLAMA_FLASH_ATTENTION to see if that equalizes performance between versions. There is no equivalent change to llama3.3 in terms of flash attention being enabled/disabled between versions, so I'm not sure why that would change. However, that one does show splitting between GPUs so there might be something different that triggered that.

GiteaMirror commented

2026-05-04 20:04:40 -05:00

@alromb01 commented on GitHub (Aug 21, 2025):

However, the performance with flash attention is much better with the upcoming 0.11.6 - you can try the rc0 now if you like.

Thanks for the heads up, just tested with 0.11.6 and got the following results for gpt-oss:20b 🎉
(also with OLLAMA_FLASH_ATTENTION=1, OLLAMA_KV_CACHE_TYPE=q8_0)

Model load time: 4.337 seconds
Time to first token (TTFT) (model_load + prompt_eval): 4.629 seconds
Generation speed: 146.75 tokens/sec
Prompt tokens: 90 | Generation tokens: 650
Overall duration: 9.15 seconds

gpt-oss:120b is now also running with > 100 tokens/sec

And for llama3.3:70b-instruct-q4_0:
Model load time: 8.616 seconds
Time to first token (TTFT) (model_load + prompt_eval): 8.721 seconds
Generation speed: 41.22 tokens/sec
Prompt tokens: 30 | Generation tokens: 44
Overall duration: 9.79 seconds

@alromb01 commented on GitHub (Aug 21, 2025): > However, the performance with flash attention is much better with the upcoming 0.11.6 - you can try the rc0 now if you like. Thanks for the heads up, just tested with 0.11.6 and got the following results for `gpt-oss:20b` 🎉 (also with `OLLAMA_FLASH_ATTENTION=1`, `OLLAMA_KV_CACHE_TYPE=q8_0`) Model load time: 4.337 seconds Time to first token (TTFT) (model_load + prompt_eval): 4.629 seconds **Generation speed: 146.75 tokens/sec** Prompt tokens: 90 | Generation tokens: 650 Overall duration: 9.15 seconds `gpt-oss:120b` is now also running with > 100 tokens/sec And for `llama3.3:70b-instruct-q4_0`: Model load time: 8.616 seconds Time to first token (TTFT) (model_load + prompt_eval): 8.721 seconds Generation speed: 41.22 tokens/sec Prompt tokens: 30 | Generation tokens: 44 Overall duration: 9.79 seconds

GiteaMirror commented

2026-05-04 20:04:40 -05:00

@jessegross commented on GitHub (Aug 21, 2025):

@alromb01 That's great to hear. Sounds like everything is working at this point and we can close the issue?

@jessegross commented on GitHub (Aug 21, 2025): @alromb01 That's great to hear. Sounds like everything is working at this point and we can close the issue?

GiteaMirror commented

2026-05-04 20:04:41 -05:00

@alromb01 commented on GitHub (Aug 21, 2025):

@jessegross yes, thanks

@alromb01 commented on GitHub (Aug 21, 2025): @jessegross yes, thanks

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#70018