[GH-ISSUE #14351] Misallocation of MoE layers on multi-GPU partial offload #35088

New Issue

GiteaMirror · 2026-04-22T19:17:33-05:00

GiteaMirror commented

2026-04-22 19:17:33 -05:00

Originally created by @ka-admin on GitHub (Feb 21, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14351

What is the issue?

<html>

When running a large MoE model with the new engine on a mixed multi-GPU system under partial offload, the layer scheduler assigns far more VRAM to a GPU than it physically has, silently overflowing those tensors into pinned host RAM (CUDA_Host). The result is that layers reported as "GPU offloaded" are in fact running from system RAM, with added PCIe overhead.

Additionally, OLLAMA_SCHED_SPREAD=true has zero effect — the layer assignment and buffer allocation are byte-for-byte identical with the flag on or off, suggesting the bug lies in per-layer VRAM cost estimation before any placement policy is applied.

Environment

Ollama version: 0.16.3
OS: Linux (Ubuntu), kernel with systemd
Model: MiniMax M2.5 Q8_0 GGUF (minimax-m2 architecture, 228.69B params, 226.43 GiB)
GPUs:

System RAM: 256 GiB DDR5
Driver / CUDA: 580.126.18 / 13.0
Relevant env config:

CUDA_VISIBLE_DEVICES=0,1,2
OLLAMA_CONTEXT_LENGTH=131072
OLLAMA_FLASH_ATTENTION=true
OLLAMA_NUM_PARALLEL=1

Steps to reproduce

Run Ollama 0.16.3 on a mixed multi-GPU system where the GPU with largest total VRAM (V100, 32 GiB) cannot physically hold the layers the scheduler intends to assign it
Load MiniMax M2.5 Q8_0 with partial GPU offload (19 of 63 layers)
Observe layer assignment and actual buffer allocation in logs

Expected behavior

Layers assigned to each GPU should not exceed that GPU's physical VRAM. If a GPU cannot fit its assigned layers, fewer layers should be assigned or they should be redistributed. OLLAMA_SCHED_SPREAD=true should produce a meaningfully different (more even) distribution than the default greedy policy.

Actual behavior

The scheduler assigns 16 layers to the V100 (32 GiB), but the actual model buffer for those layers totals 59,514 MiB (~58 GiB) — nearly 2× its physical capacity. This overflows silently into CUDA_Host (pinned system RAM) with no warning or error. The 16 "GPU layers" on the V100 are functionally CPU layers.

Furthermore, enabling OLLAMA_SCHED_SPREAD=true produces identical output — same layer assignment, same buffer sizes, same graph splits — indicating the spread/pack policy is not the issue and that incorrect per-layer cost estimation is happening upstream of any placement decision.

Experiment 1 — `OLLAMA_SCHED_SPREAD=false` (default)

GPULayers:19[
  GPU-dfc3d6a8 (V100-32GB):  Layers:16 (43..58)
  GPU-7a420261 (RTX4090 #1): Layers:2  (59..60)
  GPU-0dcf0ac3 (RTX4090 #2): Layers:1  (61..61)
]

load_tensors: CUDA0 model buffer size =  7,439.35 MiB
load_tensors: CUDA1 model buffer size =  3,719.68 MiB
load_tensors: CUDA2 model buffer size = 59,514.83 MiB  ← exceeds 32 GiB physical VRAM
load_tensors: CUDA_Host model buffer  = 161,191.63 MiB

graph splits = 651 (with bs=512), 5 (with bs=1)

Experiment 2 — `OLLAMA_SCHED_SPREAD=true`

GPULayers:19[
  GPU-dfc3d6a8 (V100-32GB):  Layers:16 (43..58)   ← IDENTICAL
  GPU-7a420261 (RTX4090 #1): Layers:2  (59..60)
  GPU-0dcf0ac3 (RTX4090 #2): Layers:1  (61..61)
]

load_tensors: CUDA0 model buffer size =  7,439.35 MiB
load_tensors: CUDA1 model buffer size =  3,719.68 MiB
load_tensors: CUDA2 model buffer size = 59,514.83 MiB  ← IDENTICAL
load_tensors: CUDA_Host model buffer  = 161,191.63 MiB

graph splits = 651 (with bs=512), 5 (with bs=1)

Root cause hypothesis

minimax-m2 uses 256 experts per MoE layer with expert_feed_forward_length=1536 and embedding_length=3072. At Q8_0, each MoE layer weighs approximately 3.6 GiB in actual allocated buffers. The scheduler's per-layer VRAM cost estimator is clearly computing a much smaller value — otherwise it would never plan to fit 16 such layers (~58 GiB) onto a 32 GiB GPU.

The GPUs are sorted largest-VRAM-first (V100 at 32 GiB before both 4090s at 24 GiB), so the greedy packer fills the V100 until its estimated budget is exhausted. Because the estimate is wrong, it over-assigns by ~2×, and llama.cpp silently backs the overflow with pinned host RAM.

The extremely high graph split count (651 at bs=512 vs 5 at bs=1) is a secondary symptom of the resulting cross-device tensor fragmentation.

The bug likely lives in the per-layer cost estimation for minimax-m2 in device.go (or equivalent scheduler weight logic), not in the spread/pack placement policy.

</html>

Relevant log output

ollama.service: Consumed 33min 52.940s CPU time, 218.4G memory peak, 5.6G memory swap peak.
Feb 21 20:36:21 systemd[1]: Started ollama.service - Ollama Service.
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.938+03:00 level=INFO source=routes.go:1663 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1,2 GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:131072 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:true OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.938+03:00 level=INFO source=routes.go:1665 msg="Ollama cloud disabled: false"
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.949+03:00 level=INFO source=images.go:473 msg="total blobs: 22"
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.949+03:00 level=INFO source=images.go:480 msg="total unused blobs removed: 0"
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.950+03:00 level=INFO source=routes.go:1718 msg="Listening on [::]:11434 (version 0.16.3)"
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=WARN source=runner.go:485 msg="user overrode visible devices" CUDA_VISIBLE_DEVICES=0,1,2
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=WARN source=runner.go:489 msg="if GPUs are not correctly discovered, unset and try again"
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1"
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 37329"
Feb 21 20:36:22 ollama[539850]: time=2026-02-21T20:36:22.657+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 42953"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 36173"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 44529"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43121"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 46013"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 46881"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 39921"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 filter_id="" library=CUDA compute=7.0 name=CUDA2 description="Tesla V100-SXM2-32GB" libdirs=ollama,cuda_v12 driver=13.0 pci_id=0000:09:00.0 type=discrete total="32.0 GiB" available="31.4 GiB"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe filter_id="" library=CUDA compute=8.9 name=CUDA0 description="NVIDIA GeForce RTX 4090" libdirs=ollama,cuda_v12 driver=13.0 pci_id=0000:01:00.0 type=discrete total="24.0 GiB" available="23.5 GiB"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 filter_id="" library=CUDA compute=8.9 name=CUDA1 description="NVIDIA GeForce RTX 4090" libdirs=ollama,cuda_v12 driver=13.0 pci_id=0000:03:00.0 type=discrete total="24.0 GiB" available="23.1 GiB"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=routes.go:1768 msg="vram-based default context" total_vram="80.0 GiB" default_num_ctx=262144
Feb 21 20:36:28 ollama[539850]: time=2026-02-21T20:36:28.802+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43123"
Feb 21 20:36:29 ollama[539850]: llama_model_loader: loaded meta data with 38 key-value pairs and 809 tensors from /ai/llm/models/blobs/sha256-f4f7f4e8e9f04b21d5c5a277223592388f17ae22a6e08f2ae1ab12bef6f9fca3 (version GGUF V3 (latest))
Feb 21 20:36:29 ollama[539850]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   0:                       general.architecture str              = minimax-m2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   1:                          general.file_type u32              = 7
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   2:                            general.license str              = other
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   3:                       general.license.link str              = https://github.com/MiniMax-AI/MiniMax...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   4:                       general.license.name str              = modified-mit
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   5:                               general.name str              = Workdir
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   6:                    general.parameter_count u64              = 228689764864
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   7:               general.quantization_version u32              = 2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   8:                      general.sampling.temp f32              = 1.000000
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   9:                     general.sampling.top_k i32              = 40
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  10:                     general.sampling.top_p f32              = 0.950000
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  11:                         general.size_label str              = 256x4.9B
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  12:                               general.tags arr[str,1]       = ["text-generation"]
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  13:                               general.type str              = model
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  14:            minimax-m2.attention.head_count u32              = 48
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  15:         minimax-m2.attention.head_count_kv u32              = 8
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  16:            minimax-m2.attention.key_length u32              = 128
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  17: minimax-m2.attention.layer_norm_rms_epsilon f32              = 0.000001
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  18:          minimax-m2.attention.value_length u32              = 128
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  19:                     minimax-m2.block_count u32              = 62
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  20:                  minimax-m2.context_length u32              = 196608
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  21:                minimax-m2.embedding_length u32              = 3072
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  22:                    minimax-m2.expert_count u32              = 256
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  23:      minimax-m2.expert_feed_forward_length u32              = 1536
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  24:              minimax-m2.expert_gating_func u32              = 2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  25:               minimax-m2.expert_used_count u32              = 8
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  26:             minimax-m2.feed_forward_length u32              = 1536
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  27:            minimax-m2.rope.dimension_count u32              = 64
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  28:                  minimax-m2.rope.freq_base f32              = 5000000.000000
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {# ----------‑‑‑ special token ...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 200034
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 200020
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,199744]  = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = minimax-m2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,200064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,200064]  = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  37:            tokenizer.ggml.unknown_token_id u32              = 200021
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type  f32:  373 tensors
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type q8_0:  436 tensors
Feb 21 20:36:29 ollama[539850]: print_info: file format = GGUF V3 (latest)
Feb 21 20:36:29 ollama[539850]: print_info: file type   = Q8_0
Feb 21 20:36:29 ollama[539850]: print_info: file size   = 226.43 GiB (8.51 BPW)
Feb 21 20:36:29 ollama[539850]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Feb 21 20:36:29 ollama[539850]: load: printing all EOG tokens:
Feb 21 20:36:29 ollama[539850]: load:   - 200004 ('<fim_pad>')
Feb 21 20:36:29 ollama[539850]: load:   - 200005 ('<reponame>')
Feb 21 20:36:29 ollama[539850]: load:   - 200020 ('[e~[')
Feb 21 20:36:29 ollama[539850]: load: special tokens cache size = 54
Feb 21 20:36:29 ollama[539850]: load: token to piece cache size = 1.3355 MB
Feb 21 20:36:29 ollama[539850]: print_info: arch             = minimax-m2
Feb 21 20:36:29 ollama[539850]: print_info: vocab_only       = 1
Feb 21 20:36:29 ollama[539850]: print_info: no_alloc         = 0
Feb 21 20:36:29 ollama[539850]: print_info: model type       = ?B
Feb 21 20:36:29 ollama[539850]: print_info: model params     = 228.69 B
Feb 21 20:36:29 ollama[539850]: print_info: general.name     = Workdir
Feb 21 20:36:29 ollama[539850]: print_info: vocab type       = BPE
Feb 21 20:36:29 ollama[539850]: print_info: n_vocab          = 200064
Feb 21 20:36:29 ollama[539850]: print_info: n_merges         = 199744
Feb 21 20:36:29 ollama[539850]: print_info: BOS token        = 200034 ']~!b['
Feb 21 20:36:29 ollama[539850]: print_info: EOS token        = 200020 '[e~['
Feb 21 20:36:29 ollama[539850]: print_info: UNK token        = 200021 ']!d~['
Feb 21 20:36:29 ollama[539850]: print_info: LF token         = 10 'Ċ'
Feb 21 20:36:29 ollama[539850]: print_info: FIM PRE token    = 200001 '<fim_prefix>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM SUF token    = 200003 '<fim_suffix>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM MID token    = 200002 '<fim_middle>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM PAD token    = 200004 '<fim_pad>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM REP token    = 200005 '<reponame>'
Feb 21 20:36:29 ollama[539850]: print_info: EOG token        = 200004 '<fim_pad>'
Feb 21 20:36:29 ollama[539850]: print_info: EOG token        = 200005 '<reponame>'
Feb 21 20:36:29 ollama[539850]: print_info: EOG token        = 200020 '[e~['
Feb 21 20:36:29 ollama[539850]: print_info: max token length = 256
Feb 21 20:36:29 ollama[539850]: llama_model_load: vocab only - skipping tensors
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /ai/llm/models/blobs/sha256-f4f7f4e8e9f04b21d5c5a277223592388f17ae22a6e08f2ae1ab12bef6f9fca3 --port 42699"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:491 msg="system memory" total="245.1 GiB" free="239.3 GiB" free_swap="6.8 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=CUDA available="23.1 GiB" free="23.5 GiB" minimum="457.0 MiB" overhead="0 B"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=CUDA available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=CUDA available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=server.go:498 msg="loading model" "model layers"=63 requested=19
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="7.3 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="3.6 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA2 size="58.1 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="156.8 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="585.9 MiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="293.0 MiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA2 size="4.6 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="12.3 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="17.7 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="17.7 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA2 size="17.7 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:272 msg="total memory" size="296.8 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.392+03:00 level=INFO source=runner.go:965 msg="starting go runner"
Feb 21 20:36:29 ollama[539850]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Feb 21 20:36:29 ollama[539850]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Feb 21 20:36:29 ollama[539850]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Feb 21 20:36:29 ollama[539850]: ggml_cuda_init: found 3 CUDA devices:
Feb 21 20:36:29 ollama[539850]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Feb 21 20:36:29 ollama[539850]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Feb 21 20:36:29 ollama[539850]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Feb 21 20:36:29 ollama[539850]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.537+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.537+03:00 level=INFO source=runner.go:1001 msg="Server listening on 127.0.0.1:42699"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.541+03:00 level=INFO source=runner.go:895 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:75000 KvCacheType: NumThreads:16 GPULayers:19[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:16(43..58) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:2(59..60) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:1(61..61)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.541+03:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.541+03:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
Feb 21 20:36:29 ollama[539850]: ggml_backend_cuda_device_get_memory device GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 utilizing NVML memory reporting free: 33746845696 total: 34359738368
Feb 21 20:36:29 ollama[539850]: llama_model_load_from_file_impl: using device CUDA2 (Tesla V100-SXM2-32GB) (0000:09:00.0) - 32183 MiB free
Feb 21 20:36:29 ollama[539850]: ggml_backend_cuda_device_get_memory device GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe utilizing NVML memory reporting free: 24836243456 total: 25757220864
Feb 21 20:36:29 ollama[539850]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) (0000:01:00.0) - 23685 MiB free
Feb 21 20:36:29 ollama[539850]: ggml_backend_cuda_device_get_memory device GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 utilizing NVML memory reporting free: 24836243456 total: 25757220864
Feb 21 20:36:29 ollama[539850]: llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) (0000:03:00.0) - 23685 MiB free
Feb 21 20:36:29 ollama[539850]: llama_model_loader: loaded meta data with 38 key-value pairs and 809 tensors from /ai/llm/models/blobs/sha256-f4f7f4e8e9f04b21d5c5a277223592388f17ae22a6e08f2ae1ab12bef6f9fca3 (version GGUF V3 (latest))
Feb 21 20:36:29 ollama[539850]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   0:                       general.architecture str              = minimax-m2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   1:                          general.file_type u32              = 7
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   2:                            general.license str              = other
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   3:                       general.license.link str              = https://github.com/MiniMax-AI/MiniMax...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   4:                       general.license.name str              = modified-mit
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   5:                               general.name str              = Workdir
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   6:                    general.parameter_count u64              = 228689764864
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   7:               general.quantization_version u32              = 2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   8:                      general.sampling.temp f32              = 1.000000
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   9:                     general.sampling.top_k i32              = 40
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  10:                     general.sampling.top_p f32              = 0.950000
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  11:                         general.size_label str              = 256x4.9B
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  12:                               general.tags arr[str,1]       = ["text-generation"]
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  13:                               general.type str              = model
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  14:            minimax-m2.attention.head_count u32              = 48
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  15:         minimax-m2.attention.head_count_kv u32              = 8
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  16:            minimax-m2.attention.key_length u32              = 128
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  17: minimax-m2.attention.layer_norm_rms_epsilon f32              = 0.000001
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  18:          minimax-m2.attention.value_length u32              = 128
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  19:                     minimax-m2.block_count u32              = 62
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  20:                  minimax-m2.context_length u32              = 196608
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  21:                minimax-m2.embedding_length u32              = 3072
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  22:                    minimax-m2.expert_count u32              = 256
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  23:      minimax-m2.expert_feed_forward_length u32              = 1536
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  24:              minimax-m2.expert_gating_func u32              = 2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  25:               minimax-m2.expert_used_count u32              = 8
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  26:             minimax-m2.feed_forward_length u32              = 1536
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  27:            minimax-m2.rope.dimension_count u32              = 64
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  28:                  minimax-m2.rope.freq_base f32              = 5000000.000000
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {# ----------‑‑‑ special token ...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 200034
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 200020
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,199744]  = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = minimax-m2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,200064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,200064]  = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  37:            tokenizer.ggml.unknown_token_id u32              = 200021
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type  f32:  373 tensors
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type q8_0:  436 tensors
Feb 21 20:36:29 ollama[539850]: print_info: file format = GGUF V3 (latest)
Feb 21 20:36:29 ollama[539850]: print_info: file type   = Q8_0
Feb 21 20:36:29 ollama[539850]: print_info: file size   = 226.43 GiB (8.51 BPW)
Feb 21 20:36:29 ollama[539850]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Feb 21 20:36:29 ollama[539850]: load: printing all EOG tokens:
Feb 21 20:36:29 ollama[539850]: load:   - 200004 ('<fim_pad>')
Feb 21 20:36:29 ollama[539850]: load:   - 200005 ('<reponame>')
Feb 21 20:36:29 ollama[539850]: load:   - 200020 ('[e~[')
Feb 21 20:36:29 ollama[539850]: load: special tokens cache size = 54
Feb 21 20:36:29 ollama[539850]: load: token to piece cache size = 1.3355 MB
Feb 21 20:36:29 ollama[539850]: print_info: arch             = minimax-m2
Feb 21 20:36:29 ollama[539850]: print_info: vocab_only       = 0
Feb 21 20:36:29 ollama[539850]: print_info: no_alloc         = 0
Feb 21 20:36:29 ollama[539850]: print_info: n_ctx_train      = 196608
Feb 21 20:36:29 ollama[539850]: print_info: n_embd           = 3072
Feb 21 20:36:29 ollama[539850]: print_info: n_embd_inp       = 3072
Feb 21 20:36:29 ollama[539850]: print_info: n_layer          = 62
Feb 21 20:36:29 ollama[539850]: print_info: n_head           = 48
Feb 21 20:36:29 ollama[539850]: print_info: n_head_kv        = 8
Feb 21 20:36:29 ollama[539850]: print_info: n_rot            = 64
Feb 21 20:36:29 ollama[539850]: print_info: n_swa            = 0
Feb 21 20:36:29 ollama[539850]: print_info: is_swa_any       = 0
Feb 21 20:36:29 ollama[539850]: print_info: n_embd_head_k    = 128
Feb 21 20:36:29 ollama[539850]: print_info: n_embd_head_v    = 128
Feb 21 20:36:29 ollama[539850]: print_info: n_gqa            = 6
Feb 21 20:36:29 ollama[539850]: print_info: n_embd_k_gqa     = 1024
Feb 21 20:36:29 ollama[539850]: print_info: n_embd_v_gqa     = 1024
Feb 21 20:36:29 ollama[539850]: print_info: f_norm_eps       = 0.0e+00
Feb 21 20:36:29 ollama[539850]: print_info: f_norm_rms_eps   = 1.0e-06
Feb 21 20:36:29 ollama[539850]: print_info: f_clamp_kqv      = 0.0e+00
Feb 21 20:36:29 ollama[539850]: print_info: f_max_alibi_bias = 0.0e+00
Feb 21 20:36:29 ollama[539850]: print_info: f_logit_scale    = 0.0e+00
Feb 21 20:36:29 ollama[539850]: print_info: f_attn_scale     = 0.0e+00
Feb 21 20:36:29 ollama[539850]: print_info: n_ff             = 1536
Feb 21 20:36:29 ollama[539850]: print_info: n_expert         = 256
Feb 21 20:36:29 ollama[539850]: print_info: n_expert_used    = 8
Feb 21 20:36:29 ollama[539850]: print_info: n_expert_groups  = 0
Feb 21 20:36:29 ollama[539850]: print_info: n_group_used     = 0
Feb 21 20:36:29 ollama[539850]: print_info: causal attn      = 1
Feb 21 20:36:29 ollama[539850]: print_info: pooling type     = 0
Feb 21 20:36:29 ollama[539850]: print_info: rope type        = 2
Feb 21 20:36:29 ollama[539850]: print_info: rope scaling     = linear
Feb 21 20:36:29 ollama[539850]: print_info: freq_base_train  = 5000000.0
Feb 21 20:36:29 ollama[539850]: print_info: freq_scale_train = 1
Feb 21 20:36:29 ollama[539850]: print_info: n_ctx_orig_yarn  = 196608
Feb 21 20:36:29 ollama[539850]: print_info: rope_yarn_log_mul= 0.0000
Feb 21 20:36:29 ollama[539850]: print_info: rope_finetuned   = unknown
Feb 21 20:36:29 ollama[539850]: print_info: model type       = 230B.A10B
Feb 21 20:36:29 ollama[539850]: print_info: model params     = 228.69 B
Feb 21 20:36:29 ollama[539850]: print_info: general.name     = Workdir
Feb 21 20:36:29 ollama[539850]: print_info: vocab type       = BPE
Feb 21 20:36:29 ollama[539850]: print_info: n_vocab          = 200064
Feb 21 20:36:29 ollama[539850]: print_info: n_merges         = 199744
Feb 21 20:36:29 ollama[539850]: print_info: BOS token        = 200034 ']~!b['
Feb 21 20:36:29 ollama[539850]: print_info: EOS token        = 200020 '[e~['
Feb 21 20:36:29 ollama[539850]: print_info: UNK token        = 200021 ']!d~['
Feb 21 20:36:29 ollama[539850]: print_info: LF token         = 10 'Ċ'
Feb 21 20:36:29 ollama[539850]: print_info: FIM PRE token    = 200001 '<fim_prefix>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM SUF token    = 200003 '<fim_suffix>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM MID token    = 200002 '<fim_middle>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM PAD token    = 200004 '<fim_pad>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM REP token    = 200005 '<reponame>'
Feb 21 20:36:29 ollama[539850]: print_info: EOG token        = 200004 '<fim_pad>'
Feb 21 20:36:29 ollama[539850]: print_info: EOG token        = 200005 '<reponame>'
Feb 21 20:36:29 ollama[539850]: print_info: EOG token        = 200020 '[e~['
Feb 21 20:36:29 ollama[539850]: print_info: max token length = 256
Feb 21 20:36:29 ollama[539850]: load_tensors: loading model tensors, this can take a while... (mmap = false)
Feb 21 20:36:31 ollama[539850]: time=2026-02-21T20:36:31.997+03:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding"
Feb 21 20:36:32 ollama[539850]: time=2026-02-21T20:36:32.255+03:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
Feb 21 20:36:51 ollama[539850]: load_tensors: offloading 19 repeating layers to GPU
Feb 21 20:36:51 ollama[539850]: load_tensors: offloaded 19/63 layers to GPU
Feb 21 20:36:51 ollama[539850]: load_tensors:        CUDA0 model buffer size =  7439.35 MiB
Feb 21 20:36:51 ollama[539850]: load_tensors:        CUDA1 model buffer size =  3719.68 MiB
Feb 21 20:36:51 ollama[539850]: load_tensors:        CUDA2 model buffer size = 59514.83 MiB
Feb 21 20:36:51 ollama[539850]: load_tensors:    CUDA_Host model buffer size = 161191.63 MiB
Feb 21 20:38:20 ollama[539850]: llama_context: constructing llama_context
Feb 21 20:38:20 ollama[539850]: llama_context: n_seq_max     = 1
Feb 21 20:38:20 ollama[539850]: llama_context: n_ctx         = 75008
Feb 21 20:38:20 ollama[539850]: llama_context: n_ctx_seq     = 75008
Feb 21 20:38:20 ollama[539850]: llama_context: n_batch       = 512
Feb 21 20:38:20 ollama[539850]: llama_context: n_ubatch      = 512
Feb 21 20:38:20 ollama[539850]: llama_context: causal_attn   = 1
Feb 21 20:38:20 ollama[539850]: llama_context: flash_attn    = enabled
Feb 21 20:38:20 ollama[539850]: llama_context: kv_unified    = false
Feb 21 20:38:20 ollama[539850]: llama_context: freq_base     = 5000000.0
Feb 21 20:38:20 ollama[539850]: llama_context: freq_scale    = 1
Feb 21 20:38:20 ollama[539850]: llama_context: n_ctx_seq (75008) < n_ctx_train (196608) -- the full capacity of the model will not be utilized
Feb 21 20:38:20 ollama[539850]: llama_context:        CPU  output buffer size =     0.77 MiB
Feb 21 20:38:20 ollama[539850]: llama_kv_cache:        CPU KV buffer size = 12599.00 MiB
Feb 21 20:38:21 ollama[539850]: llama_kv_cache:      CUDA0 KV buffer size =   586.00 MiB
Feb 21 20:38:22 ollama[539850]: llama_kv_cache:      CUDA1 KV buffer size =   293.00 MiB
Feb 21 20:38:22 ollama[539850]: llama_kv_cache:      CUDA2 KV buffer size =  4688.00 MiB
Feb 21 20:38:24 ollama[539850]: llama_kv_cache: size = 18166.00 MiB ( 75008 cells,  62 layers,  1/1 seqs), K (f16): 9083.00 MiB, V (f16): 9083.00 MiB
Feb 21 20:38:24 ollama[539850]: llama_context:      CUDA2 compute buffer size =  1760.78 MiB
Feb 21 20:38:24 ollama[539850]: llama_context:      CUDA0 compute buffer size =   158.76 MiB
Feb 21 20:38:24 ollama[539850]: llama_context:      CUDA1 compute buffer size =   109.26 MiB
Feb 21 20:38:24 ollama[539850]: llama_context:  CUDA_Host compute buffer size =   152.51 MiB
Feb 21 20:38:24 ollama[539850]: llama_context: graph nodes  = 3975
Feb 21 20:38:24 ollama[539850]: llama_context: graph splits = 651 (with bs=512), 5 (with bs=1)
Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=server.go:1388 msg="llama runner started in 115.64 seconds"
Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=sched.go:566 msg="loaded runners" count=1
Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=server.go:1388 msg="llama runner started in 115.65 seconds"
Feb 21 20:42:15 ollama[539850]: [GIN] 2026/02/21 - 20:42:15 | 200 |         5m46s |  192.168.127.20 | POST     "/api/chat"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.16.3

Originally created by @ka-admin on GitHub (Feb 21, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14351 ### What is the issue? <html><body> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">When running a large MoE model with the new engine on a mixed multi-GPU system under partial offload, the layer scheduler assigns far more VRAM to a GPU than it physically has, silently overflowing those tensors into pinned host RAM (<code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">CUDA_Host</code>). The result is that layers reported as "GPU offloaded" are in fact running from system RAM, with added PCIe overhead.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Additionally, <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">OLLAMA_SCHED_SPREAD=true</code> has <strong>zero effect</strong> — the layer assignment and buffer allocation are byte-for-byte identical with the flag on or off, suggesting the bug lies in per-layer VRAM cost estimation before any placement policy is applied.</p> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Environment</h2> <ul class="[li_&]:mb-0 [li_&]:mt-1 [li_&]:gap-1 [&:not(:last-child)_ul]:pb-1 [&:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3"> <li class="whitespace-normal break-words pl-2"><strong>Ollama version:</strong> 0.16.3</li> <li class="whitespace-normal break-words pl-2"><strong>OS:</strong> Linux (Ubuntu), kernel with systemd</li> <li class="whitespace-normal break-words pl-2"><strong>Model:</strong> MiniMax M2.5 Q8_0 GGUF (<code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">minimax-m2</code> architecture, 228.69B params, 226.43 GiB)</li> <li class="whitespace-normal break-words pl-2"><strong>GPUs:</strong></li> </ul> <div class="overflow-x-auto w-full px-2 mb-6"> Device | Name | VRAM | Compute -- | -- | -- | -- CUDA0 | NVIDIA GeForce RTX 4090 | 24 GiB | 8.9 CUDA1 | NVIDIA GeForce RTX 4090 | 24 GiB | 8.9 CUDA2 | Tesla V100-SXM2-32GB | 32 GiB | 7.0 </div> <ul class="[li_&]:mb-0 [li_&]:mt-1 [li_&]:gap-1 [&:not(:last-child)_ul]:pb-1 [&:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3"> <li class="whitespace-normal break-words pl-2"><strong>System RAM:</strong> 256 GiB DDR5</li> <li class="whitespace-normal break-words pl-2"><strong>Driver / CUDA:</strong> 580.126.18 / 13.0</li> <li class="whitespace-normal break-words pl-2"><strong>Relevant env config:</strong></li> </ul> <div class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg"><div class="sticky opacity-0 group-hover/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"><div class="absolute right-0 h-8 px-2 items-center inline-flex z-10"><button class="inline-flex items-center justify-center relative shrink-0 can-focus select-none disabled:pointer-events-none disabled:opacity-50 disabled:shadow-none disabled:drop-shadow-none border-transparent transition font-base duration-300 ease-[cubic-bezier(0.165,0.85,0.45,1)] h-8 w-8 rounded-md active:scale-95 backdrop-blur-md Button_ghost__BUAoh" type="button" aria-label="Copy to clipboard" data-state="closed"><div class="relative"><div class="transition-all opacity-100 scale-100" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" style="flex-shrink: 0;" class="transition-all opacity-100 scale-100" aria-hidden="true"><path d="M12.5 3C13.3284 3 14 3.67157 14 4.5V6H15.5C16.3284 6 17 6.67157 17 7.5V15.5C17 16.3284 16.3284 17 15.5 17H7.5C6.67157 17 6 16.3284 6 15.5V14H4.5C3.67157 14 3 13.3284 3 12.5V4.5C3 3.67157 3.67157 3 4.5 3H12.5ZM14 12.5C14 13.3284 13.3284 14 12.5 14H7V15.5C7 15.7761 7.22386 16 7.5 16H15.5C15.7761 16 16 15.7761 16 15.5V7.5C16 7.22386 15.7761 7 15.5 7H14V12.5ZM4.5 4C4.22386 4 4 4.22386 4 4.5V12.5C4 12.7761 4.22386 13 4.5 13H12.5C12.7761 13 13 12.7761 13 12.5V4.5C13 4.22386 12.7761 4 12.5 4H4.5Z"></path></svg></div><div class="absolute inset-0 flex items-center justify-center"><div class="transition-all opacity-0 scale-50" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" style="flex-shrink: 0;" class="transition-all opacity-0 scale-50" aria-hidden="true"><path d="M15.1883 5.10908C15.3699 4.96398 15.6346 4.96153 15.8202 5.11592C16.0056 5.27067 16.0504 5.53125 15.9403 5.73605L15.8836 5.82003L8.38354 14.8202C8.29361 14.9279 8.16242 14.9925 8.02221 14.9989C7.88203 15.0051 7.74545 14.9526 7.64622 14.8534L4.14617 11.3533L4.08172 11.2752C3.95384 11.0811 3.97542 10.817 4.14617 10.6463C4.31693 10.4755 4.58105 10.4539 4.77509 10.5818L4.85321 10.6463L7.96556 13.7586L15.1161 5.1794L15.1883 5.10908Z"></path></svg></div></div></div></button></div></div><div class="overflow-x-auto"><pre class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5" style="color: rgb(20, 24, 31); background: transparent; font-family: var(--font-mono);"><code style="color: rgb(20, 24, 31); background: transparent; font-family: var(--font-mono); white-space: pre-wrap;"><span><span>CUDA_VISIBLE_DEVICES=0,1,2 </span></span><span>OLLAMA_CONTEXT_LENGTH=131072 </span><span>OLLAMA_FLASH_ATTENTION=true </span><span>OLLAMA_NUM_PARALLEL=1</span></code></pre></div></div> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Steps to reproduce</h2> <ol class="[li_&]:mb-0 [li_&]:mt-1 [li_&]:gap-1 [&:not(:last-child)_ul]:pb-1 [&:not(:last-child)_ol]:pb-1 list-decimal flex flex-col gap-1 pl-8 mb-3"> <li class="whitespace-normal break-words pl-2">Run Ollama 0.16.3 on a mixed multi-GPU system where the GPU with largest total VRAM (V100, 32 GiB) cannot physically hold the layers the scheduler intends to assign it</li> <li class="whitespace-normal break-words pl-2">Load MiniMax M2.5 Q8_0 with partial GPU offload (19 of 63 layers)</li> <li class="whitespace-normal break-words pl-2">Observe layer assignment and actual buffer allocation in logs</li> </ol> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Expected behavior</h2> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Layers assigned to each GPU should not exceed that GPU's physical VRAM. If a GPU cannot fit its assigned layers, fewer layers should be assigned or they should be redistributed. <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">OLLAMA_SCHED_SPREAD=true</code> should produce a meaningfully different (more even) distribution than the default greedy policy.</p> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Actual behavior</h2> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The scheduler assigns 16 layers to the V100 (32 GiB), but the actual model buffer for those layers totals <strong>59,514 MiB (~58 GiB)</strong> — nearly 2× its physical capacity. This overflows silently into <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">CUDA_Host</code> (pinned system RAM) with no warning or error. The 16 "GPU layers" on the V100 are functionally CPU layers.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Furthermore, enabling <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">OLLAMA_SCHED_SPREAD=true</code> produces <strong>identical output</strong> — same layer assignment, same buffer sizes, same graph splits — indicating the spread/pack policy is not the issue and that incorrect per-layer cost estimation is happening upstream of any placement decision.</p> <h3 class="text-text-100 mt-2 -mb-1 text-base font-bold">Experiment 1 — <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">OLLAMA_SCHED_SPREAD=false</code> (default)</h3> <div class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg"><div class="sticky opacity-0 group-hover/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"><div class="absolute right-0 h-8 px-2 items-center inline-flex z-10"><button class="inline-flex items-center justify-center relative shrink-0 can-focus select-none disabled:pointer-events-none disabled:opacity-50 disabled:shadow-none disabled:drop-shadow-none border-transparent transition font-base duration-300 ease-[cubic-bezier(0.165,0.85,0.45,1)] h-8 w-8 rounded-md active:scale-95 backdrop-blur-md Button_ghost__BUAoh" type="button" aria-label="Copy to clipboard" data-state="closed"><div class="relative"><div class="transition-all opacity-100 scale-100" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" style="flex-shrink: 0;" class="transition-all opacity-100 scale-100" aria-hidden="true"><path d="M12.5 3C13.3284 3 14 3.67157 14 4.5V6H15.5C16.3284 6 17 6.67157 17 7.5V15.5C17 16.3284 16.3284 17 15.5 17H7.5C6.67157 17 6 16.3284 6 15.5V14H4.5C3.67157 14 3 13.3284 3 12.5V4.5C3 3.67157 3.67157 3 4.5 3H12.5ZM14 12.5C14 13.3284 13.3284 14 12.5 14H7V15.5C7 15.7761 7.22386 16 7.5 16H15.5C15.7761 16 16 15.7761 16 15.5V7.5C16 7.22386 15.7761 7 15.5 7H14V12.5ZM4.5 4C4.22386 4 4 4.22386 4 4.5V12.5C4 12.7761 4.22386 13 4.5 13H12.5C12.7761 13 13 12.7761 13 12.5V4.5C13 4.22386 12.7761 4 12.5 4H4.5Z"></path></svg></div><div class="absolute inset-0 flex items-center justify-center"><div class="transition-all opacity-0 scale-50" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" style="flex-shrink: 0;" class="transition-all opacity-0 scale-50" aria-hidden="true"><path d="M15.1883 5.10908C15.3699 4.96398 15.6346 4.96153 15.8202 5.11592C16.0056 5.27067 16.0504 5.53125 15.9403 5.73605L15.8836 5.82003L8.38354 14.8202C8.29361 14.9279 8.16242 14.9925 8.02221 14.9989C7.88203 15.0051 7.74545 14.9526 7.64622 14.8534L4.14617 11.3533L4.08172 11.2752C3.95384 11.0811 3.97542 10.817 4.14617 10.6463C4.31693 10.4755 4.58105 10.4539 4.77509 10.5818L4.85321 10.6463L7.96556 13.7586L15.1161 5.1794L15.1883 5.10908Z"></path></svg></div></div></div></button></div></div><div class="overflow-x-auto"><pre class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5" style="color: rgb(20, 24, 31); background: transparent; font-family: var(--font-mono);"><code style="color: rgb(20, 24, 31); background: transparent; font-family: var(--font-mono); white-space: pre-wrap;"><span><span>GPULayers:19[ </span></span><span> GPU-dfc3d6a8 (V100-32GB): Layers:16 (43..58) </span><span> GPU-7a420261 (RTX4090 #1): Layers:2 (59..60) </span><span> GPU-0dcf0ac3 (RTX4090 #2): Layers:1 (61..61) </span><span>] </span><span> </span><span>load_tensors: CUDA0 model buffer size = 7,439.35 MiB </span><span>load_tensors: CUDA1 model buffer size = 3,719.68 MiB </span><span>load_tensors: CUDA2 model buffer size = 59,514.83 MiB ← exceeds 32 GiB physical VRAM </span><span>load_tensors: CUDA_Host model buffer = 161,191.63 MiB </span><span> </span><span>graph splits = 651 (with bs=512), 5 (with bs=1)</span></code></pre></div></div> <h3 class="text-text-100 mt-2 -mb-1 text-base font-bold">Experiment 2 — <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">OLLAMA_SCHED_SPREAD=true</code></h3> <div class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg"><div class="sticky opacity-0 group-hover/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"><div class="absolute right-0 h-8 px-2 items-center inline-flex z-10"><button class="inline-flex items-center justify-center relative shrink-0 can-focus select-none disabled:pointer-events-none disabled:opacity-50 disabled:shadow-none disabled:drop-shadow-none border-transparent transition font-base duration-300 ease-[cubic-bezier(0.165,0.85,0.45,1)] h-8 w-8 rounded-md active:scale-95 backdrop-blur-md Button_ghost__BUAoh" type="button" aria-label="Copy to clipboard" data-state="closed"><div class="relative"><div class="transition-all opacity-100 scale-100" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" style="flex-shrink: 0;" class="transition-all opacity-100 scale-100" aria-hidden="true"><path d="M12.5 3C13.3284 3 14 3.67157 14 4.5V6H15.5C16.3284 6 17 6.67157 17 7.5V15.5C17 16.3284 16.3284 17 15.5 17H7.5C6.67157 17 6 16.3284 6 15.5V14H4.5C3.67157 14 3 13.3284 3 12.5V4.5C3 3.67157 3.67157 3 4.5 3H12.5ZM14 12.5C14 13.3284 13.3284 14 12.5 14H7V15.5C7 15.7761 7.22386 16 7.5 16H15.5C15.7761 16 16 15.7761 16 15.5V7.5C16 7.22386 15.7761 7 15.5 7H14V12.5ZM4.5 4C4.22386 4 4 4.22386 4 4.5V12.5C4 12.7761 4.22386 13 4.5 13H12.5C12.7761 13 13 12.7761 13 12.5V4.5C13 4.22386 12.7761 4 12.5 4H4.5Z"></path></svg></div><div class="absolute inset-0 flex items-center justify-center"><div class="transition-all opacity-0 scale-50" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" style="flex-shrink: 0;" class="transition-all opacity-0 scale-50" aria-hidden="true"><path d="M15.1883 5.10908C15.3699 4.96398 15.6346 4.96153 15.8202 5.11592C16.0056 5.27067 16.0504 5.53125 15.9403 5.73605L15.8836 5.82003L8.38354 14.8202C8.29361 14.9279 8.16242 14.9925 8.02221 14.9989C7.88203 15.0051 7.74545 14.9526 7.64622 14.8534L4.14617 11.3533L4.08172 11.2752C3.95384 11.0811 3.97542 10.817 4.14617 10.6463C4.31693 10.4755 4.58105 10.4539 4.77509 10.5818L4.85321 10.6463L7.96556 13.7586L15.1161 5.1794L15.1883 5.10908Z"></path></svg></div></div></div></button></div></div><div class="overflow-x-auto"><pre class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5" style="color: rgb(20, 24, 31); background: transparent; font-family: var(--font-mono);"><code style="color: rgb(20, 24, 31); background: transparent; font-family: var(--font-mono); white-space: pre-wrap;"><span><span>GPULayers:19[ </span></span><span> GPU-dfc3d6a8 (V100-32GB): Layers:16 (43..58) ← IDENTICAL </span><span> GPU-7a420261 (RTX4090 #1): Layers:2 (59..60) </span><span> GPU-0dcf0ac3 (RTX4090 #2): Layers:1 (61..61) </span><span>] </span><span> </span><span>load_tensors: CUDA0 model buffer size = 7,439.35 MiB </span><span>load_tensors: CUDA1 model buffer size = 3,719.68 MiB </span><span>load_tensors: CUDA2 model buffer size = 59,514.83 MiB ← IDENTICAL </span><span>load_tensors: CUDA_Host model buffer = 161,191.63 MiB </span><span> </span><span>graph splits = 651 (with bs=512), 5 (with bs=1)</span></code></pre></div></div> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Root cause hypothesis</h2> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]"><code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">minimax-m2</code> uses <strong>256 experts per MoE layer</strong> with <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">expert_feed_forward_length=1536</code> and <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">embedding_length=3072</code>. At Q8_0, each MoE layer weighs approximately <strong>3.6 GiB</strong> in actual allocated buffers. The scheduler's per-layer VRAM cost estimator is clearly computing a much smaller value — otherwise it would never plan to fit 16 such layers (~58 GiB) onto a 32 GiB GPU.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The GPUs are sorted largest-VRAM-first (V100 at 32 GiB before both 4090s at 24 GiB), so the greedy packer fills the V100 until its estimated budget is exhausted. Because the estimate is wrong, it over-assigns by ~2×, and llama.cpp silently backs the overflow with pinned host RAM.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The extremely high graph split count (<code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">651</code> at bs=512 vs <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">5</code> at bs=1) is a secondary symptom of the resulting cross-device tensor fragmentation.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The bug likely lives in the per-layer cost estimation for <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">minimax-m2</code> in <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">device.go</code> (or equivalent scheduler weight logic), not in the spread/pack placement policy.</p> </body> </html> ### Relevant log output ```shell ollama.service: Consumed 33min 52.940s CPU time, 218.4G memory peak, 5.6G memory swap peak. Feb 21 20:36:21 systemd[1]: Started ollama.service - Ollama Service. Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.938+03:00 level=INFO source=routes.go:1663 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1,2 GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:131072 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:true OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.938+03:00 level=INFO source=routes.go:1665 msg="Ollama cloud disabled: false" Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.949+03:00 level=INFO source=images.go:473 msg="total blobs: 22" Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.949+03:00 level=INFO source=images.go:480 msg="total unused blobs removed: 0" Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.950+03:00 level=INFO source=routes.go:1718 msg="Listening on [::]:11434 (version 0.16.3)" Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=WARN source=runner.go:485 msg="user overrode visible devices" CUDA_VISIBLE_DEVICES=0,1,2 Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=WARN source=runner.go:489 msg="if GPUs are not correctly discovered, unset and try again" Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled. To enable, set OLLAMA_VULKAN=1" Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 37329" Feb 21 20:36:22 ollama[539850]: time=2026-02-21T20:36:22.657+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 42953" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 36173" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 44529" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43121" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 46013" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 46881" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 39921" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 filter_id="" library=CUDA compute=7.0 name=CUDA2 description="Tesla V100-SXM2-32GB" libdirs=ollama,cuda_v12 driver=13.0 pci_id=0000:09:00.0 type=discrete total="32.0 GiB" available="31.4 GiB" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe filter_id="" library=CUDA compute=8.9 name=CUDA0 description="NVIDIA GeForce RTX 4090" libdirs=ollama,cuda_v12 driver=13.0 pci_id=0000:01:00.0 type=discrete total="24.0 GiB" available="23.5 GiB" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 filter_id="" library=CUDA compute=8.9 name=CUDA1 description="NVIDIA GeForce RTX 4090" libdirs=ollama,cuda_v12 driver=13.0 pci_id=0000:03:00.0 type=discrete total="24.0 GiB" available="23.1 GiB" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=routes.go:1768 msg="vram-based default context" total_vram="80.0 GiB" default_num_ctx=262144 Feb 21 20:36:28 ollama[539850]: time=2026-02-21T20:36:28.802+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43123" Feb 21 20:36:29 ollama[539850]: llama_model_loader: loaded meta data with 38 key-value pairs and 809 tensors from /ai/llm/models/blobs/sha256-f4f7f4e8e9f04b21d5c5a277223592388f17ae22a6e08f2ae1ab12bef6f9fca3 (version GGUF V3 (latest)) Feb 21 20:36:29 ollama[539850]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 0: general.architecture str = minimax-m2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 1: general.file_type u32 = 7 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 2: general.license str = other Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 3: general.license.link str = https://github.com/MiniMax-AI/MiniMax... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 4: general.license.name str = modified-mit Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 5: general.name str = Workdir Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 6: general.parameter_count u64 = 228689764864 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 7: general.quantization_version u32 = 2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 8: general.sampling.temp f32 = 1.000000 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 9: general.sampling.top_k i32 = 40 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 10: general.sampling.top_p f32 = 0.950000 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 11: general.size_label str = 256x4.9B Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 12: general.tags arr[str,1] = ["text-generation"] Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 13: general.type str = model Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 14: minimax-m2.attention.head_count u32 = 48 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 15: minimax-m2.attention.head_count_kv u32 = 8 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 16: minimax-m2.attention.key_length u32 = 128 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 17: minimax-m2.attention.layer_norm_rms_epsilon f32 = 0.000001 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 18: minimax-m2.attention.value_length u32 = 128 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 19: minimax-m2.block_count u32 = 62 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 20: minimax-m2.context_length u32 = 196608 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 21: minimax-m2.embedding_length u32 = 3072 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 22: minimax-m2.expert_count u32 = 256 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 23: minimax-m2.expert_feed_forward_length u32 = 1536 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 24: minimax-m2.expert_gating_func u32 = 2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 25: minimax-m2.expert_used_count u32 = 8 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 26: minimax-m2.feed_forward_length u32 = 1536 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 27: minimax-m2.rope.dimension_count u32 = 64 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 28: minimax-m2.rope.freq_base f32 = 5000000.000000 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 29: tokenizer.chat_template str = {# ----------‑‑‑ special token ... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 200034 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 200020 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,199744] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 34: tokenizer.ggml.pre str = minimax-m2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,200064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,200064] = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 37: tokenizer.ggml.unknown_token_id u32 = 200021 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type f32: 373 tensors Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type q8_0: 436 tensors Feb 21 20:36:29 ollama[539850]: print_info: file format = GGUF V3 (latest) Feb 21 20:36:29 ollama[539850]: print_info: file type = Q8_0 Feb 21 20:36:29 ollama[539850]: print_info: file size = 226.43 GiB (8.51 BPW) Feb 21 20:36:29 ollama[539850]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect Feb 21 20:36:29 ollama[539850]: load: printing all EOG tokens: Feb 21 20:36:29 ollama[539850]: load: - 200004 ('<fim_pad>') Feb 21 20:36:29 ollama[539850]: load: - 200005 ('<reponame>') Feb 21 20:36:29 ollama[539850]: load: - 200020 ('[e~[') Feb 21 20:36:29 ollama[539850]: load: special tokens cache size = 54 Feb 21 20:36:29 ollama[539850]: load: token to piece cache size = 1.3355 MB Feb 21 20:36:29 ollama[539850]: print_info: arch = minimax-m2 Feb 21 20:36:29 ollama[539850]: print_info: vocab_only = 1 Feb 21 20:36:29 ollama[539850]: print_info: no_alloc = 0 Feb 21 20:36:29 ollama[539850]: print_info: model type = ?B Feb 21 20:36:29 ollama[539850]: print_info: model params = 228.69 B Feb 21 20:36:29 ollama[539850]: print_info: general.name = Workdir Feb 21 20:36:29 ollama[539850]: print_info: vocab type = BPE Feb 21 20:36:29 ollama[539850]: print_info: n_vocab = 200064 Feb 21 20:36:29 ollama[539850]: print_info: n_merges = 199744 Feb 21 20:36:29 ollama[539850]: print_info: BOS token = 200034 ']~!b[' Feb 21 20:36:29 ollama[539850]: print_info: EOS token = 200020 '[e~[' Feb 21 20:36:29 ollama[539850]: print_info: UNK token = 200021 ']!d~[' Feb 21 20:36:29 ollama[539850]: print_info: LF token = 10 'Ċ' Feb 21 20:36:29 ollama[539850]: print_info: FIM PRE token = 200001 '<fim_prefix>' Feb 21 20:36:29 ollama[539850]: print_info: FIM SUF token = 200003 '<fim_suffix>' Feb 21 20:36:29 ollama[539850]: print_info: FIM MID token = 200002 '<fim_middle>' Feb 21 20:36:29 ollama[539850]: print_info: FIM PAD token = 200004 '<fim_pad>' Feb 21 20:36:29 ollama[539850]: print_info: FIM REP token = 200005 '<reponame>' Feb 21 20:36:29 ollama[539850]: print_info: EOG token = 200004 '<fim_pad>' Feb 21 20:36:29 ollama[539850]: print_info: EOG token = 200005 '<reponame>' Feb 21 20:36:29 ollama[539850]: print_info: EOG token = 200020 '[e~[' Feb 21 20:36:29 ollama[539850]: print_info: max token length = 256 Feb 21 20:36:29 ollama[539850]: llama_model_load: vocab only - skipping tensors Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /ai/llm/models/blobs/sha256-f4f7f4e8e9f04b21d5c5a277223592388f17ae22a6e08f2ae1ab12bef6f9fca3 --port 42699" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:491 msg="system memory" total="245.1 GiB" free="239.3 GiB" free_swap="6.8 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=CUDA available="23.1 GiB" free="23.5 GiB" minimum="457.0 MiB" overhead="0 B" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=CUDA available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=CUDA available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=server.go:498 msg="loading model" "model layers"=63 requested=19 Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="7.3 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="3.6 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA2 size="58.1 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="156.8 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="585.9 MiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="293.0 MiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA2 size="4.6 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="12.3 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="17.7 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="17.7 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA2 size="17.7 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:272 msg="total memory" size="296.8 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.392+03:00 level=INFO source=runner.go:965 msg="starting go runner" Feb 21 20:36:29 ollama[539850]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Feb 21 20:36:29 ollama[539850]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Feb 21 20:36:29 ollama[539850]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Feb 21 20:36:29 ollama[539850]: ggml_cuda_init: found 3 CUDA devices: Feb 21 20:36:29 ollama[539850]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Feb 21 20:36:29 ollama[539850]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Feb 21 20:36:29 ollama[539850]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Feb 21 20:36:29 ollama[539850]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.537+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.537+03:00 level=INFO source=runner.go:1001 msg="Server listening on 127.0.0.1:42699" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.541+03:00 level=INFO source=runner.go:895 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:75000 KvCacheType: NumThreads:16 GPULayers:19[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:16(43..58) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:2(59..60) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:1(61..61)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.541+03:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.541+03:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" Feb 21 20:36:29 ollama[539850]: ggml_backend_cuda_device_get_memory device GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 utilizing NVML memory reporting free: 33746845696 total: 34359738368 Feb 21 20:36:29 ollama[539850]: llama_model_load_from_file_impl: using device CUDA2 (Tesla V100-SXM2-32GB) (0000:09:00.0) - 32183 MiB free Feb 21 20:36:29 ollama[539850]: ggml_backend_cuda_device_get_memory device GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe utilizing NVML memory reporting free: 24836243456 total: 25757220864 Feb 21 20:36:29 ollama[539850]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) (0000:01:00.0) - 23685 MiB free Feb 21 20:36:29 ollama[539850]: ggml_backend_cuda_device_get_memory device GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 utilizing NVML memory reporting free: 24836243456 total: 25757220864 Feb 21 20:36:29 ollama[539850]: llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) (0000:03:00.0) - 23685 MiB free Feb 21 20:36:29 ollama[539850]: llama_model_loader: loaded meta data with 38 key-value pairs and 809 tensors from /ai/llm/models/blobs/sha256-f4f7f4e8e9f04b21d5c5a277223592388f17ae22a6e08f2ae1ab12bef6f9fca3 (version GGUF V3 (latest)) Feb 21 20:36:29 ollama[539850]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 0: general.architecture str = minimax-m2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 1: general.file_type u32 = 7 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 2: general.license str = other Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 3: general.license.link str = https://github.com/MiniMax-AI/MiniMax... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 4: general.license.name str = modified-mit Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 5: general.name str = Workdir Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 6: general.parameter_count u64 = 228689764864 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 7: general.quantization_version u32 = 2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 8: general.sampling.temp f32 = 1.000000 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 9: general.sampling.top_k i32 = 40 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 10: general.sampling.top_p f32 = 0.950000 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 11: general.size_label str = 256x4.9B Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 12: general.tags arr[str,1] = ["text-generation"] Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 13: general.type str = model Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 14: minimax-m2.attention.head_count u32 = 48 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 15: minimax-m2.attention.head_count_kv u32 = 8 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 16: minimax-m2.attention.key_length u32 = 128 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 17: minimax-m2.attention.layer_norm_rms_epsilon f32 = 0.000001 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 18: minimax-m2.attention.value_length u32 = 128 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 19: minimax-m2.block_count u32 = 62 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 20: minimax-m2.context_length u32 = 196608 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 21: minimax-m2.embedding_length u32 = 3072 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 22: minimax-m2.expert_count u32 = 256 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 23: minimax-m2.expert_feed_forward_length u32 = 1536 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 24: minimax-m2.expert_gating_func u32 = 2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 25: minimax-m2.expert_used_count u32 = 8 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 26: minimax-m2.feed_forward_length u32 = 1536 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 27: minimax-m2.rope.dimension_count u32 = 64 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 28: minimax-m2.rope.freq_base f32 = 5000000.000000 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 29: tokenizer.chat_template str = {# ----------‑‑‑ special token ... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 200034 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 200020 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,199744] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 34: tokenizer.ggml.pre str = minimax-m2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,200064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,200064] = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 37: tokenizer.ggml.unknown_token_id u32 = 200021 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type f32: 373 tensors Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type q8_0: 436 tensors Feb 21 20:36:29 ollama[539850]: print_info: file format = GGUF V3 (latest) Feb 21 20:36:29 ollama[539850]: print_info: file type = Q8_0 Feb 21 20:36:29 ollama[539850]: print_info: file size = 226.43 GiB (8.51 BPW) Feb 21 20:36:29 ollama[539850]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect Feb 21 20:36:29 ollama[539850]: load: printing all EOG tokens: Feb 21 20:36:29 ollama[539850]: load: - 200004 ('<fim_pad>') Feb 21 20:36:29 ollama[539850]: load: - 200005 ('<reponame>') Feb 21 20:36:29 ollama[539850]: load: - 200020 ('[e~[') Feb 21 20:36:29 ollama[539850]: load: special tokens cache size = 54 Feb 21 20:36:29 ollama[539850]: load: token to piece cache size = 1.3355 MB Feb 21 20:36:29 ollama[539850]: print_info: arch = minimax-m2 Feb 21 20:36:29 ollama[539850]: print_info: vocab_only = 0 Feb 21 20:36:29 ollama[539850]: print_info: no_alloc = 0 Feb 21 20:36:29 ollama[539850]: print_info: n_ctx_train = 196608 Feb 21 20:36:29 ollama[539850]: print_info: n_embd = 3072 Feb 21 20:36:29 ollama[539850]: print_info: n_embd_inp = 3072 Feb 21 20:36:29 ollama[539850]: print_info: n_layer = 62 Feb 21 20:36:29 ollama[539850]: print_info: n_head = 48 Feb 21 20:36:29 ollama[539850]: print_info: n_head_kv = 8 Feb 21 20:36:29 ollama[539850]: print_info: n_rot = 64 Feb 21 20:36:29 ollama[539850]: print_info: n_swa = 0 Feb 21 20:36:29 ollama[539850]: print_info: is_swa_any = 0 Feb 21 20:36:29 ollama[539850]: print_info: n_embd_head_k = 128 Feb 21 20:36:29 ollama[539850]: print_info: n_embd_head_v = 128 Feb 21 20:36:29 ollama[539850]: print_info: n_gqa = 6 Feb 21 20:36:29 ollama[539850]: print_info: n_embd_k_gqa = 1024 Feb 21 20:36:29 ollama[539850]: print_info: n_embd_v_gqa = 1024 Feb 21 20:36:29 ollama[539850]: print_info: f_norm_eps = 0.0e+00 Feb 21 20:36:29 ollama[539850]: print_info: f_norm_rms_eps = 1.0e-06 Feb 21 20:36:29 ollama[539850]: print_info: f_clamp_kqv = 0.0e+00 Feb 21 20:36:29 ollama[539850]: print_info: f_max_alibi_bias = 0.0e+00 Feb 21 20:36:29 ollama[539850]: print_info: f_logit_scale = 0.0e+00 Feb 21 20:36:29 ollama[539850]: print_info: f_attn_scale = 0.0e+00 Feb 21 20:36:29 ollama[539850]: print_info: n_ff = 1536 Feb 21 20:36:29 ollama[539850]: print_info: n_expert = 256 Feb 21 20:36:29 ollama[539850]: print_info: n_expert_used = 8 Feb 21 20:36:29 ollama[539850]: print_info: n_expert_groups = 0 Feb 21 20:36:29 ollama[539850]: print_info: n_group_used = 0 Feb 21 20:36:29 ollama[539850]: print_info: causal attn = 1 Feb 21 20:36:29 ollama[539850]: print_info: pooling type = 0 Feb 21 20:36:29 ollama[539850]: print_info: rope type = 2 Feb 21 20:36:29 ollama[539850]: print_info: rope scaling = linear Feb 21 20:36:29 ollama[539850]: print_info: freq_base_train = 5000000.0 Feb 21 20:36:29 ollama[539850]: print_info: freq_scale_train = 1 Feb 21 20:36:29 ollama[539850]: print_info: n_ctx_orig_yarn = 196608 Feb 21 20:36:29 ollama[539850]: print_info: rope_yarn_log_mul= 0.0000 Feb 21 20:36:29 ollama[539850]: print_info: rope_finetuned = unknown Feb 21 20:36:29 ollama[539850]: print_info: model type = 230B.A10B Feb 21 20:36:29 ollama[539850]: print_info: model params = 228.69 B Feb 21 20:36:29 ollama[539850]: print_info: general.name = Workdir Feb 21 20:36:29 ollama[539850]: print_info: vocab type = BPE Feb 21 20:36:29 ollama[539850]: print_info: n_vocab = 200064 Feb 21 20:36:29 ollama[539850]: print_info: n_merges = 199744 Feb 21 20:36:29 ollama[539850]: print_info: BOS token = 200034 ']~!b[' Feb 21 20:36:29 ollama[539850]: print_info: EOS token = 200020 '[e~[' Feb 21 20:36:29 ollama[539850]: print_info: UNK token = 200021 ']!d~[' Feb 21 20:36:29 ollama[539850]: print_info: LF token = 10 'Ċ' Feb 21 20:36:29 ollama[539850]: print_info: FIM PRE token = 200001 '<fim_prefix>' Feb 21 20:36:29 ollama[539850]: print_info: FIM SUF token = 200003 '<fim_suffix>' Feb 21 20:36:29 ollama[539850]: print_info: FIM MID token = 200002 '<fim_middle>' Feb 21 20:36:29 ollama[539850]: print_info: FIM PAD token = 200004 '<fim_pad>' Feb 21 20:36:29 ollama[539850]: print_info: FIM REP token = 200005 '<reponame>' Feb 21 20:36:29 ollama[539850]: print_info: EOG token = 200004 '<fim_pad>' Feb 21 20:36:29 ollama[539850]: print_info: EOG token = 200005 '<reponame>' Feb 21 20:36:29 ollama[539850]: print_info: EOG token = 200020 '[e~[' Feb 21 20:36:29 ollama[539850]: print_info: max token length = 256 Feb 21 20:36:29 ollama[539850]: load_tensors: loading model tensors, this can take a while... (mmap = false) Feb 21 20:36:31 ollama[539850]: time=2026-02-21T20:36:31.997+03:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding" Feb 21 20:36:32 ollama[539850]: time=2026-02-21T20:36:32.255+03:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" Feb 21 20:36:51 ollama[539850]: load_tensors: offloading 19 repeating layers to GPU Feb 21 20:36:51 ollama[539850]: load_tensors: offloaded 19/63 layers to GPU Feb 21 20:36:51 ollama[539850]: load_tensors: CUDA0 model buffer size = 7439.35 MiB Feb 21 20:36:51 ollama[539850]: load_tensors: CUDA1 model buffer size = 3719.68 MiB Feb 21 20:36:51 ollama[539850]: load_tensors: CUDA2 model buffer size = 59514.83 MiB Feb 21 20:36:51 ollama[539850]: load_tensors: CUDA_Host model buffer size = 161191.63 MiB Feb 21 20:38:20 ollama[539850]: llama_context: constructing llama_context Feb 21 20:38:20 ollama[539850]: llama_context: n_seq_max = 1 Feb 21 20:38:20 ollama[539850]: llama_context: n_ctx = 75008 Feb 21 20:38:20 ollama[539850]: llama_context: n_ctx_seq = 75008 Feb 21 20:38:20 ollama[539850]: llama_context: n_batch = 512 Feb 21 20:38:20 ollama[539850]: llama_context: n_ubatch = 512 Feb 21 20:38:20 ollama[539850]: llama_context: causal_attn = 1 Feb 21 20:38:20 ollama[539850]: llama_context: flash_attn = enabled Feb 21 20:38:20 ollama[539850]: llama_context: kv_unified = false Feb 21 20:38:20 ollama[539850]: llama_context: freq_base = 5000000.0 Feb 21 20:38:20 ollama[539850]: llama_context: freq_scale = 1 Feb 21 20:38:20 ollama[539850]: llama_context: n_ctx_seq (75008) < n_ctx_train (196608) -- the full capacity of the model will not be utilized Feb 21 20:38:20 ollama[539850]: llama_context: CPU output buffer size = 0.77 MiB Feb 21 20:38:20 ollama[539850]: llama_kv_cache: CPU KV buffer size = 12599.00 MiB Feb 21 20:38:21 ollama[539850]: llama_kv_cache: CUDA0 KV buffer size = 586.00 MiB Feb 21 20:38:22 ollama[539850]: llama_kv_cache: CUDA1 KV buffer size = 293.00 MiB Feb 21 20:38:22 ollama[539850]: llama_kv_cache: CUDA2 KV buffer size = 4688.00 MiB Feb 21 20:38:24 ollama[539850]: llama_kv_cache: size = 18166.00 MiB ( 75008 cells, 62 layers, 1/1 seqs), K (f16): 9083.00 MiB, V (f16): 9083.00 MiB Feb 21 20:38:24 ollama[539850]: llama_context: CUDA2 compute buffer size = 1760.78 MiB Feb 21 20:38:24 ollama[539850]: llama_context: CUDA0 compute buffer size = 158.76 MiB Feb 21 20:38:24 ollama[539850]: llama_context: CUDA1 compute buffer size = 109.26 MiB Feb 21 20:38:24 ollama[539850]: llama_context: CUDA_Host compute buffer size = 152.51 MiB Feb 21 20:38:24 ollama[539850]: llama_context: graph nodes = 3975 Feb 21 20:38:24 ollama[539850]: llama_context: graph splits = 651 (with bs=512), 5 (with bs=1) Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=server.go:1388 msg="llama runner started in 115.64 seconds" Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=sched.go:566 msg="loaded runners" count=1 Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=server.go:1388 msg="llama runner started in 115.65 seconds" Feb 21 20:42:15 ollama[539850]: [GIN] 2026/02/21 - 20:42:15 | 200 | 5m46s | 192.168.127.20 | POST "/api/chat" ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.16.3

GiteaMirror added the bug label 2026-04-22 19:17:33 -05:00

GiteaMirror closed this issue

2026-04-22 19:17:34 -05:00

GiteaMirror commented

2026-04-22 19:17:35 -05:00

@rick-github commented on GitHub (Feb 21, 2026):

OLLAMA_SCHED_SPREAD just tells ollama to spread layers across all devices rather than trying to schedule on the least number of devices. If the model is large enough to spill across multiple devices, then the setting of OLLAMA_SCHED_SPREAD becomes irrelevant.

The model spills into system RAM because you have GGML_CUDA_ENABLE_UNIFIED_MEMORY set. Remove it from the environment and the model will not allocate system RAM to a GPU.

@rick-github commented on GitHub (Feb 21, 2026): `OLLAMA_SCHED_SPREAD` just tells ollama to spread layers across all devices rather than trying to schedule on the least number of devices. If the model is large enough to spill across multiple devices, then the setting of `OLLAMA_SCHED_SPREAD` becomes irrelevant. The model spills into system RAM because you have `GGML_CUDA_ENABLE_UNIFIED_MEMORY` set. Remove it from the environment and the model will not allocate system RAM to a GPU.

GiteaMirror commented

2026-04-22 19:17:36 -05:00

@ka-admin commented on GitHub (Feb 21, 2026):

Thanks, I'll try removing GGML_CUDA_ENABLE_UNIFIED_MEMORY. One question though — after removing it, will the layers be distributed more evenly across all three GPUs, or will the scheduler still pile most of them onto the V100 since it has the most VRAM?

@ka-admin commented on GitHub (Feb 21, 2026): Thanks, I'll try removing GGML_CUDA_ENABLE_UNIFIED_MEMORY. One question though — after removing it, will the layers be distributed more evenly across all three GPUs, or will the scheduler still pile most of them onto the V100 since it has the most VRAM?

GiteaMirror commented

2026-04-22 19:17:37 -05:00

@rick-github commented on GitHub (Feb 21, 2026):

Since the model is larger than 80G, ollama will allocate as many layers as it can to the devices. Since the V100 has more VRAM, more layers will be assigned to it. Note that there's a packing factor to the layer assignment that needs to account for other resource requirements like the compute graph.

@rick-github commented on GitHub (Feb 21, 2026): Since the model is larger than 80G, ollama will allocate as many layers as it can to the devices. Since the V100 has more VRAM, more layers will be assigned to it. Note that there's a packing factor to the layer assignment that needs to account for other resource requirements like the compute graph.

GiteaMirror commented

2026-04-22 19:17:38 -05:00

@ka-admin commented on GitHub (Feb 21, 2026):

thank you it helps a lot to understand the cause of the problem. Qwen3 with 100k context lenght never gave me such imballance in offloading layers to GPUs.

@ka-admin commented on GitHub (Feb 21, 2026): thank you it helps a lot to understand the cause of the problem. Qwen3 with 100k context lenght never gave me such imballance in offloading layers to GPUs.

GiteaMirror commented

2026-04-22 19:17:39 -05:00

@xXMrNidaXx commented on GitHub (Feb 23, 2026):

Multi-GPU MoE layer allocation is tricky. At RevolutionAI (https://revolutionai.io), we've deployed Mixtral and other MoE models across multi-GPU setups.

What we've found:

The default layer allocation doesn't account for MoE's uneven compute distribution — expert layers are much heavier than attention layers.

Workarounds:

Manual layer assignment (if supported):

OLLAMA_NUM_GPU_LAYERS_0=20 OLLAMA_NUM_GPU_LAYERS_1=30 ollama run mixtral

Use tensor parallelism instead of pipeline parallelism for MoE:
- vLLM handles this better for MoE models
- TGI also has better MoE-aware sharding
Monitor per-GPU utilization to identify imbalance:

watch -n 1 nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv

Consider quantization to fit more layers per GPU

Root cause: MoE routing means some experts get activated more than others, causing load imbalance even with "balanced" layer distribution.

What model and GPU configuration are you running? The optimal split varies significantly by architecture.

@xXMrNidaXx commented on GitHub (Feb 23, 2026): Multi-GPU MoE layer allocation is tricky. At RevolutionAI (https://revolutionai.io), we've deployed Mixtral and other MoE models across multi-GPU setups. **What we've found:** The default layer allocation doesn't account for MoE's uneven compute distribution — expert layers are much heavier than attention layers. **Workarounds:** 1. **Manual layer assignment** (if supported): ```bash OLLAMA_NUM_GPU_LAYERS_0=20 OLLAMA_NUM_GPU_LAYERS_1=30 ollama run mixtral ``` 2. **Use tensor parallelism** instead of pipeline parallelism for MoE: - vLLM handles this better for MoE models - TGI also has better MoE-aware sharding 3. **Monitor per-GPU utilization** to identify imbalance: ```bash watch -n 1 nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv ``` 4. **Consider quantization** to fit more layers per GPU **Root cause:** MoE routing means some experts get activated more than others, causing load imbalance even with "balanced" layer distribution. What model and GPU configuration are you running? The optimal split varies significantly by architecture.

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

dhiltgen/llama-runner

parth-launch-codex-app

hoyyeva/anthropic-local-image-path

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#35088

[GH-ISSUE #14351] Misallocation of MoE layers on multi-GPU partial offload #35088

What is the issue?

Environment

Steps to reproduce

Expected behavior

Actual behavior

Experiment 1 — OLLAMA_SCHED_SPREAD=false (default)

Experiment 2 — OLLAMA_SCHED_SPREAD=true

Root cause hypothesis

Relevant log output

OS

GPU

CPU

Ollama version

Experiment 1 — `OLLAMA_SCHED_SPREAD=false` (default)

Experiment 2 — `OLLAMA_SCHED_SPREAD=true`