[GH-ISSUE #14351] Misallocation of MoE layers on multi-GPU partial offload #35088

Closed
opened 2026-04-22 19:17:33 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @ka-admin on GitHub (Feb 21, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14351

What is the issue?

<html>

When running a large MoE model with the new engine on a mixed multi-GPU system under partial offload, the layer scheduler assigns far more VRAM to a GPU than it physically has, silently overflowing those tensors into pinned host RAM (CUDA_Host). The result is that layers reported as "GPU offloaded" are in fact running from system RAM, with added PCIe overhead.

Additionally, OLLAMA_SCHED_SPREAD=true has zero effect — the layer assignment and buffer allocation are byte-for-byte identical with the flag on or off, suggesting the bug lies in per-layer VRAM cost estimation before any placement policy is applied.

Environment

  • Ollama version: 0.16.3
  • OS: Linux (Ubuntu), kernel with systemd
  • Model: MiniMax M2.5 Q8_0 GGUF (minimax-m2 architecture, 228.69B params, 226.43 GiB)
  • GPUs:
Device | Name | VRAM | Compute -- | -- | -- | -- CUDA0 | NVIDIA GeForce RTX 4090 | 24 GiB | 8.9 CUDA1 | NVIDIA GeForce RTX 4090 | 24 GiB | 8.9 CUDA2 | Tesla V100-SXM2-32GB | 32 GiB | 7.0
  • System RAM: 256 GiB DDR5
  • Driver / CUDA: 580.126.18 / 13.0
  • Relevant env config:
CUDA_VISIBLE_DEVICES=0,1,2
OLLAMA_CONTEXT_LENGTH=131072
OLLAMA_FLASH_ATTENTION=true
OLLAMA_NUM_PARALLEL=1

Steps to reproduce

  1. Run Ollama 0.16.3 on a mixed multi-GPU system where the GPU with largest total VRAM (V100, 32 GiB) cannot physically hold the layers the scheduler intends to assign it
  2. Load MiniMax M2.5 Q8_0 with partial GPU offload (19 of 63 layers)
  3. Observe layer assignment and actual buffer allocation in logs

Expected behavior

Layers assigned to each GPU should not exceed that GPU's physical VRAM. If a GPU cannot fit its assigned layers, fewer layers should be assigned or they should be redistributed. OLLAMA_SCHED_SPREAD=true should produce a meaningfully different (more even) distribution than the default greedy policy.

Actual behavior

The scheduler assigns 16 layers to the V100 (32 GiB), but the actual model buffer for those layers totals 59,514 MiB (~58 GiB) — nearly 2× its physical capacity. This overflows silently into CUDA_Host (pinned system RAM) with no warning or error. The 16 "GPU layers" on the V100 are functionally CPU layers.

Furthermore, enabling OLLAMA_SCHED_SPREAD=true produces identical output — same layer assignment, same buffer sizes, same graph splits — indicating the spread/pack policy is not the issue and that incorrect per-layer cost estimation is happening upstream of any placement decision.

Experiment 1 — OLLAMA_SCHED_SPREAD=false (default)

GPULayers:19[
  GPU-dfc3d6a8 (V100-32GB):  Layers:16 (43..58)
  GPU-7a420261 (RTX4090 #1): Layers:2  (59..60)
  GPU-0dcf0ac3 (RTX4090 #2): Layers:1  (61..61)
]

load_tensors: CUDA0 model buffer size =  7,439.35 MiB
load_tensors: CUDA1 model buffer size =  3,719.68 MiB
load_tensors: CUDA2 model buffer size = 59,514.83 MiB  ← exceeds 32 GiB physical VRAM
load_tensors: CUDA_Host model buffer  = 161,191.63 MiB

graph splits = 651 (with bs=512), 5 (with bs=1)

Experiment 2 — OLLAMA_SCHED_SPREAD=true

GPULayers:19[
  GPU-dfc3d6a8 (V100-32GB):  Layers:16 (43..58)   ← IDENTICAL
  GPU-7a420261 (RTX4090 #1): Layers:2  (59..60)
  GPU-0dcf0ac3 (RTX4090 #2): Layers:1  (61..61)
]

load_tensors: CUDA0 model buffer size =  7,439.35 MiB
load_tensors: CUDA1 model buffer size =  3,719.68 MiB
load_tensors: CUDA2 model buffer size = 59,514.83 MiB  ← IDENTICAL
load_tensors: CUDA_Host model buffer  = 161,191.63 MiB

graph splits = 651 (with bs=512), 5 (with bs=1)

Root cause hypothesis

minimax-m2 uses 256 experts per MoE layer with expert_feed_forward_length=1536 and embedding_length=3072. At Q8_0, each MoE layer weighs approximately 3.6 GiB in actual allocated buffers. The scheduler's per-layer VRAM cost estimator is clearly computing a much smaller value — otherwise it would never plan to fit 16 such layers (~58 GiB) onto a 32 GiB GPU.

The GPUs are sorted largest-VRAM-first (V100 at 32 GiB before both 4090s at 24 GiB), so the greedy packer fills the V100 until its estimated budget is exhausted. Because the estimate is wrong, it over-assigns by ~2×, and llama.cpp silently backs the overflow with pinned host RAM.

The extremely high graph split count (651 at bs=512 vs 5 at bs=1) is a secondary symptom of the resulting cross-device tensor fragmentation.

The bug likely lives in the per-layer cost estimation for minimax-m2 in device.go (or equivalent scheduler weight logic), not in the spread/pack placement policy.

</html>

Relevant log output

ollama.service: Consumed 33min 52.940s CPU time, 218.4G memory peak, 5.6G memory swap peak.
Feb 21 20:36:21 systemd[1]: Started ollama.service - Ollama Service.
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.938+03:00 level=INFO source=routes.go:1663 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1,2 GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:131072 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:true OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.938+03:00 level=INFO source=routes.go:1665 msg="Ollama cloud disabled: false"
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.949+03:00 level=INFO source=images.go:473 msg="total blobs: 22"
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.949+03:00 level=INFO source=images.go:480 msg="total unused blobs removed: 0"
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.950+03:00 level=INFO source=routes.go:1718 msg="Listening on [::]:11434 (version 0.16.3)"
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=INFO source=runner.go:67 msg="discovering available GPUs..."
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=WARN source=runner.go:485 msg="user overrode visible devices" CUDA_VISIBLE_DEVICES=0,1,2
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=WARN source=runner.go:489 msg="if GPUs are not correctly discovered, unset and try again"
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled.  To enable, set OLLAMA_VULKAN=1"
Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 37329"
Feb 21 20:36:22 ollama[539850]: time=2026-02-21T20:36:22.657+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 42953"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 36173"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 44529"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43121"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 46013"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 46881"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 39921"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 filter_id="" library=CUDA compute=7.0 name=CUDA2 description="Tesla V100-SXM2-32GB" libdirs=ollama,cuda_v12 driver=13.0 pci_id=0000:09:00.0 type=discrete total="32.0 GiB" available="31.4 GiB"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe filter_id="" library=CUDA compute=8.9 name=CUDA0 description="NVIDIA GeForce RTX 4090" libdirs=ollama,cuda_v12 driver=13.0 pci_id=0000:01:00.0 type=discrete total="24.0 GiB" available="23.5 GiB"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 filter_id="" library=CUDA compute=8.9 name=CUDA1 description="NVIDIA GeForce RTX 4090" libdirs=ollama,cuda_v12 driver=13.0 pci_id=0000:03:00.0 type=discrete total="24.0 GiB" available="23.1 GiB"
Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=routes.go:1768 msg="vram-based default context" total_vram="80.0 GiB" default_num_ctx=262144
Feb 21 20:36:28 ollama[539850]: time=2026-02-21T20:36:28.802+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43123"
Feb 21 20:36:29 ollama[539850]: llama_model_loader: loaded meta data with 38 key-value pairs and 809 tensors from /ai/llm/models/blobs/sha256-f4f7f4e8e9f04b21d5c5a277223592388f17ae22a6e08f2ae1ab12bef6f9fca3 (version GGUF V3 (latest))
Feb 21 20:36:29 ollama[539850]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   0:                       general.architecture str              = minimax-m2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   1:                          general.file_type u32              = 7
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   2:                            general.license str              = other
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   3:                       general.license.link str              = https://github.com/MiniMax-AI/MiniMax...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   4:                       general.license.name str              = modified-mit
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   5:                               general.name str              = Workdir
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   6:                    general.parameter_count u64              = 228689764864
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   7:               general.quantization_version u32              = 2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   8:                      general.sampling.temp f32              = 1.000000
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   9:                     general.sampling.top_k i32              = 40
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  10:                     general.sampling.top_p f32              = 0.950000
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  11:                         general.size_label str              = 256x4.9B
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  12:                               general.tags arr[str,1]       = ["text-generation"]
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  13:                               general.type str              = model
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  14:            minimax-m2.attention.head_count u32              = 48
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  15:         minimax-m2.attention.head_count_kv u32              = 8
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  16:            minimax-m2.attention.key_length u32              = 128
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  17: minimax-m2.attention.layer_norm_rms_epsilon f32              = 0.000001
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  18:          minimax-m2.attention.value_length u32              = 128
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  19:                     minimax-m2.block_count u32              = 62
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  20:                  minimax-m2.context_length u32              = 196608
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  21:                minimax-m2.embedding_length u32              = 3072
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  22:                    minimax-m2.expert_count u32              = 256
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  23:      minimax-m2.expert_feed_forward_length u32              = 1536
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  24:              minimax-m2.expert_gating_func u32              = 2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  25:               minimax-m2.expert_used_count u32              = 8
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  26:             minimax-m2.feed_forward_length u32              = 1536
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  27:            minimax-m2.rope.dimension_count u32              = 64
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  28:                  minimax-m2.rope.freq_base f32              = 5000000.000000
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {# ----------‑‑‑ special token ...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 200034
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 200020
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,199744]  = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = minimax-m2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,200064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,200064]  = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  37:            tokenizer.ggml.unknown_token_id u32              = 200021
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type  f32:  373 tensors
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type q8_0:  436 tensors
Feb 21 20:36:29 ollama[539850]: print_info: file format = GGUF V3 (latest)
Feb 21 20:36:29 ollama[539850]: print_info: file type   = Q8_0
Feb 21 20:36:29 ollama[539850]: print_info: file size   = 226.43 GiB (8.51 BPW)
Feb 21 20:36:29 ollama[539850]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Feb 21 20:36:29 ollama[539850]: load: printing all EOG tokens:
Feb 21 20:36:29 ollama[539850]: load:   - 200004 ('<fim_pad>')
Feb 21 20:36:29 ollama[539850]: load:   - 200005 ('<reponame>')
Feb 21 20:36:29 ollama[539850]: load:   - 200020 ('[e~[')
Feb 21 20:36:29 ollama[539850]: load: special tokens cache size = 54
Feb 21 20:36:29 ollama[539850]: load: token to piece cache size = 1.3355 MB
Feb 21 20:36:29 ollama[539850]: print_info: arch             = minimax-m2
Feb 21 20:36:29 ollama[539850]: print_info: vocab_only       = 1
Feb 21 20:36:29 ollama[539850]: print_info: no_alloc         = 0
Feb 21 20:36:29 ollama[539850]: print_info: model type       = ?B
Feb 21 20:36:29 ollama[539850]: print_info: model params     = 228.69 B
Feb 21 20:36:29 ollama[539850]: print_info: general.name     = Workdir
Feb 21 20:36:29 ollama[539850]: print_info: vocab type       = BPE
Feb 21 20:36:29 ollama[539850]: print_info: n_vocab          = 200064
Feb 21 20:36:29 ollama[539850]: print_info: n_merges         = 199744
Feb 21 20:36:29 ollama[539850]: print_info: BOS token        = 200034 ']~!b['
Feb 21 20:36:29 ollama[539850]: print_info: EOS token        = 200020 '[e~['
Feb 21 20:36:29 ollama[539850]: print_info: UNK token        = 200021 ']!d~['
Feb 21 20:36:29 ollama[539850]: print_info: LF token         = 10 'Ċ'
Feb 21 20:36:29 ollama[539850]: print_info: FIM PRE token    = 200001 '<fim_prefix>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM SUF token    = 200003 '<fim_suffix>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM MID token    = 200002 '<fim_middle>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM PAD token    = 200004 '<fim_pad>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM REP token    = 200005 '<reponame>'
Feb 21 20:36:29 ollama[539850]: print_info: EOG token        = 200004 '<fim_pad>'
Feb 21 20:36:29 ollama[539850]: print_info: EOG token        = 200005 '<reponame>'
Feb 21 20:36:29 ollama[539850]: print_info: EOG token        = 200020 '[e~['
Feb 21 20:36:29 ollama[539850]: print_info: max token length = 256
Feb 21 20:36:29 ollama[539850]: llama_model_load: vocab only - skipping tensors
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /ai/llm/models/blobs/sha256-f4f7f4e8e9f04b21d5c5a277223592388f17ae22a6e08f2ae1ab12bef6f9fca3 --port 42699"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:491 msg="system memory" total="245.1 GiB" free="239.3 GiB" free_swap="6.8 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=CUDA available="23.1 GiB" free="23.5 GiB" minimum="457.0 MiB" overhead="0 B"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=CUDA available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=CUDA available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=server.go:498 msg="loading model" "model layers"=63 requested=19
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="7.3 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="3.6 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA2 size="58.1 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="156.8 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="585.9 MiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="293.0 MiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA2 size="4.6 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="12.3 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="17.7 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="17.7 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA2 size="17.7 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:272 msg="total memory" size="296.8 GiB"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.392+03:00 level=INFO source=runner.go:965 msg="starting go runner"
Feb 21 20:36:29 ollama[539850]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so
Feb 21 20:36:29 ollama[539850]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Feb 21 20:36:29 ollama[539850]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Feb 21 20:36:29 ollama[539850]: ggml_cuda_init: found 3 CUDA devices:
Feb 21 20:36:29 ollama[539850]:   Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe
Feb 21 20:36:29 ollama[539850]:   Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0
Feb 21 20:36:29 ollama[539850]:   Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39
Feb 21 20:36:29 ollama[539850]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.537+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.537+03:00 level=INFO source=runner.go:1001 msg="Server listening on 127.0.0.1:42699"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.541+03:00 level=INFO source=runner.go:895 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:75000 KvCacheType: NumThreads:16 GPULayers:19[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:16(43..58) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:2(59..60) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:1(61..61)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.541+03:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.541+03:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
Feb 21 20:36:29 ollama[539850]: ggml_backend_cuda_device_get_memory device GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 utilizing NVML memory reporting free: 33746845696 total: 34359738368
Feb 21 20:36:29 ollama[539850]: llama_model_load_from_file_impl: using device CUDA2 (Tesla V100-SXM2-32GB) (0000:09:00.0) - 32183 MiB free
Feb 21 20:36:29 ollama[539850]: ggml_backend_cuda_device_get_memory device GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe utilizing NVML memory reporting free: 24836243456 total: 25757220864
Feb 21 20:36:29 ollama[539850]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) (0000:01:00.0) - 23685 MiB free
Feb 21 20:36:29 ollama[539850]: ggml_backend_cuda_device_get_memory device GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 utilizing NVML memory reporting free: 24836243456 total: 25757220864
Feb 21 20:36:29 ollama[539850]: llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) (0000:03:00.0) - 23685 MiB free
Feb 21 20:36:29 ollama[539850]: llama_model_loader: loaded meta data with 38 key-value pairs and 809 tensors from /ai/llm/models/blobs/sha256-f4f7f4e8e9f04b21d5c5a277223592388f17ae22a6e08f2ae1ab12bef6f9fca3 (version GGUF V3 (latest))
Feb 21 20:36:29 ollama[539850]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   0:                       general.architecture str              = minimax-m2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   1:                          general.file_type u32              = 7
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   2:                            general.license str              = other
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   3:                       general.license.link str              = https://github.com/MiniMax-AI/MiniMax...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   4:                       general.license.name str              = modified-mit
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   5:                               general.name str              = Workdir
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   6:                    general.parameter_count u64              = 228689764864
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   7:               general.quantization_version u32              = 2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   8:                      general.sampling.temp f32              = 1.000000
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv   9:                     general.sampling.top_k i32              = 40
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  10:                     general.sampling.top_p f32              = 0.950000
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  11:                         general.size_label str              = 256x4.9B
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  12:                               general.tags arr[str,1]       = ["text-generation"]
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  13:                               general.type str              = model
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  14:            minimax-m2.attention.head_count u32              = 48
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  15:         minimax-m2.attention.head_count_kv u32              = 8
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  16:            minimax-m2.attention.key_length u32              = 128
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  17: minimax-m2.attention.layer_norm_rms_epsilon f32              = 0.000001
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  18:          minimax-m2.attention.value_length u32              = 128
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  19:                     minimax-m2.block_count u32              = 62
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  20:                  minimax-m2.context_length u32              = 196608
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  21:                minimax-m2.embedding_length u32              = 3072
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  22:                    minimax-m2.expert_count u32              = 256
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  23:      minimax-m2.expert_feed_forward_length u32              = 1536
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  24:              minimax-m2.expert_gating_func u32              = 2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  25:               minimax-m2.expert_used_count u32              = 8
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  26:             minimax-m2.feed_forward_length u32              = 1536
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  27:            minimax-m2.rope.dimension_count u32              = 64
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  28:                  minimax-m2.rope.freq_base f32              = 5000000.000000
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {# ----------‑‑‑ special token ...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 200034
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 200020
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  32:                      tokenizer.ggml.merges arr[str,199744]  = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = minimax-m2
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,200064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,200064]  = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ...
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv  37:            tokenizer.ggml.unknown_token_id u32              = 200021
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type  f32:  373 tensors
Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type q8_0:  436 tensors
Feb 21 20:36:29 ollama[539850]: print_info: file format = GGUF V3 (latest)
Feb 21 20:36:29 ollama[539850]: print_info: file type   = Q8_0
Feb 21 20:36:29 ollama[539850]: print_info: file size   = 226.43 GiB (8.51 BPW)
Feb 21 20:36:29 ollama[539850]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
Feb 21 20:36:29 ollama[539850]: load: printing all EOG tokens:
Feb 21 20:36:29 ollama[539850]: load:   - 200004 ('<fim_pad>')
Feb 21 20:36:29 ollama[539850]: load:   - 200005 ('<reponame>')
Feb 21 20:36:29 ollama[539850]: load:   - 200020 ('[e~[')
Feb 21 20:36:29 ollama[539850]: load: special tokens cache size = 54
Feb 21 20:36:29 ollama[539850]: load: token to piece cache size = 1.3355 MB
Feb 21 20:36:29 ollama[539850]: print_info: arch             = minimax-m2
Feb 21 20:36:29 ollama[539850]: print_info: vocab_only       = 0
Feb 21 20:36:29 ollama[539850]: print_info: no_alloc         = 0
Feb 21 20:36:29 ollama[539850]: print_info: n_ctx_train      = 196608
Feb 21 20:36:29 ollama[539850]: print_info: n_embd           = 3072
Feb 21 20:36:29 ollama[539850]: print_info: n_embd_inp       = 3072
Feb 21 20:36:29 ollama[539850]: print_info: n_layer          = 62
Feb 21 20:36:29 ollama[539850]: print_info: n_head           = 48
Feb 21 20:36:29 ollama[539850]: print_info: n_head_kv        = 8
Feb 21 20:36:29 ollama[539850]: print_info: n_rot            = 64
Feb 21 20:36:29 ollama[539850]: print_info: n_swa            = 0
Feb 21 20:36:29 ollama[539850]: print_info: is_swa_any       = 0
Feb 21 20:36:29 ollama[539850]: print_info: n_embd_head_k    = 128
Feb 21 20:36:29 ollama[539850]: print_info: n_embd_head_v    = 128
Feb 21 20:36:29 ollama[539850]: print_info: n_gqa            = 6
Feb 21 20:36:29 ollama[539850]: print_info: n_embd_k_gqa     = 1024
Feb 21 20:36:29 ollama[539850]: print_info: n_embd_v_gqa     = 1024
Feb 21 20:36:29 ollama[539850]: print_info: f_norm_eps       = 0.0e+00
Feb 21 20:36:29 ollama[539850]: print_info: f_norm_rms_eps   = 1.0e-06
Feb 21 20:36:29 ollama[539850]: print_info: f_clamp_kqv      = 0.0e+00
Feb 21 20:36:29 ollama[539850]: print_info: f_max_alibi_bias = 0.0e+00
Feb 21 20:36:29 ollama[539850]: print_info: f_logit_scale    = 0.0e+00
Feb 21 20:36:29 ollama[539850]: print_info: f_attn_scale     = 0.0e+00
Feb 21 20:36:29 ollama[539850]: print_info: n_ff             = 1536
Feb 21 20:36:29 ollama[539850]: print_info: n_expert         = 256
Feb 21 20:36:29 ollama[539850]: print_info: n_expert_used    = 8
Feb 21 20:36:29 ollama[539850]: print_info: n_expert_groups  = 0
Feb 21 20:36:29 ollama[539850]: print_info: n_group_used     = 0
Feb 21 20:36:29 ollama[539850]: print_info: causal attn      = 1
Feb 21 20:36:29 ollama[539850]: print_info: pooling type     = 0
Feb 21 20:36:29 ollama[539850]: print_info: rope type        = 2
Feb 21 20:36:29 ollama[539850]: print_info: rope scaling     = linear
Feb 21 20:36:29 ollama[539850]: print_info: freq_base_train  = 5000000.0
Feb 21 20:36:29 ollama[539850]: print_info: freq_scale_train = 1
Feb 21 20:36:29 ollama[539850]: print_info: n_ctx_orig_yarn  = 196608
Feb 21 20:36:29 ollama[539850]: print_info: rope_yarn_log_mul= 0.0000
Feb 21 20:36:29 ollama[539850]: print_info: rope_finetuned   = unknown
Feb 21 20:36:29 ollama[539850]: print_info: model type       = 230B.A10B
Feb 21 20:36:29 ollama[539850]: print_info: model params     = 228.69 B
Feb 21 20:36:29 ollama[539850]: print_info: general.name     = Workdir
Feb 21 20:36:29 ollama[539850]: print_info: vocab type       = BPE
Feb 21 20:36:29 ollama[539850]: print_info: n_vocab          = 200064
Feb 21 20:36:29 ollama[539850]: print_info: n_merges         = 199744
Feb 21 20:36:29 ollama[539850]: print_info: BOS token        = 200034 ']~!b['
Feb 21 20:36:29 ollama[539850]: print_info: EOS token        = 200020 '[e~['
Feb 21 20:36:29 ollama[539850]: print_info: UNK token        = 200021 ']!d~['
Feb 21 20:36:29 ollama[539850]: print_info: LF token         = 10 'Ċ'
Feb 21 20:36:29 ollama[539850]: print_info: FIM PRE token    = 200001 '<fim_prefix>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM SUF token    = 200003 '<fim_suffix>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM MID token    = 200002 '<fim_middle>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM PAD token    = 200004 '<fim_pad>'
Feb 21 20:36:29 ollama[539850]: print_info: FIM REP token    = 200005 '<reponame>'
Feb 21 20:36:29 ollama[539850]: print_info: EOG token        = 200004 '<fim_pad>'
Feb 21 20:36:29 ollama[539850]: print_info: EOG token        = 200005 '<reponame>'
Feb 21 20:36:29 ollama[539850]: print_info: EOG token        = 200020 '[e~['
Feb 21 20:36:29 ollama[539850]: print_info: max token length = 256
Feb 21 20:36:29 ollama[539850]: load_tensors: loading model tensors, this can take a while... (mmap = false)
Feb 21 20:36:31 ollama[539850]: time=2026-02-21T20:36:31.997+03:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding"
Feb 21 20:36:32 ollama[539850]: time=2026-02-21T20:36:32.255+03:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model"
Feb 21 20:36:51 ollama[539850]: load_tensors: offloading 19 repeating layers to GPU
Feb 21 20:36:51 ollama[539850]: load_tensors: offloaded 19/63 layers to GPU
Feb 21 20:36:51 ollama[539850]: load_tensors:        CUDA0 model buffer size =  7439.35 MiB
Feb 21 20:36:51 ollama[539850]: load_tensors:        CUDA1 model buffer size =  3719.68 MiB
Feb 21 20:36:51 ollama[539850]: load_tensors:        CUDA2 model buffer size = 59514.83 MiB
Feb 21 20:36:51 ollama[539850]: load_tensors:    CUDA_Host model buffer size = 161191.63 MiB
Feb 21 20:38:20 ollama[539850]: llama_context: constructing llama_context
Feb 21 20:38:20 ollama[539850]: llama_context: n_seq_max     = 1
Feb 21 20:38:20 ollama[539850]: llama_context: n_ctx         = 75008
Feb 21 20:38:20 ollama[539850]: llama_context: n_ctx_seq     = 75008
Feb 21 20:38:20 ollama[539850]: llama_context: n_batch       = 512
Feb 21 20:38:20 ollama[539850]: llama_context: n_ubatch      = 512
Feb 21 20:38:20 ollama[539850]: llama_context: causal_attn   = 1
Feb 21 20:38:20 ollama[539850]: llama_context: flash_attn    = enabled
Feb 21 20:38:20 ollama[539850]: llama_context: kv_unified    = false
Feb 21 20:38:20 ollama[539850]: llama_context: freq_base     = 5000000.0
Feb 21 20:38:20 ollama[539850]: llama_context: freq_scale    = 1
Feb 21 20:38:20 ollama[539850]: llama_context: n_ctx_seq (75008) < n_ctx_train (196608) -- the full capacity of the model will not be utilized
Feb 21 20:38:20 ollama[539850]: llama_context:        CPU  output buffer size =     0.77 MiB
Feb 21 20:38:20 ollama[539850]: llama_kv_cache:        CPU KV buffer size = 12599.00 MiB
Feb 21 20:38:21 ollama[539850]: llama_kv_cache:      CUDA0 KV buffer size =   586.00 MiB
Feb 21 20:38:22 ollama[539850]: llama_kv_cache:      CUDA1 KV buffer size =   293.00 MiB
Feb 21 20:38:22 ollama[539850]: llama_kv_cache:      CUDA2 KV buffer size =  4688.00 MiB
Feb 21 20:38:24 ollama[539850]: llama_kv_cache: size = 18166.00 MiB ( 75008 cells,  62 layers,  1/1 seqs), K (f16): 9083.00 MiB, V (f16): 9083.00 MiB
Feb 21 20:38:24 ollama[539850]: llama_context:      CUDA2 compute buffer size =  1760.78 MiB
Feb 21 20:38:24 ollama[539850]: llama_context:      CUDA0 compute buffer size =   158.76 MiB
Feb 21 20:38:24 ollama[539850]: llama_context:      CUDA1 compute buffer size =   109.26 MiB
Feb 21 20:38:24 ollama[539850]: llama_context:  CUDA_Host compute buffer size =   152.51 MiB
Feb 21 20:38:24 ollama[539850]: llama_context: graph nodes  = 3975
Feb 21 20:38:24 ollama[539850]: llama_context: graph splits = 651 (with bs=512), 5 (with bs=1)
Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=server.go:1388 msg="llama runner started in 115.64 seconds"
Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=sched.go:566 msg="loaded runners" count=1
Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding"
Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=server.go:1388 msg="llama runner started in 115.65 seconds"
Feb 21 20:42:15 ollama[539850]: [GIN] 2026/02/21 - 20:42:15 | 200 |         5m46s |  192.168.127.20 | POST     "/api/chat"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.16.3

Originally created by @ka-admin on GitHub (Feb 21, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14351 ### What is the issue? <html><body> <!--StartFragment--><p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">When running a large MoE model with the new engine on a mixed multi-GPU system under partial offload, the layer scheduler assigns far more VRAM to a GPU than it physically has, silently overflowing those tensors into pinned host RAM (<code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">CUDA_Host</code>). The result is that layers reported as "GPU offloaded" are in fact running from system RAM, with added PCIe overhead.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Additionally, <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">OLLAMA_SCHED_SPREAD=true</code> has <strong>zero effect</strong> — the layer assignment and buffer allocation are byte-for-byte identical with the flag on or off, suggesting the bug lies in per-layer VRAM cost estimation before any placement policy is applied.</p> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Environment</h2> <ul class="[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3"> <li class="whitespace-normal break-words pl-2"><strong>Ollama version:</strong> 0.16.3</li> <li class="whitespace-normal break-words pl-2"><strong>OS:</strong> Linux (Ubuntu), kernel with systemd</li> <li class="whitespace-normal break-words pl-2"><strong>Model:</strong> MiniMax M2.5 Q8_0 GGUF (<code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">minimax-m2</code> architecture, 228.69B params, 226.43 GiB)</li> <li class="whitespace-normal break-words pl-2"><strong>GPUs:</strong></li> </ul> <div class="overflow-x-auto w-full px-2 mb-6"> Device | Name | VRAM | Compute -- | -- | -- | -- CUDA0 | NVIDIA GeForce RTX 4090 | 24 GiB | 8.9 CUDA1 | NVIDIA GeForce RTX 4090 | 24 GiB | 8.9 CUDA2 | Tesla V100-SXM2-32GB | 32 GiB | 7.0 </div> <ul class="[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3"> <li class="whitespace-normal break-words pl-2"><strong>System RAM:</strong> 256 GiB DDR5</li> <li class="whitespace-normal break-words pl-2"><strong>Driver / CUDA:</strong> 580.126.18 / 13.0</li> <li class="whitespace-normal break-words pl-2"><strong>Relevant env config:</strong></li> </ul> <div class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg"><div class="sticky opacity-0 group-hover/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"><div class="absolute right-0 h-8 px-2 items-center inline-flex z-10"><button class="inline-flex items-center justify-center relative shrink-0 can-focus select-none disabled:pointer-events-none disabled:opacity-50 disabled:shadow-none disabled:drop-shadow-none border-transparent transition font-base duration-300 ease-[cubic-bezier(0.165,0.85,0.45,1)] h-8 w-8 rounded-md active:scale-95 backdrop-blur-md Button_ghost__BUAoh" type="button" aria-label="Copy to clipboard" data-state="closed"><div class="relative"><div class="transition-all opacity-100 scale-100" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" style="flex-shrink: 0;" class="transition-all opacity-100 scale-100" aria-hidden="true"><path d="M12.5 3C13.3284 3 14 3.67157 14 4.5V6H15.5C16.3284 6 17 6.67157 17 7.5V15.5C17 16.3284 16.3284 17 15.5 17H7.5C6.67157 17 6 16.3284 6 15.5V14H4.5C3.67157 14 3 13.3284 3 12.5V4.5C3 3.67157 3.67157 3 4.5 3H12.5ZM14 12.5C14 13.3284 13.3284 14 12.5 14H7V15.5C7 15.7761 7.22386 16 7.5 16H15.5C15.7761 16 16 15.7761 16 15.5V7.5C16 7.22386 15.7761 7 15.5 7H14V12.5ZM4.5 4C4.22386 4 4 4.22386 4 4.5V12.5C4 12.7761 4.22386 13 4.5 13H12.5C12.7761 13 13 12.7761 13 12.5V4.5C13 4.22386 12.7761 4 12.5 4H4.5Z"></path></svg></div><div class="absolute inset-0 flex items-center justify-center"><div class="transition-all opacity-0 scale-50" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" style="flex-shrink: 0;" class="transition-all opacity-0 scale-50" aria-hidden="true"><path d="M15.1883 5.10908C15.3699 4.96398 15.6346 4.96153 15.8202 5.11592C16.0056 5.27067 16.0504 5.53125 15.9403 5.73605L15.8836 5.82003L8.38354 14.8202C8.29361 14.9279 8.16242 14.9925 8.02221 14.9989C7.88203 15.0051 7.74545 14.9526 7.64622 14.8534L4.14617 11.3533L4.08172 11.2752C3.95384 11.0811 3.97542 10.817 4.14617 10.6463C4.31693 10.4755 4.58105 10.4539 4.77509 10.5818L4.85321 10.6463L7.96556 13.7586L15.1161 5.1794L15.1883 5.10908Z"></path></svg></div></div></div></button></div></div><div class="overflow-x-auto"><pre class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5" style="color: rgb(20, 24, 31); background: transparent; font-family: var(--font-mono);"><code style="color: rgb(20, 24, 31); background: transparent; font-family: var(--font-mono); white-space: pre-wrap;"><span><span>CUDA_VISIBLE_DEVICES=0,1,2 </span></span><span>OLLAMA_CONTEXT_LENGTH=131072 </span><span>OLLAMA_FLASH_ATTENTION=true </span><span>OLLAMA_NUM_PARALLEL=1</span></code></pre></div></div> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Steps to reproduce</h2> <ol class="[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-decimal flex flex-col gap-1 pl-8 mb-3"> <li class="whitespace-normal break-words pl-2">Run Ollama 0.16.3 on a mixed multi-GPU system where the GPU with largest total VRAM (V100, 32 GiB) cannot physically hold the layers the scheduler intends to assign it</li> <li class="whitespace-normal break-words pl-2">Load MiniMax M2.5 Q8_0 with partial GPU offload (19 of 63 layers)</li> <li class="whitespace-normal break-words pl-2">Observe layer assignment and actual buffer allocation in logs</li> </ol> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Expected behavior</h2> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Layers assigned to each GPU should not exceed that GPU's physical VRAM. If a GPU cannot fit its assigned layers, fewer layers should be assigned or they should be redistributed. <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">OLLAMA_SCHED_SPREAD=true</code> should produce a meaningfully different (more even) distribution than the default greedy policy.</p> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Actual behavior</h2> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The scheduler assigns 16 layers to the V100 (32 GiB), but the actual model buffer for those layers totals <strong>59,514 MiB (~58 GiB)</strong> — nearly 2× its physical capacity. This overflows silently into <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">CUDA_Host</code> (pinned system RAM) with no warning or error. The 16 "GPU layers" on the V100 are functionally CPU layers.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Furthermore, enabling <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">OLLAMA_SCHED_SPREAD=true</code> produces <strong>identical output</strong> — same layer assignment, same buffer sizes, same graph splits — indicating the spread/pack policy is not the issue and that incorrect per-layer cost estimation is happening upstream of any placement decision.</p> <h3 class="text-text-100 mt-2 -mb-1 text-base font-bold">Experiment 1 — <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">OLLAMA_SCHED_SPREAD=false</code> (default)</h3> <div class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg"><div class="sticky opacity-0 group-hover/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"><div class="absolute right-0 h-8 px-2 items-center inline-flex z-10"><button class="inline-flex items-center justify-center relative shrink-0 can-focus select-none disabled:pointer-events-none disabled:opacity-50 disabled:shadow-none disabled:drop-shadow-none border-transparent transition font-base duration-300 ease-[cubic-bezier(0.165,0.85,0.45,1)] h-8 w-8 rounded-md active:scale-95 backdrop-blur-md Button_ghost__BUAoh" type="button" aria-label="Copy to clipboard" data-state="closed"><div class="relative"><div class="transition-all opacity-100 scale-100" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" style="flex-shrink: 0;" class="transition-all opacity-100 scale-100" aria-hidden="true"><path d="M12.5 3C13.3284 3 14 3.67157 14 4.5V6H15.5C16.3284 6 17 6.67157 17 7.5V15.5C17 16.3284 16.3284 17 15.5 17H7.5C6.67157 17 6 16.3284 6 15.5V14H4.5C3.67157 14 3 13.3284 3 12.5V4.5C3 3.67157 3.67157 3 4.5 3H12.5ZM14 12.5C14 13.3284 13.3284 14 12.5 14H7V15.5C7 15.7761 7.22386 16 7.5 16H15.5C15.7761 16 16 15.7761 16 15.5V7.5C16 7.22386 15.7761 7 15.5 7H14V12.5ZM4.5 4C4.22386 4 4 4.22386 4 4.5V12.5C4 12.7761 4.22386 13 4.5 13H12.5C12.7761 13 13 12.7761 13 12.5V4.5C13 4.22386 12.7761 4 12.5 4H4.5Z"></path></svg></div><div class="absolute inset-0 flex items-center justify-center"><div class="transition-all opacity-0 scale-50" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" style="flex-shrink: 0;" class="transition-all opacity-0 scale-50" aria-hidden="true"><path d="M15.1883 5.10908C15.3699 4.96398 15.6346 4.96153 15.8202 5.11592C16.0056 5.27067 16.0504 5.53125 15.9403 5.73605L15.8836 5.82003L8.38354 14.8202C8.29361 14.9279 8.16242 14.9925 8.02221 14.9989C7.88203 15.0051 7.74545 14.9526 7.64622 14.8534L4.14617 11.3533L4.08172 11.2752C3.95384 11.0811 3.97542 10.817 4.14617 10.6463C4.31693 10.4755 4.58105 10.4539 4.77509 10.5818L4.85321 10.6463L7.96556 13.7586L15.1161 5.1794L15.1883 5.10908Z"></path></svg></div></div></div></button></div></div><div class="overflow-x-auto"><pre class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5" style="color: rgb(20, 24, 31); background: transparent; font-family: var(--font-mono);"><code style="color: rgb(20, 24, 31); background: transparent; font-family: var(--font-mono); white-space: pre-wrap;"><span><span>GPULayers:19[ </span></span><span> GPU-dfc3d6a8 (V100-32GB): Layers:16 (43..58) </span><span> GPU-7a420261 (RTX4090 #1): Layers:2 (59..60) </span><span> GPU-0dcf0ac3 (RTX4090 #2): Layers:1 (61..61) </span><span>] </span><span> </span><span>load_tensors: CUDA0 model buffer size = 7,439.35 MiB </span><span>load_tensors: CUDA1 model buffer size = 3,719.68 MiB </span><span>load_tensors: CUDA2 model buffer size = 59,514.83 MiB ← exceeds 32 GiB physical VRAM </span><span>load_tensors: CUDA_Host model buffer = 161,191.63 MiB </span><span> </span><span>graph splits = 651 (with bs=512), 5 (with bs=1)</span></code></pre></div></div> <h3 class="text-text-100 mt-2 -mb-1 text-base font-bold">Experiment 2 — <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">OLLAMA_SCHED_SPREAD=true</code></h3> <div class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg"><div class="sticky opacity-0 group-hover/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"><div class="absolute right-0 h-8 px-2 items-center inline-flex z-10"><button class="inline-flex items-center justify-center relative shrink-0 can-focus select-none disabled:pointer-events-none disabled:opacity-50 disabled:shadow-none disabled:drop-shadow-none border-transparent transition font-base duration-300 ease-[cubic-bezier(0.165,0.85,0.45,1)] h-8 w-8 rounded-md active:scale-95 backdrop-blur-md Button_ghost__BUAoh" type="button" aria-label="Copy to clipboard" data-state="closed"><div class="relative"><div class="transition-all opacity-100 scale-100" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" style="flex-shrink: 0;" class="transition-all opacity-100 scale-100" aria-hidden="true"><path d="M12.5 3C13.3284 3 14 3.67157 14 4.5V6H15.5C16.3284 6 17 6.67157 17 7.5V15.5C17 16.3284 16.3284 17 15.5 17H7.5C6.67157 17 6 16.3284 6 15.5V14H4.5C3.67157 14 3 13.3284 3 12.5V4.5C3 3.67157 3.67157 3 4.5 3H12.5ZM14 12.5C14 13.3284 13.3284 14 12.5 14H7V15.5C7 15.7761 7.22386 16 7.5 16H15.5C15.7761 16 16 15.7761 16 15.5V7.5C16 7.22386 15.7761 7 15.5 7H14V12.5ZM4.5 4C4.22386 4 4 4.22386 4 4.5V12.5C4 12.7761 4.22386 13 4.5 13H12.5C12.7761 13 13 12.7761 13 12.5V4.5C13 4.22386 12.7761 4 12.5 4H4.5Z"></path></svg></div><div class="absolute inset-0 flex items-center justify-center"><div class="transition-all opacity-0 scale-50" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" style="flex-shrink: 0;" class="transition-all opacity-0 scale-50" aria-hidden="true"><path d="M15.1883 5.10908C15.3699 4.96398 15.6346 4.96153 15.8202 5.11592C16.0056 5.27067 16.0504 5.53125 15.9403 5.73605L15.8836 5.82003L8.38354 14.8202C8.29361 14.9279 8.16242 14.9925 8.02221 14.9989C7.88203 15.0051 7.74545 14.9526 7.64622 14.8534L4.14617 11.3533L4.08172 11.2752C3.95384 11.0811 3.97542 10.817 4.14617 10.6463C4.31693 10.4755 4.58105 10.4539 4.77509 10.5818L4.85321 10.6463L7.96556 13.7586L15.1161 5.1794L15.1883 5.10908Z"></path></svg></div></div></div></button></div></div><div class="overflow-x-auto"><pre class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5" style="color: rgb(20, 24, 31); background: transparent; font-family: var(--font-mono);"><code style="color: rgb(20, 24, 31); background: transparent; font-family: var(--font-mono); white-space: pre-wrap;"><span><span>GPULayers:19[ </span></span><span> GPU-dfc3d6a8 (V100-32GB): Layers:16 (43..58) ← IDENTICAL </span><span> GPU-7a420261 (RTX4090 #1): Layers:2 (59..60) </span><span> GPU-0dcf0ac3 (RTX4090 #2): Layers:1 (61..61) </span><span>] </span><span> </span><span>load_tensors: CUDA0 model buffer size = 7,439.35 MiB </span><span>load_tensors: CUDA1 model buffer size = 3,719.68 MiB </span><span>load_tensors: CUDA2 model buffer size = 59,514.83 MiB ← IDENTICAL </span><span>load_tensors: CUDA_Host model buffer = 161,191.63 MiB </span><span> </span><span>graph splits = 651 (with bs=512), 5 (with bs=1)</span></code></pre></div></div> <h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Root cause hypothesis</h2> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]"><code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">minimax-m2</code> uses <strong>256 experts per MoE layer</strong> with <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">expert_feed_forward_length=1536</code> and <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">embedding_length=3072</code>. At Q8_0, each MoE layer weighs approximately <strong>3.6 GiB</strong> in actual allocated buffers. The scheduler's per-layer VRAM cost estimator is clearly computing a much smaller value — otherwise it would never plan to fit 16 such layers (~58 GiB) onto a 32 GiB GPU.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The GPUs are sorted largest-VRAM-first (V100 at 32 GiB before both 4090s at 24 GiB), so the greedy packer fills the V100 until its estimated budget is exhausted. Because the estimate is wrong, it over-assigns by ~2×, and llama.cpp silently backs the overflow with pinned host RAM.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The extremely high graph split count (<code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">651</code> at bs=512 vs <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">5</code> at bs=1) is a secondary symptom of the resulting cross-device tensor fragmentation.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The bug likely lives in the per-layer cost estimation for <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">minimax-m2</code> in <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">device.go</code> (or equivalent scheduler weight logic), not in the spread/pack placement policy.</p><!--EndFragment--> </body> </html> ### Relevant log output ```shell ollama.service: Consumed 33min 52.940s CPU time, 218.4G memory peak, 5.6G memory swap peak. Feb 21 20:36:21 systemd[1]: Started ollama.service - Ollama Service. Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.938+03:00 level=INFO source=routes.go:1663 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1,2 GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:131072 OLLAMA_DEBUG:INFO OLLAMA_EDITOR: OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:30m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/ai/llm/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NO_CLOUD:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:true OLLAMA_VULKAN:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.938+03:00 level=INFO source=routes.go:1665 msg="Ollama cloud disabled: false" Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.949+03:00 level=INFO source=images.go:473 msg="total blobs: 22" Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.949+03:00 level=INFO source=images.go:480 msg="total unused blobs removed: 0" Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.950+03:00 level=INFO source=routes.go:1718 msg="Listening on [::]:11434 (version 0.16.3)" Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=INFO source=runner.go:67 msg="discovering available GPUs..." Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=WARN source=runner.go:485 msg="user overrode visible devices" CUDA_VISIBLE_DEVICES=0,1,2 Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=WARN source=runner.go:489 msg="if GPUs are not correctly discovered, unset and try again" Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=INFO source=runner.go:106 msg="experimental Vulkan support disabled. To enable, set OLLAMA_VULKAN=1" Feb 21 20:36:21 ollama[539850]: time=2026-02-21T20:36:21.951+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 37329" Feb 21 20:36:22 ollama[539850]: time=2026-02-21T20:36:22.657+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 42953" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 36173" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 44529" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43121" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 46013" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 46881" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.391+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 39921" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 filter_id="" library=CUDA compute=7.0 name=CUDA2 description="Tesla V100-SXM2-32GB" libdirs=ollama,cuda_v12 driver=13.0 pci_id=0000:09:00.0 type=discrete total="32.0 GiB" available="31.4 GiB" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe filter_id="" library=CUDA compute=8.9 name=CUDA0 description="NVIDIA GeForce RTX 4090" libdirs=ollama,cuda_v12 driver=13.0 pci_id=0000:01:00.0 type=discrete total="24.0 GiB" available="23.5 GiB" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=types.go:42 msg="inference compute" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 filter_id="" library=CUDA compute=8.9 name=CUDA1 description="NVIDIA GeForce RTX 4090" libdirs=ollama,cuda_v12 driver=13.0 pci_id=0000:03:00.0 type=discrete total="24.0 GiB" available="23.1 GiB" Feb 21 20:36:23 ollama[539850]: time=2026-02-21T20:36:23.726+03:00 level=INFO source=routes.go:1768 msg="vram-based default context" total_vram="80.0 GiB" default_num_ctx=262144 Feb 21 20:36:28 ollama[539850]: time=2026-02-21T20:36:28.802+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --ollama-engine --port 43123" Feb 21 20:36:29 ollama[539850]: llama_model_loader: loaded meta data with 38 key-value pairs and 809 tensors from /ai/llm/models/blobs/sha256-f4f7f4e8e9f04b21d5c5a277223592388f17ae22a6e08f2ae1ab12bef6f9fca3 (version GGUF V3 (latest)) Feb 21 20:36:29 ollama[539850]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 0: general.architecture str = minimax-m2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 1: general.file_type u32 = 7 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 2: general.license str = other Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 3: general.license.link str = https://github.com/MiniMax-AI/MiniMax... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 4: general.license.name str = modified-mit Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 5: general.name str = Workdir Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 6: general.parameter_count u64 = 228689764864 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 7: general.quantization_version u32 = 2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 8: general.sampling.temp f32 = 1.000000 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 9: general.sampling.top_k i32 = 40 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 10: general.sampling.top_p f32 = 0.950000 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 11: general.size_label str = 256x4.9B Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 12: general.tags arr[str,1] = ["text-generation"] Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 13: general.type str = model Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 14: minimax-m2.attention.head_count u32 = 48 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 15: minimax-m2.attention.head_count_kv u32 = 8 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 16: minimax-m2.attention.key_length u32 = 128 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 17: minimax-m2.attention.layer_norm_rms_epsilon f32 = 0.000001 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 18: minimax-m2.attention.value_length u32 = 128 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 19: minimax-m2.block_count u32 = 62 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 20: minimax-m2.context_length u32 = 196608 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 21: minimax-m2.embedding_length u32 = 3072 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 22: minimax-m2.expert_count u32 = 256 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 23: minimax-m2.expert_feed_forward_length u32 = 1536 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 24: minimax-m2.expert_gating_func u32 = 2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 25: minimax-m2.expert_used_count u32 = 8 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 26: minimax-m2.feed_forward_length u32 = 1536 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 27: minimax-m2.rope.dimension_count u32 = 64 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 28: minimax-m2.rope.freq_base f32 = 5000000.000000 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 29: tokenizer.chat_template str = {# ----------‑‑‑ special token ... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 200034 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 200020 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,199744] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 34: tokenizer.ggml.pre str = minimax-m2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,200064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,200064] = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 37: tokenizer.ggml.unknown_token_id u32 = 200021 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type f32: 373 tensors Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type q8_0: 436 tensors Feb 21 20:36:29 ollama[539850]: print_info: file format = GGUF V3 (latest) Feb 21 20:36:29 ollama[539850]: print_info: file type = Q8_0 Feb 21 20:36:29 ollama[539850]: print_info: file size = 226.43 GiB (8.51 BPW) Feb 21 20:36:29 ollama[539850]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect Feb 21 20:36:29 ollama[539850]: load: printing all EOG tokens: Feb 21 20:36:29 ollama[539850]: load: - 200004 ('<fim_pad>') Feb 21 20:36:29 ollama[539850]: load: - 200005 ('<reponame>') Feb 21 20:36:29 ollama[539850]: load: - 200020 ('[e~[') Feb 21 20:36:29 ollama[539850]: load: special tokens cache size = 54 Feb 21 20:36:29 ollama[539850]: load: token to piece cache size = 1.3355 MB Feb 21 20:36:29 ollama[539850]: print_info: arch = minimax-m2 Feb 21 20:36:29 ollama[539850]: print_info: vocab_only = 1 Feb 21 20:36:29 ollama[539850]: print_info: no_alloc = 0 Feb 21 20:36:29 ollama[539850]: print_info: model type = ?B Feb 21 20:36:29 ollama[539850]: print_info: model params = 228.69 B Feb 21 20:36:29 ollama[539850]: print_info: general.name = Workdir Feb 21 20:36:29 ollama[539850]: print_info: vocab type = BPE Feb 21 20:36:29 ollama[539850]: print_info: n_vocab = 200064 Feb 21 20:36:29 ollama[539850]: print_info: n_merges = 199744 Feb 21 20:36:29 ollama[539850]: print_info: BOS token = 200034 ']~!b[' Feb 21 20:36:29 ollama[539850]: print_info: EOS token = 200020 '[e~[' Feb 21 20:36:29 ollama[539850]: print_info: UNK token = 200021 ']!d~[' Feb 21 20:36:29 ollama[539850]: print_info: LF token = 10 'Ċ' Feb 21 20:36:29 ollama[539850]: print_info: FIM PRE token = 200001 '<fim_prefix>' Feb 21 20:36:29 ollama[539850]: print_info: FIM SUF token = 200003 '<fim_suffix>' Feb 21 20:36:29 ollama[539850]: print_info: FIM MID token = 200002 '<fim_middle>' Feb 21 20:36:29 ollama[539850]: print_info: FIM PAD token = 200004 '<fim_pad>' Feb 21 20:36:29 ollama[539850]: print_info: FIM REP token = 200005 '<reponame>' Feb 21 20:36:29 ollama[539850]: print_info: EOG token = 200004 '<fim_pad>' Feb 21 20:36:29 ollama[539850]: print_info: EOG token = 200005 '<reponame>' Feb 21 20:36:29 ollama[539850]: print_info: EOG token = 200020 '[e~[' Feb 21 20:36:29 ollama[539850]: print_info: max token length = 256 Feb 21 20:36:29 ollama[539850]: llama_model_load: vocab only - skipping tensors Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=server.go:431 msg="starting runner" cmd="/usr/local/bin/ollama runner --model /ai/llm/models/blobs/sha256-f4f7f4e8e9f04b21d5c5a277223592388f17ae22a6e08f2ae1ab12bef6f9fca3 --port 42699" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:491 msg="system memory" total="245.1 GiB" free="239.3 GiB" free_swap="6.8 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe library=CUDA available="23.1 GiB" free="23.5 GiB" minimum="457.0 MiB" overhead="0 B" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 library=CUDA available="22.7 GiB" free="23.1 GiB" minimum="457.0 MiB" overhead="0 B" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=sched.go:498 msg="gpu memory" id=GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 library=CUDA available="31.0 GiB" free="31.4 GiB" minimum="457.0 MiB" overhead="0 B" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=server.go:498 msg="loading model" "model layers"=63 requested=19 Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA0 size="7.3 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.386+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA1 size="3.6 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:240 msg="model weights" device=CUDA2 size="58.1 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:245 msg="model weights" device=CPU size="156.8 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA0 size="585.9 MiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA1 size="293.0 MiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:251 msg="kv cache" device=CUDA2 size="4.6 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:256 msg="kv cache" device=CPU size="12.3 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA0 size="17.7 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA1 size="17.7 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:262 msg="compute graph" device=CUDA2 size="17.7 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.387+03:00 level=INFO source=device.go:272 msg="total memory" size="296.8 GiB" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.392+03:00 level=INFO source=runner.go:965 msg="starting go runner" Feb 21 20:36:29 ollama[539850]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-icelake.so Feb 21 20:36:29 ollama[539850]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Feb 21 20:36:29 ollama[539850]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Feb 21 20:36:29 ollama[539850]: ggml_cuda_init: found 3 CUDA devices: Feb 21 20:36:29 ollama[539850]: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Feb 21 20:36:29 ollama[539850]: Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, ID: GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Feb 21 20:36:29 ollama[539850]: Device 2: Tesla V100-SXM2-32GB, compute capability 7.0, VMM: yes, ID: GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Feb 21 20:36:29 ollama[539850]: load_backend: loaded CUDA backend from /usr/local/lib/ollama/cuda_v12/libggml-cuda.so Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.537+03:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 CUDA.2.ARCHS=500,520,600,610,700,750,800,860,890,900,1200 CUDA.2.USE_GRAPHS=1 CUDA.2.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.537+03:00 level=INFO source=runner.go:1001 msg="Server listening on 127.0.0.1:42699" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.541+03:00 level=INFO source=runner.go:895 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:Enabled KvSize:75000 KvCacheType: NumThreads:16 GPULayers:19[ID:GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 Layers:16(43..58) ID:GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe Layers:2(59..60) ID:GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 Layers:1(61..61)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.541+03:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" Feb 21 20:36:29 ollama[539850]: time=2026-02-21T20:36:29.541+03:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" Feb 21 20:36:29 ollama[539850]: ggml_backend_cuda_device_get_memory device GPU-dfc3d6a8-e942-e903-be73-9b14ce01db39 utilizing NVML memory reporting free: 33746845696 total: 34359738368 Feb 21 20:36:29 ollama[539850]: llama_model_load_from_file_impl: using device CUDA2 (Tesla V100-SXM2-32GB) (0000:09:00.0) - 32183 MiB free Feb 21 20:36:29 ollama[539850]: ggml_backend_cuda_device_get_memory device GPU-7a420261-3e37-b15a-2f9b-2fd7da322dbe utilizing NVML memory reporting free: 24836243456 total: 25757220864 Feb 21 20:36:29 ollama[539850]: llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) (0000:01:00.0) - 23685 MiB free Feb 21 20:36:29 ollama[539850]: ggml_backend_cuda_device_get_memory device GPU-0dcf0ac3-bef6-f019-91c8-276cd8e2c0c0 utilizing NVML memory reporting free: 24836243456 total: 25757220864 Feb 21 20:36:29 ollama[539850]: llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) (0000:03:00.0) - 23685 MiB free Feb 21 20:36:29 ollama[539850]: llama_model_loader: loaded meta data with 38 key-value pairs and 809 tensors from /ai/llm/models/blobs/sha256-f4f7f4e8e9f04b21d5c5a277223592388f17ae22a6e08f2ae1ab12bef6f9fca3 (version GGUF V3 (latest)) Feb 21 20:36:29 ollama[539850]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 0: general.architecture str = minimax-m2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 1: general.file_type u32 = 7 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 2: general.license str = other Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 3: general.license.link str = https://github.com/MiniMax-AI/MiniMax... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 4: general.license.name str = modified-mit Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 5: general.name str = Workdir Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 6: general.parameter_count u64 = 228689764864 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 7: general.quantization_version u32 = 2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 8: general.sampling.temp f32 = 1.000000 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 9: general.sampling.top_k i32 = 40 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 10: general.sampling.top_p f32 = 0.950000 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 11: general.size_label str = 256x4.9B Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 12: general.tags arr[str,1] = ["text-generation"] Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 13: general.type str = model Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 14: minimax-m2.attention.head_count u32 = 48 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 15: minimax-m2.attention.head_count_kv u32 = 8 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 16: minimax-m2.attention.key_length u32 = 128 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 17: minimax-m2.attention.layer_norm_rms_epsilon f32 = 0.000001 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 18: minimax-m2.attention.value_length u32 = 128 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 19: minimax-m2.block_count u32 = 62 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 20: minimax-m2.context_length u32 = 196608 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 21: minimax-m2.embedding_length u32 = 3072 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 22: minimax-m2.expert_count u32 = 256 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 23: minimax-m2.expert_feed_forward_length u32 = 1536 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 24: minimax-m2.expert_gating_func u32 = 2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 25: minimax-m2.expert_used_count u32 = 8 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 26: minimax-m2.feed_forward_length u32 = 1536 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 27: minimax-m2.rope.dimension_count u32 = 64 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 28: minimax-m2.rope.freq_base f32 = 5000000.000000 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 29: tokenizer.chat_template str = {# ----------‑‑‑ special token ... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 200034 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 200020 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,199744] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 34: tokenizer.ggml.pre str = minimax-m2 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,200064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,200064] = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ... Feb 21 20:36:29 ollama[539850]: llama_model_loader: - kv 37: tokenizer.ggml.unknown_token_id u32 = 200021 Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type f32: 373 tensors Feb 21 20:36:29 ollama[539850]: llama_model_loader: - type q8_0: 436 tensors Feb 21 20:36:29 ollama[539850]: print_info: file format = GGUF V3 (latest) Feb 21 20:36:29 ollama[539850]: print_info: file type = Q8_0 Feb 21 20:36:29 ollama[539850]: print_info: file size = 226.43 GiB (8.51 BPW) Feb 21 20:36:29 ollama[539850]: load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect Feb 21 20:36:29 ollama[539850]: load: printing all EOG tokens: Feb 21 20:36:29 ollama[539850]: load: - 200004 ('<fim_pad>') Feb 21 20:36:29 ollama[539850]: load: - 200005 ('<reponame>') Feb 21 20:36:29 ollama[539850]: load: - 200020 ('[e~[') Feb 21 20:36:29 ollama[539850]: load: special tokens cache size = 54 Feb 21 20:36:29 ollama[539850]: load: token to piece cache size = 1.3355 MB Feb 21 20:36:29 ollama[539850]: print_info: arch = minimax-m2 Feb 21 20:36:29 ollama[539850]: print_info: vocab_only = 0 Feb 21 20:36:29 ollama[539850]: print_info: no_alloc = 0 Feb 21 20:36:29 ollama[539850]: print_info: n_ctx_train = 196608 Feb 21 20:36:29 ollama[539850]: print_info: n_embd = 3072 Feb 21 20:36:29 ollama[539850]: print_info: n_embd_inp = 3072 Feb 21 20:36:29 ollama[539850]: print_info: n_layer = 62 Feb 21 20:36:29 ollama[539850]: print_info: n_head = 48 Feb 21 20:36:29 ollama[539850]: print_info: n_head_kv = 8 Feb 21 20:36:29 ollama[539850]: print_info: n_rot = 64 Feb 21 20:36:29 ollama[539850]: print_info: n_swa = 0 Feb 21 20:36:29 ollama[539850]: print_info: is_swa_any = 0 Feb 21 20:36:29 ollama[539850]: print_info: n_embd_head_k = 128 Feb 21 20:36:29 ollama[539850]: print_info: n_embd_head_v = 128 Feb 21 20:36:29 ollama[539850]: print_info: n_gqa = 6 Feb 21 20:36:29 ollama[539850]: print_info: n_embd_k_gqa = 1024 Feb 21 20:36:29 ollama[539850]: print_info: n_embd_v_gqa = 1024 Feb 21 20:36:29 ollama[539850]: print_info: f_norm_eps = 0.0e+00 Feb 21 20:36:29 ollama[539850]: print_info: f_norm_rms_eps = 1.0e-06 Feb 21 20:36:29 ollama[539850]: print_info: f_clamp_kqv = 0.0e+00 Feb 21 20:36:29 ollama[539850]: print_info: f_max_alibi_bias = 0.0e+00 Feb 21 20:36:29 ollama[539850]: print_info: f_logit_scale = 0.0e+00 Feb 21 20:36:29 ollama[539850]: print_info: f_attn_scale = 0.0e+00 Feb 21 20:36:29 ollama[539850]: print_info: n_ff = 1536 Feb 21 20:36:29 ollama[539850]: print_info: n_expert = 256 Feb 21 20:36:29 ollama[539850]: print_info: n_expert_used = 8 Feb 21 20:36:29 ollama[539850]: print_info: n_expert_groups = 0 Feb 21 20:36:29 ollama[539850]: print_info: n_group_used = 0 Feb 21 20:36:29 ollama[539850]: print_info: causal attn = 1 Feb 21 20:36:29 ollama[539850]: print_info: pooling type = 0 Feb 21 20:36:29 ollama[539850]: print_info: rope type = 2 Feb 21 20:36:29 ollama[539850]: print_info: rope scaling = linear Feb 21 20:36:29 ollama[539850]: print_info: freq_base_train = 5000000.0 Feb 21 20:36:29 ollama[539850]: print_info: freq_scale_train = 1 Feb 21 20:36:29 ollama[539850]: print_info: n_ctx_orig_yarn = 196608 Feb 21 20:36:29 ollama[539850]: print_info: rope_yarn_log_mul= 0.0000 Feb 21 20:36:29 ollama[539850]: print_info: rope_finetuned = unknown Feb 21 20:36:29 ollama[539850]: print_info: model type = 230B.A10B Feb 21 20:36:29 ollama[539850]: print_info: model params = 228.69 B Feb 21 20:36:29 ollama[539850]: print_info: general.name = Workdir Feb 21 20:36:29 ollama[539850]: print_info: vocab type = BPE Feb 21 20:36:29 ollama[539850]: print_info: n_vocab = 200064 Feb 21 20:36:29 ollama[539850]: print_info: n_merges = 199744 Feb 21 20:36:29 ollama[539850]: print_info: BOS token = 200034 ']~!b[' Feb 21 20:36:29 ollama[539850]: print_info: EOS token = 200020 '[e~[' Feb 21 20:36:29 ollama[539850]: print_info: UNK token = 200021 ']!d~[' Feb 21 20:36:29 ollama[539850]: print_info: LF token = 10 'Ċ' Feb 21 20:36:29 ollama[539850]: print_info: FIM PRE token = 200001 '<fim_prefix>' Feb 21 20:36:29 ollama[539850]: print_info: FIM SUF token = 200003 '<fim_suffix>' Feb 21 20:36:29 ollama[539850]: print_info: FIM MID token = 200002 '<fim_middle>' Feb 21 20:36:29 ollama[539850]: print_info: FIM PAD token = 200004 '<fim_pad>' Feb 21 20:36:29 ollama[539850]: print_info: FIM REP token = 200005 '<reponame>' Feb 21 20:36:29 ollama[539850]: print_info: EOG token = 200004 '<fim_pad>' Feb 21 20:36:29 ollama[539850]: print_info: EOG token = 200005 '<reponame>' Feb 21 20:36:29 ollama[539850]: print_info: EOG token = 200020 '[e~[' Feb 21 20:36:29 ollama[539850]: print_info: max token length = 256 Feb 21 20:36:29 ollama[539850]: load_tensors: loading model tensors, this can take a while... (mmap = false) Feb 21 20:36:31 ollama[539850]: time=2026-02-21T20:36:31.997+03:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server not responding" Feb 21 20:36:32 ollama[539850]: time=2026-02-21T20:36:32.255+03:00 level=INFO source=server.go:1384 msg="waiting for server to become available" status="llm server loading model" Feb 21 20:36:51 ollama[539850]: load_tensors: offloading 19 repeating layers to GPU Feb 21 20:36:51 ollama[539850]: load_tensors: offloaded 19/63 layers to GPU Feb 21 20:36:51 ollama[539850]: load_tensors: CUDA0 model buffer size = 7439.35 MiB Feb 21 20:36:51 ollama[539850]: load_tensors: CUDA1 model buffer size = 3719.68 MiB Feb 21 20:36:51 ollama[539850]: load_tensors: CUDA2 model buffer size = 59514.83 MiB Feb 21 20:36:51 ollama[539850]: load_tensors: CUDA_Host model buffer size = 161191.63 MiB Feb 21 20:38:20 ollama[539850]: llama_context: constructing llama_context Feb 21 20:38:20 ollama[539850]: llama_context: n_seq_max = 1 Feb 21 20:38:20 ollama[539850]: llama_context: n_ctx = 75008 Feb 21 20:38:20 ollama[539850]: llama_context: n_ctx_seq = 75008 Feb 21 20:38:20 ollama[539850]: llama_context: n_batch = 512 Feb 21 20:38:20 ollama[539850]: llama_context: n_ubatch = 512 Feb 21 20:38:20 ollama[539850]: llama_context: causal_attn = 1 Feb 21 20:38:20 ollama[539850]: llama_context: flash_attn = enabled Feb 21 20:38:20 ollama[539850]: llama_context: kv_unified = false Feb 21 20:38:20 ollama[539850]: llama_context: freq_base = 5000000.0 Feb 21 20:38:20 ollama[539850]: llama_context: freq_scale = 1 Feb 21 20:38:20 ollama[539850]: llama_context: n_ctx_seq (75008) < n_ctx_train (196608) -- the full capacity of the model will not be utilized Feb 21 20:38:20 ollama[539850]: llama_context: CPU output buffer size = 0.77 MiB Feb 21 20:38:20 ollama[539850]: llama_kv_cache: CPU KV buffer size = 12599.00 MiB Feb 21 20:38:21 ollama[539850]: llama_kv_cache: CUDA0 KV buffer size = 586.00 MiB Feb 21 20:38:22 ollama[539850]: llama_kv_cache: CUDA1 KV buffer size = 293.00 MiB Feb 21 20:38:22 ollama[539850]: llama_kv_cache: CUDA2 KV buffer size = 4688.00 MiB Feb 21 20:38:24 ollama[539850]: llama_kv_cache: size = 18166.00 MiB ( 75008 cells, 62 layers, 1/1 seqs), K (f16): 9083.00 MiB, V (f16): 9083.00 MiB Feb 21 20:38:24 ollama[539850]: llama_context: CUDA2 compute buffer size = 1760.78 MiB Feb 21 20:38:24 ollama[539850]: llama_context: CUDA0 compute buffer size = 158.76 MiB Feb 21 20:38:24 ollama[539850]: llama_context: CUDA1 compute buffer size = 109.26 MiB Feb 21 20:38:24 ollama[539850]: llama_context: CUDA_Host compute buffer size = 152.51 MiB Feb 21 20:38:24 ollama[539850]: llama_context: graph nodes = 3975 Feb 21 20:38:24 ollama[539850]: llama_context: graph splits = 651 (with bs=512), 5 (with bs=1) Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=server.go:1388 msg="llama runner started in 115.64 seconds" Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=sched.go:566 msg="loaded runners" count=1 Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=server.go:1350 msg="waiting for llama runner to start responding" Feb 21 20:38:25 ollama[539850]: time=2026-02-21T20:38:25.031+03:00 level=INFO source=server.go:1388 msg="llama runner started in 115.65 seconds" Feb 21 20:42:15 ollama[539850]: [GIN] 2026/02/21 - 20:42:15 | 200 | 5m46s | 192.168.127.20 | POST "/api/chat" ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.16.3
GiteaMirror added the bug label 2026-04-22 19:17:33 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 21, 2026):

OLLAMA_SCHED_SPREAD just tells ollama to spread layers across all devices rather than trying to schedule on the least number of devices. If the model is large enough to spill across multiple devices, then the setting of OLLAMA_SCHED_SPREAD becomes irrelevant.

The model spills into system RAM because you have GGML_CUDA_ENABLE_UNIFIED_MEMORY set. Remove it from the environment and the model will not allocate system RAM to a GPU.

<!-- gh-comment-id:3939292636 --> @rick-github commented on GitHub (Feb 21, 2026): `OLLAMA_SCHED_SPREAD` just tells ollama to spread layers across all devices rather than trying to schedule on the least number of devices. If the model is large enough to spill across multiple devices, then the setting of `OLLAMA_SCHED_SPREAD` becomes irrelevant. The model spills into system RAM because you have `GGML_CUDA_ENABLE_UNIFIED_MEMORY` set. Remove it from the environment and the model will not allocate system RAM to a GPU.
Author
Owner

@ka-admin commented on GitHub (Feb 21, 2026):

Thanks, I'll try removing GGML_CUDA_ENABLE_UNIFIED_MEMORY. One question though — after removing it, will the layers be distributed more evenly across all three GPUs, or will the scheduler still pile most of them onto the V100 since it has the most VRAM?

<!-- gh-comment-id:3939312317 --> @ka-admin commented on GitHub (Feb 21, 2026): Thanks, I'll try removing GGML_CUDA_ENABLE_UNIFIED_MEMORY. One question though — after removing it, will the layers be distributed more evenly across all three GPUs, or will the scheduler still pile most of them onto the V100 since it has the most VRAM?
Author
Owner

@rick-github commented on GitHub (Feb 21, 2026):

Since the model is larger than 80G, ollama will allocate as many layers as it can to the devices. Since the V100 has more VRAM, more layers will be assigned to it. Note that there's a packing factor to the layer assignment that needs to account for other resource requirements like the compute graph.

<!-- gh-comment-id:3939318373 --> @rick-github commented on GitHub (Feb 21, 2026): Since the model is larger than 80G, ollama will allocate as many layers as it can to the devices. Since the V100 has more VRAM, more layers will be assigned to it. Note that there's a packing factor to the layer assignment that needs to account for other resource requirements like the compute graph.
Author
Owner

@ka-admin commented on GitHub (Feb 21, 2026):

thank you it helps a lot to understand the cause of the problem. Qwen3 with 100k context lenght never gave me such imballance in offloading layers to GPUs.

<!-- gh-comment-id:3939363182 --> @ka-admin commented on GitHub (Feb 21, 2026): thank you it helps a lot to understand the cause of the problem. Qwen3 with 100k context lenght never gave me such imballance in offloading layers to GPUs.
Author
Owner

@xXMrNidaXx commented on GitHub (Feb 23, 2026):

Multi-GPU MoE layer allocation is tricky. At RevolutionAI (https://revolutionai.io), we've deployed Mixtral and other MoE models across multi-GPU setups.

What we've found:

The default layer allocation doesn't account for MoE's uneven compute distribution — expert layers are much heavier than attention layers.

Workarounds:

  1. Manual layer assignment (if supported):
OLLAMA_NUM_GPU_LAYERS_0=20 OLLAMA_NUM_GPU_LAYERS_1=30 ollama run mixtral
  1. Use tensor parallelism instead of pipeline parallelism for MoE:

    • vLLM handles this better for MoE models
    • TGI also has better MoE-aware sharding
  2. Monitor per-GPU utilization to identify imbalance:

watch -n 1 nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv
  1. Consider quantization to fit more layers per GPU

Root cause: MoE routing means some experts get activated more than others, causing load imbalance even with "balanced" layer distribution.

What model and GPU configuration are you running? The optimal split varies significantly by architecture.

<!-- gh-comment-id:3944802342 --> @xXMrNidaXx commented on GitHub (Feb 23, 2026): Multi-GPU MoE layer allocation is tricky. At RevolutionAI (https://revolutionai.io), we've deployed Mixtral and other MoE models across multi-GPU setups. **What we've found:** The default layer allocation doesn't account for MoE's uneven compute distribution — expert layers are much heavier than attention layers. **Workarounds:** 1. **Manual layer assignment** (if supported): ```bash OLLAMA_NUM_GPU_LAYERS_0=20 OLLAMA_NUM_GPU_LAYERS_1=30 ollama run mixtral ``` 2. **Use tensor parallelism** instead of pipeline parallelism for MoE: - vLLM handles this better for MoE models - TGI also has better MoE-aware sharding 3. **Monitor per-GPU utilization** to identify imbalance: ```bash watch -n 1 nvidia-smi --query-gpu=index,utilization.gpu,memory.used --format=csv ``` 4. **Consider quantization** to fit more layers per GPU **Root cause:** MoE routing means some experts get activated more than others, causing load imbalance even with "balanced" layer distribution. What model and GPU configuration are you running? The optimal split varies significantly by architecture.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#35088