[PR #15207] [CLOSED] llm: add automatic MOE expert weight offloading #25615

Closed
opened 2026-04-19 18:18:45 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15207
Author: @sf1tzp
Created: 4/2/2026
Status: Closed

Base: mainHead: moe-offload


📝 Commits (1)

  • 388e3b2 llm: add automatic MOE expert weight offloading

📊 Changes

6 files changed (+235 additions, -48 deletions)

View changed files

📝 llm/server.go (+176 -39)
📝 llm/server_test.go (+5 -3)
📝 ml/backend.go (+6 -0)
📝 ml/backend/ggml/ggml.go (+27 -4)
📝 ml/device.go (+20 -2)
📝 runner/ollamarunner/runner.go (+1 -0)

📄 Description

Summary

Adds automatic partial layer offloading for Mixture-of-Experts models on the new engine. When a MOE model doesn't fully fit in VRAM, expert weights (ffn_gate_exps, ffn_up_exps, ffn_down_exps) can be selectively placed on CPU while the layer's
attention, norms, and routing tensors remain on GPU.

This addresses #11772 and the feedback on #12333:

  • Automatic based on available memory — no user configuration
  • Works with the existing multi-GPU layer assignment framework
  • Targets the new engine only (ml/, runner/ollamarunner/, llm/)

How it works

  1. The GGML backend now classifies expert tensors separately, tracking their memory in a new DeviceMemory.ExpertWeights field
  2. After the standard layout assigns full layers to GPUs, remaining VRAM is greedily filled with additional layers at base-only size (expert weights on CPU)
  3. ExpertOffload layer indices are threaded through LoadRequestBackendParams → GGML backend, which routes expert tensors to CPU buffers for those layers

Benchmarks

Tested with qwen3:30b-a3b (Q4_K_M, 19.3 GB) on an RTX 3060 (12 GB VRAM, 24 GB system RAM):

Metric Baseline (main) MOE offload Delta
VRAM 11.84 GB (61.4%) 11.98 GB (62.2%) +0.8%
Generate 29.41 tok/s 26.64 tok/s -9.4%
Prefill 1264.76 tok/s 646.05 tok/s -48.9%

On this hardware the model already fits ~61% on GPU at full layer size, leaving little remaining VRAM for additional base-only layers. The few extra layers add cross-device transfer overhead that outweighs their benefit.

The feature should have more impact on systems where VRAM is significantly more constrained relative to model size — e.g., 6-8 GB GPUs running large MOE models where many more layers would benefit from partial offloading. Looking for feedback from users
with more constrained setups.

Test plan

  • Existing TestLLMServerFitGPU passes
  • Dense models unaffected (pass 2 skipped when no expert weights detected)
  • MOE model loads and runs correctly with expert offloading active
  • Testing on more VRAM-constrained hardware (6-8 GB GPU)
  • Multi-GPU testing

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15207 **Author:** [@sf1tzp](https://github.com/sf1tzp) **Created:** 4/2/2026 **Status:** ❌ Closed **Base:** `main` ← **Head:** `moe-offload` --- ### 📝 Commits (1) - [`388e3b2`](https://github.com/ollama/ollama/commit/388e3b2cbcd4e255dc364f6dda4096b242ab5bda) llm: add automatic MOE expert weight offloading ### 📊 Changes **6 files changed** (+235 additions, -48 deletions) <details> <summary>View changed files</summary> 📝 `llm/server.go` (+176 -39) 📝 `llm/server_test.go` (+5 -3) 📝 `ml/backend.go` (+6 -0) 📝 `ml/backend/ggml/ggml.go` (+27 -4) 📝 `ml/device.go` (+20 -2) 📝 `runner/ollamarunner/runner.go` (+1 -0) </details> ### 📄 Description ## Summary Adds automatic partial layer offloading for Mixture-of-Experts models on the new engine. When a MOE model doesn't fully fit in VRAM, expert weights (`ffn_gate_exps`, `ffn_up_exps`, `ffn_down_exps`) can be selectively placed on CPU while the layer's attention, norms, and routing tensors remain on GPU. This addresses #11772 and the feedback on #12333: - Automatic based on available memory — no user configuration - Works with the existing multi-GPU layer assignment framework - Targets the new engine only (`ml/`, `runner/ollamarunner/`, `llm/`) ## How it works 1. The GGML backend now classifies expert tensors separately, tracking their memory in a new `DeviceMemory.ExpertWeights` field 2. After the standard layout assigns full layers to GPUs, remaining VRAM is greedily filled with additional layers at base-only size (expert weights on CPU) 3. `ExpertOffload` layer indices are threaded through `LoadRequest` → `BackendParams` → GGML backend, which routes expert tensors to CPU buffers for those layers ## Benchmarks Tested with `qwen3:30b-a3b` (Q4_K_M, 19.3 GB) on an RTX 3060 (12 GB VRAM, 24 GB system RAM): | Metric | Baseline (main) | MOE offload | Delta | |--------|-----------------|-------------|-------| | VRAM | 11.84 GB (61.4%) | 11.98 GB (62.2%) | +0.8% | | Generate | 29.41 tok/s | 26.64 tok/s | -9.4% | | Prefill | 1264.76 tok/s | 646.05 tok/s | -48.9% | On this hardware the model already fits ~61% on GPU at full layer size, leaving little remaining VRAM for additional base-only layers. The few extra layers add cross-device transfer overhead that outweighs their benefit. The feature should have more impact on systems where VRAM is significantly more constrained relative to model size — e.g., 6-8 GB GPUs running large MOE models where many more layers would benefit from partial offloading. **Looking for feedback from users with more constrained setups.** ## Test plan - [x] Existing `TestLLMServerFitGPU` passes - [x] Dense models unaffected (pass 2 skipped when no expert weights detected) - [x] MOE model loads and runs correctly with expert offloading active - [ ] Testing on more VRAM-constrained hardware (6-8 GB GPU) - [ ] Multi-GPU testing --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-19 18:18:45 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#25615