[PR #12333] feat: add support for MoE offloading #60484

Open
opened 2026-04-29 15:28:34 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/12333
Author: @Readon
Created: 9/18/2025
Status: 🔄 Open

Base: mainHead: feat-moe-offload


📝 Commits (1)

  • fe7d9c5 feat: add support for MoE offloading

📊 Changes

5 files changed (+62 additions, -22 deletions)

View changed files

📝 api/types.go (+13 -11)
📝 docs/modelfile.md (+1 -0)
📝 llama/llama.go (+41 -6)
📝 llm/server.go (+2 -1)
📝 runner/llamarunner/runner.go (+5 -4)

📄 Description

This commit introduces a new parameter num_moe_offload to the Modelfile, allowing users to offload Mixture-of-Experts (MoE) weights to the CPU to reduce VRAM usage.

The num_moe_offload parameter can be set to:

  • A positive integer N to offload the first N MoE layers.
  • -1 to offload all MoE layers.
  • 0 (default) to disable offloading.

This is implemented by passing tensor override rules to the underlying llama.cpp library, which already supports this functionality. The documentation for the new parameter has also been updated.

Try to use Jules to solve the #11772


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/12333 **Author:** [@Readon](https://github.com/Readon) **Created:** 9/18/2025 **Status:** 🔄 Open **Base:** `main` ← **Head:** `feat-moe-offload` --- ### 📝 Commits (1) - [`fe7d9c5`](https://github.com/ollama/ollama/commit/fe7d9c5432ae1b2d805bb8c98411f18d520890a2) feat: add support for MoE offloading ### 📊 Changes **5 files changed** (+62 additions, -22 deletions) <details> <summary>View changed files</summary> 📝 `api/types.go` (+13 -11) 📝 `docs/modelfile.md` (+1 -0) 📝 `llama/llama.go` (+41 -6) 📝 `llm/server.go` (+2 -1) 📝 `runner/llamarunner/runner.go` (+5 -4) </details> ### 📄 Description This commit introduces a new parameter `num_moe_offload` to the Modelfile, allowing users to offload Mixture-of-Experts (MoE) weights to the CPU to reduce VRAM usage. The `num_moe_offload` parameter can be set to: - A positive integer `N` to offload the first `N` MoE layers. - `-1` to offload all MoE layers. - `0` (default) to disable offloading. This is implemented by passing tensor override rules to the underlying `llama.cpp` library, which already supports this functionality. The documentation for the new parameter has also been updated. Try to use Jules to solve the #11772 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 15:28:34 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#60484