[PR #12437] [CLOSED] feat: add Mixture of Experts (MoE) dynamic loading optimization #19094

Closed
opened 2026-04-16 06:56:42 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/12437
Author: @Ray0907
Created: 9/28/2025
Status: Closed

Base: mainHead: main


📝 Commits (1)

  • 790a0f0 feat: add Mixture of Experts (MoE) dynamic loading optimization

📊 Changes

9 files changed (+2377 additions, -7 deletions)

View changed files

📝 api/types.go (+11 -0)
llama/llama-moe-integration.go (+726 -0)
📝 llama/llama.cpp/src/llama-graph.cpp (+89 -1)
llama/llama.cpp/src/llama-moe-dynamic.cpp (+434 -0)
llama/llama.cpp/src/llama-moe-dynamic.h (+93 -0)
📝 llama/llama.cpp/src/llama.cpp (+14 -0)
llm/moe_optimizer.go (+780 -0)
📝 llm/server.go (+44 -6)
📝 server/routes.go (+186 -0)

📄 Description

Summary

Implements comprehensive Mixture of Experts (MoE) optimization system for
memory-efficient handling of large MoE models with dynamic expert
loading/unloading capabilities.

Related Issue: #11005

Changes

Core Implementation

  • API Types: Add MoE configuration options to Runner struct in
    api/types.go
  • C++ Integration: Implement dynamic expert loading in llama-graph
    computation (llama/llama.cpp/src/llama-graph.cpp,
    llama/llama.cpp/src/llama-moe-dynamic.cpp/h)
  • Go-C++ Bridge: Create integration layer for expert request handling
    (llama/llama-moe-integration.go)
  • MoE Optimizer: Add comprehensive optimizer with VRAM/CPU budget
    management (llm/moe_optimizer.go)
  • HTTP Endpoints: Create API endpoints for MoE stats, cleanup, and
    configuration (server/routes.go)

Key Features

  • Dynamic expert loading/unloading based on memory budgets
  • VRAM and CPU memory management
  • Expert caching with hit/miss tracking
  • Configurable expert limits and optimization settings
  • Real-time statistics and monitoring via /api/moe/stats

Known Issues

  • /api/moe/stats endpoint shows zero values for statistics despite MoE
    optimization being active:
    • active_experts, cache_hits, cache_misses show 0
    • cpu_usage, vram_usage show "0 B"
    • total_load_time, total_offload_time show "0s"
  • This suggests the Go-C++ bridge or statistics collection mechanism requires
    debugging to properly report runtime metrics.

References


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/12437 **Author:** [@Ray0907](https://github.com/Ray0907) **Created:** 9/28/2025 **Status:** ❌ Closed **Base:** `main` ← **Head:** `main` --- ### 📝 Commits (1) - [`790a0f0`](https://github.com/ollama/ollama/commit/790a0f0f31cea643a62c0c89a12e906e5c9486dc) feat: add Mixture of Experts (MoE) dynamic loading optimization ### 📊 Changes **9 files changed** (+2377 additions, -7 deletions) <details> <summary>View changed files</summary> 📝 `api/types.go` (+11 -0) ➕ `llama/llama-moe-integration.go` (+726 -0) 📝 `llama/llama.cpp/src/llama-graph.cpp` (+89 -1) ➕ `llama/llama.cpp/src/llama-moe-dynamic.cpp` (+434 -0) ➕ `llama/llama.cpp/src/llama-moe-dynamic.h` (+93 -0) 📝 `llama/llama.cpp/src/llama.cpp` (+14 -0) ➕ `llm/moe_optimizer.go` (+780 -0) 📝 `llm/server.go` (+44 -6) 📝 `server/routes.go` (+186 -0) </details> ### 📄 Description ## Summary Implements comprehensive Mixture of Experts (MoE) optimization system for memory-efficient handling of large MoE models with dynamic expert loading/unloading capabilities. **Related Issue:** #11005 ## Changes ### Core Implementation - **API Types**: Add MoE configuration options to Runner struct in `api/types.go` - **C++ Integration**: Implement dynamic expert loading in llama-graph computation (`llama/llama.cpp/src/llama-graph.cpp`, `llama/llama.cpp/src/llama-moe-dynamic.cpp/h`) - **Go-C++ Bridge**: Create integration layer for expert request handling (`llama/llama-moe-integration.go`) - **MoE Optimizer**: Add comprehensive optimizer with VRAM/CPU budget management (`llm/moe_optimizer.go`) - **HTTP Endpoints**: Create API endpoints for MoE stats, cleanup, and configuration (`server/routes.go`) ### Key Features - Dynamic expert loading/unloading based on memory budgets - VRAM and CPU memory management - Expert caching with hit/miss tracking - Configurable expert limits and optimization settings - Real-time statistics and monitoring via `/api/moe/stats` ## Known Issues - `/api/moe/stats` endpoint shows zero values for statistics despite MoE optimization being active: - `active_experts`, `cache_hits`, `cache_misses` show 0 - `cpu_usage`, `vram_usage` show "0 B" - `total_load_time`, `total_offload_time` show "0s" - This suggests the Go-C++ bridge or statistics collection mechanism requires debugging to properly report runtime metrics. ## References - https://medium.com/@david.sanftenberg/gpu-poor-how-to-configure-offloading-for-the-qwen-3-235b-a22b-moe-model-using-llama-cpp-13dc15287bed --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 06:56:42 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#19094