[GH-ISSUE #11005] Memory Optimization for MoE Models via Sparse-Activation-Aware Techniques #53768

Open
opened 2026-04-29 04:43:48 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @gffice on GitHub (Jun 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11005

Problem Statement

Mixture-of-Experts (MoE) models (e.g., Mixtral, Switch Transformers) activate only a subset of experts per token (top-k routing). However, current implementations often load all experts into memory during inference/training, leading to:

  • Excessive VRAM consumption (e.g., 8x7B model requiring >90GB VRAM).
  • Barrier to deployment on consumer hardware.
    Goal: Exploit sparse activation to reduce memory footprint while preserving performance.

Proposed Solutions

1. Dynamic Expert Loading

  • Mechanism: Load only activated experts into VRAM after routing.
  • Implementation:
    • Split expert weights into independent blocks (e.g., per-expert .safetensors).
    • Post-routing, load top-k experts via memory-mapped I/O.
  • Benefit: Reduces VRAM from O(total_experts)O(activated_experts).

2. Expert Offloading

  • Strategy:
    • CPU Offloading: Move inactive experts to RAM.
    • Disk Offloading: Store rarely used experts on SSD (via mmap).
  • Optimization: Async I/O + LRU caching for prefetching.

3. Sparse Computation Graphs

  • Approach: Skip computations for inactive experts at kernel/compiler level.
    • Example: Build conditional execution paths in MLIR/TVM.
    • Fused SparseMoELayer operator:
      # Pseudocode  
      output = zeros_like(input)  
      for expert_id in activated_experts:  # Only compute active experts  
          output += expert[expert_id](input_slice)  
      

4. Quantization-Aware Sparsity

  • Hybrid Quantization: Apply lower precision (e.g., int8) to inactive experts.
  • Compression: Use pruning/structured sparsity for offloaded experts.

5. Distributed Expert Parallelism

  • Design: Shard experts across devices, transfer only inputs/outputs for activated experts.
  • Benefit: Near-linear memory scaling with devices.

Expected Impact

Technique VRAM Reduction Use Case
Dynamic Expert Loading 50–90% Large MoE (≥8 experts)
CPU/Disk Offloading 30–70% Resource-constrained envs
Sparse Computation 20–40% Compute-bound workloads
Expert Parallelism 1/N scaling Multi-GPU/Node

Challenges & Mitigations

Challenge Mitigation
Dynamic loading latency Async prefetch + pipeline parallelism
Routing overhead Lightweight router (e.g., low-dim linear)
System complexity Gradual rollout via --sparse_moe flag

Requested OLLaMA Changes

  1. API Extensions:
    • load_expert(expert_id: int) → nn.Module for dynamic weight management.
    • offload_expert(expert_id: int, device='cpu') context manager.
  2. Runtime Support:
    • Conditional execution in computation graphs (e.g., JIT-traced branches).
  3. Documentation:
    • Add moe_memory_optimization.md with benchmarks & code samples.
Originally created by @gffice on GitHub (Jun 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11005 ### Problem Statement Mixture-of-Experts (MoE) models (e.g., Mixtral, Switch Transformers) activate only a subset of experts per token (`top-k routing`). However, current implementations often **load all experts into memory** during inference/training, leading to: - Excessive VRAM consumption (e.g., 8x7B model requiring >90GB VRAM). - Barrier to deployment on consumer hardware. **Goal**: Exploit sparse activation to reduce memory footprint while preserving performance. --- ### Proposed Solutions #### 1. **Dynamic Expert Loading** - **Mechanism**: Load **only activated experts** into VRAM after routing. - **Implementation**: - Split expert weights into independent blocks (e.g., per-expert `.safetensors`). - Post-routing, load `top-k` experts via memory-mapped I/O. - **Benefit**: Reduces VRAM from `O(total_experts)` → `O(activated_experts)`. #### 2. **Expert Offloading** - **Strategy**: - **CPU Offloading**: Move inactive experts to RAM. - **Disk Offloading**: Store rarely used experts on SSD (via `mmap`). - **Optimization**: Async I/O + LRU caching for prefetching. #### 3. **Sparse Computation Graphs** - **Approach**: Skip computations for inactive experts at kernel/compiler level. - Example: Build conditional execution paths in MLIR/TVM. - Fused `SparseMoELayer` operator: ```python # Pseudocode output = zeros_like(input) for expert_id in activated_experts: # Only compute active experts output += expert[expert_id](input_slice) ``` #### 4. **Quantization-Aware Sparsity** - **Hybrid Quantization**: Apply lower precision (e.g., `int8`) to inactive experts. - **Compression**: Use pruning/structured sparsity for offloaded experts. #### 5. **Distributed Expert Parallelism** - **Design**: Shard experts across devices, transfer only inputs/outputs for activated experts. - **Benefit**: Near-linear memory scaling with devices. --- ### Expected Impact | Technique | VRAM Reduction | Use Case | |-------------------------|----------------|----------------------------| | Dynamic Expert Loading | 50–90% | Large MoE (≥8 experts) | | CPU/Disk Offloading | 30–70% | Resource-constrained envs | | Sparse Computation | 20–40% | Compute-bound workloads | | Expert Parallelism | 1/N scaling | Multi-GPU/Node | --- ### Challenges & Mitigations | Challenge | Mitigation | |-------------------------|---------------------------------------------| | Dynamic loading latency | Async prefetch + pipeline parallelism | | Routing overhead | Lightweight router (e.g., low-dim linear) | | System complexity | Gradual rollout via `--sparse_moe` flag | --- ### Requested OLLaMA Changes 1. **API Extensions**: - `load_expert(expert_id: int) → nn.Module` for dynamic weight management. - `offload_expert(expert_id: int, device='cpu')` context manager. 2. **Runtime Support**: - Conditional execution in computation graphs (e.g., JIT-traced branches). 3. **Documentation**: - Add `moe_memory_optimization.md` with benchmarks & code samples.
GiteaMirror added the feature request label 2026-04-29 04:43:48 -05:00
Author
Owner

@dullbananas commented on GitHub (Jul 8, 2025):

This should also optionally apply to the initial download of experts. This would allow much bigger models to be used.

<!-- gh-comment-id:3050581873 --> @dullbananas commented on GitHub (Jul 8, 2025): This should also optionally apply to the initial download of experts. This would allow much bigger models to be used.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#53768