[GH-ISSUE #8861] [Feature Request] Dynamic Hierarchical Memory Management for MOE Models #5743

Open
opened 2026-04-12 17:01:57 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @ipfgao on GitHub (Feb 6, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8861

以下是Deepseek R1给出的提案:


标题/Title:
[功能请求] 为MOE模型实现动态分级内存管理
[Feature Request] Dynamic Hierarchical Memory Management for MOE Models


问题描述/Description:
请求为MOE(Mixture of Experts)模型实现三级动态内存管理,通过显存-内存-固态存储的分层存储机制,使消费级显卡(如8-24GB显存)+ 消费级内存配置(如32-64GB内存)能够推理超大规模模型(200B+参数级)。
This is a request to implement a three-level dynamic memory management system for MOE (Mixture of Experts) models. By utilizing a hierarchical storage mechanism (VRAM → RAM → SSD), consumer-grade GPUs (e.g., 8-24GB VRAM) + consumer-grade memory configurations (with 32-64GB of RAM) can support inference for ultra-large-scale models (200B+ parameters).


技术提案/Technical Proposal:

  1. 内存层级设计/Memory Hierarchy Design

    • 第一层:GPU显存(存放活跃专家模块)
      Level 1: GPU VRAM (stores active expert modules)
    • 第二层:系统内存(存放近期可能使用的专家模块)
      Level 2: System RAM (stores recently used expert modules)
    • 第三层:NVMe SSD(存放冷数据专家模块)
      Level 3: NVMe SSD (stores cold expert modules)
  2. 动态调度策略/Dynamic Scheduling Strategy

    # 伪代码示例/Pseudo-code Example  
    class MemoryManager:  
        def __init__(self):  
            self.gpu_cache = LRUCache(capacity=VRAM_LIMIT)  
            self.ram_cache = LRUCache(capacity=RAM_ALLOCATION)  
            self.ssd_store = DiskStorage(SSD_PATH)  
    
        def fetch_expert(self, expert_id):  
            if expert_id in self.gpu_cache:  
                return self.gpu_cache.get(expert_id)  
    
            if expert_id in self.ram_cache:  
                data = self.ram_cache.pop(expert_id)  
                self._move_to_gpu(expert_id, data)  
                return data  
    
            ssd_data = self.ssd_store.load(expert_id)  
            self._allocate_space(ssd_data.size)  
            self.gpu_cache.put(expert_id, ssd_data)  
            return ssd_data  
    
  3. 关键技术点/Key Technical Points

    • 基于专家路由预测的预加载机制
      Preloading mechanism based on expert routing prediction
    • 异步数据传输流水线(使用CUDA Stream)
      Asynchronous data transfer pipeline (using CUDA Stream)
    • 内存压缩(支持量化格式自动转换)
      Memory compression (supports automatic quantization format conversion)
    • 基于访问模式的智能缓存策略(LRU + 访问频率加权)
      Intelligent caching strategy based on access patterns (LRU + frequency-weighted)

性能优化/Performance Optimization:

  • 使用内存映射文件实现SSD快速访问
    Use memory-mapped files for fast SSD access
  • 采用Zstandard压缩算法(压缩比 vs 解压速度的平衡)
    Use Zstandard compression algorithm (balance between compression ratio and decompression speed)
  • 实现PCIe 4.0 x16通道下的异步传输(预期吞吐量 ≥ 15GB/s)
    Implement asynchronous transfer over PCIe 4.0 x16 (expected throughput ≥ 15GB/s)

配置建议/Configuration Suggestions:

# ollama配置示例/Ollama Configuration Example  
memory_strategy:  
  hierarchy: [vram, ram, ssd]  
  allocation_policy: "adaptive" # 或 "conservative" / or "conservative"  
  ssd_cache_path: "~/.ollama/cache"  
  max_swap_size: "100GB"  
  preload_experts: 3 # 预加载专家数量 / Number of preloaded experts  

预期效果/Expected Outcomes:

  • 24GB显存显卡可承载200B+参数的MOE模型推理
    24GB VRAM GPUs can support inference for 200B+ parameter MOE models
  • 专家模块切换延迟 < 50ms(SSD→RAM→VRAM)
    Expert module switching latency < 50ms (SSD → RAM → VRAM)
  • 内存占用降低60-80%(相比全加载方案)
    Memory usage reduced by 60-80% (compared to full-loading approach)

相关工作/Related Work:
可参考Microsoft DeepSpeed的ZeRO-Offload设计(但需适配MOE特性)、Petals项目的分布式加载方案
Reference Microsoft DeepSpeed's ZeRO-Offload design (with MOE-specific adaptations) and Petals' distributed loading approach.


附加建议/Additional Suggestions:

  1. 开发路线建议分阶段实施:
    Phase 1: 实现显存-内存二级交换
    Phase 2: 添加SSD存储支持
    Phase 3: 优化预加载算法
    Development roadmap suggested in phases:
    Phase 1: Implement VRAM-RAM two-level swapping
    Phase 2: Add SSD storage support
    Phase 3: Optimize preloading algorithms

  2. 可考虑与GGUF格式集成,利用现有的量化基础设施
    Consider integrating with GGUF format to leverage existing quantization infrastructure

  3. 建议添加性能监控接口:

    ollama stat --memory  
    # 输出显存/内存/SSD的使用比例和交换频率  
    # Output VRAM/RAM/SSD usage ratios and swap frequencies  
    
Originally created by @ipfgao on GitHub (Feb 6, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8861 以下是Deepseek R1给出的提案: --- **标题/Title**: [功能请求] 为MOE模型实现动态分级内存管理 [Feature Request] Dynamic Hierarchical Memory Management for MOE Models --- **问题描述/Description**: 请求为MOE(Mixture of Experts)模型实现三级动态内存管理,通过显存-内存-固态存储的分层存储机制,使消费级显卡(如8-24GB显存)+ 消费级内存配置(如32-64GB内存)能够推理超大规模模型(200B+参数级)。 This is a request to implement a three-level dynamic memory management system for MOE (Mixture of Experts) models. By utilizing a hierarchical storage mechanism (VRAM → RAM → SSD), consumer-grade GPUs (e.g., 8-24GB VRAM) + consumer-grade memory configurations (with 32-64GB of RAM) can support inference for ultra-large-scale models (200B+ parameters). --- **技术提案/Technical Proposal**: 1. **内存层级设计/Memory Hierarchy Design** - 第一层:GPU显存(存放活跃专家模块) Level 1: GPU VRAM (stores active expert modules) - 第二层:系统内存(存放近期可能使用的专家模块) Level 2: System RAM (stores recently used expert modules) - 第三层:NVMe SSD(存放冷数据专家模块) Level 3: NVMe SSD (stores cold expert modules) 2. **动态调度策略/Dynamic Scheduling Strategy** ```python # 伪代码示例/Pseudo-code Example class MemoryManager: def __init__(self): self.gpu_cache = LRUCache(capacity=VRAM_LIMIT) self.ram_cache = LRUCache(capacity=RAM_ALLOCATION) self.ssd_store = DiskStorage(SSD_PATH) def fetch_expert(self, expert_id): if expert_id in self.gpu_cache: return self.gpu_cache.get(expert_id) if expert_id in self.ram_cache: data = self.ram_cache.pop(expert_id) self._move_to_gpu(expert_id, data) return data ssd_data = self.ssd_store.load(expert_id) self._allocate_space(ssd_data.size) self.gpu_cache.put(expert_id, ssd_data) return ssd_data ``` 3. **关键技术点/Key Technical Points** - 基于专家路由预测的预加载机制 Preloading mechanism based on expert routing prediction - 异步数据传输流水线(使用CUDA Stream) Asynchronous data transfer pipeline (using CUDA Stream) - 内存压缩(支持量化格式自动转换) Memory compression (supports automatic quantization format conversion) - 基于访问模式的智能缓存策略(LRU + 访问频率加权) Intelligent caching strategy based on access patterns (LRU + frequency-weighted) --- **性能优化/Performance Optimization**: - 使用内存映射文件实现SSD快速访问 Use memory-mapped files for fast SSD access - 采用Zstandard压缩算法(压缩比 vs 解压速度的平衡) Use Zstandard compression algorithm (balance between compression ratio and decompression speed) - 实现PCIe 4.0 x16通道下的异步传输(预期吞吐量 ≥ 15GB/s) Implement asynchronous transfer over PCIe 4.0 x16 (expected throughput ≥ 15GB/s) --- **配置建议/Configuration Suggestions**: ```yaml # ollama配置示例/Ollama Configuration Example memory_strategy: hierarchy: [vram, ram, ssd] allocation_policy: "adaptive" # 或 "conservative" / or "conservative" ssd_cache_path: "~/.ollama/cache" max_swap_size: "100GB" preload_experts: 3 # 预加载专家数量 / Number of preloaded experts ``` --- **预期效果/Expected Outcomes**: - 24GB显存显卡可承载200B+参数的MOE模型推理 24GB VRAM GPUs can support inference for 200B+ parameter MOE models - 专家模块切换延迟 < 50ms(SSD→RAM→VRAM) Expert module switching latency < 50ms (SSD → RAM → VRAM) - 内存占用降低60-80%(相比全加载方案) Memory usage reduced by 60-80% (compared to full-loading approach) --- **相关工作/Related Work**: 可参考Microsoft DeepSpeed的ZeRO-Offload设计(但需适配MOE特性)、Petals项目的分布式加载方案 Reference Microsoft DeepSpeed's ZeRO-Offload design (with MOE-specific adaptations) and Petals' distributed loading approach. --- **附加建议/Additional Suggestions**: 1. 开发路线建议分阶段实施: Phase 1: 实现显存-内存二级交换 Phase 2: 添加SSD存储支持 Phase 3: 优化预加载算法 Development roadmap suggested in phases: Phase 1: Implement VRAM-RAM two-level swapping Phase 2: Add SSD storage support Phase 3: Optimize preloading algorithms 2. 可考虑与GGUF格式集成,利用现有的量化基础设施 Consider integrating with GGUF format to leverage existing quantization infrastructure 3. 建议添加性能监控接口: ```bash ollama stat --memory # 输出显存/内存/SSD的使用比例和交换频率 # Output VRAM/RAM/SSD usage ratios and swap frequencies ```
GiteaMirror added the feature request label 2026-04-12 17:01:57 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 6, 2025):

https://github.com/ggerganov/llama.cpp/issues/11532

<!-- gh-comment-id:2638399101 --> @rick-github commented on GitHub (Feb 6, 2025): https://github.com/ggerganov/llama.cpp/issues/11532
Author
Owner

@zimdin12 commented on GitHub (Jul 23, 2025):

Any news ? this would change the world :D

<!-- gh-comment-id:3109950224 --> @zimdin12 commented on GitHub (Jul 23, 2025): Any news ? this would change the world :D
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5743