[GH-ISSUE #4161] implement LRU cache for GPU VRAM when inferencing MoE model #28345

Open
opened 2026-04-22 06:27:14 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @davinwang on GitHub (May 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4161

Pardon me if this had been already implemented.

https://arxiv.org/pdf/2312.17238
According to above article, implementing LRU cache will speed up 2-3x for running MoE when GPU VRAM cannot fit entire model. e.g. Only 12.7B active parameters out of total 46.7B for Mixtral 8x7B will be involved in calculation.

Originally created by @davinwang on GitHub (May 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4161 Pardon me if this had been already implemented. https://arxiv.org/pdf/2312.17238 According to above article, implementing LRU cache will speed up 2-3x for running MoE when GPU VRAM cannot fit entire model. e.g. Only 12.7B active parameters out of total 46.7B for Mixtral 8x7B will be involved in calculation.
GiteaMirror added the feature request label 2026-04-22 06:27:14 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#28345