[GH-ISSUE #8795] Load model into NVME SSD #31471

Closed
opened 2026-04-22 11:55:43 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @phly95 on GitHub (Feb 3, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8795

Apparently, it is possible to use NVME SSDs to load model weights, which can be used directly by the GPU.

https://www.reddit.com/r/LocalLLaMA/comments/1iczucy/running_deepseek_r1_iq2xxs_200gb_from_ssd/

From what I've heard, this does not damage the SSD since it's performing frequent reads and not writes, and can allow users with very little VRAM to run much larger models. Apparently, there is also another project called UMbreLLa that can also further speed up inference. Perhaps if ollama were to integrate these two setups, users of small 8GB cards might be able to expect reasonable performance with models like DeepSeek R1, Mistral Small 24b, llama 70b, etc.

Originally created by @phly95 on GitHub (Feb 3, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8795 Apparently, it is possible to use NVME SSDs to load model weights, which can be used directly by the GPU. https://www.reddit.com/r/LocalLLaMA/comments/1iczucy/running_deepseek_r1_iq2xxs_200gb_from_ssd/ From what I've heard, this does not damage the SSD since it's performing frequent reads and not writes, and can allow users with very little VRAM to run much larger models. Apparently, there is also another project called UMbreLLa that can also further speed up inference. Perhaps if ollama were to integrate these two setups, users of small 8GB cards might be able to expect reasonable performance with models like DeepSeek R1, Mistral Small 24b, llama 70b, etc.
GiteaMirror added the feature request label 2026-04-22 11:55:43 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 3, 2025):

This is just mmap. ollama disables this for some platforms based on performance reports, but can that can be overriden by setting "use_mmap":true. There is one wrinkle in that ollama still does a memory check even if mmap is enabled, a ticket about that includes a workaround for Linux platforms.

<!-- gh-comment-id:2631898310 --> @rick-github commented on GitHub (Feb 3, 2025): This is just `mmap`. ollama disables this for [some platforms](https://github.com/ollama/ollama/blob/ad22ace439eb3fab7230134e56bb6276a78347e4/llm/server.go#L210) based on performance reports, but can that can be overriden by [setting](https://github.com/ollama/ollama/blob/main/docs/api.md#generate-request-with-options:~:text=vocab_only%22%3A%20false%2C%0A%20%20%20%20%22-,use_mmap,-%22%3A%20true%2C%0A%20%20%20%20%22use_mlock) `"use_mmap":true`. There is one wrinkle in that ollama still does a memory check even if `mmap` is enabled, a [ticket](https://github.com/ollama/ollama/issues/8654) about that includes a workaround for Linux platforms.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#31471