[GH-ISSUE #12615] Is use_mmap ineffective for some models? For example, GPT-OSS and Qwen3:30B-A3B? #70434

Closed
opened 2026-05-04 21:32:48 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @ghost on GitHub (Oct 14, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12615

What is the issue?

After setting use_mmap to true in Ollama, loading Qwen3:30B-A3B still tries to fully read the model into memory, causing my phone with 12 GB RAM to crash.

However, when I load the same model with llama.cpp, it runs normally on my phone — since it uses memory mapping, there’s no memory crash.

I also tried enabling use_mmap for some other models in Ollama, and memory mapping worked fine for them.
It seems that some models support it, while others don’t?

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @ghost on GitHub (Oct 14, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12615 ### What is the issue? After setting `use_mmap` to **true** in Ollama, loading **Qwen3:30B-A3B** still tries to fully read the model into memory, causing my phone with **12 GB RAM** to crash. However, when I load the same model with **llama.cpp**, it runs normally on my phone — since it uses **memory mapping**, there’s no memory crash. I also tried enabling `use_mmap` for some other models in Ollama, and memory mapping worked fine for them. It seems that **some models support it, while others don’t?** ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-05-04 21:32:48 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 14, 2025):

Mmap is not supported for models that use the ollama engine.

<!-- gh-comment-id:3403312297 --> @rick-github commented on GitHub (Oct 14, 2025): Mmap is not supported for models that use the ollama engine.
Author
Owner

@ghost commented on GitHub (Oct 14, 2025):

Mmap is not supported for models that use the ollama engine.

So does that mean Ollama has to fully load MOE (Mixture of Experts) models into memory as well?
That doesn’t seem necessary, since the active parameters are only about 3B.

I tested MiniCPM-V:8B, and it can use memory mapping in Ollama without any problem.

<!-- gh-comment-id:3403333875 --> @ghost commented on GitHub (Oct 14, 2025): > Mmap is not supported for models that use the ollama engine. So does that mean **Ollama has to fully load MOE (Mixture of Experts) models into memory** as well? That doesn’t seem necessary, since the **active parameters are only about 3B**. I tested **MiniCPM-V:8B**, and it **can use memory mapping** in Ollama without any problem.
Author
Owner

@rick-github commented on GitHub (Oct 14, 2025):

MoE models are always loaded into memory. There is work in progress to allow selective offload.

MiniCPM-V:8B runs in the llama.cpp engine, and so supports the use of mmap.

<!-- gh-comment-id:3403352155 --> @rick-github commented on GitHub (Oct 14, 2025): MoE **models** are always **loaded** into memory. There **is** work **in** progress to **allow** selective **offload**. **MiniCPM-V:8B** runs **in** the **llama.cpp** engine, and so **supports** the **use** of mmap.
Author
Owner

@ghost commented on GitHub (Oct 14, 2025):

MoE models are always loaded into memory. There is work in progress to allow selective offload.

MiniCPM-V:8B runs in the llama.cpp engine, and so supports the use of mmap.

Then why doesn’t Ollama simply use llama.cpp for memory mapping when use_mmap is explicitly set to true — just like it works with MiniCPM-V:8B?

The Qwen3:30B-A3B model I pulled from Ollama can also run on llama.cpp, although some models do seem to be incompatible with llama.cpp.

<!-- gh-comment-id:3403891384 --> @ghost commented on GitHub (Oct 14, 2025): > MoE **models** are always **loaded** into memory. There **is** work **in** progress to **allow** selective **offload**. > > **MiniCPM-V:8B** runs **in** the **llama.cpp** engine, and so **supports** the **use** of mmap. Then why doesn’t Ollama simply **use llama.cpp for memory mapping** when `use_mmap` is explicitly set to **true** — just like it works with **MiniCPM-V:8B**? The **Qwen3:30B-A3B** model I pulled from Ollama can also run on **llama.cpp**, although some models do seem to be **incompatible** with llama.cpp.
Author
Owner

@ghost commented on GitHub (Oct 14, 2025):

Additionally, I personally think that if a model truly does not support use_mmap, Ollama should display an error message and cancel the loading process.

Forcing the entire model to load into memory in a low-memory environment can cause the system to crash, taking down other applications as well.

<!-- gh-comment-id:3403964873 --> @ghost commented on GitHub (Oct 14, 2025): Additionally, I personally think that if a model truly **does not support `use_mmap`**, Ollama should **display an error message and cancel the loading process**. Forcing the entire model to load into memory in a **low-memory environment** can cause the **system to crash**, taking down other applications as well.
Author
Owner

@ghost commented on GitHub (Oct 21, 2025):

I directly tested the GPT-OSS 120B model using llama.cpp on my phone, and it runs at about 2 tokens per second. The Qwen3 30B model even generates faster than I can read. It seems that using llama.cpp directly is still the only way to run large MoE models on my phone.

<!-- gh-comment-id:3425304739 --> @ghost commented on GitHub (Oct 21, 2025): I directly tested the **GPT-OSS 120B** model using **llama.cpp** on my phone, and it runs at about **2 tokens per second**. The **Qwen3 30B** model even generates faster than I can read. It seems that using **llama.cpp** directly is still the only way to run large **MoE models** on my phone.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70434