[GH-ISSUE #10539] Allow "use_mmap" to be set at a global level using enviroment variables. #6935

Closed
opened 2026-04-12 18:49:40 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @Slymi on GitHub (May 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10539

Context:

I am running Ollama V0.6.7 using Docker on a system running TRUENAS Scale with 128 GB of RAM and 02x 16 GB RTX 4060 TI. The container running off a NVME drive.

Current Issue

After loading a model that does not use the new Ollama engine with "use_mmap" = true, the allocated RAM is not fully released when the model stopped. Not only that, loading times are worse with "use_mmap" = true.

Example with "use_mmap" = true (Qwen3:32b Q4_K_M @ 16384 Context Size):

Before loading (System RAM Usage on the right):
Image

Loaded:
Image

Stopped (450~ MB released?):
Image

Results:
Image

Logs:
Qwen3_MMAP_Logs.txt

Example with "use_mmap" = false (Qwen3:32b Q4_K_M @ 16384 Context Size):

Loaded:
Image

Stopped:
Image

Results:
Image

Logs:
Qwen3_NoMMAP_Logs.txt

Notes:

  • Restarting the container does not free up the RAM either, requiring a full system reboot to resolve.
  • In both examples, subsequent model loading times are reduced to 4-5 seconds as the model has been picked up by ZFS ARC (RAM Cache).
  • If the RAM limit of the container is reduced to below the model file size, "use_mmap" = true would significantly lengthen the time it takes to load the model. Example as shown below.

Examples of 8 GB container RAM limit and loading from ZFS ARC.

"use_mmap" = true:
Image

"use_mmap" = false:
Image

12x improvement with "use_mmap" = false as the model is already stored in the ZFS Cache.

Conclusion:

I hope the developers of Ollama will look into this and allow "use_mmap" to be set globally in the enviroment variables in order to resolve these kind of issues. As of now, I'm using OpenWebUI to set "use_mmap" = false for all models. However, some other frontends lack the option to do so. Thus, restricting the frontends that can be used that allow for a satisfactory experience.

References that I think is related:
#6854
#10076

Originally created by @Slymi on GitHub (May 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10539 ### Context: I am running Ollama V0.6.7 using Docker on a system running TRUENAS Scale with 128 GB of RAM and 02x 16 GB RTX 4060 TI. The container running off a NVME drive. ### Current Issue After loading a model that does not use the new Ollama engine with "use_mmap" = true, the allocated RAM is not fully released when the model stopped. Not only that, loading times are worse with "use_mmap" = true. ### Example with "use_mmap" = true (Qwen3:32b Q4_K_M @ 16384 Context Size): **Before loading (System RAM Usage on the right):** ![Image](https://github.com/user-attachments/assets/31b480e3-b069-44f6-9206-57bd456ed3f4) **Loaded:** ![Image](https://github.com/user-attachments/assets/a613c938-347d-4604-91ae-5d572601b274) **Stopped (450~ MB released?):** ![Image](https://github.com/user-attachments/assets/c01413aa-d322-45d0-99a9-96aeb76df50f) **Results:** ![Image](https://github.com/user-attachments/assets/8743c584-1227-4819-aaaf-b96dad595396) **Logs:** [Qwen3_MMAP_Logs.txt](https://github.com/user-attachments/files/20015338/Qwen3_MMAP_Logs.txt) ### Example with "use_mmap" = false (Qwen3:32b Q4_K_M @ 16384 Context Size): **Loaded:** ![Image](https://github.com/user-attachments/assets/8ff9f7da-27b9-4f83-b976-cb7e70068208) **Stopped:** ![Image](https://github.com/user-attachments/assets/ef076a2d-a809-4488-b1bb-86a1a7d92955) **Results:** ![Image](https://github.com/user-attachments/assets/bdd6e4de-b51c-4ed9-ac8d-e7ea963b6500) **Logs:** [Qwen3_NoMMAP_Logs.txt](https://github.com/user-attachments/files/20015446/Qwen3_NoMMAP_Logs.txt) **Notes:** - Restarting the container does not free up the RAM either, requiring a full system reboot to resolve. - In both examples, subsequent model loading times are reduced to 4-5 seconds as the model has been picked up by ZFS ARC (RAM Cache). - If the RAM limit of the container is reduced to below the model file size, "use_mmap" = true would significantly lengthen the time it takes to load the model. Example as shown below. ### Examples of 8 GB container RAM limit and loading from ZFS ARC. **"use_mmap" = true:** ![Image](https://github.com/user-attachments/assets/b1a2f18f-f111-4b0e-bc53-80a42e72e8f3) **"use_mmap" = false:** ![Image](https://github.com/user-attachments/assets/b0cdcc1e-0bb3-4242-b0ee-260b34034385) 12x improvement with "use_mmap" = false as the model is already stored in the ZFS Cache. ### Conclusion: I hope the developers of Ollama will look into this and allow "use_mmap" to be set globally in the enviroment variables in order to resolve these kind of issues. As of now, I'm using OpenWebUI to set "use_mmap" = false for all models. However, some other frontends lack the option to do so. Thus, restricting the frontends that can be used that allow for a satisfactory experience. **References that I think is related:** #6854 #10076
GiteaMirror added the feature request label 2026-04-12 18:49:40 -05:00
Author
Owner

@rick-github commented on GitHub (Jun 6, 2025):

The new engine doesn't use mmap so use_mmap is ignored. For the old engine, #8895.

<!-- gh-comment-id:2948914926 --> @rick-github commented on GitHub (Jun 6, 2025): The new engine doesn't use mmap so `use_mmap` is ignored. For the old engine, #8895.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6935