[GH-ISSUE #3054] Immense amount of disk reads when paging with mmap #1879

Closed
opened 2026-04-12 11:57:58 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @hedleyroos on GitHub (Mar 11, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3054

Originally assigned to: @dhiltgen on GitHub.

Querying the /generate API with the llama2 models works as expected. I append keep_alive=0 to the query string to keep the model in RAM, and from iotop I can see it immediately loads the model in RAM (I am in CPU only mode). The loading also seems to take place in a single sub-process or thread - unsure which since I don't know the underlying design, but it looks like a sub-process.

However, when switching to the llama2:70b model things change. It does the same single process loading of the model and takes longer because the model is naturally bigger, but thereafter 16 ollama processes are constantly reading from disk at about 140MB/s, and as a result generation is extremely slow. I have a 7950x, so the 16 processes make sense, but I do have 64GB RAM, and this should be enough for the model. top shows RAM usage at 34GB, but since there is memory mapping behind the scenes virt goes to about 46GB as expected.

I cannot find a way to prevent ollama from constantly hammering the disk with the llama2:70b model. Unfortunately I don't know Golang, but could it be an issue with the memory mapping not being correctly shared between processes?

Originally created by @hedleyroos on GitHub (Mar 11, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3054 Originally assigned to: @dhiltgen on GitHub. Querying the `/generate` API with the llama2 models works as expected. I append `keep_alive=0` to the query string to keep the model in RAM, and from `iotop` I can see it immediately loads the model in RAM (I am in CPU only mode). The loading also seems to take place in a single sub-process or thread - unsure which since I don't know the underlying design, but it looks like a sub-process. However, when switching to the llama2:70b model things change. It does the same single process loading of the model and takes longer because the model is naturally bigger, but thereafter 16 ollama processes are constantly reading from disk at about 140MB/s, and as a result generation is extremely slow. I have a 7950x, so the 16 processes make sense, but I do have 64GB RAM, and this should be enough for the model. `top` shows RAM usage at 34GB, but since there is memory mapping behind the scenes virt goes to about 46GB as expected. I cannot find a way to prevent ollama from constantly hammering the disk with the llama2:70b model. Unfortunately I don't know Golang, but could it be an issue with the memory mapping not being correctly shared between processes?
GiteaMirror added the buglinux labels 2026-04-12 11:57:58 -05:00
Author
Owner

@hedleyroos commented on GitHub (Mar 11, 2024):

Answering my own question after spotting https://github.com/ollama/ollama/blob/main/docs/api.md.

Pass "use_mmap": false as an option in the API call. It solves everything.

<!-- gh-comment-id:1988743702 --> @hedleyroos commented on GitHub (Mar 11, 2024): Answering my own question after spotting https://github.com/ollama/ollama/blob/main/docs/api.md. Pass "use_mmap": false as an option in the API call. It solves everything.
Author
Owner

@jmorganca commented on GitHub (Mar 11, 2024):

Hi @hedleyroos thanks for the issue. This is indeed because llama2:70b is quite big and may be paged to disk. I've updated as we can probably disable mmap if the model is too big to fit into memory. How much RAM do you have on your machine?

<!-- gh-comment-id:1989356696 --> @jmorganca commented on GitHub (Mar 11, 2024): Hi @hedleyroos thanks for the issue. This is indeed because llama2:70b is quite big and may be paged to disk. I've updated as we can probably disable mmap if the model is too big to fit into memory. How much RAM do you have on your machine?
Author
Owner

@hedleyroos commented on GitHub (Mar 11, 2024):

Hi @jmorganca . I have 64 GB.

<!-- gh-comment-id:1989400953 --> @hedleyroos commented on GitHub (Mar 11, 2024): Hi @jmorganca . I have 64 GB.
Author
Owner

@sytelus commented on GitHub (Mar 27, 2024):

How can you disable so model is fully loaded in RAM? I am testing this on machine with huge RAM in CPU only model. I see there is some option for API but is there any way for CLI?

<!-- gh-comment-id:2022127561 --> @sytelus commented on GitHub (Mar 27, 2024): How can you disable so model is fully loaded in RAM? I am testing this on machine with huge RAM in CPU only model. I see there is some option for API but is there any way for CLI?
Author
Owner

@kotatsuyaki commented on GitHub (Jun 22, 2024):

I set /set parameter use_mmap 0 (along with /set parameter num_gpu 0 for CPU-only inference) from the ollama CLI, and from the system monitor it seemed like the model is loaded into the memory.

<!-- gh-comment-id:2183903271 --> @kotatsuyaki commented on GitHub (Jun 22, 2024): I set `/set parameter use_mmap 0` (along with `/set parameter num_gpu 0` for CPU-only inference) from the ollama CLI, and from the system monitor it seemed like the model is loaded into the memory.
Author
Owner

@dhiltgen commented on GitHub (Jul 24, 2024):

We've been adjusting our default selection algorithm on when to use mmap vs. when to do normal file reads. Our current algorithm looks at the amount of Free memory in the system and if the model is larger than that, we switch to file reads so that other memory usage can be pushed into swap to make room, which should help avoid mmap induced thrashing. @hedleyroos you should give the latest version a try and see if the default algorithm works out for you, but you can force the desired behavior with use_mmap 0

<!-- gh-comment-id:2249040325 --> @dhiltgen commented on GitHub (Jul 24, 2024): We've been adjusting our default selection algorithm on when to use mmap vs. when to do normal file reads. Our current algorithm looks at the amount of Free memory in the system and if the model is larger than that, we switch to file reads so that other memory usage can be pushed into swap to make room, which should help avoid mmap induced thrashing. @hedleyroos you should give the latest version a try and see if the default algorithm works out for you, but you can force the desired behavior with `use_mmap 0`
Author
Owner

@hedleyroos commented on GitHub (Aug 2, 2024):

@dhiltgen I have tested with 0.3.2 and it works correctly now.

<!-- gh-comment-id:2264841331 --> @hedleyroos commented on GitHub (Aug 2, 2024): @dhiltgen I have tested with 0.3.2 and it works correctly now.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1879