[GH-ISSUE #7798] Is this a bug? (2GB model -> up to 20GB pagefile) #4985

Closed
opened 2026-04-12 16:02:45 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @sebkont on GitHub (Nov 22, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7798

What is the issue?

My GPU is old (GTX 1070) with 8GB, but should still be enough for running a model based on Phi 3 Mini? This one

Unfortunately what happens is 'ollama ps' says 20 GB 63%/37% CPU/GPU + C:/ drive instantly gets filled with 10-20GB of pagefile (which I don't think should be happening at all ?). But I rarely observe any spikes for GPU, mostly its idle, up to 10% at best during prompt responses.

Makes me think might be a bug or something weird going on, after all? How to stop it from dumping so much into Pagefile?

PS: if this matters, MODELS and TMPDIR pathways under Windows variables were changed to D:/ and I installed Ollama under my D:/ drive. My C is too small, that's why pagefile bothers me, besides wanted to limit it all to D.

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.4.3

Originally created by @sebkont on GitHub (Nov 22, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7798 ### What is the issue? My GPU is old (GTX 1070) with 8GB, but should still be enough for running a model based on Phi 3 Mini? [This one ](https://huggingface.co/v8karlo/UNCENSORED-Phi-3-mini-4k-geminified-Q4_K_M-GGUF) Unfortunately what happens is 'ollama ps' says 20 GB 63%/37% CPU/GPU + C:/ drive instantly gets filled with 10-20GB of pagefile (which I don't think should be happening at all ?). But I rarely observe any spikes for GPU, mostly its idle, up to 10% at best during prompt responses. Makes me think might be a bug or something weird going on, after all? How to stop it from dumping so much into Pagefile? PS: if this matters, MODELS and TMPDIR pathways under Windows variables were changed to D:/ and I installed Ollama under my D:/ drive. My C is too small, that's why pagefile bothers me, besides wanted to limit it all to D. ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.4.3
GiteaMirror added the bug label 2026-04-12 16:02:45 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 22, 2024):

How did you import the model? Server logs may aid in diagnosis.

<!-- gh-comment-id:2493830661 --> @rick-github commented on GitHub (Nov 22, 2024): How did you import the model? [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may aid in diagnosis.
Author
Owner

@sebkont commented on GitHub (Nov 22, 2024):

ollama run hf.co/

Here is the log:
server.log

<!-- gh-comment-id:2493850550 --> @sebkont commented on GitHub (Nov 22, 2024): ollama run hf.co/ Here is the log: [server.log](https://github.com/user-attachments/files/17871241/server.log)
Author
Owner

@rick-github commented on GitHub (Nov 22, 2024):

time=2024-11-22T14:18:45.310+01:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=8 layers.split="" memory.available="[7.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="19.4 GiB" memory.required.partial="7.3 GiB" memory.required.kv="13.9 GiB" memory.required.allocations="[7.3 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="15.9 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="2.3 GiB" memory.graph.partial="2.3 GiB"
time=2024-11-22T14:18:45.315+01:00 level=INFO source=server.go:383 msg="starting llama server" cmd="D:\\Program-nons\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe --model D:\\SYS\\USER\\Ollama\\models\\blobs\\sha256-4c0f9d3e34119bbdf607e05497b812597543ed2ad64e721dd1454b55f81260c4 --ctx-size 38000 --batch-size 512 --n-gpu-layers 8 --threads 4 --no-mmap --parallel 1 --port 50531"

You are requesting a large context window, 38000 tokens, which requires 13.9GB. Reducing the context window will reduce the memory footprint.

<!-- gh-comment-id:2493861349 --> @rick-github commented on GitHub (Nov 22, 2024): ``` time=2024-11-22T14:18:45.310+01:00 level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=8 layers.split="" memory.available="[7.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="19.4 GiB" memory.required.partial="7.3 GiB" memory.required.kv="13.9 GiB" memory.required.allocations="[7.3 GiB]" memory.weights.total="16.0 GiB" memory.weights.repeating="15.9 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="2.3 GiB" memory.graph.partial="2.3 GiB" time=2024-11-22T14:18:45.315+01:00 level=INFO source=server.go:383 msg="starting llama server" cmd="D:\\Program-nons\\Ollama\\lib\\ollama\\runners\\cuda_v12\\ollama_llama_server.exe --model D:\\SYS\\USER\\Ollama\\models\\blobs\\sha256-4c0f9d3e34119bbdf607e05497b812597543ed2ad64e721dd1454b55f81260c4 --ctx-size 38000 --batch-size 512 --n-gpu-layers 8 --threads 4 --no-mmap --parallel 1 --port 50531" ``` You are requesting a large context window, 38000 tokens, which requires 13.9GB. Reducing the context window will reduce the memory footprint.
Author
Owner

@sebkont commented on GitHub (Nov 22, 2024):

You are requesting a large context window, 38000 tokens, which requires 13.9GB. Reducing the context window will reduce the memory footprint.

shouldn't it be ok if I also got 16GB RAM? Why 20GB in a pagefile?

<!-- gh-comment-id:2493888751 --> @sebkont commented on GitHub (Nov 22, 2024): > You are requesting a large context window, 38000 tokens, which requires 13.9GB. Reducing the context window will reduce the memory footprint. shouldn't it be ok if I also got 16GB RAM? Why 20GB in a pagefile?
Author
Owner

@rick-github commented on GitHub (Nov 22, 2024):

19.4GB total, the easiest way to reduce the memory footprint is to reduce the size of the context window. For example, default value of 2048 tokens:

$ ollama ps
NAME           ID              SIZE      PROCESSOR    UNTIL
7798:latest    0dbd62e8d5e2    3.9 GB    100% GPU     Forever
<!-- gh-comment-id:2493905264 --> @rick-github commented on GitHub (Nov 22, 2024): 19.4GB total, the easiest way to reduce the memory footprint is to reduce the size of the context window. For example, default value of 2048 tokens: ```console $ ollama ps NAME ID SIZE PROCESSOR UNTIL 7798:latest 0dbd62e8d5e2 3.9 GB 100% GPU Forever ```
Author
Owner

@sebkont commented on GitHub (Nov 22, 2024):

19.4GB total, the easiest way to reduce the memory footprint is to reduce the size of the context window. For example, default value of 2048 tokens:

$ ollama ps
NAME           ID              SIZE      PROCESSOR    UNTIL
7798:latest    0dbd62e8d5e2    3.9 GB    100% GPU     Forever

OK, thanks. I'm still not quite sure why it does 20GB in a pagefile alone if should at least cover around 14 GB or so via RAM.

Can I control which pagefile is used? So if I enable a pagefile on my D, ollama will use that and stay away from my C?

<!-- gh-comment-id:2493944039 --> @sebkont commented on GitHub (Nov 22, 2024): > 19.4GB total, the easiest way to reduce the memory footprint is to reduce the size of the context window. For example, default value of 2048 tokens: > > ``` > $ ollama ps > NAME ID SIZE PROCESSOR UNTIL > 7798:latest 0dbd62e8d5e2 3.9 GB 100% GPU Forever > ``` OK, thanks. I'm still not quite sure why it does 20GB in a pagefile alone if should at least cover around 14 GB or so via RAM. Can I control which pagefile is used? So if I enable a pagefile on my D, ollama will use that and stay away from my C?
Author
Owner

@rick-github commented on GitHub (Nov 22, 2024):

Presumably you are running other programs, they will be pushed into the pagefile when another program (ie ollama) allocates a large amount of RAM, so not all of that 20GB is ollama. I don't know the fine details on how pagefiles in Windows work, but I'm pretty sure that you can't control that from an application since it's an OS level operation. Note that if you reduce the context window to the point where the model fits 100% in GPU VRAM, there will be much less system RAM requirement.

<!-- gh-comment-id:2493971867 --> @rick-github commented on GitHub (Nov 22, 2024): Presumably you are running other programs, they will be pushed into the pagefile when another program (ie ollama) allocates a large amount of RAM, so not all of that 20GB is ollama. I don't know the fine details on how pagefiles in Windows work, but I'm pretty sure that you can't control that from an application since it's an OS level operation. Note that if you reduce the context window to the point where the model fits 100% in GPU VRAM, there will be much less system RAM requirement.
Author
Owner

@sebkont commented on GitHub (Nov 22, 2024):

Ok, thanks. Yeah, but for this use case I needed more context. Maybe I will try looking into whether I can entirely move pagefile to where I have more space

<!-- gh-comment-id:2494003246 --> @sebkont commented on GitHub (Nov 22, 2024): Ok, thanks. Yeah, but for this use case I needed more context. Maybe I will try looking into whether I can entirely move pagefile to where I have more space
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#4985