[GH-ISSUE #3837] Don't attempt to load a model larger than physical memory + VRAM which will result in thrashing #64413

Closed
opened 2026-05-03 17:34:03 -05:00 by GiteaMirror · 13 comments
Owner

Originally created by @maa105 on GitHub (Apr 23, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3837

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

I have a Nvidia 3070 GPU with 8GB vram. when running lama3 I notice the GPU vram fills ~7GB but the compute remains at 0-1% and 16 cores of my CPU are active. leading me to conclude that the model is running purely on the CPU and not using the GPU.

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.31

Originally created by @maa105 on GitHub (Apr 23, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3837 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? I have a Nvidia 3070 GPU with 8GB vram. when running lama3 I notice the GPU vram fills ~7GB but the compute remains at 0-1% and 16 cores of my CPU are active. leading me to conclude that the model is running purely on the CPU and not using the GPU. ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.1.31
GiteaMirror added the feature request label 2026-05-03 17:34:03 -05:00
Author
Owner

@markhill343 commented on GitHub (Apr 23, 2024):

The 70B should be around ~50GB. So your GPU can only help the CPU. The GPU works with the 7GB it loads into the VRAM. The remaing part is stored in the system ram and only you CPU can process that. Thats why it looks like you GPU is doing nothing. You should try the 8B model for better performance.

<!-- gh-comment-id:2071687120 --> @markhill343 commented on GitHub (Apr 23, 2024): The 70B should be around ~50GB. So your GPU can only help the CPU. The GPU works with the 7GB it loads into the VRAM. The remaing part is stored in the system ram and only you CPU can process that. Thats why it looks like you GPU is doing nothing. You should try the 8B model for better performance.
Author
Owner

@siakc commented on GitHub (Apr 30, 2024):

The 70B should be around ~50GB. So your GPU can only help the CPU. The GPU works with the 7GB it loads into the VRAM. The remaining part is stored in the system ram and only you CPU can process that. That's why it looks like you GPU is doing nothing. You should try the 8B model for better performance.

I have the same issue. If you were right the memory would have been filled with other 65 GBs and cores under heavy load but that is not the case.

I see heavy disk I/O by ollama instead. This must be the bottleneck.

Seems that ollama is not using memory efficiently.

<!-- gh-comment-id:2085803596 --> @siakc commented on GitHub (Apr 30, 2024): > The 70B should be around ~50GB. So your GPU can only help the CPU. The GPU works with the 7GB it loads into the VRAM. The remaining part is stored in the system ram and only you CPU can process that. That's why it looks like you GPU is doing nothing. You should try the 8B model for better performance. I have the same issue. If you were right the memory would have been filled with other 65 GBs and cores under heavy load but that is not the case. I see heavy disk I/O by ollama instead. This must be the bottleneck. Seems that ollama is not using memory efficiently.
Author
Owner

@easp commented on GitHub (Apr 30, 2024):

How much RAM do you actually have?

<!-- gh-comment-id:2085854872 --> @easp commented on GitHub (Apr 30, 2024): How much *RAM* do you actually have?
Author
Owner

@siakc commented on GitHub (Apr 30, 2024):

I have 32 GBs of physical memory. Only 3 GBs is filled. 6 GBs of VRAM is filled (from nvidia 3070's 8GBs). GPU is idle. Heavy disk IO reported by htop. CPU is also idle.
These logs may be relevant:

llm_load_tensors: offloading 11 repeating layers to GPU
llm_load_tensors: offloaded 11/81 layers to GPU
llm_load_tensors:        CPU buffer size = 38110.61 MiB
llm_load_tensors:      CUDA0 buffer size =  5049.69 MiB
...
Device 0: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes
llama_kv_cache_init:  CUDA_Host KV buffer size =   552.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =    88.00 MiB
llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =    21.02 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   324.01 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   336.00 MiB
llama_new_context_with_model: graph splits (measure): 5
<!-- gh-comment-id:2086643519 --> @siakc commented on GitHub (Apr 30, 2024): I have 32 GBs of physical memory. Only 3 GBs is filled. 6 GBs of VRAM is filled (from nvidia 3070's 8GBs). GPU is idle. Heavy disk IO reported by htop. CPU is also idle. These logs may be relevant: ```llm_load_tensors: ggml ctx size = 0.55 MiB llm_load_tensors: offloading 11 repeating layers to GPU llm_load_tensors: offloaded 11/81 layers to GPU llm_load_tensors: CPU buffer size = 38110.61 MiB llm_load_tensors: CUDA0 buffer size = 5049.69 MiB ... Device 0: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes llama_kv_cache_init: CUDA_Host KV buffer size = 552.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 88.00 MiB llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 21.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 324.01 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 336.00 MiB llama_new_context_with_model: graph splits (measure): 5 ```
Author
Owner

@easp commented on GitHub (May 1, 2024):

llm_load_tensors: CPU buffer size = 38110.61 MiB

38110.61 MiB > 32GB.

You have heavy disk I/O because you don't have enough RAM. Your CPU is idle because it's waiting for at least 6GB of model weights to be loaded from disk for every token generated. Your GPU is idle because it is waiting for the CPU.

Your RAM utilization probably looks low because you are looking at the wrong metric. Model weights are memory mapped. On MacOS and Linux they are counted in the file cache size.

<!-- gh-comment-id:2087968391 --> @easp commented on GitHub (May 1, 2024): `llm_load_tensors: CPU buffer size = 38110.61 MiB` 38110.61 MiB > 32GB. You have heavy disk I/O because you don't have enough RAM. Your CPU is idle because it's waiting for at least 6GB of model weights to be loaded from disk for every token generated. Your GPU is idle because it is waiting for the CPU. Your RAM utilization probably looks low because you are looking at the wrong metric. Model weights are memory mapped. On MacOS and Linux they are counted in the file cache size.
Author
Owner

@siakc commented on GitHub (May 1, 2024):

38 - 32 = 6
So I have plenty of memory. This would still be a problem if the whole model should be traversed so many times for each response.
And please note that we have 7 GBs of VRAM.

<!-- gh-comment-id:2088465280 --> @siakc commented on GitHub (May 1, 2024): 38 - 32 = 6 So I have plenty of memory. This would still be a problem if the whole model should be traversed so many times for each response. And please note that we have 7 GBs of VRAM.
Author
Owner

@easp commented on GitHub (May 1, 2024):

Your math is dead backwards.

You have 32GB of RAM. Ollama is allocating 38GB for a CPU buffer and 6GB for a CUDA buffer.

<!-- gh-comment-id:2088566093 --> @easp commented on GitHub (May 1, 2024): Your math is dead backwards. You have 32GB of RAM. Ollama is allocating 38GB for a CPU buffer and 6GB for a CUDA buffer.
Author
Owner

@siakc commented on GitHub (May 1, 2024):

Ok but only 7 GBs can't be present in physical memory. Should this make it behave like this?

<!-- gh-comment-id:2088722198 --> @siakc commented on GitHub (May 1, 2024): Ok but only 7 GBs can't be present in physical memory. Should this make it behave like this?
Author
Owner

@easp commented on GitHub (May 1, 2024):

The entire model is traversed for each token. As I said before, that means that 6+ GB have to be read from storage for each token. That'll probably add at least a couple of seconds per token.

<!-- gh-comment-id:2088896148 --> @easp commented on GitHub (May 1, 2024): The entire model is traversed for each token. As I said before, that means that 6+ GB have to be read from storage for each token. That'll probably add at least a couple of seconds per token.
Author
Owner

@dhiltgen commented on GitHub (May 1, 2024):

We should probably take total physical memory into consideration, and if we're clearly going to start thrashing, prevent loading the model with a clearer error message.

<!-- gh-comment-id:2089279986 --> @dhiltgen commented on GitHub (May 1, 2024): We should probably take total physical memory into consideration, and if we're clearly going to start thrashing, prevent loading the model with a clearer error message.
Author
Owner

@siakc commented on GitHub (May 7, 2024):

How can I see that the model file is mmapped? I used lsof but don't see the modefile in the output list.

<!-- gh-comment-id:2098045682 --> @siakc commented on GitHub (May 7, 2024): How can I see that the model file is mmapped? I used lsof but don't see the modefile in the output list.
Author
Owner

@easp commented on GitHub (May 7, 2024):

@siakc on linux make sure you run lsof with sudo as the model file is opened by the ollama server, which is typically running as the "ollama" user, rather than your own.

<!-- gh-comment-id:2098808955 --> @easp commented on GitHub (May 7, 2024): @siakc on linux make sure you run lsof with sudo as the model file is opened by the ollama server, which is typically running as the "ollama" user, rather than your own.
Author
Owner

@siakc commented on GitHub (May 7, 2024):

I did so with no gain.
sudo lsof +D ./.ollama/models

<!-- gh-comment-id:2099230697 --> @siakc commented on GitHub (May 7, 2024): I did so with no gain. ``` sudo lsof +D ./.ollama/models```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#64413