[GH-ISSUE #9471] Gpu memory requirements are off #6171

Closed
opened 2026-04-12 17:32:35 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @forReason on GitHub (Mar 3, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9471

What is the issue?

I dont know exactly how I should put this. I am experimenting with llama3.3:70b and I find something really odd on my personal machine, which works differently on my work machine and I cant make up what it is:

On ollama, the model sizes:

NAME                                 ID              SIZE      MODIFIED       
llama3.3:70b                         a6eb4748fd29    42 GB     19 seconds ago    
llama3.3:70b-instruct-q2_K           a6f03da15cbc    26 GB     9 hours ago       
llama3.3:70b-instruct-q3_K_S         84d6ecd40b42    30 GB     9 hours ago

On my work computer, the llama3.3:70b fits conveniently onto one RTX 6000 GPU with 48 GB VRAM.
However, on my homelab, the Model is not 42GB, its 61:

ollama ps
NAME               ID              SIZE     PROCESSOR          UNTIL              
llama3.3:latest    a6eb4748fd29    61 GB    44%/56% CPU/GPU    4 minutes from now

The GPU memory also seems underutilized, leaving almost 5GB free on each GPU:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.120                Driver Version: 550.120        CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        On  |   00000000:01:00.0 Off |                  N/A |
| 59%   59C    P2             83W /  100W |    7400MiB /  12288MiB |     99%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A2000 12GB          On  |   00000000:05:00.0 Off |                  Off |
| 30%   41C    P2             37W /   70W |    8068MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 3060        On  |   00000000:0A:00.0 Off |                  N/A |
| 33%   51C    P2             47W /  100W |    7652MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

This is also the case for the two smaller quantized versions.
Specifically notable for the q2 version which takes over 45GB on my GPU's when in theory it should fit onto the 36GB memory.any Ideas what can cause this stark difference on my work computer vs home lab?

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @forReason on GitHub (Mar 3, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9471 ### What is the issue? I dont know exactly how I should put this. I am experimenting with llama3.3:70b and I find something really odd on my personal machine, which works differently on my work machine and I cant make up what it is: On ollama, the model sizes: ``` NAME ID SIZE MODIFIED llama3.3:70b a6eb4748fd29 42 GB 19 seconds ago llama3.3:70b-instruct-q2_K a6f03da15cbc 26 GB 9 hours ago llama3.3:70b-instruct-q3_K_S 84d6ecd40b42 30 GB 9 hours ago ``` On my work computer, the llama3.3:70b fits conveniently onto one RTX 6000 GPU with 48 GB VRAM. However, on my homelab, the Model is not 42GB, its 61: ``` ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.3:latest a6eb4748fd29 61 GB 44%/56% CPU/GPU 4 minutes from now ``` The GPU memory also seems underutilized, leaving almost 5GB free on each GPU: ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.120 Driver Version: 550.120 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3060 On | 00000000:01:00.0 Off | N/A | | 59% 59C P2 83W / 100W | 7400MiB / 12288MiB | 99% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA RTX A2000 12GB On | 00000000:05:00.0 Off | Off | | 30% 41C P2 37W / 70W | 8068MiB / 12282MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 3060 On | 00000000:0A:00.0 Off | N/A | | 33% 51C P2 47W / 100W | 7652MiB / 12288MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ``` This is also the case for the two smaller quantized versions. Specifically notable for the q2 version which takes over 45GB on my GPU's when in theory it should fit onto the 36GB memory.any Ideas what can cause this stark difference on my work computer vs home lab? ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-12 17:32:35 -05:00
Author
Owner

@forReason commented on GitHub (Mar 3, 2025):

Note: this is because of the context length, which I have quite high.

<!-- gh-comment-id:2693486269 --> @forReason commented on GitHub (Mar 3, 2025): Note: this is because of the context length, which I have quite high.
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

However, on my homelab, the Model is not 42GB, its 61:

There are data structures that are duplicated for each device that the model is running on.

The GPU memory also seems underutilized, leaving almost 5GB free on each GPU:

The memory estimation is sometimes inaccurate. You can make ollama load more layers into VRAM by setting num_gpu either in an API call or in the Modelfile, see here. Note that if you set it too high, you may experience OOMs or performance degradation.

<!-- gh-comment-id:2693494721 --> @rick-github commented on GitHub (Mar 3, 2025): > However, on my homelab, the Model is not 42GB, its 61: There are data structures that are duplicated for each device that the model is running on. > The GPU memory also seems underutilized, leaving almost 5GB free on each GPU: The memory estimation is sometimes inaccurate. You can make ollama load more layers into VRAM by setting `num_gpu` either in an API call or in the Modelfile, see [here](https://github.com/ollama/ollama/issues/6950#issuecomment-2373663650). Note that if you set it too high, you may experience OOMs or performance degradation.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6171