[GH-ISSUE #10756] Is the change in memory usage expected going from 0.6.8 to 0.7.0 #69124

Closed
opened 2026-05-04 17:14:04 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @MarkWard0110 on GitHub (May 17, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10756

Originally assigned to: @jessegross on GitHub.

What is the issue?

After installing 0.7.0 I have noticed the context sizes that I was using now push the models into system RAM when they were 100% VRAM on 0.6.8

I am wondering if the change in the memory usage for a given model and context is expected for 0.7.0

The original column was the context size that would fit 100% in VRAM using Ollama 0.6.8. The latest column is the maximum context size for Ollama 0.7.0, which fits in VRAM. The % change is the percentage of change from original to latest context sizes.

What is weird is the llama3.2-vision:11b change in context. It just went to the maximum context size, which is suspicious.

model_name original latest % change
bespoke-minicheck:7b-fp16 32768 32768 0.00%
gemma3:12b-it-fp16 79584 60459 -24.04%
gemma3:1b-it-fp16 32768 32768 0.00%
gemma3:27b-it-q4_K_M 86211 78660 -8.77%
gemma3:27b-it-q8_0 29941 16105 -46.20%
gemma3:4b-it-fp16 131072 131072 0.00%
llama3.1:8b-instruct-fp16 82109 78312 -4.62%
llama3.1:8b-instruct-q2_K 126928 124865 -1.62%
llama3.1:8b-instruct-q3_K_L 122121 120258 -1.53%
llama3.1:8b-instruct-q3_K_M 123435 121513 -1.56%
llama3.1:8b-instruct-q3_K_S 125022 123038 -1.59%
llama3.1:8b-instruct-q4_0 120932 119119 -1.50%
llama3.1:8b-instruct-q4_K_M 119787 118069 -1.43%
llama3.1:8b-instruct-q6_K 114344 112186 -1.89%
llama3.1:8b-instruct-q8_0 108771 106571 -2.02%
llama3.2-vision:11b-instruct-fp16 47114 131072 178.32%
llama3.2-vision:11b-instruct-q4_K_M 89288 131072 46.77%
llama3.2-vision:11b-instruct-q8_0 76032 131072 72.41%
llama3.2:1b-instruct-fp16 131072 131072 0.00%
llama3.2:3b-instruct-fp16 131072 131072 0.00%
llama3.3:70b-instruct-q2_K 20183 18602 -7.84%
llama3.3:70b-instruct-q3_K_M 6597 4104 -37.80%
llava-llama3:8b-v1.1-q4_0 8192 8192 0.00%
mistral-small3.1:24b-instruct-2503-q4_K_M 36287 35209 -2.97%
mistral-small3.1:24b-instruct-2503-q8_0 7676 7086 -7.68%
mistral-small:24b-instruct-2501-q4_K_M 32768 32768 0.00%
mistral-small:24b-instruct-2501-q8_0 32768 32768 0.00%
nomic-embed-text:137m-v1.5-fp16 2048 2048 0.00%
phi4-mini-reasoning:3.8b-fp16 113297 106863 -5.68%
phi4-mini:3.8b-fp16 113297 106863 -5.68%
phi4-reasoning:14b-fp16 19318 18035 -6.64%
phi4-reasoning:14b-plus-fp16 19318 18035 -6.64%
phi4:14b-fp16 16384 16384 0.00%
phi4:14b-q4_K_M 16384 16384 0.00%
qwen2.5-coder:1.5b-instruct-fp16 32768 32768 0.00%
qwen2.5-coder:3b-instruct-fp16 32768 32768 0.00%
qwen2.5-coder:7b-instruct-q8_0 32768 32768 0.00%
qwen3:0.6b-fp16 40960 40960 0.00%
qwen3:1.7b-fp16 40960 40960 0.00%
qwen3:14b-fp16 21504 17821 -17.16%
qwen3:14b-q4_K_M 40960 40960 0.00%
qwen3:30b-a3b-q4_K_M 40960 40960 0.00%
qwen3:30b-a3b-q8_0 15673 15673 0.00%
qwen3:32b-q4_K_M 19145 18536 -3.19%
qwen3:4b-fp16 40960 40960 0.00%
qwen3:8b-fp16 40960 40960 0.00%
qwen3:8b-q4_K_M 40960 40960 0.00%
qwq:32b-q4_K_M 39447 38403 -2.65%
qwq:32b-q8_0 8552 5649 -33.91%

System information
Windows 11 pro
Intel Core i9 14900K
System Ram 96GB
2 GPU

  • 3090 24GB
  • 4070 TI Super 16GB

Primary video is the motherboard (iGPU from CPU). No monitors are connected to GPUs.
no applications are using GPU (other than Ollama)

C:\Users\wardm> nvidia-smi
Sat May 17 09:59:25 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 576.28                 Driver Version: 576.28         CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090      WDDM  |   00000000:01:00.0 Off |                  N/A |
|  0%   41C    P8              7W /  370W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4070 ...  WDDM  |   00000000:08:00.0 Off |                  N/A |
|  0%   38C    P8              3W /  285W |       0MiB /  16376MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Relevant log output


OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.7.0

Originally created by @MarkWard0110 on GitHub (May 17, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10756 Originally assigned to: @jessegross on GitHub. ### What is the issue? After installing 0.7.0 I have noticed the context sizes that I was using now push the models into system RAM when they were 100% VRAM on 0.6.8 I am wondering if the change in the memory usage for a given model and context is expected for 0.7.0 The original column was the context size that would fit 100% in VRAM using Ollama 0.6.8. The latest column is the maximum context size for Ollama 0.7.0, which fits in VRAM. The % change is the percentage of change from original to latest context sizes. What is weird is the llama3.2-vision:11b change in context. It just went to the maximum context size, which is suspicious. | model_name | original | latest | % change | |------------|----------|--------|----------| | bespoke-minicheck:7b-fp16 | 32768 | 32768 | 0.00% | | gemma3:12b-it-fp16 | 79584 | 60459 | -24.04% | | gemma3:1b-it-fp16 | 32768 | 32768 | 0.00% | | gemma3:27b-it-q4_K_M | 86211 | 78660 | -8.77% | | gemma3:27b-it-q8_0 | 29941 | 16105 | -46.20% | | gemma3:4b-it-fp16 | 131072 | 131072 | 0.00% | | llama3.1:8b-instruct-fp16 | 82109 | 78312 | -4.62% | | llama3.1:8b-instruct-q2_K | 126928 | 124865 | -1.62% | | llama3.1:8b-instruct-q3_K_L | 122121 | 120258 | -1.53% | | llama3.1:8b-instruct-q3_K_M | 123435 | 121513 | -1.56% | | llama3.1:8b-instruct-q3_K_S | 125022 | 123038 | -1.59% | | llama3.1:8b-instruct-q4_0 | 120932 | 119119 | -1.50% | | llama3.1:8b-instruct-q4_K_M | 119787 | 118069 | -1.43% | | llama3.1:8b-instruct-q6_K | 114344 | 112186 | -1.89% | | llama3.1:8b-instruct-q8_0 | 108771 | 106571 | -2.02% | | llama3.2-vision:11b-instruct-fp16 | 47114 | 131072 | 178.32% | | llama3.2-vision:11b-instruct-q4_K_M | 89288 | 131072 | 46.77% | | llama3.2-vision:11b-instruct-q8_0 | 76032 | 131072 | 72.41% | | llama3.2:1b-instruct-fp16 | 131072 | 131072 | 0.00% | | llama3.2:3b-instruct-fp16 | 131072 | 131072 | 0.00% | | llama3.3:70b-instruct-q2_K | 20183 | 18602 | -7.84% | | llama3.3:70b-instruct-q3_K_M | 6597 | 4104 | -37.80% | | llava-llama3:8b-v1.1-q4_0 | 8192 | 8192 | 0.00% | | mistral-small3.1:24b-instruct-2503-q4_K_M | 36287 | 35209 | -2.97% | | mistral-small3.1:24b-instruct-2503-q8_0 | 7676 | 7086 | -7.68% | | mistral-small:24b-instruct-2501-q4_K_M | 32768 | 32768 | 0.00% | | mistral-small:24b-instruct-2501-q8_0 | 32768 | 32768 | 0.00% | | nomic-embed-text:137m-v1.5-fp16 | 2048 | 2048 | 0.00% | | phi4-mini-reasoning:3.8b-fp16 | 113297 | 106863 | -5.68% | | phi4-mini:3.8b-fp16 | 113297 | 106863 | -5.68% | | phi4-reasoning:14b-fp16 | 19318 | 18035 | -6.64% | | phi4-reasoning:14b-plus-fp16 | 19318 | 18035 | -6.64% | | phi4:14b-fp16 | 16384 | 16384 | 0.00% | | phi4:14b-q4_K_M | 16384 | 16384 | 0.00% | | qwen2.5-coder:1.5b-instruct-fp16 | 32768 | 32768 | 0.00% | | qwen2.5-coder:3b-instruct-fp16 | 32768 | 32768 | 0.00% | | qwen2.5-coder:7b-instruct-q8_0 | 32768 | 32768 | 0.00% | | qwen3:0.6b-fp16 | 40960 | 40960 | 0.00% | | qwen3:1.7b-fp16 | 40960 | 40960 | 0.00% | | qwen3:14b-fp16 | 21504 | 17821 | -17.16% | | qwen3:14b-q4_K_M | 40960 | 40960 | 0.00% | | qwen3:30b-a3b-q4_K_M | 40960 | 40960 | 0.00% | | qwen3:30b-a3b-q8_0 | 15673 | 15673 | 0.00% | | qwen3:32b-q4_K_M | 19145 | 18536 | -3.19% | | qwen3:4b-fp16 | 40960 | 40960 | 0.00% | | qwen3:8b-fp16 | 40960 | 40960 | 0.00% | | qwen3:8b-q4_K_M | 40960 | 40960 | 0.00% | | qwq:32b-q4_K_M | 39447 | 38403 | -2.65% | | qwq:32b-q8_0 | 8552 | 5649 | -33.91% | System information Windows 11 pro Intel Core i9 14900K System Ram 96GB 2 GPU * 3090 24GB * 4070 TI Super 16GB Primary video is the motherboard (iGPU from CPU). No monitors are connected to GPUs. no applications are using GPU (other than Ollama) ``` C:\Users\wardm> nvidia-smi Sat May 17 09:59:25 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 576.28 Driver Version: 576.28 CUDA Version: 12.9 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3090 WDDM | 00000000:01:00.0 Off | N/A | | 0% 41C P8 7W / 370W | 0MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4070 ... WDDM | 00000000:08:00.0 Off | N/A | | 0% 38C P8 3W / 285W | 0MiB / 16376MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ ``` ### Relevant log output ```shell ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.7.0
GiteaMirror added the bug label 2026-05-04 17:14:04 -05:00
Author
Owner

@MarkWard0110 commented on GitHub (May 17, 2025):

I have the following environment variables set

Image

<!-- gh-comment-id:2888446450 --> @MarkWard0110 commented on GitHub (May 17, 2025): I have the following environment variables set ![Image](https://github.com/user-attachments/assets/0471d8b6-df83-452b-98ca-7e5c08aef6c0)
Author
Owner

@MarkWard0110 commented on GitHub (May 17, 2025):

I think this is https://github.com/ollama/ollama/issues/10553 related because I think some of what I am seeing looks similar to what was being found with the mistral-small in the previous 0.6.x release

<!-- gh-comment-id:2888448150 --> @MarkWard0110 commented on GitHub (May 17, 2025): I think this is https://github.com/ollama/ollama/issues/10553 related because I think some of what I am seeing looks similar to what was being found with the mistral-small in the previous 0.6.x release
Author
Owner

@ACheshirov commented on GitHub (May 17, 2025):

Same here... for some reason there is no ollama process in nvidia-smi and everything is handled by CPU/RAM...

<!-- gh-comment-id:2888472194 --> @ACheshirov commented on GitHub (May 17, 2025): Same here... for some reason there is no ollama process in nvidia-smi and everything is handled by CPU/RAM...
Author
Owner

@rick-github commented on GitHub (May 17, 2025):

https://github.com/ollama/ollama/issues/10726

<!-- gh-comment-id:2888491546 --> @rick-github commented on GitHub (May 17, 2025): https://github.com/ollama/ollama/issues/10726
Author
Owner

@DarkCaster commented on GitHub (May 18, 2025):

Same problem. Some models carefully selected to fit in VRAM with context in Ollama<=0.6.8 now no longer fit in Ollama 0.7.0, which leads to a noticeable performance drop.

<!-- gh-comment-id:2888742602 --> @DarkCaster commented on GitHub (May 18, 2025): Same problem. Some models carefully selected to fit in VRAM with context in Ollama<=0.6.8 now no longer fit in Ollama 0.7.0, which leads to a noticeable performance drop.
Author
Owner

@JKratto commented on GitHub (May 18, 2025):

I can confirm, that I have a similar problem. Previously perfectly running llama3.3 q4_K_M split between 3 16GB GPUs, with about 30k CTX (q4) was fitting nicely with about 96 % VRAM utilization. Now it just OOM crashes...

<!-- gh-comment-id:2888925492 --> @JKratto commented on GitHub (May 18, 2025): I can confirm, that I have a similar problem. Previously perfectly running llama3.3 q4_K_M split between 3 16GB GPUs, with about 30k CTX (q4) was fitting nicely with about 96 % VRAM utilization. Now it just OOM crashes...
Author
Owner

@jessegross commented on GitHub (May 20, 2025):

Fixed in #10773

<!-- gh-comment-id:2896079090 --> @jessegross commented on GitHub (May 20, 2025): Fixed in #10773
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#69124