[GH-ISSUE #10729] qwen2.5vl:72b-q4_K_M file size (71 GB) appears abnormally large #32806

Closed
opened 2026-04-22 14:38:43 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @gakugaku on GitHub (May 16, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10729

Originally assigned to: @BruceMacD on GitHub.

Overview

The Ollama build qwen2.5vl:72b-q4_K_M is shown as 71 GB, while the same 72B model quantized to 4-bit (q4_K_M) by unsloth is only 47.4 GB.

Model sizes

Parameters Provider Model / File Size
32 B Ollama qwen2.5vl:32b-q4_K_M 21 GB
Ollama qwen2.5vl:32b-q8_0 36 GB
Ollama qwen2.5vl:32b-fp16 67 GB
72 B Ollama qwen2.5vl:72b-q4_K_M 71 GB
Ollama qwen2.5vl:72b-q8_0 79 GB
Ollama qwen2.5vl:72b-fp16 147 GB
unsloth Qwen2.5-VL-72B-Instruct-IQ4_XS.gguf 39.7 GB
unsloth Qwen2.5-VL-72B-Instruct-IQ4_NL.gguf 41.3 GB
unsloth Qwen2.5-VL-72B-Instruct-Q4_0.gguf 41.4 GB
unsloth Qwen2.5-VL-72B-Instruct-Q4_K_S.gguf 43.9 GB
unsloth Qwen2.5-VL-72B-Instruct-Q4_1.gguf 45.7 GB
unsloth Qwen2.5-VL-72B-Instruct-Q4_K_M.gguf 47.4 GB
unsloth Qwen2.5-VL-72B-Instruct-UD-Q4_K_XL.gguf 47.3 GB

Ollama sizes are taken from the tag page, unsloth sizes from the Hugging Face file list.

Observations

  • The 32B build (qwen2.5vl:32b-q4_K_M, 21 GB) looks normal, so the issue seems specific to the 72B build, not VL models in general.
  • None of the unsloth 72B 4-bit quantized files exceed 50 GB.
  • I would like to confirm whether 71 GB is expected or if something went wrong during quantization/export.

Questions / Requests

  1. Could you please verify whether the 71 GB size is intended?
  2. If it is correct, what accounts for the ~24 GB difference compared with other 72B q4_K_M builds?
  3. If it is unintended, would it be possible to republish a q4_K_M build closer to ~47 GB?
  4. Would you consider publishing the exact quantization build steps for Ollama models? Full transparency would help users validate and trust the quantized versions.

Ollama version

0.7.0

Originally created by @gakugaku on GitHub (May 16, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10729 Originally assigned to: @BruceMacD on GitHub. ### Overview The Ollama build `qwen2.5vl:72b-q4_K_M` is shown as **71 GB**, while the same 72B model quantized to 4-bit (`q4_K_M`) by unsloth is only **47.4 GB**. ### Model sizes | Parameters | Provider | Model / File | Size | |------------|----------|--------------|------| | 32 B | Ollama | `qwen2.5vl:32b-q4_K_M` | 21 GB | | | Ollama | `qwen2.5vl:32b-q8_0` | 36 GB | | | Ollama | `qwen2.5vl:32b-fp16` | 67 GB | | 72 B | **Ollama** | **`qwen2.5vl:72b-q4_K_M`** | **71 GB** | | | Ollama | `qwen2.5vl:72b-q8_0` | 79 GB | | | Ollama | `qwen2.5vl:72b-fp16` | 147 GB | | | unsloth | `Qwen2.5-VL-72B-Instruct-IQ4_XS.gguf` | 39.7 GB | | | unsloth | `Qwen2.5-VL-72B-Instruct-IQ4_NL.gguf` | 41.3 GB | | | unsloth | `Qwen2.5-VL-72B-Instruct-Q4_0.gguf` | 41.4 GB | | | unsloth | `Qwen2.5-VL-72B-Instruct-Q4_K_S.gguf` | 43.9 GB | | | unsloth | `Qwen2.5-VL-72B-Instruct-Q4_1.gguf` | 45.7 GB | | | **unsloth** | **`Qwen2.5-VL-72B-Instruct-Q4_K_M.gguf`** | **47.4 GB** | | | unsloth | `Qwen2.5-VL-72B-Instruct-UD-Q4_K_XL.gguf` | 47.3 GB | Ollama sizes are taken from the tag page, unsloth sizes from the Hugging Face file list. - <https://ollama.com/library/qwen2.5vl/tags> - <https://huggingface.co/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/tree/main> ### Observations - The **32B** build (`qwen2.5vl:32b-q4_K_M`, 21 GB) looks normal, so the issue seems **specific to the 72B build**, not VL models in general. - None of the unsloth 72B 4-bit quantized files exceed 50 GB. - I would like to confirm whether 71 GB is expected or if something went wrong during quantization/export. ### Questions / Requests 1. Could you please verify whether the 71 GB size is intended? 2. If it is correct, what accounts for the ~24 GB difference compared with other 72B `q4_K_M` builds? 3. If it is unintended, would it be possible to republish a `q4_K_M` build closer to ~47 GB? 4. Would you consider publishing the exact quantization build steps for Ollama models? Full transparency would help users validate and trust the quantized versions. ### Related - https://github.com/ollama/ollama/issues/6564 ### Ollama version 0.7.0
GiteaMirror added the bug label 2026-04-22 14:38:43 -05:00
Author
Owner

@gitcob commented on GitHub (May 19, 2025):

Just to point out the obvious, this means that this model won't fit in 48 GB VRAM, which is currently a sweet spot using 2 consumer GPUs or 1 pro GPU.

<!-- gh-comment-id:2890485577 --> @gitcob commented on GitHub (May 19, 2025): Just to point out the obvious, this means that this model won't fit in 48 GB VRAM, which is currently a sweet spot using 2 consumer GPUs or 1 pro GPU.
Author
Owner

@BruceMacD commented on GitHub (May 22, 2025):

Thanks for reporting this. It was a bug in quantization. I fixed the model and pushed it to ollama.com so it should be better now!
https://ollama.com/library/qwen2.5vl:72b

<!-- gh-comment-id:2902600322 --> @BruceMacD commented on GitHub (May 22, 2025): Thanks for reporting this. It was a bug in quantization. I fixed the model and pushed it to ollama.com so it should be better now! https://ollama.com/library/qwen2.5vl:72b
Author
Owner

@gakugaku commented on GitHub (May 23, 2025):

@BruceMacD
Thank you for the fix.

The updated model size is 49GB, which unfortunately means it won't run on many 48GB GPUs. As @gitcob pointed out, this is important for Ollama users.
What do you think is the difference between the unsloth quantization and Ollama qwen2.5:72b (47GB)?

<!-- gh-comment-id:2903091968 --> @gakugaku commented on GitHub (May 23, 2025): @BruceMacD Thank you for the fix. The updated model size is 49GB, which unfortunately means it won't run on many 48GB GPUs. As @gitcob pointed out, this is important for Ollama users. What do you think is the difference between the unsloth quantization and Ollama [qwen2.5:72b](https://ollama.com/library/qwen2.5:72b) (47GB)?
Author
Owner

@gitcob commented on GitHub (May 27, 2025):

This model might be tough to squeeze into 48GB. Ollama reports 57 GB (14% CPU) using the default context size.

<!-- gh-comment-id:2913036319 --> @gitcob commented on GitHub (May 27, 2025): This model might be tough to squeeze into 48GB. Ollama reports 57 GB (14% CPU) using the default context size.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32806