[GH-ISSUE #7289] VPTQ Model Quantization Support in Ollama #51144

Open
opened 2026-04-28 18:33:15 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @YangWang92 on GitHub (Oct 21, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7289

Hi all,

We recently developed a fully open-source quantization method called VPTQ (Vector Post-Training Quantization) https://github.com/microsoft/VPTQ which enables fast quantization of large language models (LLMs) down to 1-4 bits. The community has also helped release several models using this method https://huggingface.co/VPTQ-community. I am personally very interested in integrating VPTQ into ollama/llama.cpp.

One of the key advantages of VPTQ is that the dequantization method is very straightforward, relying only on a simple lookup table.

I would like to ask for guidance on how best to support this quantization method within Ollama, even if it's on my own fork. Specifically, which approach should I take?

  1. Define a series of new models (e.g., vptq-llama3.1) using existing data types (int32, fp16), and hide the model dequantization within a separate dequant op.

  2. Define a new quantization data type (e.g., a custom lookup table data structure)?

I’d love to hear your thoughts or any suggestions on how to proceed!

Thank you!
Yang

Originally created by @YangWang92 on GitHub (Oct 21, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7289 Hi all, We recently developed a fully open-source quantization method called VPTQ (Vector Post-Training Quantization) [https://github.com/microsoft/VPTQ](https://github.com/microsoft/VPTQ) which enables fast quantization of large language models (LLMs) down to 1-4 bits. The community has also helped release several models using this method [https://huggingface.co/VPTQ-community](https://huggingface.co/VPTQ-community). I am personally very interested in integrating VPTQ into ollama/llama.cpp. One of the key advantages of VPTQ is that the dequantization method is very straightforward, relying only on a simple lookup table. I would like to ask for guidance on how best to support this quantization method within Ollama, even if it's on my own fork. Specifically, which approach should I take? 1. Define a series of new models (e.g., vptq-llama3.1) using existing data types (int32, fp16), and hide the model dequantization within a separate dequant op. 2. Define a new quantization data type (e.g., a custom lookup table data structure)? I’d love to hear your thoughts or any suggestions on how to proceed! Thank you! Yang
GiteaMirror added the feature request label 2026-04-28 18:33:15 -05:00
Author
Owner

@guibirow commented on GitHub (Oct 25, 2024):

#2821

<!-- gh-comment-id:2437418374 --> @guibirow commented on GitHub (Oct 25, 2024): #2821
Author
Owner

@YangWang92 commented on GitHub (Oct 25, 2024):

#2821
Hi @guibirow,
cool, thanks for point this out!

<!-- gh-comment-id:2437421986 --> @YangWang92 commented on GitHub (Oct 25, 2024): > #2821 Hi @guibirow, cool, thanks for point this out!
Author
Owner

@guibirow commented on GitHub (Oct 25, 2024):

How does this compare/relate to https://github.com/microsoft/BitNet

<!-- gh-comment-id:2437430858 --> @guibirow commented on GitHub (Oct 25, 2024): How does this compare/relate to https://github.com/microsoft/BitNet
Author
Owner

@YangWang92 commented on GitHub (Oct 25, 2024):

How does this compare/relate to https://github.com/microsoft/BitNet

Our VPTQ is a weight-only quantization method, so it clearly cannot achieve the same results as Bitnet at the same bit level. The advantage of VPTQ is that it provides 1-2 bit quantization for the latest models, such as Llama 3.1 and Qwen2.5. https://huggingface.co/VPTQ-community

<!-- gh-comment-id:2437441386 --> @YangWang92 commented on GitHub (Oct 25, 2024): > How does this compare/relate to https://github.com/microsoft/BitNet Our VPTQ is a weight-only quantization method, so it clearly cannot achieve the same results as Bitnet at the same bit level. The advantage of VPTQ is that it provides 1-2 bit quantization for the latest models, such as Llama 3.1 and Qwen2.5. https://huggingface.co/VPTQ-community
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51144