[GH-ISSUE #10021] AWQ support? #68628

Open
opened 2026-05-04 14:39:08 -05:00 by GiteaMirror · 8 comments
Owner

Originally created by @ALLMI78 on GitHub (Mar 27, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10021

Hi guys, not sure if it would help but i only found this: https://github.com/ollama/ollama/issues/1862

### AutoAWQ speeds up models by 3x and reduces memory requirements by 3x compared to FP16.

@jmorganca wrote: "would it be possible to open an issue with the performance improvements or feature sets AWQ brings"

Integrating AWQ (Activation-aware Weight Quantization) into Ollama would maybe offer several technical advantages:

  1. Memory Efficiency: AWQ reduces the model size by up to 3x compared to FP16 by quantizing weights to 4 bits, allowing Ollama to run large models with significantly reduced memory usage.

2. Speed Improvement: AWQ accelerates inference by optimizing model weights for hardware-friendly operations, potentially speeding up response times and reducing compute requirements. <<< how?

  1. Easy Integration: AWQ supports frameworks like Hugging Face and vLLM, meaning Ollama can leverage pre-quantized models or quantize its own models seamlessly with minimal effort.

  2. Hardware Optimization: AWQ is designed to work well with specialized hardware that benefits from low-bit precision, making it suitable for efficient deployment on resource-constrained devices.

  3. Cost Reduction: By decreasing memory and compute requirements, AWQ helps lower operational costs when running large models.

In summary, integrating AWQ into Ollama would enhance model efficiency, speed, and scalability while reducing resource consumption and costs.

https://qwen.readthedocs.io/en/latest/quantization/awq.html

https://www.youtube.com/watch?v=GKd92rhTBGo <<< explained

AWQ (Activation-aware Weight Quantization) accelerates inference by reducing the precision of model weights to 4 bits, significantly lowering memory bandwidth requirements during the memory-bound decoding phase. This reduction allows for faster data transfer between the GPU and memory, leading to improved response times. ​Moreover, AWQ preserves a small fraction of critical weights essential for model performance, minimizing accuracy degradation while enhancing speed. ​Empirical evaluations have demonstrated that AWQ-based inference can be over three times faster than FP16 models on GPUs, highlighting its effectiveness in accelerating large language model inference. ​Hugging Face

https://medium.com/byte-sized-ai/vllm-quantization-awq-activation-aware-weight-quantization-for-llm-compression-and-35894ffd6a9b

maybe it helps ;)

Originally created by @ALLMI78 on GitHub (Mar 27, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10021 Hi guys, not sure if it would help but i only found this: https://github.com/ollama/ollama/issues/1862 **### AutoAWQ speeds up models by 3x and reduces memory requirements by 3x compared to FP16.** @jmorganca wrote: "would it be possible to open an issue with the performance improvements or feature sets AWQ brings" **Integrating AWQ (Activation-aware Weight Quantization) into Ollama would maybe offer several technical advantages:** 1. **Memory Efficiency**: AWQ reduces the model size by up to 3x compared to FP16 by quantizing weights to 4 bits, allowing Ollama to run large models with significantly reduced memory usage. **2. **Speed Improvement**: AWQ accelerates inference by optimizing model weights for hardware-friendly operations, potentially speeding up response times and reducing compute requirements.** <<< how? 3. **Easy Integration**: AWQ supports frameworks like Hugging Face and vLLM, meaning Ollama can leverage pre-quantized models or quantize its own models seamlessly with minimal effort. 4. **Hardware Optimization**: AWQ is designed to work well with specialized hardware that benefits from low-bit precision, making it suitable for efficient deployment on resource-constrained devices. 5. **Cost Reduction**: By decreasing memory and compute requirements, AWQ helps lower operational costs when running large models. In summary, integrating AWQ into Ollama would enhance model efficiency, speed, and scalability while reducing resource consumption and costs. https://qwen.readthedocs.io/en/latest/quantization/awq.html https://www.youtube.com/watch?v=GKd92rhTBGo <<< explained > AWQ (Activation-aware Weight Quantization) accelerates inference by reducing the precision of model weights to 4 bits, significantly lowering memory bandwidth requirements during the memory-bound decoding phase. This reduction allows for faster data transfer between the GPU and memory, leading to improved response times. ​Moreover, AWQ preserves a small fraction of critical weights essential for model performance, minimizing accuracy degradation while enhancing speed. ​Empirical evaluations have demonstrated that AWQ-based inference can be over three times faster than FP16 models on GPUs, highlighting its effectiveness in accelerating large language model inference. ​[Hugging Face](https://huggingface.co/docs/transformers/en/quantization/awq) https://medium.com/byte-sized-ai/vllm-quantization-awq-activation-aware-weight-quantization-for-llm-compression-and-35894ffd6a9b maybe it helps ;)
GiteaMirror added the feature request label 2026-05-04 14:39:08 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 28, 2025):

Image

NVIDIA GeForce RTX 4070, compute capability 8.9
ollama 0.6.2
vllm 0.8.1

<!-- gh-comment-id:2759945928 --> @rick-github commented on GitHub (Mar 28, 2025): ![Image](https://github.com/user-attachments/assets/ed34bcdb-bc72-40c1-89a4-aa1de290ef49) NVIDIA GeForce RTX 4070, compute capability 8.9 ollama 0.6.2 vllm 0.8.1
Author
Owner

@ALLMI78 commented on GitHub (Mar 28, 2025):

Hey Rick, thanks for the test.

I'm not sure about the interpretation of the results or what you're trying to convey.

Does the picture prove for you that:

A. Ollama with Q4_K_M is faster, or that
B. AWQ is faster than Ollama, or
C. both? :)

<!-- gh-comment-id:2761344595 --> @ALLMI78 commented on GitHub (Mar 28, 2025): Hey Rick, thanks for the test. I'm not sure about the interpretation of the results or what you're trying to convey. Does the picture prove for you that: A. Ollama with Q4_K_M is faster, or that B. AWQ is faster than Ollama, or C. both? :)
Author
Owner

@rick-github commented on GitHub (Mar 28, 2025):

ollama is faster than vllm using GGUF. vllm with AWQ is faster than GGUF.

<!-- gh-comment-id:2761355747 --> @rick-github commented on GitHub (Mar 28, 2025): ollama is faster than vllm using GGUF. vllm with AWQ is faster than GGUF.
Author
Owner

@ALLMI78 commented on GitHub (Mar 28, 2025):

;) ok C...

it shows >>> "vllm with AWQ is much faster than GGUF"

Did you also pay attention to the memory requirements? Does the same model in the VLLM-AWQ version require less VRAM?

<!-- gh-comment-id:2761366385 --> @ALLMI78 commented on GitHub (Mar 28, 2025): ;) ok C... it shows >>> "vllm with AWQ is **much** faster than GGUF" Did you also pay attention to the memory requirements? Does the same model in the VLLM-AWQ version require less VRAM?
Author
Owner

@rick-github commented on GitHub (Mar 28, 2025):

The nature of vllm is to use all available VRAM, so it has a larger footprint than ollama. A better metric would be what's the least amount of VRAM you can use and still have the model function. However, I don't use vllm so I don't know how to tweak the service, I just had a bit of spare time and was curious about AWQ.

<!-- gh-comment-id:2761384791 --> @rick-github commented on GitHub (Mar 28, 2025): The nature of vllm is to use all available VRAM, so it has a larger footprint than ollama. A better metric would be what's the least amount of VRAM you can use and still have the model function. However, I don't use vllm so I don't know how to tweak the service, I just had a bit of spare time and was curious about AWQ.
Author
Owner

@rick-github commented on GitHub (Mar 28, 2025):

Note that what this really points out is that AWQ scales better than GGUF - the non parallel tps is the same, it's just when you start doing parallel operations is when AWQ pulls out in front. So the difference in generation may be reduced by optimisations. It will be interesting to see what the new ollama engine performs like. Unfortunately the current docker image of vllm doesn't support gemma3, and ollama's new engine only supports gemma3, so there's no apples-apples comparison.

<!-- gh-comment-id:2761473857 --> @rick-github commented on GitHub (Mar 28, 2025): Note that what this really points out is that AWQ scales better than GGUF - the non parallel tps is the same, it's just when you start doing parallel operations is when AWQ pulls out in front. So the difference in generation may be reduced by optimisations. It will be interesting to see what the new ollama engine performs like. Unfortunately the current docker image of vllm doesn't support gemma3, and ollama's new engine only supports gemma3, so there's no apples-apples comparison.
Author
Owner

@rick-github commented on GitHub (Mar 28, 2025):

Image

<!-- gh-comment-id:2762234512 --> @rick-github commented on GitHub (Mar 28, 2025): ![Image](https://github.com/user-attachments/assets/384f2833-b12e-42ce-901b-928d1078a59b)
Author
Owner

@pdevine commented on GitHub (Mar 31, 2025):

So TPS is one metric, but there is also perplexity. When I was thinking about AWQ support for the new engine (vs. Kawrakow's k-quants) I had seen one paper which showed AWQ being dominated on the pareto front by K quantization (although I've yet to see an actual real paper for K quants).

I'll see if I can find my notes for this.

<!-- gh-comment-id:2767135168 --> @pdevine commented on GitHub (Mar 31, 2025): So TPS is one metric, but there is also perplexity. When I was thinking about AWQ support for the new engine (vs. Kawrakow's k-quants) I had seen one paper which showed AWQ being dominated on the pareto front by K quantization (although I've yet to see an actual real paper for K quants). I'll see if I can find my notes for this.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68628