[PR #15690] Optimize ggml_vec_dot_q4_K_q8_K with AVX-512 implementation #61961

Open
opened 2026-04-29 16:55:50 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15690
Author: @emad-elsaid
Created: 4/19/2026
Status: 🔄 Open

Base: mainHead: feature/avx512-q4k-optimization


📝 Commits (1)

  • 6641f5e Optimize ggml_vec_dot_q4_K_q8_K with minimal AVX-512 implementation

📊 Changes

1 file changed (+76 additions, -1 deletions)

View changed files

📝 ml/backend/ggml/ggml/src/ggml-cpu/arch/x86/quants.c (+76 -1)

📄 Description

Code change assisted by ClaudeCode/OpenCode + Sonnet 4.5

Add AVX-512 code path for the Q4_K quantization kernel that achieves 34.5% performance improvement (18.93 → 25.46 tok/s on Tiger Lake i7-1185G7).

Implementation strategy:

  • Keep the AVX2 algorithm for data processing (64 elements/iteration)
  • Use 512-bit accumulator for final summation to reduce reduction overhead
  • Add hsum_float_16() helper for efficient 512-bit horizontal reduction

Benchmark: qwen2.5-coder:1.5b on Intel i7-1185G7 (Tiger Lake, AVX-512 capable)

  • Baseline (AVX2): 18.93 tok/s
  • Optimized (AVX-512): 25.46 tok/s
  • Improvement: +34.5%

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15690 **Author:** [@emad-elsaid](https://github.com/emad-elsaid) **Created:** 4/19/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `feature/avx512-q4k-optimization` --- ### 📝 Commits (1) - [`6641f5e`](https://github.com/ollama/ollama/commit/6641f5e76781a3e81e3f23eb83060919d0429d4e) Optimize ggml_vec_dot_q4_K_q8_K with minimal AVX-512 implementation ### 📊 Changes **1 file changed** (+76 additions, -1 deletions) <details> <summary>View changed files</summary> 📝 `ml/backend/ggml/ggml/src/ggml-cpu/arch/x86/quants.c` (+76 -1) </details> ### 📄 Description Code change assisted by ClaudeCode/OpenCode + Sonnet 4.5 Add AVX-512 code path for the Q4_K quantization kernel that achieves 34.5% performance improvement (18.93 → 25.46 tok/s on Tiger Lake i7-1185G7). Implementation strategy: - Keep the AVX2 algorithm for data processing (64 elements/iteration) - Use 512-bit accumulator for final summation to reduce reduction overhead - Add hsum_float_16() helper for efficient 512-bit horizontal reduction Benchmark: qwen2.5-coder:1.5b on Intel i7-1185G7 (Tiger Lake, AVX-512 capable) - Baseline (AVX2): 18.93 tok/s - Optimized (AVX-512): 25.46 tok/s - Improvement: +34.5% --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 16:55:50 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#61961