[PR #10363] [MERGED] Move quantization to new backend #39094

Closed
opened 2026-04-22 23:44:44 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/10363
Author: @dhiltgen
Created: 4/21/2025
Status: Merged
Merged: 5/6/2025
Merged by: @dhiltgen

Base: mainHead: ggml_quant


📝 Commits (2)

  • 3b329d9 Move quantization logic to GGML via new backend
  • ed9acd0 Remove "add model quantizations"

📊 Changes

39 files changed (+1850 additions, -436 deletions)

View changed files

📝 cmd/cmd.go (+5 -1)
📝 convert/convert.go (+9 -9)
📝 convert/convert_bert.go (+3 -3)
📝 convert/convert_commandr.go (+3 -3)
📝 convert/convert_gemma.go (+3 -3)
📝 convert/convert_gemma2_adapter.go (+3 -3)
📝 convert/convert_llama.go (+4 -4)
📝 convert/convert_llama4.go (+5 -5)
📝 convert/convert_llama_adapter.go (+3 -3)
📝 convert/convert_mistral.go (+3 -3)
📝 convert/convert_mixtral.go (+3 -3)
📝 convert/convert_phi3.go (+5 -5)
📝 convert/convert_qwen2.go (+3 -3)
📝 fs/ggml/ggml.go (+47 -39)
📝 fs/ggml/gguf.go (+57 -52)
📝 fs/ggml/gguf_test.go (+1 -1)
📝 fs/ggml/type.go (+286 -130)
📝 integration/model_arch_test.go (+0 -11)
integration/quantization_test.go (+130 -0)
📝 integration/utils_test.go (+11 -0)

...and 19 more files

📄 Description

This change converts the quantization support for model creation to use the new ml backend via CGO, and implements the model specific type adjustments in Go. The algorithm balances parallelism and CPU/Memory load to reduce the time it takes to perform the conversion process. This also makes some changes to the fs/ggml module to support round-trip serialization, and exposes supported quantization types. Tensors can now be written based on a channel to enable more efficient memory usage during conversion.

This enables removing one of our carried patches.

New CLI UX:

% cat test.modelfile
FROM llava:34b-v1.6-fp16
% ./ollama create test4 -f ./test.modelfile -q Q4_0
gathering model components
quantizing F16 model to Q4_0 100% ▕████████████████████████████████████▏  68 GB
verifying conversion
creating new layer sha256:8156890c443d9809ba8e00c2990638462a1ad038c53a93bc0c9dd94fa94c3031
using existing layer sha256:83720bd8438ccdc910deba5efbdc3340820b29258d94a7a60d1addc9a1b5f095
using existing layer sha256:43070e2d4e532684de521b885f385d0841030efa2b1a20bafb76133a5e1379c1
using existing layer sha256:a47b02e00552cd7022ea700b1abf8c572bb26c9bc8c1a37e01b566f2344df5dc
using existing layer sha256:f02dd72bb2423204352eabc5637b44d79d17f109fdb510a7c51455892aa2d216
writing manifest
success

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/10363 **Author:** [@dhiltgen](https://github.com/dhiltgen) **Created:** 4/21/2025 **Status:** ✅ Merged **Merged:** 5/6/2025 **Merged by:** [@dhiltgen](https://github.com/dhiltgen) **Base:** `main` ← **Head:** `ggml_quant` --- ### 📝 Commits (2) - [`3b329d9`](https://github.com/ollama/ollama/commit/3b329d9b1246469699958dcf0e8ef530cb8a1235) Move quantization logic to GGML via new backend - [`ed9acd0`](https://github.com/ollama/ollama/commit/ed9acd023aac7d9f9933e110a5f1837129812056) Remove "add model quantizations" ### 📊 Changes **39 files changed** (+1850 additions, -436 deletions) <details> <summary>View changed files</summary> 📝 `cmd/cmd.go` (+5 -1) 📝 `convert/convert.go` (+9 -9) 📝 `convert/convert_bert.go` (+3 -3) 📝 `convert/convert_commandr.go` (+3 -3) 📝 `convert/convert_gemma.go` (+3 -3) 📝 `convert/convert_gemma2_adapter.go` (+3 -3) 📝 `convert/convert_llama.go` (+4 -4) 📝 `convert/convert_llama4.go` (+5 -5) 📝 `convert/convert_llama_adapter.go` (+3 -3) 📝 `convert/convert_mistral.go` (+3 -3) 📝 `convert/convert_mixtral.go` (+3 -3) 📝 `convert/convert_phi3.go` (+5 -5) 📝 `convert/convert_qwen2.go` (+3 -3) 📝 `fs/ggml/ggml.go` (+47 -39) 📝 `fs/ggml/gguf.go` (+57 -52) 📝 `fs/ggml/gguf_test.go` (+1 -1) 📝 `fs/ggml/type.go` (+286 -130) 📝 `integration/model_arch_test.go` (+0 -11) ➕ `integration/quantization_test.go` (+130 -0) 📝 `integration/utils_test.go` (+11 -0) _...and 19 more files_ </details> ### 📄 Description This change converts the quantization support for model creation to use the new ml backend via CGO, and implements the model specific type adjustments in Go. The algorithm balances parallelism and CPU/Memory load to reduce the time it takes to perform the conversion process. This also makes some changes to the fs/ggml module to support round-trip serialization, and exposes supported quantization types. Tensors can now be written based on a channel to enable more efficient memory usage during conversion. This enables removing one of our carried patches. New CLI UX: ``` % cat test.modelfile FROM llava:34b-v1.6-fp16 % ./ollama create test4 -f ./test.modelfile -q Q4_0 gathering model components quantizing F16 model to Q4_0 100% ▕████████████████████████████████████▏ 68 GB verifying conversion creating new layer sha256:8156890c443d9809ba8e00c2990638462a1ad038c53a93bc0c9dd94fa94c3031 using existing layer sha256:83720bd8438ccdc910deba5efbdc3340820b29258d94a7a60d1addc9a1b5f095 using existing layer sha256:43070e2d4e532684de521b885f385d0841030efa2b1a20bafb76133a5e1379c1 using existing layer sha256:a47b02e00552cd7022ea700b1abf8c572bb26c9bc8c1a37e01b566f2344df5dc using existing layer sha256:f02dd72bb2423204352eabc5637b44d79d17f109fdb510a7c51455892aa2d216 writing manifest success ``` --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-22 23:44:44 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#39094