[GH-ISSUE #2821] Can we have the newest 1-bit model #1713

Open
opened 2026-04-12 11:41:16 -05:00 by GiteaMirror · 20 comments
Owner

Originally created by @chuangtc on GitHub (Feb 29, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2821

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
https://thegenerality.com/agi/
https://arxiv.org/abs/2402.17764

Originally created by @chuangtc on GitHub (Feb 29, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2821 The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits https://thegenerality.com/agi/ https://arxiv.org/abs/2402.17764
GiteaMirror added the model label 2026-04-12 11:41:16 -05:00
Author
Owner

@josharian commented on GitHub (Mar 3, 2024):

IIUC the model hasn't been released yet. When it is, I believe it will appear at https://github.com/microsoft/unilm/tree/master/bitnet. Then there'll be some work to get llama.cpp support. Then Ollama can pull it in.

<!-- gh-comment-id:1975199327 --> @josharian commented on GitHub (Mar 3, 2024): IIUC the model hasn't been released yet. When it is, I believe it will appear at https://github.com/microsoft/unilm/tree/master/bitnet. Then there'll be some work to get llama.cpp support. Then Ollama can pull it in.
Author
Owner

@unclemusclez commented on GitHub (Oct 14, 2024):

I have this gguf'd and ready to be pushed to ollama but i am getting

Error: invalid file magic

i need me wizards

<!-- gh-comment-id:2409898093 --> @unclemusclez commented on GitHub (Oct 14, 2024): I have this gguf'd and ready to be pushed to ollama but i am getting `Error: invalid file magic` i need me wizards
Author
Owner

@FGRibreau commented on GitHub (Oct 20, 2024):

It's now released https://github.com/microsoft/BitNet \o/

<!-- gh-comment-id:2424657064 --> @FGRibreau commented on GitHub (Oct 20, 2024): It's now released https://github.com/microsoft/BitNet \o/
Author
Owner

@akashAD98 commented on GitHub (Oct 20, 2024):

its released

<!-- gh-comment-id:2424716194 --> @akashAD98 commented on GitHub (Oct 20, 2024): its released
Author
Owner

@ozbillwang commented on GitHub (Oct 22, 2024):

This request is great! upvote it.

I managed to create a 70MB CPU-only Ollama Docker image (#7184). However, when testing it in real environments like MacBooks and EC2 instances, the response time was too slow, always with high CPU usage. It struggled to handle requests efficiently.

If we could implement a 1-bit model optimized for CPU inference, it would significantly improve performance and allow us to deploy it widely.

<!-- gh-comment-id:2429048178 --> @ozbillwang commented on GitHub (Oct 22, 2024): This request is great! upvote it. I managed to create a 70MB CPU-only Ollama Docker image (#7184). However, when testing it in real environments like MacBooks and EC2 instances, the response time was too slow, always with high CPU usage. It struggled to handle requests efficiently. If we could implement a 1-bit model optimized for CPU inference, it would significantly improve performance and allow us to deploy it widely.
Author
Owner

@kth8 commented on GitHub (Oct 23, 2024):

@ozbillwang I did some testing regarding AVX for CPU inferencing. With AVX2 CPU runner:

ollama run llama3.2:1b-instruct-q4_K_M
>>> /set verbose
Set 'verbose' mode.
>>> /set parameter temperature 0
Set parameter 'temperature' to '0'
>>> /set parameter seed 0
Set parameter 'seed' to '0'
>>> why is the sky blue?
The sky appears blue because of a phenomenon called Rayleigh scattering, named after the British physicist Lord Rayleigh. He discovered that shorter (blue) 
wavelengths of light are scattered more than longer (red) wavelengths by the tiny molecules of gases in the atmosphere.

Here's what happens:

1. Sunlight enters the Earth's atmosphere.
2. The sunlight is made up of a spectrum of colors, including red, orange, yellow, green, blue, indigo, and violet.
3. The shorter (blue) wavelengths are scattered by the tiny molecules of gases such as nitrogen (N2) and oxygen (O2) in the atmosphere.
4. This scattering effect gives the sky its blue color.

The amount of scattering that occurs depends on several factors, including:

* The altitude of the atmosphere: Scattering decreases with increasing altitude.
* The concentration of atmospheric gases: More gas molecules scatter shorter wavelengths.
* The angle of the sunlight: The more direct the sunlight, the more it is scattered.

As a result, the sky appears blue during the daytime when the sun is overhead and the light has to travel through a longer distance in the atmosphere. At 
sunrise and sunset, the light has to travel through a shorter distance, which scatters the shorter wavelengths, making the sky appear redder or orange.

It's worth noting that the color of the sky can also be affected by other factors, such as pollution, dust, and water vapor, but Rayleigh scattering is the 
primary reason for the blue color we see.

total duration:       18.680602651s
load duration:        26.641548ms
prompt eval count:    31 token(s)
prompt eval duration: 712.777ms
prompt eval rate:     43.49 tokens/s
eval count:           306 token(s)
eval duration:        17.899705s
eval rate:            17.10 tokens/s

with just the CPU runner:

ollama run llama3.2:1b-instruct-q4_K_M
>>> /set verbose
Set 'verbose' mode.
>>> /set parameter temperature 0
Set parameter 'temperature' to '0'
>>> /set parameter seed 0
Set parameter 'seed' to '0'
>>> why is the sky blue?
The sky appears blue because of a phenomenon called Rayleigh scattering, named after the British physicist Lord Rayleigh. He discovered that shorter (blue) 
wavelengths of light are scattered more than longer (red) wavelengths by the tiny molecules of gases in the atmosphere.
...
total duration:       1m55.741991018s
load duration:        38.39719ms
prompt eval count:    31 token(s)
prompt eval duration: 2.987612s
prompt eval rate:     10.38 tokens/s
eval count:           319 token(s)
eval duration:        1m52.674415s
eval rate:            2.83 tokens/s

the threads you linked to were using GPU for inference so the lack of AVX may not have been a big deal but for CPU inferencing it makes a massive difference. If you don't use AVX your performance is going to remain terrible regardless if the model is 1 bit or not.

<!-- gh-comment-id:2430851801 --> @kth8 commented on GitHub (Oct 23, 2024): @ozbillwang I did some testing regarding AVX for CPU inferencing. With AVX2 CPU runner: ``` ollama run llama3.2:1b-instruct-q4_K_M >>> /set verbose Set 'verbose' mode. >>> /set parameter temperature 0 Set parameter 'temperature' to '0' >>> /set parameter seed 0 Set parameter 'seed' to '0' >>> why is the sky blue? The sky appears blue because of a phenomenon called Rayleigh scattering, named after the British physicist Lord Rayleigh. He discovered that shorter (blue) wavelengths of light are scattered more than longer (red) wavelengths by the tiny molecules of gases in the atmosphere. Here's what happens: 1. Sunlight enters the Earth's atmosphere. 2. The sunlight is made up of a spectrum of colors, including red, orange, yellow, green, blue, indigo, and violet. 3. The shorter (blue) wavelengths are scattered by the tiny molecules of gases such as nitrogen (N2) and oxygen (O2) in the atmosphere. 4. This scattering effect gives the sky its blue color. The amount of scattering that occurs depends on several factors, including: * The altitude of the atmosphere: Scattering decreases with increasing altitude. * The concentration of atmospheric gases: More gas molecules scatter shorter wavelengths. * The angle of the sunlight: The more direct the sunlight, the more it is scattered. As a result, the sky appears blue during the daytime when the sun is overhead and the light has to travel through a longer distance in the atmosphere. At sunrise and sunset, the light has to travel through a shorter distance, which scatters the shorter wavelengths, making the sky appear redder or orange. It's worth noting that the color of the sky can also be affected by other factors, such as pollution, dust, and water vapor, but Rayleigh scattering is the primary reason for the blue color we see. total duration: 18.680602651s load duration: 26.641548ms prompt eval count: 31 token(s) prompt eval duration: 712.777ms prompt eval rate: 43.49 tokens/s eval count: 306 token(s) eval duration: 17.899705s eval rate: 17.10 tokens/s ``` with just the CPU runner: ``` ollama run llama3.2:1b-instruct-q4_K_M >>> /set verbose Set 'verbose' mode. >>> /set parameter temperature 0 Set parameter 'temperature' to '0' >>> /set parameter seed 0 Set parameter 'seed' to '0' >>> why is the sky blue? The sky appears blue because of a phenomenon called Rayleigh scattering, named after the British physicist Lord Rayleigh. He discovered that shorter (blue) wavelengths of light are scattered more than longer (red) wavelengths by the tiny molecules of gases in the atmosphere. ... total duration: 1m55.741991018s load duration: 38.39719ms prompt eval count: 31 token(s) prompt eval duration: 2.987612s prompt eval rate: 10.38 tokens/s eval count: 319 token(s) eval duration: 1m52.674415s eval rate: 2.83 tokens/s ``` the threads you linked to were using GPU for inference so the lack of AVX may not have been a big deal but for CPU inferencing it makes a massive difference. If you don't use AVX your performance is going to remain terrible regardless if the model is 1 bit or not.
Author
Owner

@ozbillwang commented on GitHub (Oct 23, 2024):

I did some testing regarding AVX for CPU inferencing. With AVX2 CPU runner:

Could you share the way on how to test with AVX and AVX2 CPU?

I got the result, the speed is similar your second test

total duration:       1m1.460351303s
load duration:        24.006432ms
prompt eval count:    71 token(s)
prompt eval duration: 1.398321s
prompt eval rate:     50.78 tokens/s
eval count:           309 token(s)
eval duration:        59.907969s
eval rate:            5.16 tokens/s
<!-- gh-comment-id:2432029231 --> @ozbillwang commented on GitHub (Oct 23, 2024): > I did some testing regarding AVX for CPU inferencing. With AVX2 CPU runner: Could you share the way on how to test with AVX and AVX2 CPU? I got the result, the speed is similar your second test ``` total duration: 1m1.460351303s load duration: 24.006432ms prompt eval count: 71 token(s) prompt eval duration: 1.398321s prompt eval rate: 50.78 tokens/s eval count: 309 token(s) eval duration: 59.907969s eval rate: 5.16 tokens/s ```
Author
Owner

@kth8 commented on GitHub (Oct 23, 2024):

if you include

COPY --from=ollama /usr/lib/ollama/runners/cpu_avx /usr/lib/ollama/runners/cpu_avx
COPY --from=ollama /usr/lib/ollama/runners/cpu_avx2 /usr/lib/ollama/runners/cpu_avx2

in your Docker image then if you run ps aux inside the container after loading a model you will see it being used.

<!-- gh-comment-id:2432117479 --> @kth8 commented on GitHub (Oct 23, 2024): if you include ``` COPY --from=ollama /usr/lib/ollama/runners/cpu_avx /usr/lib/ollama/runners/cpu_avx COPY --from=ollama /usr/lib/ollama/runners/cpu_avx2 /usr/lib/ollama/runners/cpu_avx2 ``` in your Docker image then if you run `ps aux` inside the container after loading a model you will see it being used.
Author
Owner

@ozbillwang commented on GitHub (Oct 23, 2024):

Thanks. @kth8 . It is faster, but still high CPU Usage

image
total duration:       29.888580911s
load duration:        31.737842ms
prompt eval count:    346 token(s)
prompt eval duration: 151.134ms
prompt eval rate:     2289.36 tokens/s
eval count:           454 token(s)
eval duration:        29.573488s
eval rate:            15.35 tokens/s

If compare to BitNet, the CPU usage is less

image
llama_perf_sampler_print:    sampling time =      17.98 ms /   137 runs   (    0.13 ms per token,  7621.27 tokens per second)
llama_perf_context_print:        load time =    1257.96 ms
llama_perf_context_print: prompt eval time =    1363.48 ms /     9 tokens (  151.50 ms per token,     6.60 tokens per second)
llama_perf_context_print:        eval time =   19435.12 ms /   127 runs   (  153.03 ms per token,     6.53 tokens per second)
llama_perf_context_print:       total time =   20850.38 ms /   136 tokens

Seems we still need wait for this feature in Ollama

<!-- gh-comment-id:2433794347 --> @ozbillwang commented on GitHub (Oct 23, 2024): Thanks. @kth8 . It is faster, but still high CPU Usage <img width="1522" alt="image" src="https://github.com/user-attachments/assets/e0a15bce-690f-40b2-bc14-f309efab3f77"> ``` total duration: 29.888580911s load duration: 31.737842ms prompt eval count: 346 token(s) prompt eval duration: 151.134ms prompt eval rate: 2289.36 tokens/s eval count: 454 token(s) eval duration: 29.573488s eval rate: 15.35 tokens/s ``` If compare to BitNet, the CPU usage is less <img width="1526" alt="image" src="https://github.com/user-attachments/assets/d11116ad-4b55-467c-b174-2e27337296de"> ``` llama_perf_sampler_print: sampling time = 17.98 ms / 137 runs ( 0.13 ms per token, 7621.27 tokens per second) llama_perf_context_print: load time = 1257.96 ms llama_perf_context_print: prompt eval time = 1363.48 ms / 9 tokens ( 151.50 ms per token, 6.60 tokens per second) llama_perf_context_print: eval time = 19435.12 ms / 127 runs ( 153.03 ms per token, 6.53 tokens per second) llama_perf_context_print: total time = 20850.38 ms / 136 tokens ``` Seems we still need wait for this feature in Ollama
Author
Owner

@kth8 commented on GitHub (Oct 24, 2024):

The new granite3-moe 3B model could be good for CPU inferencing since it was designed for low latency usage and has less active parameters than Llama3.2 1B

<!-- gh-comment-id:2433863676 --> @kth8 commented on GitHub (Oct 24, 2024): The new granite3-moe 3B model could be good for CPU inferencing since it was designed for low latency usage and has less active parameters than Llama3.2 1B
Author
Owner

@YangWang92 commented on GitHub (Oct 25, 2024):

From https://github.com/ollama/ollama/issues/7289

Hi all,

We recently developed a fully open-source quantization method called VPTQ (Vector Post-Training Quantization) https://github.com/microsoft/VPTQ which enables fast quantization of large language models (LLMs) down to 1-4 bits. The community has also helped release several models using this method https://huggingface.co/VPTQ-community. I am personally very interested in integrating VPTQ into ollama/llama.cpp.

One of the key advantages of VPTQ is that the dequantization method is very straightforward, relying only on a simple lookup table.

I would like to ask for guidance on how best to support this quantization method within Ollama, even if it's on my own fork. Specifically, which approach should I take?

Define a series of new models (e.g., vptq-llama3.1) using existing data types (int32, fp16), and hide the model dequantization within a separate dequant op.

Define a new quantization data type (e.g., a custom lookup table data structure)?

I’d love to hear your thoughts or any suggestions on how to proceed!

Thank you!
Yang

<!-- gh-comment-id:2437422837 --> @YangWang92 commented on GitHub (Oct 25, 2024): From https://github.com/ollama/ollama/issues/7289 Hi all, We recently developed a fully open-source quantization method called VPTQ (Vector Post-Training Quantization) https://github.com/microsoft/VPTQ which enables fast quantization of large language models (LLMs) down to 1-4 bits. The community has also helped release several models using this method https://huggingface.co/VPTQ-community. I am personally very interested in integrating VPTQ into ollama/llama.cpp. One of the key advantages of VPTQ is that the dequantization method is very straightforward, relying only on a simple lookup table. I would like to ask for guidance on how best to support this quantization method within Ollama, even if it's on my own fork. Specifically, which approach should I take? Define a series of new models (e.g., vptq-llama3.1) using existing data types (int32, fp16), and hide the model dequantization within a separate dequant op. Define a new quantization data type (e.g., a custom lookup table data structure)? I’d love to hear your thoughts or any suggestions on how to proceed! Thank you! Yang
Author
Owner

@teaalltr commented on GitHub (Oct 25, 2024):

@YangWang92 @kth8 isn't it already supported in llama.cpp?
https://github.com/ggerganov/llama.cpp/pull/8151

<!-- gh-comment-id:2438935450 --> @teaalltr commented on GitHub (Oct 25, 2024): @YangWang92 @kth8 isn't it already supported in llama.cpp? https://github.com/ggerganov/llama.cpp/pull/8151
Author
Owner

@YangWang92 commented on GitHub (Oct 26, 2024):

I'm still trying to integrate VPTQ into llama.cpp. https://github.com/ggerganov/llama.cpp/discussions/9974 :)

<!-- gh-comment-id:2439170966 --> @YangWang92 commented on GitHub (Oct 26, 2024): I'm still trying to integrate VPTQ into llama.cpp. https://github.com/ggerganov/llama.cpp/discussions/9974 :)
Author
Owner

@Y-PLONI commented on GitHub (Nov 23, 2024):

Has there been any progress with this?
Did ollama or llama.cpp do this?
And if so, is there any good model that works with it?
Thanks!

<!-- gh-comment-id:2495540521 --> @Y-PLONI commented on GitHub (Nov 23, 2024): Has there been any progress with this? Did ollama or llama.cpp do this? And if so, is there any good model that works with it? Thanks!
Author
Owner

@raymond-infinitecode commented on GitHub (Jan 4, 2025):

Still no progress ?

<!-- gh-comment-id:2570571227 --> @raymond-infinitecode commented on GitHub (Jan 4, 2025): Still no progress ?
Author
Owner

@southwolf commented on GitHub (Jan 14, 2025):

Still not working ollama run hf.co/mradermacher/phi-4-i1-GGUF:i1-IQ1_M got error
Error: llama runner process has terminated: GGML_ASSERT(hparams.n_swa > 0) failed

<!-- gh-comment-id:2588627887 --> @southwolf commented on GitHub (Jan 14, 2025): Still not working `ollama run hf.co/mradermacher/phi-4-i1-GGUF:i1-IQ1_M` got error `Error: llama runner process has terminated: GGML_ASSERT(hparams.n_swa > 0) failed`
Author
Owner

@HKMV commented on GitHub (Apr 24, 2025):

Any progress?

<!-- gh-comment-id:2826186614 --> @HKMV commented on GitHub (Apr 24, 2025): Any progress?
Author
Owner

@electriquo commented on GitHub (Apr 24, 2025):

relates to #10337

<!-- gh-comment-id:2828177249 --> @electriquo commented on GitHub (Apr 24, 2025): relates to #10337
Author
Owner

@borja-rojo-ilvento commented on GitHub (Apr 24, 2025):

relates to #10337

Yup, this is what I want!!!

<!-- gh-comment-id:2829042264 --> @borja-rojo-ilvento commented on GitHub (Apr 24, 2025): > relates to [#10337](https://github.com/ollama/ollama/issues/10337) Yup, this is what I want!!!
Author
Owner

@gordan-bobic commented on GitHub (Jul 6, 2025):

Given that support for what seems to be (partly) bitnet ternary encoded models seems to be already be supported in llama.cpp:
https://unsloth.ai/blog/deepseekr1-dynamic
should this already be working in ollama as is?

<!-- gh-comment-id:3042490737 --> @gordan-bobic commented on GitHub (Jul 6, 2025): Given that support for what seems to be (partly) bitnet ternary encoded models seems to be already be supported in llama.cpp: https://unsloth.ai/blog/deepseekr1-dynamic should this already be working in ollama as is?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#1713