[GH-ISSUE #6294] AirLLM integration? #3944

Open
opened 2026-04-12 14:49:16 -05:00 by GiteaMirror · 25 comments
Owner

Originally created by @blankuserrr on GitHub (Aug 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6294

I'd love to see the addition/support of AirLLM in ollama, as it can massively decrease the needed amount of vram to run large models.

Originally created by @blankuserrr on GitHub (Aug 9, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6294 I'd love to see the addition/support of [AirLLM](https://github.com/lyogavin/airllm) in ollama, as it can massively decrease the needed amount of vram to run large models.
GiteaMirror added the feature request label 2026-04-12 14:49:16 -05:00
Author
Owner

@mdlmarkham commented on GitHub (Aug 11, 2024):

+1 My home-lab has grown more or less organically over the last 10 years... and includes a lot of castoff gaming hardware. It would be great if Ollama could incorporate features to let me get more out of what I have. Both the compression methods used by AirLLM as well as features that would allow coordination of multiple instances across a local network would be fantastic. Keep up the good work!

<!-- gh-comment-id:2282758565 --> @mdlmarkham commented on GitHub (Aug 11, 2024): +1 My home-lab has grown more or less organically over the last 10 years... and includes a lot of castoff gaming hardware. It would be great if Ollama could incorporate features to let me get more out of what I have. Both the compression methods used by [AirLLM](https://github.com/lyogavin/airllm) as well as features that would allow coordination of multiple instances across a local network would be fantastic. Keep up the good work!
Author
Owner

@EkkiBrue commented on GitHub (Sep 3, 2024):

+1 +1 +1 ;)

<!-- gh-comment-id:2326873962 --> @EkkiBrue commented on GitHub (Sep 3, 2024): +1 +1 +1 ;)
Author
Owner

@Xyz00777 commented on GitHub (Sep 3, 2024):

do i understood airllm correct? I think it's used to give it a model, and it's kind of recompile it, so it will be smaller after it with the same data (I know the word recompile is not really correct but i dont know a better word at the moment). So it would be most useful to use it in the pull process (with an additional option) after checking if the sha256 of the model is correct, then give it to airllm to get recompiled and after that stored as a smaller model. Correct?

<!-- gh-comment-id:2327310481 --> @Xyz00777 commented on GitHub (Sep 3, 2024): do i understood airllm correct? I think it's used to give it a model, and it's kind of recompile it, so it will be smaller after it with the same data (I know the word recompile is not really correct but i dont know a better word at the moment). So it would be most useful to use it in the pull process (with an additional option) after checking if the sha256 of the model is correct, then give it to airllm to get recompiled and after that stored as a smaller model. Correct?
Author
Owner

@blankuserrr commented on GitHub (Sep 5, 2024):

do i understood airllm correct? I think it's used to give it a model, and it's kind of recompile it, so it will be smaller after it with the same data (I know the word recompile is not really correct but i dont know a better word at the moment). So it would be most useful to use it in the pull process (with an additional option) after checking if the sha256 of the model is correct, then give it to airllm to get recompiled and after that stored as a smaller model. Correct?

i think it just streams the weights instead of loading them all? idk

<!-- gh-comment-id:2331303484 --> @blankuserrr commented on GitHub (Sep 5, 2024): > do i understood airllm correct? I think it's used to give it a model, and it's kind of recompile it, so it will be smaller after it with the same data (I know the word recompile is not really correct but i dont know a better word at the moment). So it would be most useful to use it in the pull process (with an additional option) after checking if the sha256 of the model is correct, then give it to airllm to get recompiled and after that stored as a smaller model. Correct? i think it just streams the weights instead of loading them all? idk
Author
Owner

@danividalg commented on GitHub (Oct 25, 2024):

+1

<!-- gh-comment-id:2438923404 --> @danividalg commented on GitHub (Oct 25, 2024): +1
Author
Owner

@limaolin2017 commented on GitHub (Oct 31, 2024):

+1

<!-- gh-comment-id:2449367274 --> @limaolin2017 commented on GitHub (Oct 31, 2024): +1
Author
Owner

@sulydeni commented on GitHub (Nov 2, 2024):

+1

<!-- gh-comment-id:2452783128 --> @sulydeni commented on GitHub (Nov 2, 2024): +1
Author
Owner

@apayne commented on GitHub (Nov 11, 2024):

Incorporation of some of AirLLM's features would be a game changer. For systems with low VRAM you could run models several times larger (think a 70b on a 4Gb VRAM card). (see https://huggingface.co/blog/lyogavin/llama3-airllm )

For systems where you want to run something greater than 100b you could still load slices of the model into less VRAM and run them. The price you pay is, much slower execution, as each layer is copied in. I wouldn't be surprised if there was some clever way to copy more than one layer into the card at a time...allowing fewer stalls in processing due to memory copies...just a thought.

<!-- gh-comment-id:2467288250 --> @apayne commented on GitHub (Nov 11, 2024): Incorporation of some of AirLLM's features would be a game changer. For systems with low VRAM you could run models several times larger (think a 70b on a 4Gb VRAM card). (see https://huggingface.co/blog/lyogavin/llama3-airllm ) For systems where you want to run something greater than 100b you could still load slices of the model into less VRAM and run them. The price you pay is, much slower execution, as each layer is copied in. I wouldn't be surprised if there was some clever way to copy more than one layer into the card at a time...allowing fewer stalls in processing due to memory copies...just a thought.
Author
Owner

@EthraZa commented on GitHub (Nov 28, 2024):

I don´t like "+1" comments, but this feature seens so cool that I will lower my head and...

+1

<!-- gh-comment-id:2506175983 --> @EthraZa commented on GitHub (Nov 28, 2024): I don´t like "+1" comments, but this feature seens so cool that I will lower my head and... +1
Author
Owner

@SerfTheNet commented on GitHub (Dec 11, 2024):

+1

<!-- gh-comment-id:2535543546 --> @SerfTheNet commented on GitHub (Dec 11, 2024): +1
Author
Owner

@zzyuzzz commented on GitHub (Jan 8, 2025):

+1

<!-- gh-comment-id:2577246336 --> @zzyuzzz commented on GitHub (Jan 8, 2025): +1
Author
Owner

@kaloslazo commented on GitHub (Mar 23, 2025):

+1

<!-- gh-comment-id:2746354412 --> @kaloslazo commented on GitHub (Mar 23, 2025): +1
Author
Owner

@BradKML commented on GitHub (Jun 6, 2025):

Hopping to here as well (and BitNet support) https://github.com/lyogavin/airllm/discussions/234
https://github.com/ollama/ollama/issues/10337 https://github.com/ollama/ollama/issues/2821

<!-- gh-comment-id:2948250370 --> @BradKML commented on GitHub (Jun 6, 2025): Hopping to here as well (and BitNet support) https://github.com/lyogavin/airllm/discussions/234 https://github.com/ollama/ollama/issues/10337 https://github.com/ollama/ollama/issues/2821
Author
Owner

@evaced commented on GitHub (Jun 24, 2025):

+1, would be a big game changer

<!-- gh-comment-id:3000537527 --> @evaced commented on GitHub (Jun 24, 2025): +1, would be a big game changer
Author
Owner

@darrkz commented on GitHub (Jul 9, 2025):

+1

<!-- gh-comment-id:3052078074 --> @darrkz commented on GitHub (Jul 9, 2025): +1
Author
Owner

@lustfeind commented on GitHub (Sep 28, 2025):

+1

or oLLM.

We need to run bigger LLM on potato hardware, even with the cost of slow speed, instead of more and more cloud-based plans.

<!-- gh-comment-id:3343513835 --> @lustfeind commented on GitHub (Sep 28, 2025): +1 or oLLM. We need to run bigger LLM on potato hardware, even with the cost of slow speed, instead of more and more cloud-based plans.
Author
Owner

@BradKML commented on GitHub (Sep 29, 2025):

  1. Do we have upstream dependencies
  2. What about Ramalama?
<!-- gh-comment-id:3344559898 --> @BradKML commented on GitHub (Sep 29, 2025): 1. Do we have upstream dependencies 2. What about Ramalama?
Author
Owner

@sathwikreddy56 commented on GitHub (Jan 28, 2026):

+1

<!-- gh-comment-id:3809102044 --> @sathwikreddy56 commented on GitHub (Jan 28, 2026): +1
Author
Owner

@coder8080 commented on GitHub (Jan 31, 2026):

+1

<!-- gh-comment-id:3828171822 --> @coder8080 commented on GitHub (Jan 31, 2026): +1
Author
Owner

@superswan commented on GitHub (Feb 3, 2026):

+1

<!-- gh-comment-id:3840609259 --> @superswan commented on GitHub (Feb 3, 2026): +1
Author
Owner

@piotroxp commented on GitHub (Feb 8, 2026):

+1 I'm trying to implement this on my local machine, will let people know the hassle if I ever succeed

<!-- gh-comment-id:3866842993 --> @piotroxp commented on GitHub (Feb 8, 2026): +1 I'm trying to implement this on my local machine, will let people know the hassle if I ever succeed
Author
Owner

@Joe-Ralph commented on GitHub (Feb 21, 2026):

+1

<!-- gh-comment-id:3938196398 --> @Joe-Ralph commented on GitHub (Feb 21, 2026): +1
Author
Owner

@arthurlacoste commented on GitHub (Feb 22, 2026):

+1

<!-- gh-comment-id:3940698453 --> @arthurlacoste commented on GitHub (Feb 22, 2026): +1
Author
Owner

@RadEdje commented on GitHub (Mar 2, 2026):

+1

<!-- gh-comment-id:3984502646 --> @RadEdje commented on GitHub (Mar 2, 2026): +1
Author
Owner

@niflheimmer commented on GitHub (Mar 13, 2026):

I would like this feature too, but it has several obstacles.

Ollama is written primarily in Go, uses llama.cpp as its backend, and has its own model registry. AirLLM is written in Python, and is tightly coupled with HuggingFace Hub. Something like bitnet.cpp would be easier to integrate into Ollama, but the PR that introduced it was closed, as it "does not meaningfully integrate with Ollama and so would not work for most users". https://github.com/ollama/ollama/pull/11218

There is the alternative of having AirLLM integrated in a Python-based server that uses HuggingFace Hub, but that could have its own set of downsides, as for instance, vLLM is tailored for newer NVIDIA GPUs and has more overhead, which I assume isn't what people here want with old, used gaming + potato hardware.

My suggestion is to update this issue to include "model layer streaming" support, as that is essentially what AirLLM does, or create a separate issue for this. I think that this could be implemented in the Go layers of Ollama without touching llama.cpp (but I could be wrong), and it could be used as a flag when running Ollama models directly: ollama run MODEL [PROMPT] --stream-layers=[true|false], as an environment variable: OLLAMA_STREAM_LAYERS=[true|false], and as a config option when requesting models over Ollama's OpenAI-compatible API.

Just my two cents, I'm not an expert in these areas.

<!-- gh-comment-id:4057185179 --> @niflheimmer commented on GitHub (Mar 13, 2026): I would like this feature too, but it has several obstacles. Ollama is written primarily in Go, uses llama.cpp as its backend, and has its own model registry. AirLLM is written in Python, and is tightly coupled with HuggingFace Hub. Something like bitnet.cpp would be easier to integrate into Ollama, but the PR that introduced it was closed, as it "does not meaningfully integrate with Ollama and so would not work for most users". https://github.com/ollama/ollama/pull/11218 There is the alternative of having AirLLM integrated in a Python-based server that uses HuggingFace Hub, but that could have its own set of downsides, as for instance, vLLM is tailored for newer NVIDIA GPUs and has more overhead, which I assume isn't what people here want with old, used gaming + potato hardware. My suggestion is to update this issue to include "model layer streaming" support, as that is essentially what AirLLM does, or create a separate issue for this. I think that this could be implemented in the Go layers of Ollama without touching llama.cpp (but I could be wrong), and it could be used as a flag when running Ollama models directly: `ollama run MODEL [PROMPT] --stream-layers=[true|false]`, as an environment variable: `OLLAMA_STREAM_LAYERS=[true|false]`, and as a config option when requesting models over Ollama's OpenAI-compatible API. Just my two cents, I'm not an expert in these areas.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3944