[GH-ISSUE #6294] AirLLM integration? #3944

New Issue

GiteaMirror · 2026-04-12T14:49:16-05:00

GiteaMirror commented

2026-04-12 14:49:16 -05:00

Originally created by @blankuserrr on GitHub (Aug 9, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6294

I'd love to see the addition/support of AirLLM in ollama, as it can massively decrease the needed amount of vram to run large models.

Originally created by @blankuserrr on GitHub (Aug 9, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6294 I'd love to see the addition/support of [AirLLM](https://github.com/lyogavin/airllm) in ollama, as it can massively decrease the needed amount of vram to run large models.

GiteaMirror added the feature request label 2026-04-12 14:49:16 -05:00

GiteaMirror commented

2026-04-12 14:49:16 -05:00

@mdlmarkham commented on GitHub (Aug 11, 2024):

+1 My home-lab has grown more or less organically over the last 10 years... and includes a lot of castoff gaming hardware. It would be great if Ollama could incorporate features to let me get more out of what I have. Both the compression methods used by AirLLM as well as features that would allow coordination of multiple instances across a local network would be fantastic. Keep up the good work!

@mdlmarkham commented on GitHub (Aug 11, 2024): +1 My home-lab has grown more or less organically over the last 10 years... and includes a lot of castoff gaming hardware. It would be great if Ollama could incorporate features to let me get more out of what I have. Both the compression methods used by [AirLLM](https://github.com/lyogavin/airllm) as well as features that would allow coordination of multiple instances across a local network would be fantastic. Keep up the good work!

GiteaMirror commented

2026-04-12 14:49:17 -05:00

@EkkiBrue commented on GitHub (Sep 3, 2024):

+1 +1 +1 ;)

@EkkiBrue commented on GitHub (Sep 3, 2024): +1 +1 +1 ;)

GiteaMirror commented

2026-04-12 14:49:18 -05:00

@Xyz00777 commented on GitHub (Sep 3, 2024):

do i understood airllm correct? I think it's used to give it a model, and it's kind of recompile it, so it will be smaller after it with the same data (I know the word recompile is not really correct but i dont know a better word at the moment). So it would be most useful to use it in the pull process (with an additional option) after checking if the sha256 of the model is correct, then give it to airllm to get recompiled and after that stored as a smaller model. Correct?

@Xyz00777 commented on GitHub (Sep 3, 2024): do i understood airllm correct? I think it's used to give it a model, and it's kind of recompile it, so it will be smaller after it with the same data (I know the word recompile is not really correct but i dont know a better word at the moment). So it would be most useful to use it in the pull process (with an additional option) after checking if the sha256 of the model is correct, then give it to airllm to get recompiled and after that stored as a smaller model. Correct?

GiteaMirror commented

2026-04-12 14:49:18 -05:00

@blankuserrr commented on GitHub (Sep 5, 2024):

do i understood airllm correct? I think it's used to give it a model, and it's kind of recompile it, so it will be smaller after it with the same data (I know the word recompile is not really correct but i dont know a better word at the moment). So it would be most useful to use it in the pull process (with an additional option) after checking if the sha256 of the model is correct, then give it to airllm to get recompiled and after that stored as a smaller model. Correct?

i think it just streams the weights instead of loading them all? idk

@blankuserrr commented on GitHub (Sep 5, 2024): > do i understood airllm correct? I think it's used to give it a model, and it's kind of recompile it, so it will be smaller after it with the same data (I know the word recompile is not really correct but i dont know a better word at the moment). So it would be most useful to use it in the pull process (with an additional option) after checking if the sha256 of the model is correct, then give it to airllm to get recompiled and after that stored as a smaller model. Correct? i think it just streams the weights instead of loading them all? idk

GiteaMirror commented

2026-04-12 14:49:18 -05:00

@danividalg commented on GitHub (Oct 25, 2024):

+1

@danividalg commented on GitHub (Oct 25, 2024): +1

GiteaMirror commented

2026-04-12 14:49:19 -05:00

@limaolin2017 commented on GitHub (Oct 31, 2024):

+1

@limaolin2017 commented on GitHub (Oct 31, 2024): +1

GiteaMirror commented

2026-04-12 14:49:20 -05:00

@sulydeni commented on GitHub (Nov 2, 2024):

+1

@sulydeni commented on GitHub (Nov 2, 2024): +1

GiteaMirror commented

2026-04-12 14:49:20 -05:00

@apayne commented on GitHub (Nov 11, 2024):

Incorporation of some of AirLLM's features would be a game changer. For systems with low VRAM you could run models several times larger (think a 70b on a 4Gb VRAM card). (see https://huggingface.co/blog/lyogavin/llama3-airllm )

For systems where you want to run something greater than 100b you could still load slices of the model into less VRAM and run them. The price you pay is, much slower execution, as each layer is copied in. I wouldn't be surprised if there was some clever way to copy more than one layer into the card at a time...allowing fewer stalls in processing due to memory copies...just a thought.

@apayne commented on GitHub (Nov 11, 2024): Incorporation of some of AirLLM's features would be a game changer. For systems with low VRAM you could run models several times larger (think a 70b on a 4Gb VRAM card). (see https://huggingface.co/blog/lyogavin/llama3-airllm ) For systems where you want to run something greater than 100b you could still load slices of the model into less VRAM and run them. The price you pay is, much slower execution, as each layer is copied in. I wouldn't be surprised if there was some clever way to copy more than one layer into the card at a time...allowing fewer stalls in processing due to memory copies...just a thought.

GiteaMirror commented

2026-04-12 14:49:20 -05:00

@EthraZa commented on GitHub (Nov 28, 2024):

I don´t like "+1" comments, but this feature seens so cool that I will lower my head and...

+1

@EthraZa commented on GitHub (Nov 28, 2024): I don´t like "+1" comments, but this feature seens so cool that I will lower my head and... +1

GiteaMirror commented

2026-04-12 14:49:21 -05:00

@SerfTheNet commented on GitHub (Dec 11, 2024):

+1

@SerfTheNet commented on GitHub (Dec 11, 2024): +1

GiteaMirror commented

2026-04-12 14:49:21 -05:00

@zzyuzzz commented on GitHub (Jan 8, 2025):

+1

@zzyuzzz commented on GitHub (Jan 8, 2025): +1

GiteaMirror commented

2026-04-12 14:49:22 -05:00

@kaloslazo commented on GitHub (Mar 23, 2025):

+1

@kaloslazo commented on GitHub (Mar 23, 2025): +1

GiteaMirror commented

2026-04-12 14:49:22 -05:00

@BradKML commented on GitHub (Jun 6, 2025):

Hopping to here as well (and BitNet support) https://github.com/lyogavin/airllm/discussions/234
https://github.com/ollama/ollama/issues/10337 https://github.com/ollama/ollama/issues/2821

@BradKML commented on GitHub (Jun 6, 2025): Hopping to here as well (and BitNet support) https://github.com/lyogavin/airllm/discussions/234 https://github.com/ollama/ollama/issues/10337 https://github.com/ollama/ollama/issues/2821

GiteaMirror commented

2026-04-12 14:49:23 -05:00

@evaced commented on GitHub (Jun 24, 2025):

+1, would be a big game changer

@evaced commented on GitHub (Jun 24, 2025): +1, would be a big game changer

GiteaMirror commented

2026-04-12 14:49:23 -05:00

@darrkz commented on GitHub (Jul 9, 2025):

+1

@darrkz commented on GitHub (Jul 9, 2025): +1

GiteaMirror commented

2026-04-12 14:49:24 -05:00

@lustfeind commented on GitHub (Sep 28, 2025):

+1

or oLLM.

We need to run bigger LLM on potato hardware, even with the cost of slow speed, instead of more and more cloud-based plans.

@lustfeind commented on GitHub (Sep 28, 2025): +1 or oLLM. We need to run bigger LLM on potato hardware, even with the cost of slow speed, instead of more and more cloud-based plans.

GiteaMirror commented

2026-04-12 14:49:24 -05:00

@BradKML commented on GitHub (Sep 29, 2025):

Do we have upstream dependencies
What about Ramalama?

@BradKML commented on GitHub (Sep 29, 2025): 1. Do we have upstream dependencies 2. What about Ramalama?

GiteaMirror commented

2026-04-12 14:49:24 -05:00

@sathwikreddy56 commented on GitHub (Jan 28, 2026):

+1

@sathwikreddy56 commented on GitHub (Jan 28, 2026): +1

GiteaMirror commented

2026-04-12 14:49:25 -05:00

@coder8080 commented on GitHub (Jan 31, 2026):

+1

@coder8080 commented on GitHub (Jan 31, 2026): +1

GiteaMirror commented

2026-04-12 14:49:25 -05:00

@superswan commented on GitHub (Feb 3, 2026):

+1

@superswan commented on GitHub (Feb 3, 2026): +1

GiteaMirror commented

2026-04-12 14:49:25 -05:00

@piotroxp commented on GitHub (Feb 8, 2026):

+1 I'm trying to implement this on my local machine, will let people know the hassle if I ever succeed

@piotroxp commented on GitHub (Feb 8, 2026): +1 I'm trying to implement this on my local machine, will let people know the hassle if I ever succeed

GiteaMirror commented

2026-04-12 14:49:26 -05:00

@Joe-Ralph commented on GitHub (Feb 21, 2026):

+1

@Joe-Ralph commented on GitHub (Feb 21, 2026): +1

GiteaMirror commented

2026-04-12 14:49:26 -05:00

@arthurlacoste commented on GitHub (Feb 22, 2026):

+1

@arthurlacoste commented on GitHub (Feb 22, 2026): +1

GiteaMirror commented

2026-04-12 14:49:26 -05:00

@RadEdje commented on GitHub (Mar 2, 2026):

+1

@RadEdje commented on GitHub (Mar 2, 2026): +1

GiteaMirror commented

2026-04-12 14:49:26 -05:00

@niflheimmer commented on GitHub (Mar 13, 2026):

I would like this feature too, but it has several obstacles.

Ollama is written primarily in Go, uses llama.cpp as its backend, and has its own model registry. AirLLM is written in Python, and is tightly coupled with HuggingFace Hub. Something like bitnet.cpp would be easier to integrate into Ollama, but the PR that introduced it was closed, as it "does not meaningfully integrate with Ollama and so would not work for most users". https://github.com/ollama/ollama/pull/11218

There is the alternative of having AirLLM integrated in a Python-based server that uses HuggingFace Hub, but that could have its own set of downsides, as for instance, vLLM is tailored for newer NVIDIA GPUs and has more overhead, which I assume isn't what people here want with old, used gaming + potato hardware.

My suggestion is to update this issue to include "model layer streaming" support, as that is essentially what AirLLM does, or create a separate issue for this. I think that this could be implemented in the Go layers of Ollama without touching llama.cpp (but I could be wrong), and it could be used as a flag when running Ollama models directly: ollama run MODEL [PROMPT] --stream-layers=[true|false], as an environment variable: OLLAMA_STREAM_LAYERS=[true|false], and as a config option when requesting models over Ollama's OpenAI-compatible API.

Just my two cents, I'm not an expert in these areas.

@niflheimmer commented on GitHub (Mar 13, 2026): I would like this feature too, but it has several obstacles. Ollama is written primarily in Go, uses llama.cpp as its backend, and has its own model registry. AirLLM is written in Python, and is tightly coupled with HuggingFace Hub. Something like bitnet.cpp would be easier to integrate into Ollama, but the PR that introduced it was closed, as it "does not meaningfully integrate with Ollama and so would not work for most users". https://github.com/ollama/ollama/pull/11218 There is the alternative of having AirLLM integrated in a Python-based server that uses HuggingFace Hub, but that could have its own set of downsides, as for instance, vLLM is tailored for newer NVIDIA GPUs and has more overhead, which I assume isn't what people here want with old, used gaming + potato hardware. My suggestion is to update this issue to include "model layer streaming" support, as that is essentially what AirLLM does, or create a separate issue for this. I think that this could be implemented in the Go layers of Ollama without touching llama.cpp (but I could be wrong), and it could be used as a flag when running Ollama models directly: `ollama run MODEL [PROMPT] --stream-layers=[true|false]`, as an environment variable: `OLLAMA_STREAM_LAYERS=[true|false]`, and as a config option when requesting models over Ollama's OpenAI-compatible API. Just my two cents, I'm not an expert in these areas.

GiteaMirror referenced this issue

2026-04-12 20:51:56 -05:00

[GH-ISSUE #12508] [Performance] embeddinggemma: 164x slowdown with excessive whitespace in text #8305

GiteaMirror referenced this issue

2026-04-22 06:06:32 -05:00

[GH-ISSUE #3944] /api/embeddings hangs when prompt is only whitespace #28206

GiteaMirror referenced this issue

2026-04-22 17:18:27 -05:00

[GH-ISSUE #12508] [Performance] embeddinggemma: 164x slowdown with excessive whitespace in text #34065

GiteaMirror referenced this issue

2026-04-28 10:17:56 -05:00

[GH-ISSUE #3944] /api/embeddings hangs when prompt is only whitespace #48958

GiteaMirror referenced this issue

2026-04-29 07:25:43 -05:00

[GH-ISSUE #12508] [Performance] embeddinggemma: 164x slowdown with excessive whitespace in text #54818

GiteaMirror referenced this issue

2026-05-03 17:49:35 -05:00

[GH-ISSUE #3944] /api/embeddings hangs when prompt is only whitespace #64484

GiteaMirror referenced this issue

2026-05-04 21:16:57 -05:00

[GH-ISSUE #12508] [Performance] embeddinggemma: 164x slowdown with excessive whitespace in text #70364

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#3944