mirror of
https://github.com/ollama/ollama.git
synced 2026-05-07 00:22:43 -05:00
Open
opened 2026-04-29 10:21:12 -05:00 by GiteaMirror
·
47 comments
No Branch/Tag Specified
main
hoyyeva/anthropic-local-image-path
dhiltgen/ci
dhiltgen/llama-runner
parth-remove-claude-desktop-launch
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.30.0-rc3
v0.30.0-rc2
v0.30.0-rc1
v0.30.0-rc0
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
No Label
feature request
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#56173
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @grepin on GitHub (Mar 25, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15051
@rick-github @jessegross jfyi https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/ + https://arxiv.org/pdf/2504.19874
@grepin commented on GitHub (Mar 25, 2026):
meta-algorithm description provided in paper; seems not too hard to implement it. implementation could significantly reduce kv-cache size while keeping quality + introduce compute speedup in most cases.
@goedzo commented on GitHub (Mar 25, 2026):
Upvote👍
@OrBeProgrammed commented on GitHub (Mar 25, 2026):
Critical need.
@postEntropy commented on GitHub (Mar 25, 2026):
Need it ASAP
@grepin commented on GitHub (Mar 26, 2026):
@OrBeProgrammed @postEntropy @goedzo guys, don't push devs, let them to decide "when and how to implement (or not)". yes, TQ is a cool thing (and from my pov it will first of all will help their business with ollama:cloud), but as in any project there are limited resources and plans. feature request made, so "let the things go as they go"
@OrBeProgrammed commented on GitHub (Mar 26, 2026):
I'm not pushing anyone I'm working on a PR myself :) We're all in this together!
@grepin commented on GitHub (Mar 26, 2026):
btw, @OrBeProgrammed: https://github.com/ggml-org/llama.cpp/issues/20977, enthusiasts already try to implement: https://github.com/ggml-org/llama.cpp/compare/master...mudler:llama.cpp:feat/turbo-quant
https://github.com/TheTom/turboquant_plus (the code seems useful in terms of "understanding of technical implementation & transferring/porting to ollama go-based codebase")
@grepin commented on GitHub (Mar 26, 2026):
most likely all the frontier inference engines will implement TQ in a couple of month.
@grepin commented on GitHub (Mar 26, 2026):
yep, this looks good as reference python implementation with "post-generation kv-cache analisys & attention quality tests": https://github.com/TheTom/turboquant_plus
@123Haynes commented on GitHub (Mar 26, 2026):
the discussion here also documents some pitfalls during the implementation: https://github.com/ggml-org/llama.cpp/discussions/20969
@grepin commented on GitHub (Mar 26, 2026):
+1 to implementations (PoC with measurement results): https://github.com/vllm-project/vllm/issues/38171#issuecomment-4134002937
@kblood commented on GitHub (Mar 26, 2026):
Oh yes, hoped to see Ollama support for this the first time I read about it :)
@codyseally commented on GitHub (Mar 26, 2026):
Absolute upvote !
@XZzYassin commented on GitHub (Mar 26, 2026):
Oleh! 🌹
@grepin commented on GitHub (Mar 27, 2026):
@mobilexmt commented on GitHub (Mar 28, 2026):
please also consider DGX Spark, thanks!
@richardokonicha commented on GitHub (Mar 28, 2026):
Hurray
@christopheduc-me commented on GitHub (Mar 29, 2026):
100% Upvoted of course ! We need it to unlock new possibilities for our local usage
@dorinsimionescu commented on GitHub (Mar 30, 2026):
upvote too
@QAM commented on GitHub (Mar 30, 2026):
+1 plz
@grepin commented on GitHub (Mar 30, 2026):
in fact, wip: https://github.com/ollama/ollama/pull/15125
if you have enough understanding of engine + paper/algo (or ready to dive into it by yourself or with any AI+any-agent + "non-blind acceptance of everything AI tries to make during implementation"), you can help. As always in opensource, "you are on your own" and "code & tests are the only source of truth", but i hope that Dankguy17 and YKesX could guide your efforts (for example, as usual, much more test-runs on different models needed to find problems & improve implementation). Many thanks to @Dankguy17 and @YKesX anyway for their contribution.
@Sreekmans commented on GitHub (Mar 30, 2026):
+1
@Reikagilu commented on GitHub (Mar 30, 2026):
+1
@medenijazbec commented on GitHub (Mar 31, 2026):
I took a pass at implementing this on top of Ollama
v0.18.3and published the work here:https://github.com/medenijazbec/ollama-turboquant/tree/turboquant-0.18.3Current status
kv_cache_typeollama run --turboquantollama-bench -turboquantPARAMETER kv_cache_type ...Benchmarking
I’ve also been running baseline vs TurboQuant experiments focused on KV pressure and long-context behavior.
I included multiple benchmark variants as well, including lighter / extra-light passes, because the available hardware is limited and some of the longer sweeps are expensive to run reliably.
Documentation
I also wrote up the implementation and benchmark notes in the repo here:
docs/turboquant_paper_design.mddocs/turboquant_audit.mddocs/kvstress-test-commands.txtImportant caveats
So I would treat the current branch as a reference implementation / experimentation branch rather than a finished upstream proposal.
Sharing it in case it helps others compare approaches, reuse some of the API / CLI / benchmark surface work, or validate against their own hardware. If others run similar tests and publish results, that would be very useful too.
Happy to compare notes with anyone working on the engine-side implementation or benchmark methodology.
@jclab-joseph commented on GitHub (Mar 31, 2026):
https://github.com/ggml-org/llama.cpp/discussions/20969 There seems to be lively discussion taking place here! I think it would be good to refer to.
@msk-one commented on GitHub (Apr 1, 2026):
+1
@Blue-Crescent commented on GitHub (Apr 3, 2026):
vote
@lxdlam commented on GitHub (Apr 4, 2026):
vote for this
@Readon commented on GitHub (Apr 7, 2026):
I think ggml-org/llama.cpp#21038 has already implement this.
@ZevAlain commented on GitHub (Apr 11, 2026):
vote
@hajhouj commented on GitHub (Apr 11, 2026):
If Ollama actually pulls this off, it’s going to hit hard. It’s wild to think that a setup with just 8GB of VRAM could handle a model originally designed for something like a 48GB GPU. That kind of leap really pushes us closer to a world where fully autonomous, self-hosted AI isn’t just hype… it’s right around the corner.
@mverrilli commented on GitHub (Apr 11, 2026):
I put together this PR if anyone wants to review and build. #15505
● Adds tq2/tq3/tq2k/tq3k KV cache types implementing TurboQuant (arXiv 2504.19874) — a GPU-resident compressed K/V path built from Householder QR rotation plus Lloyd-Max scalar quantization, with new CUDA kernels for encode, dequant, and fused flash attention.
Roughly doubles usable context per VRAM dollar on Pascal+ GPUs at near-f16 quality: tq3k matches f16 PPL on llama3.2/gemma3/qwen3-coder with ~40% KV savings, tq3 gives ~80% KV savings for a ~0.5% PPL cost, and the K-only variants (tq3k/tq2k) work even with flash attention disabled.
@mverrilli commented on GitHub (Apr 11, 2026):
@hajhouj On any GPU, TurboQuant tq3k gets you ~40% KV cache savings with PPL essentially unchanged from f16 and tq3 gets you ~80% KV savings at a small cost. That doesn’t change what models fit on your card, but it roughly doubles the usable context length for whatever model you’re already running.
@hajhouj commented on GitHub (Apr 11, 2026):
Thanks for the info. I came across a news article about this new algorithm claiming it reduces memory usage by a factor of six, which might be a bit exaggerated. Still, the ability to increase context length without sacrificing speed is genuinely promising. It feels like another big step toward making low-cost, self-hosted AI more practical, potentially opening the door for wider desktop-level use.
@mverrilli commented on GitHub (Apr 12, 2026):
All good. This does that, but it's just not total VRAM. It's KV cache. I was able to get a factor of 5 reduction (80%) with tq3. It's a little bit hyped but I think still very good.
@achraf99999 commented on GitHub (Apr 13, 2026):
did you manage to run turbo quant from this PR ? if yes , can you share your configuration and hardware setup please ?
@OrBeProgrammed commented on GitHub (Apr 13, 2026):
I got this working. I am not sure it is working with all models. It loads the model insanely faster than before. I just told claude to make the PR happen, not sure exactly what all it did. Is there a particular piece of info i can share that would help?
@medenijazbec commented on GitHub (Apr 13, 2026):
please provide some tests ive got mine ready but cant find the time to run them all, there also already seems to be a full implementation in llama, I havent really been following that thread for about 2 weeks, there might be alot of new findings, what Ive done is took inspiration from their ideas and credited them in my code for turboquant since folk over there is way smarter than i am lmao, anyway, I think a good test for this would be to connect it to something like claude code and make it run 3 runs of https://gist.github.com/ivanfioravanti/98ba7e5d3f7a88c1756b045d3e565630 using native ollama, then compare the average to the average results of your turboquant implementation
@mverrilli commented on GitHub (Apr 21, 2026):
FYI: Added Metal, and ROCM if anyone wants to test it out and report back.
@Dankguy17 commented on GitHub (Apr 21, 2026):
Yeah I can try right now - although, there is definetely a lower chance that the PR gets accepted because it adds nearly 60k lines of code lol. Did you forget to gitignore something??
@mverrilli commented on GitHub (Apr 21, 2026):
@Dankguy17 vendor patch issue. I thought I fixed it but I think it was in another branch. Pushing the branch after my build test.
@Dankguy17 commented on GitHub (Apr 21, 2026):
cool! will discuss more in your pr
@DATEx2 commented on GitHub (Apr 25, 2026):
So when will it be released? We kind of all need
TurboQuant@mverrilli commented on GitHub (Apr 26, 2026):
TQ is not a small PR. To be useful, it has to compress KV cache, not slow down prefill or decode too much, and stay off the paths that use the scratch buffer which would offset the VRAM savings, and keep it coherent.
The branch I have right now does TQ, but has two issues:
@mverrilli commented on GitHub (Apr 27, 2026):
After really digging in on this, I am starting to think TQ isn't really the best solution despite the claims in the paper. Certainly I was able to get a coherent, highly compressed KV cache. Performance issues aside (some of which can be improved, and some that are much improved on newer hardware), the perplexity scores drift from f16 quite a bit. This does not appear to be as lossless as expected.
It's possible it is due to something in my implementation, however I went back to the paper and noticed some things. First, the paper abstract sounds as if this is a general solution, however the paper itself is pretty specific about the models it selected and the method in which the loss was measured.
In addition, I read several papers that cite or critique TurboQuant. One key finding: the QJL residual component (part of what makes TQ's compression work) has a known accuracy degradation that compounds per layer and the paper only tested on 32-layer models, and the math suggests it would break down badly on larger ones (arXiv:2604.19528). Another paper points out that minimizing reconstruction error (what TQ optimizes for) isn't the same as minimizing perplexity loss and they can diverge significantly (arXiv:2602.05367). Also recommend reading arXiv:2604.18555 (and I did correct the flaw stated and benchmarked, minimal effect, though).
I'm doing some more benchmarks and will post them when they finish. I'll update a branch tonight in case anyone wants to also take a look. I do have two other approaches in progress and they are simpler so I may be able to get those benchmarked as well.
@mverrilli commented on GitHub (Apr 27, 2026):
Also I put together a better perplexity measurement tool than I used previously. The previous one I used was self-PPL, and now I'm using reference PPL (forward passes on WikiText-2, etc).
@mverrilli commented on GitHub (Apr 29, 2026):
Here's some output from some experimental compressions. https://gist.github.com/mverrilli/dbd9935bdec44495e635a3c5cdf611d0
f16 - baseline, no compression
q4_0 / q8_0 - block quantization borrowed from weight quant.
tq (TurboQuant) - rotation + Lloyd-Max codebook, *qa were some tests where I added some extra features (qjl, outlier split, asymmetric).
q8k / q4k - per-group asymmetric int8/int4 (something clean and simple, modified of an idea common in a few papers)
saw - same as q8k/q4k but with Hadamard rot first (arXiv:2604.19157)
This run was really about perplexity, not compression. Larger ctx would have better kv cache compression rates due to overhead.
But you can see TQ really isn't that great PPL-wise. saw8kv and saw4kv are the winners here. Need more tests though.