mirror of
https://github.com/ollama/ollama.git
synced 2026-05-07 00:22:43 -05:00
Open
opened 2026-04-12 14:22:23 -05:00 by GiteaMirror
·
57 comments
No Branch/Tag Specified
main
hoyyeva/anthropic-local-image-path
dhiltgen/ci
dhiltgen/llama-runner
parth-remove-claude-desktop-launch
hoyyeva/anthropic-reference-images-path
parth-anthropic-reference-images-path
brucemacd/download-before-remove
hoyyeva/editor-config-repair
parth-mlx-decode-checkpoints
parth-launch-codex-app
hoyyeva/fix-codex-model-metadata-warning
hoyyeva/qwen
parth/hide-claude-desktop-till-release
hoyyeva/opencode-image-modality
parth-add-claude-code-autoinstall
release_v0.22.0
pdevine/manifest-list
codex/fix-codex-model-metadata-warning
pdevine/addressable-manifest
brucemacd/launch-fetch-reccomended
jmorganca/llama-compat
launch-copilot-cli
hoyyeva/opencode-thinking
release_v0.20.7
parth-auto-save-backup
parth-test
jmorganca/gemma4-audio-replacements
fix-manifest-digest-on-pull
hoyyeva/vscode-improve
brucemacd/install-server-wait
parth/update-claude-docs
brucemac/start-ap-install
pdevine/mlx-update
pdevine/qwen35_vision
drifkin/api-show-fallback
mintlify/image-generation-1773352582
hoyyeva/server-context-length-local-config
jmorganca/faster-reptition-penalties
jmorganca/convert-nemotron
parth-pi-thinking
pdevine/sampling-penalties
jmorganca/fix-create-quantization-memory
dongchen/resumable_transfer_fix
pdevine/sampling-cache-error
jessegross/mlx-usage
hoyyeva/openclaw-config
hoyyeva/app-html
pdevine/qwen3next
brucemacd/sign-sh-install
brucemacd/tui-update
brucemacd/usage-api
jmorganca/launch-empty
fix-app-dist-embed
mxyng/mlx-compile
mxyng/mlx-quant
mxyng/mlx-glm4.7
mxyng/mlx
brucemacd/simplify-model-picker
jmorganca/qwen3-concurrent
fix-glm-4.7-flash-mla-config
drifkin/qwen3-coder-opening-tag
brucemacd/usage-cli
fix-cuda12-fattn-shmem
ollama-imagegen-docs
parth/fix-multiline-inputs
brucemacd/config-docs
mxyng/model-files
mxyng/simple-execute
fix-imagegen-ollama-models
mxyng/async-upload
jmorganca/lazy-no-dtype-changes
imagegen-auto-detect-create
parth/decrease-concurrent-download-hf
fix-mlx-quantize-init
jmorganca/x-cleanup
usage
imagegen-readme
jmorganca/glm-image
mlx-gpu-cd
jmorganca/imagegen-modelfile
parth/agent-skills
parth/agent-allowlist
parth/signed-in-offline
parth/agents
parth/fix-context-chopping
improve-cloud-flow
parth/add-models-websearch
parth/prompt-renderer-mcp
jmorganca/native-settings
jmorganca/download-stream-hash
jmorganca/client2-rebased
brucemacd/oai-chat-req-multipart
jessegross/multi_chunk_reserve
grace/additional-omit-empty
grace/mistral-3-large
mxyng/tokenizer2
mxyng/tokenizer
jessegross/flash
hoyyeva/windows-nacked-app
mxyng/cleanup-attention
grace/deepseek-parser
hoyyeva/remember-unsent-prompt
parth/add-lfs-pointer-error-conversion
parth/olmo2-test2
hoyyeva/ollama-launchagent-plist
nicole/olmo-model
parth/olmo-test
mxyng/remove-embedded
parth/render-template
jmorganca/intellect-3
parth/remove-prealloc-linter
jmorganca/cmd-eval
nicole/nomic-embed-text-fix
mxyng/lint-2
hoyyeva/add-gemini-3-pro-preview
hoyyeva/load-model-list
mxyng/expand-path
mxyng/environ-2
hoyyeva/deeplink-json-encoding
parth/improve-tool-calling-tests
hoyyeva/conversation
hoyyeva/assistant-edit-response
hoyyeva/thinking
origin/brucemacd/invalid-char-i-err
parth/improve-tool-calling
jmorganca/required-omitempty
grace/qwen3-vl-tests
mxyng/iter-client
parth/docs-readme
nicole/embed-test
pdevine/integration-benchstat
parth/remove-generate-cmd
parth/add-toolcall-id
mxyng/server-tests
jmorganca/glm-4.6
jmorganca/gin-h-compat
drifkin/stable-tool-args
pdevine/qwen3-more-thinking
parth/add-websearch-client
nicole/websearch_local
jmorganca/qwen3-coder-updates
grace/deepseek-v3-migration-tests
mxyng/fix-create
jmorganca/cloud-errors
pdevine/parser-tidy
revert-12233-parth/simplify-entrypoints-runner
parth/enable-so-gpt-oss
brucemacd/qwen3vl
jmorganca/readme-simplify
parth/gpt-oss-structured-outputs
revert-12039-jmorganca/tools-braces
mxyng/embeddings
mxyng/gguf
mxyng/benchmark
mxyng/types-null
parth/move-parsing
mxyng/gemma2
jmorganca/docs
mxyng/16-bit
mxyng/create-stdin
pdevine/authorizedkeys
mxyng/quant
parth/opt-in-error-context-window
brucemacd/cache-models
brucemacd/runner-completion
jmorganca/llama-update-6
brucemacd/benchmark-list
brucemacd/partial-read-caps
parth/deepseek-r1-tools
mxyng/omit-array
parth/tool-prefix-temp
brucemacd/runner-test
jmorganca/qwen25vl
brucemacd/model-forward-test-ext
parth/python-function-parsing
jmorganca/cuda-compression-none
drifkin/num-parallel
drifkin/chat-truncation-fix
jmorganca/sync
parth/python-tools-calling
drifkin/array-head-count
brucemacd/create-no-loop
parth/server-enable-content-stream-with-tools
qwen25omni
mxyng/v3
brucemacd/ropeconfig
jmorganca/silence-tokenizer
parth/sample-so-test
parth/sampling-structured-outputs
brucemacd/doc-go-engine
parth/constrained-sampling-json
jmorganca/mistral-wip
brucemacd/mistral-small-convert
parth/sample-unmarshal-json-for-params
brucemacd/jomorganca/mistral
pdevine/bfloat16
jmorganca/mistral
brucemacd/mistral
pdevine/logging
parth/sample-correctness-fix
parth/sample-fix-sorting
jmorgan/sample-fix-sorting-extras
jmorganca/temp-0-images
brucemacd/parallel-embed-models
brucemacd/shim-grammar
jmorganca/fix-gguf-error
bmizerany/nameswork
jmorganca/faster-releases
bmizerany/validatenames
brucemacd/err-no-vocab
brucemacd/rope-config
brucemacd/err-hint
brucemacd/qwen2_5
brucemacd/logprobs
brucemacd/new_runner_graph_bench
progress-flicker
brucemacd/forward-test
brucemacd/go_qwen2
pdevine/gemma2
jmorganca/add-missing-symlink-eval
mxyng/next-debug
parth/set-context-size-openai
brucemacd/next-bpe-bench
brucemacd/next-bpe-test
brucemacd/new_runner_e2e
brucemacd/new_runner_qwen2
pdevine/convert-cohere2
brucemacd/convert-cli
parth/log-probs
mxyng/next-mlx
mxyng/cmd-history
parth/templating
parth/tokenize-detokenize
brucemacd/check-key-register
bmizerany/grammar
jmorganca/vendor-081b29bd
mxyng/func-checks
jmorganca/fix-null-format
parth/fix-default-to-warn-json
jmorganca/qwen2vl
jmorganca/no-concat
parth/cmd-cleanup-SO
brucemacd/check-key-register-structured-err
parth/openai-stream-usage
parth/fix-referencing-so
stream-tools-stop
jmorganca/degin-1
brucemacd/install-path-clean
brucemacd/push-name-validation
brucemacd/browser-key-register
jmorganca/openai-fix-first-message
jmorganca/fix-proxy
jessegross/sample
parth/disallow-streaming-tools
dhiltgen/remove_submodule
jmorganca/ga
jmorganca/mllama
pdevine/newlines
pdevine/geems-2b
jmorganca/llama-bump
mxyng/modelname-7
mxyng/gin-slog
mxyng/modelname-6
jyan/convert-prog
jyan/quant5
paligemma-support
pdevine/import-docs
jmorganca/openai-context
jyan/paligemma
jyan/p2
jyan/palitest
bmizerany/embedspeedup
jmorganca/llama-vit
brucemacd/allow-ollama
royh/ep-methods
royh/whisper
mxyng/api-models
mxyng/fix-memory
jyan/q4_4/8
jyan/ollama-v
royh/stream-tools
roy-embed-parallel
bmizerany/hrm
revert-5963-revert-5924-mxyng/llama3.1-rope
royh/embed-viz
jyan/local2
jyan/auth
jyan/local
jyan/parse-temp
jmorganca/template-mistral
jyan/reord-g
royh-openai-suffixdocs
royh-imgembed
royh-embed-parallel
jyan/quant4
royh-precision
jyan/progress
pdevine/fix-template
jyan/quant3
pdevine/ggla
mxyng/update-registry-domain
jmorganca/ggml-static
mxyng/create-context
jyan/v0.146
mxyng/layers-from-files
build_dist
bmizerany/noseek
royh-ls
royh-name
timeout
mxyng/server-timestamp
bmizerany/nosillyggufslurps
royh-params
jmorganca/llama-cpp-7c26775
royh-openai-delete
royh-show-rigid
jmorganca/enable-fa
jmorganca/no-error-template
jyan/format
royh-testdelete
bmizerany/fastverify
language_support
pdevine/ps-glitches
brucemacd/tokenize
bruce/iq-quants
bmizerany/filepathwithcoloninhost
mxyng/split-bin
bmizerany/client-registry
jmorganca/if-none-match
native
jmorganca/native
jmorganca/batch-embeddings
jmorganca/initcmake
jmorganca/mm
pdevine/showggmlinfo
modenameenforcealphanum
bmizerany/modenameenforcealphanum
jmorganca/done-reason
jmorganca/llama-cpp-8960fe8
ollama.com
bmizerany/filepathnobuild
bmizerany/types/model/defaultfix
rmdisplaylong
nogogen
bmizerany/x
modelfile-readme
bmizerany/replacecolon
jmorganca/limit
jmorganca/execstack
jmorganca/replace-assets
mxyng/tune-concurrency
jmorganca/testing
whitespace-detection
jmorganca/options
upgrade-all
scratch
cuda-search
mattw/airenamer
mattw/allmodelsonhuggingface
mattw/quantcontext
mattw/whatneedstorun
brucemacd/llama-mem-calc
mattw/faq-context
mattw/communitylinks
mattw/noprune
mattw/python-functioncalling
rename
mxyng/install
pulse
remove-first
editor
mattw/selfqueryingretrieval
cgo
mattw/howtoquant
api
matt/streamingapi
format-config
mxyng/extra-args
shell
update-nous-hermes
cp-model
upload-progress
fix-unknown-model
fix-model-names
delete-fix
insecure-registry
ls
deletemodels
progressbar
readme-updates
license-layers
skip-list
list-models
modelpath
matt/examplemodelfiles
distribution
go-opts
v0.30.0-rc3
v0.30.0-rc2
v0.30.0-rc1
v0.30.0-rc0
v0.23.1
v0.23.1-rc0
v0.23.0
v0.23.0-rc0
v0.22.1
v0.22.1-rc1
v0.22.1-rc0
v0.22.0
v0.22.0-rc1
v0.21.3-rc0
v0.21.2-rc1
v0.21.2
v0.21.2-rc0
v0.21.1
v0.21.1-rc1
v0.21.1-rc0
v0.21.0
v0.21.0-rc1
v0.21.0-rc0
v0.20.8-rc0
v0.20.7
v0.20.7-rc1
v0.20.7-rc0
v0.20.6
v0.20.6-rc1
v0.20.6-rc0
v0.20.5
v0.20.5-rc2
v0.20.5-rc1
v0.20.5-rc0
v0.20.4
v0.20.4-rc2
v0.20.4-rc1
v0.20.4-rc0
v0.20.3
v0.20.3-rc0
v0.20.2
v0.20.1
v0.20.1-rc2
v0.20.1-rc1
v0.20.1-rc0
v0.20.0
v0.20.0-rc1
v0.20.0-rc0
v0.19.0
v0.19.0-rc2
v0.19.0-rc1
v0.19.0-rc0
v0.18.4-rc1
v0.18.4-rc0
v0.18.3
v0.18.3-rc2
v0.18.3-rc1
v0.18.3-rc0
v0.18.2
v0.18.2-rc1
v0.18.2-rc0
v0.18.1
v0.18.1-rc1
v0.18.1-rc0
v0.18.0
v0.18.0-rc2
v0.18.0-rc1
v0.18.0-rc0
v0.17.8-rc4
v0.17.8-rc3
v0.17.8-rc2
v0.17.8-rc1
v0.17.8-rc0
v0.17.7
v0.17.7-rc2
v0.17.7-rc1
v0.17.7-rc0
v0.17.6
v0.17.5
v0.17.4
v0.17.3
v0.17.2
v0.17.1
v0.17.1-rc2
v0.17.1-rc1
v0.17.1-rc0
v0.17.0
v0.17.0-rc2
v0.17.0-rc1
v0.17.0-rc0
v0.16.3
v0.16.3-rc2
v0.16.3-rc1
v0.16.3-rc0
v0.16.2
v0.16.2-rc0
v0.16.1
v0.16.0
v0.16.0-rc2
v0.16.0-rc0
v0.16.0-rc1
v0.15.6
v0.15.5
v0.15.5-rc5
v0.15.5-rc4
v0.15.5-rc3
v0.15.5-rc2
v0.15.5-rc1
v0.15.5-rc0
v0.15.4
v0.15.3
v0.15.2
v0.15.1
v0.15.1-rc1
v0.15.1-rc0
v0.15.0-rc6
v0.15.0
v0.15.0-rc5
v0.15.0-rc4
v0.15.0-rc3
v0.15.0-rc2
v0.15.0-rc1
v0.15.0-rc0
v0.14.3
v0.14.3-rc3
v0.14.3-rc2
v0.14.3-rc1
v0.14.3-rc0
v0.14.2
v0.14.2-rc1
v0.14.2-rc0
v0.14.1
v0.14.0-rc11
v0.14.0
v0.14.0-rc10
v0.14.0-rc9
v0.14.0-rc8
v0.14.0-rc7
v0.14.0-rc6
v0.14.0-rc5
v0.14.0-rc4
v0.14.0-rc3
v0.14.0-rc2
v0.14.0-rc1
v0.14.0-rc0
v0.13.5
v0.13.5-rc1
v0.13.5-rc0
v0.13.4-rc2
v0.13.4
v0.13.4-rc1
v0.13.4-rc0
v0.13.3
v0.13.3-rc1
v0.13.3-rc0
v0.13.2
v0.13.2-rc2
v0.13.2-rc1
v0.13.2-rc0
v0.13.1
v0.13.1-rc2
v0.13.1-rc1
v0.13.1-rc0
v0.13.0
v0.13.0-rc0
v0.12.11
v0.12.11-rc1
v0.12.11-rc0
v0.12.10
v0.12.10-rc1
v0.12.10-rc0
v0.12.9-rc0
v0.12.9
v0.12.8
v0.12.8-rc0
v0.12.7
v0.12.7-rc1
v0.12.7-rc0
v0.12.7-citest0
v0.12.6
v0.12.6-rc1
v0.12.6-rc0
v0.12.5
v0.12.5-rc0
v0.12.4
v0.12.4-rc7
v0.12.4-rc6
v0.12.4-rc5
v0.12.4-rc4
v0.12.4-rc3
v0.12.4-rc2
v0.12.4-rc1
v0.12.4-rc0
v0.12.3
v0.12.2
v0.12.2-rc0
v0.12.1
v0.12.1-rc1
v0.12.1-rc2
v0.12.1-rc0
v0.12.0
v0.12.0-rc1
v0.12.0-rc0
v0.11.11
v0.11.11-rc3
v0.11.11-rc2
v0.11.11-rc1
v0.11.11-rc0
v0.11.10
v0.11.9
v0.11.9-rc0
v0.11.8
v0.11.8-rc0
v0.11.7-rc1
v0.11.7-rc0
v0.11.7
v0.11.6
v0.11.6-rc0
v0.11.5-rc4
v0.11.5-rc3
v0.11.5
v0.11.5-rc5
v0.11.5-rc2
v0.11.5-rc1
v0.11.5-rc0
v0.11.4
v0.11.4-rc0
v0.11.3
v0.11.3-rc0
v0.11.2
v0.11.1
v0.11.0-rc0
v0.11.0-rc1
v0.11.0-rc2
v0.11.0
v0.10.2-int1
v0.10.1
v0.10.0
v0.10.0-rc4
v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
v0.10.0-rc0
v0.9.7-rc1
v0.9.7-rc0
v0.9.6
v0.9.6-rc0
v0.9.6-ci0
v0.9.5
v0.9.4-rc5
v0.9.4-rc6
v0.9.4
v0.9.4-rc3
v0.9.4-rc4
v0.9.4-rc1
v0.9.4-rc2
v0.9.4-rc0
v0.9.3
v0.9.3-rc5
v0.9.4-citest0
v0.9.3-rc4
v0.9.3-rc3
v0.9.3-rc2
v0.9.3-rc1
v0.9.3-rc0
v0.9.2
v0.9.1
v0.9.1-rc1
v0.9.1-rc0
v0.9.1-ci1
v0.9.1-ci0
v0.9.0
v0.9.0-rc0
v0.8.0
v0.8.0-rc0
v0.7.1-rc2
v0.7.1
v0.7.1-rc1
v0.7.1-rc0
v0.7.0
v0.7.0-rc1
v0.7.0-rc0
v0.6.9-rc0
v0.6.8
v0.6.8-rc0
v0.6.7
v0.6.7-rc2
v0.6.7-rc1
v0.6.7-rc0
v0.6.6
v0.6.6-rc2
v0.6.6-rc1
v0.6.6-rc0
v0.6.5-rc1
v0.6.5
v0.6.5-rc0
v0.6.4-rc0
v0.6.4
v0.6.3-rc1
v0.6.3
v0.6.3-rc0
v0.6.2
v0.6.2-rc0
v0.6.1
v0.6.1-rc0
v0.6.0-rc0
v0.6.0
v0.5.14-rc0
v0.5.13
v0.5.13-rc6
v0.5.13-rc5
v0.5.13-rc4
v0.5.13-rc3
v0.5.13-rc2
v0.5.13-rc1
v0.5.13-rc0
v0.5.12
v0.5.12-rc1
v0.5.12-rc0
v0.5.11
v0.5.10
v0.5.9
v0.5.9-rc0
v0.5.8-rc13
v0.5.8
v0.5.8-rc12
v0.5.8-rc11
v0.5.8-rc10
v0.5.8-rc9
v0.5.8-rc8
v0.5.8-rc7
v0.5.8-rc6
v0.5.8-rc5
v0.5.8-rc4
v0.5.8-rc3
v0.5.8-rc2
v0.5.8-rc1
v0.5.8-rc0
v0.5.7
v0.5.6
v0.5.5
v0.5.5-rc0
v0.5.4
v0.5.3
v0.5.3-rc0
v0.5.2
v0.5.2-rc3
v0.5.2-rc2
v0.5.2-rc1
v0.5.2-rc0
v0.5.1
v0.5.0
v0.5.0-rc1
v0.4.8-rc0
v0.4.7
v0.4.6
v0.4.5
v0.4.4
v0.4.3
v0.4.3-rc0
v0.4.2
v0.4.2-rc1
v0.4.2-rc0
v0.4.1
v0.4.1-rc0
v0.4.0
v0.4.0-rc8
v0.4.0-rc7
v0.4.0-rc6
v0.4.0-rc5
v0.4.0-rc4
v0.4.0-rc3
v0.4.0-rc2
v0.4.0-rc1
v0.4.0-rc0
v0.4.0-ci3
v0.3.14
v0.3.14-rc0
v0.3.13
v0.3.12
v0.3.12-rc5
v0.3.12-rc4
v0.3.12-rc3
v0.3.12-rc2
v0.3.12-rc1
v0.3.11
v0.3.11-rc4
v0.3.11-rc3
v0.3.11-rc2
v0.3.11-rc1
v0.3.10
v0.3.10-rc1
v0.3.9
v0.3.8
v0.3.7
v0.3.7-rc6
v0.3.7-rc5
v0.3.7-rc4
v0.3.7-rc3
v0.3.7-rc2
v0.3.7-rc1
v0.3.6
v0.3.5
v0.3.4
v0.3.3
v0.3.2
v0.3.1
v0.3.0
v0.2.8
v0.2.8-rc2
v0.2.8-rc1
v0.2.7
v0.2.6
v0.2.5
v0.2.4
v0.2.3
v0.2.2
v0.2.2-rc2
v0.2.2-rc1
v0.2.1
v0.2.0
v0.1.49-rc14
v0.1.49-rc13
v0.1.49-rc12
v0.1.49-rc11
v0.1.49-rc10
v0.1.49-rc9
v0.1.49-rc8
v0.1.49-rc7
v0.1.49-rc6
v0.1.49-rc4
v0.1.49-rc5
v0.1.49-rc3
v0.1.49-rc2
v0.1.49-rc1
v0.1.48
v0.1.47
v0.1.46
v0.1.45-rc5
v0.1.45
v0.1.45-rc4
v0.1.45-rc3
v0.1.45-rc2
v0.1.45-rc1
v0.1.44
v0.1.43
v0.1.42
v0.1.41
v0.1.40
v0.1.40-rc1
v0.1.39
v0.1.39-rc2
v0.1.39-rc1
v0.1.38
v0.1.37
v0.1.36
v0.1.35
v0.1.35-rc1
v0.1.34
v0.1.34-rc1
v0.1.33
v0.1.33-rc7
v0.1.33-rc6
v0.1.33-rc5
v0.1.33-rc4
v0.1.33-rc3
v0.1.33-rc2
v0.1.33-rc1
v0.1.32
v0.1.32-rc2
v0.1.32-rc1
v0.1.31
v0.1.30
v0.1.29
v0.1.28
v0.1.27
v0.1.26
v0.1.25
v0.1.24
v0.1.23
v0.1.22
v0.1.21
v0.1.20
v0.1.19
v0.1.18
v0.1.17
v0.1.16
v0.1.15
v0.1.14
v0.1.13
v0.1.12
v0.1.11
v0.1.10
v0.1.9
v0.1.8
v0.1.7
v0.1.6
v0.1.5
v0.1.4
v0.1.3
v0.1.2
v0.1.1
v0.1.0
v0.0.21
v0.0.20
v0.0.19
v0.0.18
v0.0.17
v0.0.16
v0.0.15
v0.0.14
v0.0.13
v0.0.12
v0.0.11
v0.0.10
v0.0.9
v0.0.8
v0.0.7
v0.0.6
v0.0.5
v0.0.4
v0.0.3
v0.0.2
v0.0.1
Labels
Clear labels
amd
api
app
bug
build
cli
cloud
compatibility
context-length
create
docker
documentation
embeddings
feature request
feedback wanted
good first issue
gpt-oss
gpu
harmony
help wanted
image
install
intel
js
launch
linux
macos
memory
mlx
model
needs more info
networking
nvidia
ollama.com
performance
pull-request
python
question
registry
rendering
thinking
tools
top
vulkan
windows
wsl
Mirrored from GitHub Pull Request
Milestone
No items
No Milestone
Projects
Clear projects
No project
No Assignees
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: github-starred/ollama#3616
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @sammcj on GitHub (Jul 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5800
Howdy fine Ollama folks 👋 ,
Back this time last year llama.cpp added support for speculative decoding using a draft model parameter. https://github.com/ggerganov/llama.cpp/issues/2030
This can massively speed up inference.
I was wondering if there's any chance you could look at adding the option for llama.cpp's
--model-draftparameter that enables this?It works by loading a smaller model (with the same tokeniser/family) in front of a larger model.
I know with exllamav2 you can get 100%-200% speed increases (seriously!) and the best part is - with no loss of quality.
For example, you might have:
The result would be the memory usage of main Qwen 72b model + the tiny 0.5b draft model but around 4x the tokens/s you'd see with just the main model loaded.
I've been using exllamav2 instead of Ollama for this feature (and the 4bit K/V cache #5091) recently and the performance really is astounding.
Parameters that can be passed to llama.cpp's server:
--model-draft(required) - the usage is the same as the existing--modelused for loading normal models.--draft(optional, but recommended to make available)--p-split(optional, but recommended to make available)There is also the following, but I think the default is probably fine 99% of the time:
--threads-draft--threads-batch-drafthttps://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md?plain=1#L37
@sammcj commented on GitHub (Jul 19, 2024):
I just found this old issue which while pretty out of date seems to be related - https://github.com/ollama/ollama/issues/1292
@jmorganca commented on GitHub (Sep 4, 2024):
@sammcj this is really, really cool. Sorry for the late reply. By the way, check out https://infini-ai-lab.github.io/Sequoia-Page/ if you haven't yet.
@sammcj commented on GitHub (Sep 4, 2024):
Oh that's a neat project too! thanks @jmorganca :)
@sammcj commented on GitHub (Sep 4, 2024):
I could imagine using draft models with Ollama Modelfiles to be quite a nice combo, e.g:
@enn-nafnlaus commented on GitHub (Sep 22, 2024):
The best option is generally a wide-but-shallow sheared model rather than a generalist small model. For example, I made some GGUF conversions of the Yan et al models here:
https://huggingface.co/Nafnlaus/Wide-Sheared-LLaMA-290M-GGUF
Non-GGUFs here:
https://huggingface.co/minghaoyan
The TL/DR is that getting "an" answer out fast is more important than getting a good answer out.
@sammcj commented on GitHub (Sep 22, 2024):
@enn-nafnlaus that's really interesting to see, I'll have to read the paper but I'm assuming this results in the draft model taking significantly less vRAM as it's essentially just the one layer being loaded (with the tokenizer)?
I suspect this might be what Exllamav2 does with it's draft model loading as they don't seem to use as much vRAM as loading it normally.
@sammcj commented on GitHub (Sep 22, 2024):
I should add I find draft models/speculative decoding with Exllamav2 so useful that I often find myself choosing to use Ellamav2 (via TabbyAPI/TabbyLoader) over Ollama when loading models larger than 30b~ - the performance improvements when running 70b models is nothing short of amazing.
@oxfighterjet commented on GitHub (Nov 6, 2024):
Is this a feature the ollama maintainers would be interested in implementing? I'm asking because I'm considering giving it a shot.
Are there some pointers / suggestions for anyone unfamiliar looking into implementing this? Thank you.
@sammcj commented on GitHub (Nov 6, 2024):
@oxfighterjet I think this would be amazing to add, combined with #6279 - this would bring Ollama up to speed with the likes of ExllamaV2 / TabbyAPI which have had these as core features for a long time.
I was actually planning on trying to get it merged after #6279 is merged. As such I'd be more than happy to work with you on this (just note I'm only a contributor - not a maintainer).
If you look at #6279 you'll see how I've added parameters that pass down to the underlying llama.cpp.
I would take the same approach but also make sure there is support for configuring the draft model in the Modelfile and API. This is something I did have in my PR (prior to the latest reactor for the new runners/server) but was asked to remove as Ollama didn't want to add new features to the API / CLI at the time, however for the draft model feature it will be required by design, I still have the code kicking round for this here: https://github.com/sammcj/ollama/pull/26/files.
Again - I was going to work on this after #6279 is merged, assuming it actually is merged in soon - I'd still be happy to do the work for this ticket, or work with you on it - be it doing a first pass for you to review/improve, or simply to help with peer review.
One thing I'd be aware of expectations wise, getting features merged into Ollama is painfully slow - as are the feedback cycles, just to set your expectations up front 😅.
@oxfighterjet commented on GitHub (Nov 7, 2024):
@sammcj Thank you for your helpful resources, they will most certainly come in handy!
I see indeed that your #6279 PR has been a rollercoaster of a ride, I hope it can get merged soon.
I'll study the codebase and your PRs and see what I can contribute. I might get back to you with questions! :)
@bsu3338 commented on GitHub (Nov 10, 2024):
Have you guys also considered the below approach? It looks like you could mix and match 2 models. However, it might not be as performant.
https://huggingface.co/blog/universal_assisted_generation
@TheTerrasque commented on GitHub (Nov 12, 2024):
I tried running llama server with speculative decoding to see if I could speed up some model, but I found out it's not supported by the server:
https://github.com/ggerganov/llama.cpp/issues/5877
@TheTerrasque commented on GitHub (Nov 25, 2024):
https://github.com/ggerganov/llama.cpp/pull/10455 - this is now in llama.cpp server!
@oxfighterjet commented on GitHub (Nov 25, 2024):
Great, I'm still on the ollama implementation and I'll be able to test it now. Will report back when I have a working prototype.
Edit: I have to admit I was following relevant threads of llama.cpp and didn't get a single notification, so it escaped me. Thanks for bringing it up.
Edit: I'm guessing it might take some time for these changes to propagate to ollama, given #7670 has been open for two weeks and would need to be updated.
@chris-calo commented on GitHub (Dec 3, 2024):
@oxfighterjet looks like #7875 was favoured over #7670, and is moving faster, if it helps any
@sammcj commented on GitHub (Dec 4, 2024):
Looks like performance just got another big bump thanks to https://github.com/ggerganov/llama.cpp/pull/10586 (source: https://www.reddit.com/r/LocalLLaMA/comments/1h5uq43/llamacpp_bug_fixed_speculative_decoding_is_30/)
@cduk commented on GitHub (Dec 6, 2024):
These options were tantalisingly mentioned in the opening post, but these don't seem to be valid options in llama-server. Have these been implemented in any branch or are these just proposals?
@TheTerrasque commented on GitHub (Dec 6, 2024):
it was merged into master 2 weeks ago. Check the PR link I gave a few posts up.
These are the options of the current llama-server regarding draft model
@cduk commented on GitHub (Dec 6, 2024):
I will check again but I was referring specifically to flags -tbd, -td and -ps.
@bfroemel commented on GitHub (Dec 10, 2024):
agreeing, the draft model (model-draft) and most related parameters (draft-max, draft-min, draft-p-min, ctx-size-draft) should be specified in the Modelfile. some parameters could be defined in the environment (device-draft, gpu-layers-draft) unless there is a good way to derive them automatically.
application/vnd.ollama.image.model) to reference model blobs. it would be nice to reuse the same mechanism and store the layer reference in the model manifest. Maybe just assume that the firstapplication/vnd.ollama.image.modellayer is the main model, and an optional additionalapplication/vnd.ollama.image.modellayer is the draft model?@Steel-skull commented on GitHub (Dec 11, 2024):
looks like https://github.com/ollama/ollama/pull/7875 was merged
@chris-calo commented on GitHub (Dec 11, 2024):
@oxfighterjet are you still working on this?
@oxfighterjet commented on GitHub (Dec 11, 2024):
@chris-calo yes.
@bfroemel commented on GitHub (Dec 11, 2024):
just because it wasn't obvious to me: getting this into ollama is going to be more work than just handling down the mentioned parameters.
it appears that we basically have to replicate this as well:
9ca2e67762and keeping track of fixes (for example, there are more):
84e1c33cde1da7b76569@bfroemel commented on GitHub (Dec 11, 2024):
Before moving forward with a prototype implementation it may be helpful to discuss the necessary changes?
Imo we roughly have the following tasks:
cf4d7c52c4/llama/runner/runner.go (L845)) and using it during inference (cf4d7c52c4/llama/runner/runner.go (L360))@mspinelli commented on GitHub (Dec 15, 2024):
Maybe this is not helpful, but perhaps there are some additional ideas on how to easily add this functionality by looking at how the llama-swap project accomplishes this?
@bfroemel commented on GitHub (Dec 16, 2024):
Before having looked at the source I also assumed that ollama just starts llama.cpp server instances similar to llama-swap. I guess there are or have been good reasons why ollama reimplemented that part of llama.cpp; probably because of the added flexibility and maybe being able to implement some features quicker than having to wait for upstream. At least for this feature, upstream was faster; also it is my impression that llama.cpp server nowadays appears more sophisticated than what we have in ollama, so for the long run it might really be a good idea to look into adopting llama.cpp server directly as a runner (and add potentially missing instrumentation/control API to llama.cpp server).
Anyway, as I wanted to understand speculative decoding and getting into Go I tried to move forward with the previously outlined tasks and made progress with 1., and 3. (turned out ollama already made use of interfacing c++ code via C wrappers, so this was easy to extend). The second task is a bit of a struggle to debug. As soon as I have something of an initial proof of concept-quality solution to show in a couple of days and @oxfighterjet hasn't already done so, I'll open a PR...
@oxfighterjet commented on GitHub (Dec 16, 2024):
@bfroemel Thanks for sharing your thoughts and your intentions. I am mostly interested in this feature being implemented at all, but my personal availability has decreased lately, with my work requiring more of my attention before the end of the year. It seems you are interested in taking over this issue and I'm glad to hand it over, I do not want to claim any exclusivity over it. If you have some ideas of how to implement it, please go ahead. I will anyway follow the progress of this issue closely, and am hoping for this feature to be propagated all the way to the top with open-webui :)
@bfroemel commented on GitHub (Dec 18, 2024):
@oxfighterjet @sammcj Could you take a look at https://github.com/ollama/ollama/pull/8134 ? Testing/reviews/comments very welcome ;)
@sammcj commented on GitHub (Jan 1, 2025):
Seeing more very positive things about the performance and surprisingly - TDP/power usage required with speculative decoding in llama.cpp: https://www.reddit.com/r/LocalLLaMA/comments/1hqlug2/revisting_llamacpp_speculative_decoding_w/
@zjh-nuc-AIOT commented on GitHub (Feb 11, 2025):
现在这个功能实现了吗
@sammcj commented on GitHub (Feb 12, 2025):
FYI LM Studio just added speculative decoding / draft model support in 0.3.10.
As expected it makes a fantastic improvement to performance (10+%).
Qwen 2.5 32b q6_k, on my M2 Pro with and without the Qwen 2.5 0.5b q4_k_m draft model for speculative decoding:
@dalisoft commented on GitHub (Feb 13, 2025):
@sammcj Temporarily we see less performance (for M1 and M2 chips) improvements and later improvements could be more
@StevePierce commented on GitHub (Mar 2, 2025):
Posting here since this seems to be the most active thread, just wanted to ask if speculative decoding is on the roadmap?
@hennas-waifson commented on GitHub (Mar 16, 2025):
It would be so nice to have that feature. It would make a huge difference for many Ollama users.
@yurii-sio2 commented on GitHub (Mar 19, 2025):
My vote for this feature.
@iSevenDays commented on GitHub (Mar 21, 2025):
I vote for this feature too!
@sammcj commented on GitHub (Apr 2, 2025):
Really big improvements from recent llama.cpp versions:
Qwen 2.5 Coder 32b, 32k context:
FYI @jmorganca ^^
@bfroemel commented on GitHub (Apr 2, 2025):
Is this only
abd4d0bc4for did you notice any other changes related to speculative decoding?@sammcj commented on GitHub (Apr 2, 2025):
@bfroemel unsure of the exact commit that made the difference but - damn, it's impressive - that's up there with ExllamaV2 speculative decoding performance now.
I'm seeing a solid 33.5%~ performance increase by loading a tiny 0.5b draft model with hardly any additional vRAM usage and of course no quality difference.
@Abdulrahman392011 commented on GitHub (Apr 2, 2025):
you can always add the feature and make it dormant. apple use this strategy all the time. have it under development features and ask people to report anything related to this feature and explain that it's still in beta.
this will give a bit of content on how it will perform as to not embarrass oneself if there is a system that is incompatible or crash or lower the performance. in cases like these it's better to have as much beta testers as possible and thankfully most of the ollama users are somewhat experienced with computers and can be proactive in testing such feature.
@pdevine commented on GitHub (Apr 11, 2025):
I really want to do speculative decoding in Ollama, but my concern is always us trying to take too much on too quickly. Especially now that we have the new engine and we're slowly deprecating llama.cpp engine; if we add it in the new engine it would only work on a handful of models at first (although maybe that's fine). I also want to make sure we figure out the local vs. hybrid story (i.e. offloading the big model to a different Ollama server).
@pdevine commented on GitHub (Apr 11, 2025):
@sammcj hopefully you don't mind, but I changed the issue title since I think it's a broader topic. We wouldn't enable llama.cpp's draft mode since we're moving away from llama.cpp on the backend anyway.
@Abdulrahman392011 commented on GitHub (Apr 11, 2025):
well, no worries. whatever you guys are doing, keep doing it. the results are great.
these things take patience and rushing it won't give us the results we need. so no pressure, after all we understand the ollama team is doing this cause they want to, not cause they need to or have to.
@sammcj commented on GitHub (Apr 11, 2025):
@pdevine no worries at all! Not precious about the title by any means and am fully in support of any method of bringing speculative decoding to Ollama.
Developer & team health and well being > Product vision
I would say that once looking at new features and functionality for Ollama I think you'll have to be careful you don't fall too far behind performance wise, there's some very real, significant gains to be had from speculative decoding.
@Master-Pr0grammer commented on GitHub (Apr 28, 2025):
@pdevine just out of curiosity, what is the reason behind wanting to move away from llama.cpp? would it not be more efficient to stick to llama.cpp, and instead of making your own engine to support features, just contribute to llama.cpp to implement your features?
That way you get the added benefit of more community support.
Or has that been proven to difficult/inefficient?
@pdevine commented on GitHub (Apr 28, 2025):
@Master-Pr0grammer I have utmost respect for ggml and the llama.cpp project and what Georgi has done, but we were finding that we were diverging too much from llama.cpp and our design philosophies are very different.
@Master-Pr0grammer commented on GitHub (Apr 28, 2025):
ah i see, makes sense. I was just curious since it was brought up.
@Wladastic commented on GitHub (May 4, 2025):
Instead of only adding this feature, why not allow users to split inferences in between layers?
Could even make a test script that goes through combinations of layers and stitch together a frankenmerge of the bigger and smaller llm.
@pdevine commented on GitHub (May 16, 2025):
I have some ideas around how to get this going in the new engine. This hinges on getting the logprobs, but should be doable. Hopefully I'll have something more concrete details in a few weeks once I'm finished up with some other work.
@pdevine commented on GitHub (Jul 24, 2025):
OK, I haven't forgotten about this, but we've been trying to get 0.10.0 out the door. We still need logprobs to be exposed properly to make it work.
@sammcj commented on GitHub (Jul 24, 2025):
Thanks @pdevine , love your work!
@rpeinl commented on GitHub (Aug 2, 2025):
There is a new GLM model version 4.5 out there in a bigger and smaller version similar to Llama4
https://huggingface.co/zai-org/GLM-4.5-Air
This looks very promising regarding model accuracy and it can do multi-token prediction (MTP).
Unfortunately, there is not much information available about how this works in the inference engine. However, there is a recent paper from Apple that links MTP to speculative decoding.
https://arxiv.org/html/2507.11851v1
Since tools like LMStudio already support GLM 4.5 and also supports speculative decoding, maybe it only works together.
Anyway, I would be extremely interested in getting this model to work in ollama, including MTP.
@BigArty commented on GitHub (Aug 10, 2025):
Is it possible that there will be a way to make speculative decoding based on n-grams of some given text (or prompt and dialogue history)? It is by far the best way for weaker GPUs and similar or faster then 0.5B assistant models for ~8B - 14B generator models.
@BigArty commented on GitHub (Oct 14, 2025):
@pdevine Are there any chance that this feature is still in development?
@dhirajlochib commented on GitHub (Jan 8, 2026):
hi, ahm i've been working on implementing speculative decoding support and have completed the foundational infrastructure... here's the current status:
Implemented:
FROM qwen2.5:3b
DRAFT qwen2.5:0.5b
Draftfield throughout the stack:api.CreateRequestandapi.ShowResponsetypes.ConfigV2for persistenceserver/create.goandserver/images.goloadDraftModel()for async background loadingGetLoadedRunner()to retrieve loaded draft modelspeculative/speculative.go):what's not working yet
The actual 2-4x speedup doesn't activate because the integration needs to go deeper into the runner's token generation loop. Currently:
GenerateHandlerstill uses standard single-model completionSpeculativeCompletionmethod exists but needs integration into the runner's core inference loopTesting shows identical token generation rates with/without draft model because speculative decoding isn't engaging.
need guidance:
The final step requires changes to
runner/ollamarunner/runner.go- specifically the token-by-token generation logic. This touches critical inference code that I'm less familiar with.Questions:
Branch:
feature/speculative-decodingHappy to continue working on this with guidance, or hand off the runner integration to someone more familiar with that codebase. The foundation is solid and ready for the final piece!!!
@Filipp-Druan commented on GitHub (Apr 12, 2026):
Hello!
Please tell me what's going on with speculative decoding?
It's really important to me that this feature works. It's really hard without it! The models are incredibly slow!
Perhaps you could add Prompt Lookup Decoding? I really, really need fast program execution!