[GH-ISSUE #5800] Enable speculative decoding #65653

New Issue

GiteaMirror · 2026-05-03T22:04:16-05:00

GiteaMirror commented

2026-05-03 22:04:16 -05:00

Originally created by @sammcj on GitHub (Jul 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5800

Howdy fine Ollama folks 👋 ,

Back this time last year llama.cpp added support for speculative decoding using a draft model parameter. https://github.com/ggerganov/llama.cpp/issues/2030

This can massively speed up inference.

I was wondering if there's any chance you could look at adding the option for llama.cpp's --model-draft parameter that enables this?

It works by loading a smaller model (with the same tokeniser/family) in front of a larger model.

I know with exllamav2 you can get 100%-200% speed increases (seriously!) and the best part is - with no loss of quality.

For example, you might have:

Main model: Qwen 72b Q4_K_M
Draft model: Qwen 0.5b Q4_K_M

The result would be the memory usage of main Qwen 72b model + the tiny 0.5b draft model but around 4x the tokens/s you'd see with just the main model loaded.

I've been using exllamav2 instead of Ollama for this feature (and the 4bit K/V cache #5091) recently and the performance really is astounding.

Parameters that can be passed to llama.cpp's server:

--model-draft (required) - the usage is the same as the existing --model used for loading normal models.
--draft (optional, but recommended to make available)
--p-split (optional, but recommended to make available)

There is also the following, but I think the default is probably fine 99% of the time:

--threads-draft
--threads-batch-draft

https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md?plain=1#L37

  -md,   --model-draft FNAME      draft model for speculative decoding (default: unused)
  -td,   --threads-draft N        number of threads to use during generation (default: same as --threads)
  -tbd,  --threads-batch-draft N  number of threads to use during batch and prompt processing (default: same as --threads-draft)
         --draft N                number of tokens to draft for speculative decoding (default: 5)
  -ps,   --p-split N              speculative decoding split probability (default: 0.1)

Originally created by @sammcj on GitHub (Jul 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5800 Howdy fine Ollama folks 👋 , Back this time last year llama.cpp added support for speculative decoding using a draft model parameter. https://github.com/ggerganov/llama.cpp/issues/2030 This can _massively_ speed up inference. I was wondering if there's any chance you could look at adding the option for llama.cpp's `--model-draft` parameter that enables this? --- It works by loading a smaller model (with the same tokeniser/family) in front of a larger model. I know with exllamav2 you can get 100%-200% speed increases (seriously!) and the best part is - with no loss of quality. For example, you might have: - Main model: Qwen 72b Q4_K_M - Draft model: Qwen 0.5b Q4_K_M The result would be the memory usage of main Qwen 72b model + the tiny 0.5b draft model but around 4x the tokens/s you'd see with just the main model loaded. I've been using exllamav2 instead of Ollama for this feature (and the 4bit K/V cache #5091) recently and the performance really is astounding. --- Parameters that can be passed to llama.cpp's server: 1. `--model-draft` (required) - the usage is the same as the existing `--model` used for loading normal models. 2. `--draft` (optional, but recommended to make available) 3. `--p-split` (optional, but recommended to make available) There is also the following, but I think the default is probably fine 99% of the time: - `--threads-draft` - `--threads-batch-draft` https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md?plain=1#L37 ``` -md, --model-draft FNAME draft model for speculative decoding (default: unused) -td, --threads-draft N number of threads to use during generation (default: same as --threads) -tbd, --threads-batch-draft N number of threads to use during batch and prompt processing (default: same as --threads-draft) --draft N number of tokens to draft for speculative decoding (default: 5) -ps, --p-split N speculative decoding split probability (default: 0.1) ```

GiteaMirror added the performance feature request labels 2026-05-03 22:04:17 -05:00

GiteaMirror commented

2026-05-03 22:04:19 -05:00

@sammcj commented on GitHub (Jul 19, 2024):

I just found this old issue which while pretty out of date seems to be related - https://github.com/ollama/ollama/issues/1292

@sammcj commented on GitHub (Jul 19, 2024): I just found this old issue which while pretty out of date seems to be related - https://github.com/ollama/ollama/issues/1292

GiteaMirror commented

2026-05-03 22:04:20 -05:00

@jmorganca commented on GitHub (Sep 4, 2024):

@sammcj this is really, really cool. Sorry for the late reply. By the way, check out https://infini-ai-lab.github.io/Sequoia-Page/ if you haven't yet.

@jmorganca commented on GitHub (Sep 4, 2024): @sammcj this is really, really cool. Sorry for the late reply. By the way, check out https://infini-ai-lab.github.io/Sequoia-Page/ if you haven't yet.

GiteaMirror commented

2026-05-03 22:04:21 -05:00

@sammcj commented on GitHub (Sep 4, 2024):

Oh that's a neat project too! thanks @jmorganca :)

@sammcj commented on GitHub (Sep 4, 2024): Oh that's a neat project too! thanks @jmorganca :)

GiteaMirror commented

2026-05-03 22:04:23 -05:00

@sammcj commented on GitHub (Sep 4, 2024):

I could imagine using draft models with Ollama Modelfiles to be quite a nice combo, e.g:

FROM qwen2:72b-q4_k_m
DRAFT qwen2:1.5-q4_k_m

@sammcj commented on GitHub (Sep 4, 2024): I could imagine using draft models with Ollama Modelfiles to be quite a nice combo, e.g: ``` FROM qwen2:72b-q4_k_m DRAFT qwen2:1.5-q4_k_m ```

GiteaMirror commented

2026-05-03 22:04:24 -05:00

@enn-nafnlaus commented on GitHub (Sep 22, 2024):

I could imagine using draft models with Ollama Modelfiles to be quite a nice combo, e.g:
FROM qwen2:72b-q4_k_m
DRAFT qwen2:1.5-q4_k_m

The best option is generally a wide-but-shallow sheared model rather than a generalist small model. For example, I made some GGUF conversions of the Yan et al models here:

https://huggingface.co/Nafnlaus/Wide-Sheared-LLaMA-290M-GGUF

Non-GGUFs here:

https://huggingface.co/minghaoyan

The TL/DR is that getting "an" answer out fast is more important than getting a good answer out.

@enn-nafnlaus commented on GitHub (Sep 22, 2024): > I could imagine using draft models with Ollama Modelfiles to be quite a nice combo, e.g: > > ``` > FROM qwen2:72b-q4_k_m > DRAFT qwen2:1.5-q4_k_m > ``` The best option is generally a wide-but-shallow sheared model rather than a generalist small model. For example, I made some GGUF conversions of the Yan et al models here: https://huggingface.co/Nafnlaus/Wide-Sheared-LLaMA-290M-GGUF Non-GGUFs here: https://huggingface.co/minghaoyan The TL/DR is that getting "an" answer out fast is more important than getting a good answer out.

GiteaMirror commented

2026-05-03 22:04:25 -05:00

@sammcj commented on GitHub (Sep 22, 2024):

@enn-nafnlaus that's really interesting to see, I'll have to read the paper but I'm assuming this results in the draft model taking significantly less vRAM as it's essentially just the one layer being loaded (with the tokenizer)?

I suspect this might be what Exllamav2 does with it's draft model loading as they don't seem to use as much vRAM as loading it normally.

Have you documented or scripted the steps you took to generate such models?
Have you considered creating some wide/shallow sheared models for Llama 3.1 and Qwen2.5?

@sammcj commented on GitHub (Sep 22, 2024): @enn-nafnlaus that's really interesting to see, I'll have to read the paper but I'm assuming this results in the draft model taking significantly less vRAM as it's essentially just the one layer being loaded (with the tokenizer)? I suspect this might be what Exllamav2 does with it's draft model loading as they don't seem to use as much vRAM as loading it normally. - Have you documented or scripted the steps you took to generate such models? - Have you considered creating some wide/shallow sheared models for Llama 3.1 and Qwen2.5?

GiteaMirror commented

2026-05-03 22:04:26 -05:00

@sammcj commented on GitHub (Sep 22, 2024):

I should add I find draft models/speculative decoding with Exllamav2 so useful that I often find myself choosing to use Ellamav2 (via TabbyAPI/TabbyLoader) over Ollama when loading models larger than 30b~ - the performance improvements when running 70b models is nothing short of amazing.

@sammcj commented on GitHub (Sep 22, 2024): I should add I find draft models/speculative decoding with Exllamav2 so useful that I often find myself choosing to use Ellamav2 (via [TabbyAPI](https://github.com/theroyallab/tabbyAPI)/[TabbyLoader](https://github.com/theroyallab/tabbyAPI-gradio-loader)) over Ollama when loading models larger than 30b~ - the performance improvements when running 70b models is nothing short of amazing.

GiteaMirror commented

2026-05-03 22:04:27 -05:00

@oxfighterjet commented on GitHub (Nov 6, 2024):

Is this a feature the ollama maintainers would be interested in implementing? I'm asking because I'm considering giving it a shot.

Are there some pointers / suggestions for anyone unfamiliar looking into implementing this? Thank you.

@oxfighterjet commented on GitHub (Nov 6, 2024): Is this a feature the ollama maintainers would be interested in implementing? I'm asking because I'm considering giving it a shot. Are there some pointers / suggestions for anyone unfamiliar looking into implementing this? Thank you.

GiteaMirror commented

2026-05-03 22:04:28 -05:00

@sammcj commented on GitHub (Nov 6, 2024):

@oxfighterjet I think this would be amazing to add, combined with #6279 - this would bring Ollama up to speed with the likes of ExllamaV2 / TabbyAPI which have had these as core features for a long time.

I was actually planning on trying to get it merged after #6279 is merged. As such I'd be more than happy to work with you on this (just note I'm only a contributor - not a maintainer).

If you look at #6279 you'll see how I've added parameters that pass down to the underlying llama.cpp.

I would take the same approach but also make sure there is support for configuring the draft model in the Modelfile and API. This is something I did have in my PR (prior to the latest reactor for the new runners/server) but was asked to remove as Ollama didn't want to add new features to the API / CLI at the time, however for the draft model feature it will be required by design, I still have the code kicking round for this here: https://github.com/sammcj/ollama/pull/26/files.

Again - I was going to work on this after #6279 is merged, assuming it actually is merged in soon - I'd still be happy to do the work for this ticket, or work with you on it - be it doing a first pass for you to review/improve, or simply to help with peer review.

One thing I'd be aware of expectations wise, getting features merged into Ollama is painfully slow - as are the feedback cycles, just to set your expectations up front 😅.

@sammcj commented on GitHub (Nov 6, 2024): @oxfighterjet I think this would be amazing to add, combined with #6279 - this would bring Ollama up to speed with the likes of ExllamaV2 / TabbyAPI which have had these as core features for a long time. I was actually planning on trying to get it merged after #6279 is merged. As such I'd be more than happy to work with you on this (just note I'm only a contributor - not a maintainer). If you look at #6279 you'll see how I've added parameters that pass down to the underlying llama.cpp. I would take the same approach but also make sure there is support for configuring the draft model in the Modelfile and API. This is something I did have in my PR (prior to the latest reactor for the new runners/server) but was asked to remove as Ollama didn't want to add new features to the API / CLI at the time, however for the draft model feature it will be required by design, I still have the code kicking round for this here: https://github.com/sammcj/ollama/pull/26/files. Again - I was going to work on this after #6279 is merged, assuming it actually is merged in soon - I'd still be happy to do the work for this ticket, or work with you on it - be it doing a first pass for you to review/improve, or simply to help with peer review. One thing I'd be aware of expectations wise, getting features merged into Ollama is _painfully_ slow - as are the feedback cycles, just to set your expectations up front 😅.

GiteaMirror commented

2026-05-03 22:04:30 -05:00

@oxfighterjet commented on GitHub (Nov 7, 2024):

@sammcj Thank you for your helpful resources, they will most certainly come in handy!
I see indeed that your #6279 PR has been a rollercoaster of a ride, I hope it can get merged soon.

I'll study the codebase and your PRs and see what I can contribute. I might get back to you with questions! :)

@oxfighterjet commented on GitHub (Nov 7, 2024): @sammcj Thank you for your helpful resources, they will most certainly come in handy! I see indeed that your #6279 PR has been a rollercoaster of a ride, I hope it can get merged soon. I'll study the codebase and your PRs and see what I can contribute. I might get back to you with questions! :)

GiteaMirror commented

2026-05-03 22:04:31 -05:00

@bsu3338 commented on GitHub (Nov 10, 2024):

@sammcj Thank you for your helpful resources, they will most certainly come in handy! I see indeed that your #6279 PR has been a rollercoaster of a ride, I hope it can get merged soon.

I'll study the codebase and your PRs and see what I can contribute. I might get back to you with questions! :)

Have you guys also considered the below approach? It looks like you could mix and match 2 models. However, it might not be as performant.

https://huggingface.co/blog/universal_assisted_generation

@bsu3338 commented on GitHub (Nov 10, 2024): > @sammcj Thank you for your helpful resources, they will most certainly come in handy! I see indeed that your #6279 PR has been a rollercoaster of a ride, I hope it can get merged soon. > > I'll study the codebase and your PRs and see what I can contribute. I might get back to you with questions! :) Have you guys also considered the below approach? It looks like you could mix and match 2 models. However, it might not be as performant. https://huggingface.co/blog/universal_assisted_generation

GiteaMirror commented

2026-05-03 22:04:32 -05:00

@TheTerrasque commented on GitHub (Nov 12, 2024):

I tried running llama server with speculative decoding to see if I could speed up some model, but I found out it's not supported by the server:

https://github.com/ggerganov/llama.cpp/issues/5877

@TheTerrasque commented on GitHub (Nov 12, 2024): I tried running llama server with speculative decoding to see if I could speed up some model, but I found out it's not supported by the server: https://github.com/ggerganov/llama.cpp/issues/5877

GiteaMirror commented

2026-05-03 22:04:34 -05:00

@TheTerrasque commented on GitHub (Nov 25, 2024):

https://github.com/ggerganov/llama.cpp/pull/10455 - this is now in llama.cpp server!

@TheTerrasque commented on GitHub (Nov 25, 2024): https://github.com/ggerganov/llama.cpp/pull/10455 - this is now in llama.cpp server!

GiteaMirror commented

2026-05-03 22:04:35 -05:00

@oxfighterjet commented on GitHub (Nov 25, 2024):

Great, I'm still on the ollama implementation and I'll be able to test it now. Will report back when I have a working prototype.

Edit: I have to admit I was following relevant threads of llama.cpp and didn't get a single notification, so it escaped me. Thanks for bringing it up.

Edit: I'm guessing it might take some time for these changes to propagate to ollama, given #7670 has been open for two weeks and would need to be updated.

@oxfighterjet commented on GitHub (Nov 25, 2024): Great, I'm still on the ollama implementation and I'll be able to test it now. Will report back when I have a working prototype. Edit: I have to admit I was following relevant threads of llama.cpp and didn't get a single notification, so it escaped me. Thanks for bringing it up. Edit: I'm guessing it might take some time for these changes to propagate to ollama, given #7670 has been open for two weeks and would need to be updated.

GiteaMirror commented

2026-05-03 22:04:36 -05:00

@chris-calo commented on GitHub (Dec 3, 2024):

@oxfighterjet looks like #7875 was favoured over #7670, and is moving faster, if it helps any

@chris-calo commented on GitHub (Dec 3, 2024): @oxfighterjet looks like #7875 was favoured over #7670, and is moving faster, if it helps any

GiteaMirror commented

2026-05-03 22:04:37 -05:00

@sammcj commented on GitHub (Dec 4, 2024):

Looks like performance just got another big bump thanks to https://github.com/ggerganov/llama.cpp/pull/10586 (source: https://www.reddit.com/r/LocalLLaMA/comments/1h5uq43/llamacpp_bug_fixed_speculative_decoding_is_30/)

@sammcj commented on GitHub (Dec 4, 2024): Looks like performance just got another big bump thanks to https://github.com/ggerganov/llama.cpp/pull/10586 (source: https://www.reddit.com/r/LocalLLaMA/comments/1h5uq43/llamacpp_bug_fixed_speculative_decoding_is_30/)

GiteaMirror commented

2026-05-03 22:04:39 -05:00

@cduk commented on GitHub (Dec 6, 2024):

These options were tantalisingly mentioned in the opening post, but these don't seem to be valid options in llama-server. Have these been implemented in any branch or are these just proposals?

  -md,   --model-draft FNAME      draft model for speculative decoding (default: unused)
  -td,   --threads-draft N        number of threads to use during generation (default: same as --threads)
  -tbd,  --threads-batch-draft N  number of threads to use during batch and prompt processing (default: same as --threads-draft)
         --draft N                number of tokens to draft for speculative decoding (default: 5)
  -ps,   --p-split N              speculative decoding split probability (default: 0.1)

@cduk commented on GitHub (Dec 6, 2024): These options were tantalisingly mentioned in the opening post, but these don't seem to be valid options in llama-server. Have these been implemented in any branch or are these just proposals? ``` -md, --model-draft FNAME draft model for speculative decoding (default: unused) -td, --threads-draft N number of threads to use during generation (default: same as --threads) -tbd, --threads-batch-draft N number of threads to use during batch and prompt processing (default: same as --threads-draft) --draft N number of tokens to draft for speculative decoding (default: 5) -ps, --p-split N speculative decoding split probability (default: 0.1) ```

GiteaMirror commented

2026-05-03 22:04:40 -05:00

@TheTerrasque commented on GitHub (Dec 6, 2024):

it was merged into master 2 weeks ago. Check the PR link I gave a few posts up.

--draft-max, --draft, --draft-n N       number of tokens to draft for speculative decoding (default: 16)
--draft-min, --draft-n-min N            minimum number of draft tokens to use for speculative decoding
                                        (default: 5)
--draft-p-min P                         minimum speculative decoding probability (greedy) (default: 0.9)
-cd,   --ctx-size-draft N               size of the prompt context for the draft model (default: 0, 0 = loaded
                                        from model)
-devd, --device-draft <dev1,dev2,..>    comma-separated list of devices to use for offloading the draft model
                                        (none = don't offload)
                                        use --list-devices to see a list of available devices
-ngld, --gpu-layers-draft, --n-gpu-layers-draft N
                                        number of layers to store in VRAM for the draft model
-md,   --model-draft FNAME              draft model for speculative decoding (default: unused)

These are the options of the current llama-server regarding draft model

@TheTerrasque commented on GitHub (Dec 6, 2024): it was merged into master 2 weeks ago. Check the PR link I gave a few posts up. ``` --draft-max, --draft, --draft-n N number of tokens to draft for speculative decoding (default: 16) --draft-min, --draft-n-min N minimum number of draft tokens to use for speculative decoding (default: 5) --draft-p-min P minimum speculative decoding probability (greedy) (default: 0.9) -cd, --ctx-size-draft N size of the prompt context for the draft model (default: 0, 0 = loaded from model) -devd, --device-draft <dev1,dev2,..> comma-separated list of devices to use for offloading the draft model (none = don't offload) use --list-devices to see a list of available devices -ngld, --gpu-layers-draft, --n-gpu-layers-draft N number of layers to store in VRAM for the draft model -md, --model-draft FNAME draft model for speculative decoding (default: unused) ``` These are the options of the current llama-server regarding draft model

GiteaMirror commented

2026-05-03 22:04:42 -05:00

@cduk commented on GitHub (Dec 6, 2024):

I will check again but I was referring specifically to flags -tbd, -td and -ps.

@cduk commented on GitHub (Dec 6, 2024): I will check again but I was referring specifically to flags -tbd, -td and -ps.

GiteaMirror commented

2026-05-03 22:04:46 -05:00

@bfroemel commented on GitHub (Dec 10, 2024):

I could imagine using draft models with Ollama Modelfiles to be quite a nice combo, e.g:
FROM qwen2:72b-q4_k_m
DRAFT qwen2:1.5-q4_k_m

agreeing, the draft model (model-draft) and most related parameters (draft-max, draft-min, draft-p-min, ctx-size-draft) should be specified in the Modelfile. some parameters could be defined in the environment (device-draft, gpu-layers-draft) unless there is a good way to derive them automatically.

model-draft: [Modelfile, DRAFT] ollama uses layers (media type: application/vnd.ollama.image.model) to reference model blobs. it would be nice to reuse the same mechanism and store the layer reference in the model manifest. Maybe just assume that the first application/vnd.ollama.image.model layer is the main model, and an optional additional application/vnd.ollama.image.model layer is the draft model?
draft-max, draft-min, draft-p-min, ctx-size-draft [Modelfile, PARAMETER] Those parameters appear to be just Runner options, and can probably be very easily added.

@bfroemel commented on GitHub (Dec 10, 2024): > I could imagine using draft models with Ollama Modelfiles to be quite a nice combo, e.g: > > ``` > FROM qwen2:72b-q4_k_m > DRAFT qwen2:1.5-q4_k_m > ``` agreeing, the draft model (model-draft) and most related parameters (draft-max, draft-min, draft-p-min, ctx-size-draft) should be specified in the Modelfile. some parameters could be defined in the environment (device-draft, gpu-layers-draft) unless there is a good way to derive them automatically. - model-draft: [Modelfile, DRAFT] ollama uses layers (media type: `application/vnd.ollama.image.model`) to reference model blobs. it would be nice to reuse the same mechanism and store the layer reference in the model manifest. Maybe just assume that the first `application/vnd.ollama.image.model` layer is the main model, and an optional additional `application/vnd.ollama.image.model` layer is the draft model? - draft-max, draft-min, draft-p-min, ctx-size-draft [Modelfile, PARAMETER] Those parameters appear to be just Runner options, and can probably be very easily added.

GiteaMirror commented

2026-05-03 22:04:49 -05:00

@Steel-skull commented on GitHub (Dec 11, 2024):

looks like https://github.com/ollama/ollama/pull/7875 was merged

@Steel-skull commented on GitHub (Dec 11, 2024): looks like https://github.com/ollama/ollama/pull/7875 was merged

GiteaMirror commented

2026-05-03 22:04:50 -05:00

@chris-calo commented on GitHub (Dec 11, 2024):

@oxfighterjet are you still working on this?

@chris-calo commented on GitHub (Dec 11, 2024): @oxfighterjet are you still working on this?

GiteaMirror commented

2026-05-03 22:04:51 -05:00

@oxfighterjet commented on GitHub (Dec 11, 2024):

@chris-calo yes.

@oxfighterjet commented on GitHub (Dec 11, 2024): @chris-calo yes.

GiteaMirror commented

2026-05-03 22:04:52 -05:00

@bfroemel commented on GitHub (Dec 11, 2024):

just because it wasn't obvious to me: getting this into ollama is going to be more work than just handling down the mentioned parameters.

it appears that we basically have to replicate this as well:
9ca2e67762
and keeping track of fixes (for example, there are more):
84e1c33cde
1da7b76569

@bfroemel commented on GitHub (Dec 11, 2024): just because it wasn't obvious to me: getting this into ollama is going to be more work than just handling down the mentioned parameters. it appears that we basically have to replicate this as well: https://github.com/ggerganov/llama.cpp/commit/9ca2e677626fce759d5d95c407c03677b9c87a26 and keeping track of fixes (for example, there are more): https://github.com/ggerganov/llama.cpp/commit/84e1c33cde9e0a7aafcda2d4f21ba51c300482d7 https://github.com/ggerganov/llama.cpp/commit/1da7b765692764a8b33b08da61cbee63812a7bd9

GiteaMirror commented

2026-05-03 22:04:54 -05:00

@bfroemel commented on GitHub (Dec 11, 2024):

Before moving forward with a prototype implementation it may be helpful to discuss the necessary changes?

Imo we roughly have the following tasks:

Parameter handling, how and where to define them, how to reference draft models such that the existing model repository can be used for loading a draft model, how to pass all required parameters down to a runner.
Reimplementation of the actual draft model feature in the runner, i.e., draft model loading (cf4d7c52c4/llama/runner/runner.go (L845)) and using it during inference (cf4d7c52c4/llama/runner/runner.go (L360))
Reimplement or preferably somehow use existing utility code (cgo) located in llama.cpp (https://github.com/ggerganov/llama.cpp/blob/master/common/speculative.cpp, https://github.com/ggerganov/llama.cpp/blob/master/common/common.cpp) by appropriately extending https://github.com/ollama/ollama/blob/main/llama/llama.go

@bfroemel commented on GitHub (Dec 11, 2024): Before moving forward with a prototype implementation it may be helpful to discuss the necessary changes? Imo we roughly have the following tasks: 1. Parameter handling, how and where to define them, how to reference draft models such that the existing model repository can be used for loading a draft model, how to pass all required parameters down to a runner. 2. Reimplementation of the actual draft model feature in the runner, i.e., draft model loading (https://github.com/ollama/ollama/blob/cf4d7c52c47d753bd04a8791b9c6042271c40c1e/llama/runner/runner.go#L845) and using it during inference (https://github.com/ollama/ollama/blob/cf4d7c52c47d753bd04a8791b9c6042271c40c1e/llama/runner/runner.go#L360) 3. Reimplement or preferably somehow use existing utility code (cgo) located in llama.cpp (https://github.com/ggerganov/llama.cpp/blob/master/common/speculative.cpp, https://github.com/ggerganov/llama.cpp/blob/master/common/common.cpp) by appropriately extending https://github.com/ollama/ollama/blob/main/llama/llama.go

GiteaMirror commented

2026-05-03 22:04:55 -05:00

@mspinelli commented on GitHub (Dec 15, 2024):

Maybe this is not helpful, but perhaps there are some additional ideas on how to easily add this functionality by looking at how the llama-swap project accomplishes this?

@mspinelli commented on GitHub (Dec 15, 2024): Maybe this is not helpful, but perhaps there are some additional ideas on how to easily add this functionality by looking at how the [llama-swap](https://github.com/mostlygeek/llama-swap/blob/main/examples/speculative-decoding/README.md) project accomplishes this?

GiteaMirror commented

2026-05-03 22:04:57 -05:00

@bfroemel commented on GitHub (Dec 16, 2024):

Before having looked at the source I also assumed that ollama just starts llama.cpp server instances similar to llama-swap. I guess there are or have been good reasons why ollama reimplemented that part of llama.cpp; probably because of the added flexibility and maybe being able to implement some features quicker than having to wait for upstream. At least for this feature, upstream was faster; also it is my impression that llama.cpp server nowadays appears more sophisticated than what we have in ollama, so for the long run it might really be a good idea to look into adopting llama.cpp server directly as a runner (and add potentially missing instrumentation/control API to llama.cpp server).

Anyway, as I wanted to understand speculative decoding and getting into Go I tried to move forward with the previously outlined tasks and made progress with 1., and 3. (turned out ollama already made use of interfacing c++ code via C wrappers, so this was easy to extend). The second task is a bit of a struggle to debug. As soon as I have something of an initial proof of concept-quality solution to show in a couple of days and @oxfighterjet hasn't already done so, I'll open a PR...

@bfroemel commented on GitHub (Dec 16, 2024): Before having looked at the source I also assumed that ollama just starts llama.cpp server instances similar to llama-swap. I guess there are or have been good reasons why ollama reimplemented that part of llama.cpp; probably because of the added flexibility and maybe being able to implement some features quicker than having to wait for upstream. At least for this feature, upstream was faster; also it is my impression that llama.cpp server nowadays appears more sophisticated than what we have in ollama, so for the long run it might really be a good idea to look into adopting llama.cpp server directly as a runner (and add potentially missing instrumentation/control API to llama.cpp server). Anyway, as I wanted to understand speculative decoding and getting into Go I tried to move forward with the previously outlined tasks and made progress with 1., and 3. (turned out ollama already made use of interfacing c++ code via C wrappers, so this was easy to extend). The second task is a bit of a struggle to debug. As soon as I have something of an initial proof of concept-quality solution to show in a couple of days and @oxfighterjet hasn't already done so, I'll open a PR...

GiteaMirror commented

2026-05-03 22:04:58 -05:00

@oxfighterjet commented on GitHub (Dec 16, 2024):

@bfroemel Thanks for sharing your thoughts and your intentions. I am mostly interested in this feature being implemented at all, but my personal availability has decreased lately, with my work requiring more of my attention before the end of the year. It seems you are interested in taking over this issue and I'm glad to hand it over, I do not want to claim any exclusivity over it. If you have some ideas of how to implement it, please go ahead. I will anyway follow the progress of this issue closely, and am hoping for this feature to be propagated all the way to the top with open-webui :)

@oxfighterjet commented on GitHub (Dec 16, 2024): @bfroemel Thanks for sharing your thoughts and your intentions. I am mostly interested in this feature being implemented at all, but my personal availability has decreased lately, with my work requiring more of my attention before the end of the year. It seems you are interested in taking over this issue and I'm glad to hand it over, I do not want to claim any exclusivity over it. If you have some ideas of how to implement it, please go ahead. I will anyway follow the progress of this issue closely, and am hoping for this feature to be propagated all the way to the top with open-webui :)

GiteaMirror commented

2026-05-03 22:05:00 -05:00

@bfroemel commented on GitHub (Dec 18, 2024):

@oxfighterjet @sammcj Could you take a look at https://github.com/ollama/ollama/pull/8134 ? Testing/reviews/comments very welcome ;)

@bfroemel commented on GitHub (Dec 18, 2024): @oxfighterjet @sammcj Could you take a look at https://github.com/ollama/ollama/pull/8134 ? Testing/reviews/comments very welcome ;)

GiteaMirror commented

2026-05-03 22:05:01 -05:00

@sammcj commented on GitHub (Jan 1, 2025):

Seeing more very positive things about the performance and surprisingly - TDP/power usage required with speculative decoding in llama.cpp: https://www.reddit.com/r/LocalLLaMA/comments/1hqlug2/revisting_llamacpp_speculative_decoding_w/

@sammcj commented on GitHub (Jan 1, 2025): Seeing more very positive things about the performance and surprisingly - TDP/power usage required with speculative decoding in llama.cpp: https://www.reddit.com/r/LocalLLaMA/comments/1hqlug2/revisting_llamacpp_speculative_decoding_w/

GiteaMirror commented

2026-05-03 22:05:03 -05:00

@zjh-nuc-AIOT commented on GitHub (Feb 11, 2025):

现在这个功能实现了吗

@zjh-nuc-AIOT commented on GitHub (Feb 11, 2025): 现在这个功能实现了吗

GiteaMirror commented

2026-05-03 22:05:04 -05:00

@sammcj commented on GitHub (Feb 12, 2025):

FYI LM Studio just added speculative decoding / draft model support in 0.3.10.

We're excited to share LM Studio 0.3.10 (b1) in Beta with... 🥁Speculative Decoding!

Speculative Decoding is a technique to gain inference speed ups, sometimes up to 1.5x-3x, using a combination of a "main model" and "draft model". This works best for a large main model and a very small draft model.

How to turn on Speculative Decoding

In the Chat or Server UI, you will see a new section titled "Speculative Decoding".
> Once you load a model, you will be able to choose a compatible draft model to be used

For API usage, pass the draft_model field in addition to the model field to pick which draft
model to use

As expected it makes a fantastic improvement to performance (10+%).

Qwen 2.5 32b q6_k, on my M2 Pro with and without the Qwen 2.5 0.5b q4_k_m draft model for speculative decoding:

Without speculative decoding: 10.4tk/s
With speculative decoding: 11.5tk/s

@sammcj commented on GitHub (Feb 12, 2025): FYI LM Studio just added speculative decoding / draft model support in 0.3.10. > We're excited to share LM Studio 0.3.10 (b1) in Beta with... 🥁Speculative Decoding! > > Speculative Decoding is a technique to gain inference speed ups, sometimes up to 1.5x-3x, using a combination of a "main model" and "draft model". This works best for a large main model and a very small draft model. > > How to turn on Speculative Decoding > > In the Chat or Server UI, you will see a new section titled "Speculative Decoding". > > Once you load a model, you will be able to choose a compatible draft model to be used > > For API usage, pass the draft_model field in addition to the model field to pick which draft model to use As expected it makes a fantastic improvement to performance (10+%). Qwen 2.5 32b q6_k, on my M2 Pro with and without the Qwen 2.5 0.5b q4_k_m draft model for speculative decoding: - Without speculative decoding: 10.4tk/s - With speculative decoding: 11.5tk/s

GiteaMirror commented

2026-05-03 22:05:06 -05:00

@dalisoft commented on GitHub (Feb 13, 2025):

@sammcj Temporarily we see less performance (for M1 and M2 chips) improvements and later improvements could be more

@dalisoft commented on GitHub (Feb 13, 2025): @sammcj Temporarily we see less performance (for M1 and M2 chips) improvements and later improvements could be more

GiteaMirror commented

2026-05-03 22:05:07 -05:00

@StevePierce commented on GitHub (Mar 2, 2025):

Posting here since this seems to be the most active thread, just wanted to ask if speculative decoding is on the roadmap?

@StevePierce commented on GitHub (Mar 2, 2025): Posting here since this seems to be the most active thread, just wanted to ask if speculative decoding is on the roadmap?

GiteaMirror commented

2026-05-03 22:05:07 -05:00

@hennas-waifson commented on GitHub (Mar 16, 2025):

It would be so nice to have that feature. It would make a huge difference for many Ollama users.

@hennas-waifson commented on GitHub (Mar 16, 2025): It would be so nice to have that feature. It would make a huge difference for many Ollama users.

GiteaMirror commented

2026-05-03 22:05:08 -05:00

@yurii-sio2 commented on GitHub (Mar 19, 2025):

My vote for this feature.

@yurii-sio2 commented on GitHub (Mar 19, 2025): My vote for this feature.

GiteaMirror commented

2026-05-03 22:05:10 -05:00

@iSevenDays commented on GitHub (Mar 21, 2025):

I vote for this feature too!

@iSevenDays commented on GitHub (Mar 21, 2025): I vote for this feature too!

GiteaMirror commented

2026-05-03 22:05:11 -05:00

@sammcj commented on GitHub (Apr 2, 2025):

Really big improvements from recent llama.cpp versions:

Qwen 2.5 Coder 32b, 32k context:

Llama.cpp with draft model: 39.69Tk/s
Ollama without draft model: 29.75Tk/s

FYI @jmorganca ^^

llama-server --port 12394 -ngl 99 --ctx-size 32768 -fa --cache-type-k q8_0 --cache-type-v q8_0 --host 0.0.0.0 --model-draft ./Qwen2.5-Coder-0.5B-Instruct-Q5_K_M.gguf --draft-max 24 --draft-min 1 --draft-p-min 0.6 --n-gpu-layers-draft 99 --parallel 4 --model ./Qwen2.5-Coder-32B-Instruct-Q5_K_M.gguf

prompt eval time =      90.59 ms /    73 tokens (    1.24 ms per token,   805.79 tokens per second)
       eval time =   20029.22 ms /   795 tokens (   25.19 ms per token,    39.69 tokens per second)

ollama run qwen2.5-coder-32b-instruct-128k:q5_k_m --verbose
>>> /set parameter num_ctx 32768

total duration:       27.74635847s
load duration:        8.672663ms
prompt eval count:    127 token(s)
prompt eval duration: 135.316513ms
prompt eval rate:     938.54 tokens/s
eval count:           821 token(s)
eval duration:        27.600339374s
eval rate:            29.75 tokens/s

@sammcj commented on GitHub (Apr 2, 2025): _Really_ big improvements from recent llama.cpp versions: Qwen 2.5 Coder 32b, 32k context: - **Llama.cpp** with draft model: **39.69Tk/s** - **Ollama** without draft model: **29.75Tk/s** FYI @jmorganca ^^ --- ``` llama-server --port 12394 -ngl 99 --ctx-size 32768 -fa --cache-type-k q8_0 --cache-type-v q8_0 --host 0.0.0.0 --model-draft ./Qwen2.5-Coder-0.5B-Instruct-Q5_K_M.gguf --draft-max 24 --draft-min 1 --draft-p-min 0.6 --n-gpu-layers-draft 99 --parallel 4 --model ./Qwen2.5-Coder-32B-Instruct-Q5_K_M.gguf prompt eval time = 90.59 ms / 73 tokens ( 1.24 ms per token, 805.79 tokens per second) eval time = 20029.22 ms / 795 tokens ( 25.19 ms per token, 39.69 tokens per second) ``` ``` ollama run qwen2.5-coder-32b-instruct-128k:q5_k_m --verbose >>> /set parameter num_ctx 32768 total duration: 27.74635847s load duration: 8.672663ms prompt eval count: 127 token(s) prompt eval duration: 135.316513ms prompt eval rate: 938.54 tokens/s eval count: 821 token(s) eval duration: 27.600339374s eval rate: 29.75 tokens/s ```

GiteaMirror commented

2026-05-03 22:05:12 -05:00

@bfroemel commented on GitHub (Apr 2, 2025):

Is this only abd4d0bc4f or did you notice any other changes related to speculative decoding?

@bfroemel commented on GitHub (Apr 2, 2025): Is this only https://github.com/ggml-org/llama.cpp/commit/abd4d0bc4f1a9a0e429bc8ee0d5ece2a394a0a39 or did you notice any other changes related to speculative decoding?

GiteaMirror commented

2026-05-03 22:05:14 -05:00

@sammcj commented on GitHub (Apr 2, 2025):

@bfroemel unsure of the exact commit that made the difference but - damn, it's impressive - that's up there with ExllamaV2 speculative decoding performance now.

I'm seeing a solid 33.5%~ performance increase by loading a tiny 0.5b draft model with hardly any additional vRAM usage and of course no quality difference.

@sammcj commented on GitHub (Apr 2, 2025): @bfroemel unsure of the exact commit that made the difference but - damn, it's impressive - that's up there with ExllamaV2 speculative decoding performance now. I'm seeing a solid **33.5%~ performance increase** by loading a tiny 0.5b draft model with hardly any additional vRAM usage and of course no quality difference.

GiteaMirror commented

2026-05-03 22:05:15 -05:00

@Abdulrahman392011 commented on GitHub (Apr 2, 2025):

you can always add the feature and make it dormant. apple use this strategy all the time. have it under development features and ask people to report anything related to this feature and explain that it's still in beta.

this will give a bit of content on how it will perform as to not embarrass oneself if there is a system that is incompatible or crash or lower the performance. in cases like these it's better to have as much beta testers as possible and thankfully most of the ollama users are somewhat experienced with computers and can be proactive in testing such feature.

@Abdulrahman392011 commented on GitHub (Apr 2, 2025): you can always add the feature and make it dormant. apple use this strategy all the time. have it under development features and ask people to report anything related to this feature and explain that it's still in beta. this will give a bit of content on how it will perform as to not embarrass oneself if there is a system that is incompatible or crash or lower the performance. in cases like these it's better to have as much beta testers as possible and thankfully most of the ollama users are somewhat experienced with computers and can be proactive in testing such feature.

GiteaMirror commented

2026-05-03 22:05:16 -05:00

@pdevine commented on GitHub (Apr 11, 2025):

I really want to do speculative decoding in Ollama, but my concern is always us trying to take too much on too quickly. Especially now that we have the new engine and we're slowly deprecating llama.cpp engine; if we add it in the new engine it would only work on a handful of models at first (although maybe that's fine). I also want to make sure we figure out the local vs. hybrid story (i.e. offloading the big model to a different Ollama server).

@pdevine commented on GitHub (Apr 11, 2025): I really want to do speculative decoding in Ollama, but my concern is always us trying to take too much on too quickly. Especially now that we have the new engine and we're slowly deprecating llama.cpp engine; if we add it in the new engine it would only work on a handful of models at first (although maybe that's fine). I also want to make sure we figure out the local vs. hybrid story (i.e. offloading the big model to a different Ollama server).

GiteaMirror commented

2026-05-03 22:05:17 -05:00

@pdevine commented on GitHub (Apr 11, 2025):

@sammcj hopefully you don't mind, but I changed the issue title since I think it's a broader topic. We wouldn't enable llama.cpp's draft mode since we're moving away from llama.cpp on the backend anyway.

@pdevine commented on GitHub (Apr 11, 2025): @sammcj hopefully you don't mind, but I changed the issue title since I think it's a broader topic. We wouldn't enable llama.cpp's draft mode since we're moving away from llama.cpp on the backend anyway.

GiteaMirror commented

2026-05-03 22:05:18 -05:00

@Abdulrahman392011 commented on GitHub (Apr 11, 2025):

well, no worries. whatever you guys are doing, keep doing it. the results are great.

these things take patience and rushing it won't give us the results we need. so no pressure, after all we understand the ollama team is doing this cause they want to, not cause they need to or have to.

@Abdulrahman392011 commented on GitHub (Apr 11, 2025): well, no worries. whatever you guys are doing, keep doing it. the results are great. these things take patience and rushing it won't give us the results we need. so no pressure, after all we understand the ollama team is doing this cause they want to, not cause they need to or have to.

GiteaMirror commented

2026-05-03 22:05:19 -05:00

@sammcj commented on GitHub (Apr 11, 2025):

@pdevine no worries at all! Not precious about the title by any means and am fully in support of any method of bringing speculative decoding to Ollama.

"my concern is always us trying to take too much on too quickly"

Developer & team health and well being > Product vision

I would say that once looking at new features and functionality for Ollama I think you'll have to be careful you don't fall too far behind performance wise, there's some very real, significant gains to be had from speculative decoding.

@sammcj commented on GitHub (Apr 11, 2025): @pdevine no worries at all! Not precious about the title by any means and am fully in support of any method of bringing speculative decoding to Ollama. > "my concern is always us trying to take too much on too quickly" Developer & team health and well being > Product vision I would say that once looking at new features and functionality for Ollama I think you'll have to be careful you don't fall too far behind performance wise, there's some very real, significant gains to be had from speculative decoding.

GiteaMirror commented

2026-05-03 22:05:20 -05:00

@Master-Pr0grammer commented on GitHub (Apr 28, 2025):

@pdevine just out of curiosity, what is the reason behind wanting to move away from llama.cpp? would it not be more efficient to stick to llama.cpp, and instead of making your own engine to support features, just contribute to llama.cpp to implement your features?

That way you get the added benefit of more community support.

Or has that been proven to difficult/inefficient?

@Master-Pr0grammer commented on GitHub (Apr 28, 2025): @pdevine just out of curiosity, what is the reason behind wanting to move away from llama.cpp? would it not be more efficient to stick to llama.cpp, and instead of making your own engine to support features, just contribute to llama.cpp to implement your features? That way you get the added benefit of more community support. Or has that been proven to difficult/inefficient?

GiteaMirror commented

2026-05-03 22:05:21 -05:00

@pdevine commented on GitHub (Apr 28, 2025):

@Master-Pr0grammer I have utmost respect for ggml and the llama.cpp project and what Georgi has done, but we were finding that we were diverging too much from llama.cpp and our design philosophies are very different.

@pdevine commented on GitHub (Apr 28, 2025): @Master-Pr0grammer I have utmost respect for ggml and the llama.cpp project and what Georgi has done, but we were finding that we were diverging too much from llama.cpp and our design philosophies are very different.

GiteaMirror commented

2026-05-03 22:05:22 -05:00

@Master-Pr0grammer commented on GitHub (Apr 28, 2025):

ah i see, makes sense. I was just curious since it was brought up.

@Master-Pr0grammer commented on GitHub (Apr 28, 2025): ah i see, makes sense. I was just curious since it was brought up.

GiteaMirror commented

2026-05-03 22:05:25 -05:00

@Wladastic commented on GitHub (May 4, 2025):

Instead of only adding this feature, why not allow users to split inferences in between layers?
Could even make a test script that goes through combinations of layers and stitch together a frankenmerge of the bigger and smaller llm.

@Wladastic commented on GitHub (May 4, 2025): Instead of only adding this feature, why not allow users to split inferences in between layers? Could even make a test script that goes through combinations of layers and stitch together a frankenmerge of the bigger and smaller llm.

GiteaMirror commented

2026-05-03 22:05:28 -05:00

@pdevine commented on GitHub (May 16, 2025):

I have some ideas around how to get this going in the new engine. This hinges on getting the logprobs, but should be doable. Hopefully I'll have something more concrete details in a few weeks once I'm finished up with some other work.

@pdevine commented on GitHub (May 16, 2025): I have some ideas around how to get this going in the new engine. This hinges on getting the logprobs, but should be doable. Hopefully I'll have something more concrete details in a few weeks once I'm finished up with some other work.

GiteaMirror commented

2026-05-03 22:05:30 -05:00

@pdevine commented on GitHub (Jul 24, 2025):

OK, I haven't forgotten about this, but we've been trying to get 0.10.0 out the door. We still need logprobs to be exposed properly to make it work.

@pdevine commented on GitHub (Jul 24, 2025): OK, I haven't forgotten about this, but we've been trying to get 0.10.0 out the door. We still need logprobs to be exposed properly to make it work.

GiteaMirror commented

2026-05-03 22:05:33 -05:00

@sammcj commented on GitHub (Jul 24, 2025):

Thanks @pdevine , love your work!

@sammcj commented on GitHub (Jul 24, 2025): Thanks @pdevine , love your work!

GiteaMirror commented

2026-05-03 22:05:34 -05:00

@rpeinl commented on GitHub (Aug 2, 2025):

There is a new GLM model version 4.5 out there in a bigger and smaller version similar to Llama4
https://huggingface.co/zai-org/GLM-4.5-Air
This looks very promising regarding model accuracy and it can do multi-token prediction (MTP).
Unfortunately, there is not much information available about how this works in the inference engine. However, there is a recent paper from Apple that links MTP to speculative decoding.
https://arxiv.org/html/2507.11851v1
Since tools like LMStudio already support GLM 4.5 and also supports speculative decoding, maybe it only works together.
Anyway, I would be extremely interested in getting this model to work in ollama, including MTP.

@rpeinl commented on GitHub (Aug 2, 2025): There is a new GLM model version 4.5 out there in a bigger and smaller version similar to Llama4 https://huggingface.co/zai-org/GLM-4.5-Air This looks very promising regarding model accuracy and it can do multi-token prediction (MTP). Unfortunately, there is not much information available about how this works in the inference engine. However, there is a recent paper from Apple that links MTP to speculative decoding. https://arxiv.org/html/2507.11851v1 Since tools like LMStudio already support GLM 4.5 and also supports speculative decoding, maybe it only works together. Anyway, I would be extremely interested in getting this model to work in ollama, including MTP.

GiteaMirror commented

2026-05-03 22:05:36 -05:00

@BigArty commented on GitHub (Aug 10, 2025):

Is it possible that there will be a way to make speculative decoding based on n-grams of some given text (or prompt and dialogue history)? It is by far the best way for weaker GPUs and similar or faster then 0.5B assistant models for ~8B - 14B generator models.

@BigArty commented on GitHub (Aug 10, 2025): Is it possible that there will be a way to make speculative decoding based on n-grams of some given text (or prompt and dialogue history)? It is by far the best way for weaker GPUs and similar or faster then 0.5B assistant models for ~8B - 14B generator models.

GiteaMirror commented

2026-05-03 22:05:36 -05:00

@BigArty commented on GitHub (Oct 14, 2025):

@pdevine Are there any chance that this feature is still in development?

@BigArty commented on GitHub (Oct 14, 2025): @pdevine Are there any chance that this feature is still in development?

GiteaMirror commented

2026-05-03 22:05:38 -05:00

@dhirajlochib commented on GitHub (Jan 8, 2026):

hi, ahm i've been working on implementing speculative decoding support and have completed the foundational infrastructure... here's the current status:

Implemented:

Modelfile DRAFT Command - Parse and store draft model references

FROM qwen2.5:3b
DRAFT qwen2.5:0.5b

API & Config Support - Added Draft field throughout the stack:

api.CreateRequest and api.ShowResponse
types.ConfigV2 for persistence
Model storage/retrieval in server/create.go and server/images.go

Scheduler Integration - Co-loading of draft and target models:

loadDraftModel() for async background loading
GetLoadedRunner() to retrieve loaded draft model

Speculative Engine (speculative/speculative.go):

Draft token generation
Batch verification with target model
Acceptance criterion using rejection sampling (per Leviathan et al., 2022)
Metrics tracking (acceptance rate, speedup estimation)

Tests & Documentation - Parser tests, unit tests, and Modelfile docs

what's not working yet

The actual 2-4x speedup doesn't activate because the integration needs to go deeper into the runner's token generation loop. Currently:

Draft model loads successfully
But GenerateHandler still uses standard single-model completion
The SpeculativeCompletion method exists but needs integration into the runner's core inference loop

Testing shows identical token generation rates with/without draft model because speculative decoding isn't engaging.

need guidance:
The final step requires changes to runner/ollamarunner/runner.go - specifically the token-by-token generation logic. This touches critical inference code that I'm less familiar with.

Questions:

Should the speculative logic live in the runner, or can it wrap the completion flow at a higher level?
Are there specific patterns in the runner for batched token verification I should follow?
Would the maintainers prefer to handle the runner integration, or should I continue working on it?

Branch: feature/speculative-decoding

Happy to continue working on this with guidance, or hand off the runner integration to someone more familiar with that codebase. The foundation is solid and ready for the final piece!!!

@dhirajlochib commented on GitHub (Jan 8, 2026): hi, ahm i've been working on implementing speculative decoding support and have completed the foundational infrastructure... here's the current status: Implemented: 1. **Modelfile DRAFT Command** - Parse and store draft model references FROM qwen2.5:3b DRAFT qwen2.5:0.5b 2. **API & Config Support** - Added `Draft` field throughout the stack: - `api.CreateRequest` and `api.ShowResponse` - `types.ConfigV2` for persistence - Model storage/retrieval in `server/create.go` and `server/images.go` 3. **Scheduler Integration** - Co-loading of draft and target models: - `loadDraftModel()` for async background loading - `GetLoadedRunner()` to retrieve loaded draft model 4. **Speculative Engine** (`speculative/speculative.go`): - Draft token generation - Batch verification with target model - Acceptance criterion using rejection sampling (per Leviathan et al., 2022) - Metrics tracking (acceptance rate, speedup estimation) 5. **Tests & Documentation** - Parser tests, unit tests, and Modelfile docs what's not working yet The actual **2-4x speedup doesn't activate** because the integration needs to go deeper into the runner's token generation loop. Currently: - Draft model loads successfully - But `GenerateHandler` still uses standard single-model completion - The `SpeculativeCompletion` method exists but needs integration into the runner's core inference loop Testing shows identical token generation rates with/without draft model because speculative decoding isn't engaging. need guidance: The final step requires changes to `runner/ollamarunner/runner.go` - specifically the token-by-token generation logic. This touches critical inference code that I'm less familiar with. **Questions:** 1. Should the speculative logic live in the runner, or can it wrap the completion flow at a higher level? 2. Are there specific patterns in the runner for batched token verification I should follow? 3. Would the maintainers prefer to handle the runner integration, or should I continue working on it? **Branch:** [`feature/speculative-decoding`](https://github.com/dhirajlochib/ollama/tree/feature/speculative-decoding) Happy to continue working on this with guidance, or hand off the runner integration to someone more familiar with that codebase. The foundation is solid and ready for the final piece!!!

GiteaMirror commented

2026-05-03 22:05:39 -05:00

@Filipp-Druan commented on GitHub (Apr 12, 2026):

Hello!
Please tell me what's going on with speculative decoding?
It's really important to me that this feature works. It's really hard without it! The models are incredibly slow!

Perhaps you could add Prompt Lookup Decoding? I really, really need fast program execution!

@Filipp-Druan commented on GitHub (Apr 12, 2026): Hello! Please tell me what's going on with speculative decoding? It's really important to me that this feature works. It's really hard without it! The models are incredibly slow! Perhaps you could add Prompt Lookup Decoding? I really, really need fast program execution!

GiteaMirror commented

2026-05-03 22:05:40 -05:00

@pdevine commented on GitHub (Apr 13, 2026):

OK, an update on this. Yes, I'm still looking at it, but I've been focusing on the MLX runner. I have a prototype of MTP working w/ MLX and a new multi-token sampler, but we need to get the new batching changes for the MLX runner in first which are also going to change the sampler before we can get this in.

Also, for non-Metal users, we're also working on getting the MLX runner to work on other platforms (i.e. CUDA), so there's a bit of juggling that needs to happen.

@pdevine commented on GitHub (Apr 13, 2026): OK, an update on this. Yes, I'm still looking at it, but I've been focusing on the MLX runner. I have a prototype of MTP working w/ MLX and a new multi-token sampler, but we need to get the new batching changes for the MLX runner in first which are also going to change the sampler before we can get this in. Also, for non-Metal users, we're also working on getting the MLX runner to work on other platforms (i.e. CUDA), so there's a bit of juggling that needs to happen.

GiteaMirror commented

2026-05-03 22:05:42 -05:00

@alexander-potemkin commented on GitHub (Apr 14, 2026):

for non-Metal users, we're also working on getting the MLX runner to work on other platforms (i.e. CUDA), so there's a bit of juggling that needs to happen.

@pdevine , thanks for sharing! Does MLX runner has any benefits for non Mac?
I haven't heard anything on that, but it seems like there must be some, since you consider porting the code?

@alexander-potemkin commented on GitHub (Apr 14, 2026): > for non-Metal users, we're also working on getting the MLX runner to work on other platforms (i.e. CUDA), so there's a bit of juggling that needs to happen. @pdevine , thanks for sharing! Does MLX runner has any benefits for non Mac? I haven't heard anything on that, but it seems like there must be some, since you consider porting the code?

GiteaMirror commented

2026-05-03 22:05:43 -05:00

@Filipp-Druan commented on GitHub (Apr 14, 2026):

@pdevine
Excuse me, but what about speculative decoding based on n-grams? This is really, really, really important to me!
Llama.cpp already has this feature! You just need to add a command line option to Ollama! This can speed up inference significantly!

@Filipp-Druan commented on GitHub (Apr 14, 2026): @pdevine Excuse me, but what about speculative decoding based on n-grams? This is really, really, really important to me! Llama.cpp already has this feature! You just need to add a command line option to Ollama! This can speed up inference significantly!

GiteaMirror commented

2026-05-03 22:05:44 -05:00

@ucffool commented on GitHub (Apr 14, 2026):

meh. It can, but with MoE models doing some of the same heavy lifting built-in, that seems to be the current direction for speeding up inference.

@ucffool commented on GitHub (Apr 14, 2026): meh. It can, but with MoE models doing some of the same heavy lifting built-in, that seems to be the current direction for speeding up inference.

GiteaMirror commented

2026-05-03 22:05:45 -05:00

@Filipp-Druan commented on GitHub (Apr 14, 2026):

meh. It can, but with MoE models doing some of the same heavy lifting built-in, that seems to be the current direction for speeding up inference.

That's not it. I need other models, not MoE.
I need to speed up the inference of regular models! You see, simple n-gram-based acceleration can improve speed at a very small cost!

@Filipp-Druan commented on GitHub (Apr 14, 2026): > meh. It can, but with MoE models doing some of the same heavy lifting built-in, that seems to be the current direction for speeding up inference. That's not it. I need other models, not MoE. I need to speed up the inference of regular models! You see, simple n-gram-based acceleration can improve speed at a very small cost!

GiteaMirror commented

2026-05-03 22:05:46 -05:00

@Filipp-Druan commented on GitHub (Apr 14, 2026):

Using MoE reduces the model's capabilities compared to dense versions of the same size.
But n-grams don't!

@Filipp-Druan commented on GitHub (Apr 14, 2026): Using MoE reduces the model's capabilities compared to dense versions of the same size. But n-grams don't!

Sign in to join this conversation.

Branches Tags

main

parth-mlx-decode-checkpoints

dhiltgen/ci

hoyyeva/editor-config-repair

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

hoyyeva/launch-backup-ux

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

brucemacd/download-before-remove

parth/update-claude-docs

parth-anthropic-reference-images-path

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#65653