[GH-ISSUE #5800] Enable speculative decoding #65653

Open
opened 2026-05-03 22:04:16 -05:00 by GiteaMirror · 63 comments
Owner

Originally created by @sammcj on GitHub (Jul 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5800

Howdy fine Ollama folks 👋 ,

Back this time last year llama.cpp added support for speculative decoding using a draft model parameter. https://github.com/ggerganov/llama.cpp/issues/2030

This can massively speed up inference.

I was wondering if there's any chance you could look at adding the option for llama.cpp's --model-draft parameter that enables this?


It works by loading a smaller model (with the same tokeniser/family) in front of a larger model.

I know with exllamav2 you can get 100%-200% speed increases (seriously!) and the best part is - with no loss of quality.

For example, you might have:

  • Main model: Qwen 72b Q4_K_M
  • Draft model: Qwen 0.5b Q4_K_M

The result would be the memory usage of main Qwen 72b model + the tiny 0.5b draft model but around 4x the tokens/s you'd see with just the main model loaded.

I've been using exllamav2 instead of Ollama for this feature (and the 4bit K/V cache #5091) recently and the performance really is astounding.


Parameters that can be passed to llama.cpp's server:

  1. --model-draft (required) - the usage is the same as the existing --model used for loading normal models.
  2. --draft (optional, but recommended to make available)
  3. --p-split (optional, but recommended to make available)

There is also the following, but I think the default is probably fine 99% of the time:

  • --threads-draft
  • --threads-batch-draft

https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md?plain=1#L37

  -md,   --model-draft FNAME      draft model for speculative decoding (default: unused)
  -td,   --threads-draft N        number of threads to use during generation (default: same as --threads)
  -tbd,  --threads-batch-draft N  number of threads to use during batch and prompt processing (default: same as --threads-draft)
         --draft N                number of tokens to draft for speculative decoding (default: 5)
  -ps,   --p-split N              speculative decoding split probability (default: 0.1)
Originally created by @sammcj on GitHub (Jul 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5800 Howdy fine Ollama folks 👋 , Back this time last year llama.cpp added support for speculative decoding using a draft model parameter. https://github.com/ggerganov/llama.cpp/issues/2030 This can _massively_ speed up inference. I was wondering if there's any chance you could look at adding the option for llama.cpp's `--model-draft` parameter that enables this? --- It works by loading a smaller model (with the same tokeniser/family) in front of a larger model. I know with exllamav2 you can get 100%-200% speed increases (seriously!) and the best part is - with no loss of quality. For example, you might have: - Main model: Qwen 72b Q4_K_M - Draft model: Qwen 0.5b Q4_K_M The result would be the memory usage of main Qwen 72b model + the tiny 0.5b draft model but around 4x the tokens/s you'd see with just the main model loaded. I've been using exllamav2 instead of Ollama for this feature (and the 4bit K/V cache #5091) recently and the performance really is astounding. --- Parameters that can be passed to llama.cpp's server: 1. `--model-draft` (required) - the usage is the same as the existing `--model` used for loading normal models. 2. `--draft` (optional, but recommended to make available) 3. `--p-split` (optional, but recommended to make available) There is also the following, but I think the default is probably fine 99% of the time: - `--threads-draft` - `--threads-batch-draft` https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md?plain=1#L37 ``` -md, --model-draft FNAME draft model for speculative decoding (default: unused) -td, --threads-draft N number of threads to use during generation (default: same as --threads) -tbd, --threads-batch-draft N number of threads to use during batch and prompt processing (default: same as --threads-draft) --draft N number of tokens to draft for speculative decoding (default: 5) -ps, --p-split N speculative decoding split probability (default: 0.1) ```
GiteaMirror added the performancefeature request labels 2026-05-03 22:04:17 -05:00
Author
Owner

@sammcj commented on GitHub (Jul 19, 2024):

I just found this old issue which while pretty out of date seems to be related - https://github.com/ollama/ollama/issues/1292

<!-- gh-comment-id:2240326422 --> @sammcj commented on GitHub (Jul 19, 2024): I just found this old issue which while pretty out of date seems to be related - https://github.com/ollama/ollama/issues/1292
Author
Owner

@jmorganca commented on GitHub (Sep 4, 2024):

@sammcj this is really, really cool. Sorry for the late reply. By the way, check out https://infini-ai-lab.github.io/Sequoia-Page/ if you haven't yet.

<!-- gh-comment-id:2327892389 --> @jmorganca commented on GitHub (Sep 4, 2024): @sammcj this is really, really cool. Sorry for the late reply. By the way, check out https://infini-ai-lab.github.io/Sequoia-Page/ if you haven't yet.
Author
Owner

@sammcj commented on GitHub (Sep 4, 2024):

Oh that's a neat project too! thanks @jmorganca :)

<!-- gh-comment-id:2328031456 --> @sammcj commented on GitHub (Sep 4, 2024): Oh that's a neat project too! thanks @jmorganca :)
Author
Owner

@sammcj commented on GitHub (Sep 4, 2024):

I could imagine using draft models with Ollama Modelfiles to be quite a nice combo, e.g:

FROM qwen2:72b-q4_k_m
DRAFT qwen2:1.5-q4_k_m
<!-- gh-comment-id:2328149881 --> @sammcj commented on GitHub (Sep 4, 2024): I could imagine using draft models with Ollama Modelfiles to be quite a nice combo, e.g: ``` FROM qwen2:72b-q4_k_m DRAFT qwen2:1.5-q4_k_m ```
Author
Owner

@enn-nafnlaus commented on GitHub (Sep 22, 2024):

I could imagine using draft models with Ollama Modelfiles to be quite a nice combo, e.g:

FROM qwen2:72b-q4_k_m
DRAFT qwen2:1.5-q4_k_m

The best option is generally a wide-but-shallow sheared model rather than a generalist small model. For example, I made some GGUF conversions of the Yan et al models here:

https://huggingface.co/Nafnlaus/Wide-Sheared-LLaMA-290M-GGUF

Non-GGUFs here:

https://huggingface.co/minghaoyan

The TL/DR is that getting "an" answer out fast is more important than getting a good answer out.

<!-- gh-comment-id:2366789936 --> @enn-nafnlaus commented on GitHub (Sep 22, 2024): > I could imagine using draft models with Ollama Modelfiles to be quite a nice combo, e.g: > > ``` > FROM qwen2:72b-q4_k_m > DRAFT qwen2:1.5-q4_k_m > ``` The best option is generally a wide-but-shallow sheared model rather than a generalist small model. For example, I made some GGUF conversions of the Yan et al models here: https://huggingface.co/Nafnlaus/Wide-Sheared-LLaMA-290M-GGUF Non-GGUFs here: https://huggingface.co/minghaoyan The TL/DR is that getting "an" answer out fast is more important than getting a good answer out.
Author
Owner

@sammcj commented on GitHub (Sep 22, 2024):

@enn-nafnlaus that's really interesting to see, I'll have to read the paper but I'm assuming this results in the draft model taking significantly less vRAM as it's essentially just the one layer being loaded (with the tokenizer)?

I suspect this might be what Exllamav2 does with it's draft model loading as they don't seem to use as much vRAM as loading it normally.

  • Have you documented or scripted the steps you took to generate such models?
  • Have you considered creating some wide/shallow sheared models for Llama 3.1 and Qwen2.5?
<!-- gh-comment-id:2366983653 --> @sammcj commented on GitHub (Sep 22, 2024): @enn-nafnlaus that's really interesting to see, I'll have to read the paper but I'm assuming this results in the draft model taking significantly less vRAM as it's essentially just the one layer being loaded (with the tokenizer)? I suspect this might be what Exllamav2 does with it's draft model loading as they don't seem to use as much vRAM as loading it normally. - Have you documented or scripted the steps you took to generate such models? - Have you considered creating some wide/shallow sheared models for Llama 3.1 and Qwen2.5?
Author
Owner

@sammcj commented on GitHub (Sep 22, 2024):

I should add I find draft models/speculative decoding with Exllamav2 so useful that I often find myself choosing to use Ellamav2 (via TabbyAPI/TabbyLoader) over Ollama when loading models larger than 30b~ - the performance improvements when running 70b models is nothing short of amazing.

<!-- gh-comment-id:2366984647 --> @sammcj commented on GitHub (Sep 22, 2024): I should add I find draft models/speculative decoding with Exllamav2 so useful that I often find myself choosing to use Ellamav2 (via [TabbyAPI](https://github.com/theroyallab/tabbyAPI)/[TabbyLoader](https://github.com/theroyallab/tabbyAPI-gradio-loader)) over Ollama when loading models larger than 30b~ - the performance improvements when running 70b models is nothing short of amazing.
Author
Owner

@oxfighterjet commented on GitHub (Nov 6, 2024):

Is this a feature the ollama maintainers would be interested in implementing? I'm asking because I'm considering giving it a shot.

Are there some pointers / suggestions for anyone unfamiliar looking into implementing this? Thank you.

<!-- gh-comment-id:2459465324 --> @oxfighterjet commented on GitHub (Nov 6, 2024): Is this a feature the ollama maintainers would be interested in implementing? I'm asking because I'm considering giving it a shot. Are there some pointers / suggestions for anyone unfamiliar looking into implementing this? Thank you.
Author
Owner

@sammcj commented on GitHub (Nov 6, 2024):

@oxfighterjet I think this would be amazing to add, combined with #6279 - this would bring Ollama up to speed with the likes of ExllamaV2 / TabbyAPI which have had these as core features for a long time.

I was actually planning on trying to get it merged after #6279 is merged. As such I'd be more than happy to work with you on this (just note I'm only a contributor - not a maintainer).

If you look at #6279 you'll see how I've added parameters that pass down to the underlying llama.cpp.

I would take the same approach but also make sure there is support for configuring the draft model in the Modelfile and API. This is something I did have in my PR (prior to the latest reactor for the new runners/server) but was asked to remove as Ollama didn't want to add new features to the API / CLI at the time, however for the draft model feature it will be required by design, I still have the code kicking round for this here: https://github.com/sammcj/ollama/pull/26/files.

Again - I was going to work on this after #6279 is merged, assuming it actually is merged in soon - I'd still be happy to do the work for this ticket, or work with you on it - be it doing a first pass for you to review/improve, or simply to help with peer review.

One thing I'd be aware of expectations wise, getting features merged into Ollama is painfully slow - as are the feedback cycles, just to set your expectations up front 😅.

<!-- gh-comment-id:2460705191 --> @sammcj commented on GitHub (Nov 6, 2024): @oxfighterjet I think this would be amazing to add, combined with #6279 - this would bring Ollama up to speed with the likes of ExllamaV2 / TabbyAPI which have had these as core features for a long time. I was actually planning on trying to get it merged after #6279 is merged. As such I'd be more than happy to work with you on this (just note I'm only a contributor - not a maintainer). If you look at #6279 you'll see how I've added parameters that pass down to the underlying llama.cpp. I would take the same approach but also make sure there is support for configuring the draft model in the Modelfile and API. This is something I did have in my PR (prior to the latest reactor for the new runners/server) but was asked to remove as Ollama didn't want to add new features to the API / CLI at the time, however for the draft model feature it will be required by design, I still have the code kicking round for this here: https://github.com/sammcj/ollama/pull/26/files. Again - I was going to work on this after #6279 is merged, assuming it actually is merged in soon - I'd still be happy to do the work for this ticket, or work with you on it - be it doing a first pass for you to review/improve, or simply to help with peer review. One thing I'd be aware of expectations wise, getting features merged into Ollama is _painfully_ slow - as are the feedback cycles, just to set your expectations up front 😅.
Author
Owner

@oxfighterjet commented on GitHub (Nov 7, 2024):

@sammcj Thank you for your helpful resources, they will most certainly come in handy!
I see indeed that your #6279 PR has been a rollercoaster of a ride, I hope it can get merged soon.

I'll study the codebase and your PRs and see what I can contribute. I might get back to you with questions! :)

<!-- gh-comment-id:2461954127 --> @oxfighterjet commented on GitHub (Nov 7, 2024): @sammcj Thank you for your helpful resources, they will most certainly come in handy! I see indeed that your #6279 PR has been a rollercoaster of a ride, I hope it can get merged soon. I'll study the codebase and your PRs and see what I can contribute. I might get back to you with questions! :)
Author
Owner

@bsu3338 commented on GitHub (Nov 10, 2024):

@sammcj Thank you for your helpful resources, they will most certainly come in handy! I see indeed that your #6279 PR has been a rollercoaster of a ride, I hope it can get merged soon.

I'll study the codebase and your PRs and see what I can contribute. I might get back to you with questions! :)

Have you guys also considered the below approach? It looks like you could mix and match 2 models. However, it might not be as performant.

https://huggingface.co/blog/universal_assisted_generation

<!-- gh-comment-id:2466543580 --> @bsu3338 commented on GitHub (Nov 10, 2024): > @sammcj Thank you for your helpful resources, they will most certainly come in handy! I see indeed that your #6279 PR has been a rollercoaster of a ride, I hope it can get merged soon. > > I'll study the codebase and your PRs and see what I can contribute. I might get back to you with questions! :) Have you guys also considered the below approach? It looks like you could mix and match 2 models. However, it might not be as performant. https://huggingface.co/blog/universal_assisted_generation
Author
Owner

@TheTerrasque commented on GitHub (Nov 12, 2024):

I tried running llama server with speculative decoding to see if I could speed up some model, but I found out it's not supported by the server:

https://github.com/ggerganov/llama.cpp/issues/5877

<!-- gh-comment-id:2470046332 --> @TheTerrasque commented on GitHub (Nov 12, 2024): I tried running llama server with speculative decoding to see if I could speed up some model, but I found out it's not supported by the server: https://github.com/ggerganov/llama.cpp/issues/5877
Author
Owner

@TheTerrasque commented on GitHub (Nov 25, 2024):

https://github.com/ggerganov/llama.cpp/pull/10455 - this is now in llama.cpp server!

<!-- gh-comment-id:2498707084 --> @TheTerrasque commented on GitHub (Nov 25, 2024): https://github.com/ggerganov/llama.cpp/pull/10455 - this is now in llama.cpp server!
Author
Owner

@oxfighterjet commented on GitHub (Nov 25, 2024):

Great, I'm still on the ollama implementation and I'll be able to test it now. Will report back when I have a working prototype.

Edit: I have to admit I was following relevant threads of llama.cpp and didn't get a single notification, so it escaped me. Thanks for bringing it up.

Edit: I'm guessing it might take some time for these changes to propagate to ollama, given #7670 has been open for two weeks and would need to be updated.

<!-- gh-comment-id:2498893880 --> @oxfighterjet commented on GitHub (Nov 25, 2024): Great, I'm still on the ollama implementation and I'll be able to test it now. Will report back when I have a working prototype. Edit: I have to admit I was following relevant threads of llama.cpp and didn't get a single notification, so it escaped me. Thanks for bringing it up. Edit: I'm guessing it might take some time for these changes to propagate to ollama, given #7670 has been open for two weeks and would need to be updated.
Author
Owner

@chris-calo commented on GitHub (Dec 3, 2024):

@oxfighterjet looks like #7875 was favoured over #7670, and is moving faster, if it helps any

<!-- gh-comment-id:2513504336 --> @chris-calo commented on GitHub (Dec 3, 2024): @oxfighterjet looks like #7875 was favoured over #7670, and is moving faster, if it helps any
Author
Owner

@sammcj commented on GitHub (Dec 4, 2024):

Looks like performance just got another big bump thanks to https://github.com/ggerganov/llama.cpp/pull/10586 (source: https://www.reddit.com/r/LocalLLaMA/comments/1h5uq43/llamacpp_bug_fixed_speculative_decoding_is_30/)

<!-- gh-comment-id:2515864553 --> @sammcj commented on GitHub (Dec 4, 2024): Looks like performance just got another big bump thanks to https://github.com/ggerganov/llama.cpp/pull/10586 (source: https://www.reddit.com/r/LocalLLaMA/comments/1h5uq43/llamacpp_bug_fixed_speculative_decoding_is_30/)
Author
Owner

@cduk commented on GitHub (Dec 6, 2024):

These options were tantalisingly mentioned in the opening post, but these don't seem to be valid options in llama-server. Have these been implemented in any branch or are these just proposals?

  -md,   --model-draft FNAME      draft model for speculative decoding (default: unused)
  -td,   --threads-draft N        number of threads to use during generation (default: same as --threads)
  -tbd,  --threads-batch-draft N  number of threads to use during batch and prompt processing (default: same as --threads-draft)
         --draft N                number of tokens to draft for speculative decoding (default: 5)
  -ps,   --p-split N              speculative decoding split probability (default: 0.1)
<!-- gh-comment-id:2522774717 --> @cduk commented on GitHub (Dec 6, 2024): These options were tantalisingly mentioned in the opening post, but these don't seem to be valid options in llama-server. Have these been implemented in any branch or are these just proposals? ``` -md, --model-draft FNAME draft model for speculative decoding (default: unused) -td, --threads-draft N number of threads to use during generation (default: same as --threads) -tbd, --threads-batch-draft N number of threads to use during batch and prompt processing (default: same as --threads-draft) --draft N number of tokens to draft for speculative decoding (default: 5) -ps, --p-split N speculative decoding split probability (default: 0.1) ```
Author
Owner

@TheTerrasque commented on GitHub (Dec 6, 2024):

it was merged into master 2 weeks ago. Check the PR link I gave a few posts up.

--draft-max, --draft, --draft-n N       number of tokens to draft for speculative decoding (default: 16)
--draft-min, --draft-n-min N            minimum number of draft tokens to use for speculative decoding
                                        (default: 5)
--draft-p-min P                         minimum speculative decoding probability (greedy) (default: 0.9)
-cd,   --ctx-size-draft N               size of the prompt context for the draft model (default: 0, 0 = loaded
                                        from model)
-devd, --device-draft <dev1,dev2,..>    comma-separated list of devices to use for offloading the draft model
                                        (none = don't offload)
                                        use --list-devices to see a list of available devices
-ngld, --gpu-layers-draft, --n-gpu-layers-draft N
                                        number of layers to store in VRAM for the draft model
-md,   --model-draft FNAME              draft model for speculative decoding (default: unused)

These are the options of the current llama-server regarding draft model

<!-- gh-comment-id:2523024092 --> @TheTerrasque commented on GitHub (Dec 6, 2024): it was merged into master 2 weeks ago. Check the PR link I gave a few posts up. ``` --draft-max, --draft, --draft-n N number of tokens to draft for speculative decoding (default: 16) --draft-min, --draft-n-min N minimum number of draft tokens to use for speculative decoding (default: 5) --draft-p-min P minimum speculative decoding probability (greedy) (default: 0.9) -cd, --ctx-size-draft N size of the prompt context for the draft model (default: 0, 0 = loaded from model) -devd, --device-draft <dev1,dev2,..> comma-separated list of devices to use for offloading the draft model (none = don't offload) use --list-devices to see a list of available devices -ngld, --gpu-layers-draft, --n-gpu-layers-draft N number of layers to store in VRAM for the draft model -md, --model-draft FNAME draft model for speculative decoding (default: unused) ``` These are the options of the current llama-server regarding draft model
Author
Owner

@cduk commented on GitHub (Dec 6, 2024):

I will check again but I was referring specifically to flags -tbd, -td and -ps.

<!-- gh-comment-id:2523311795 --> @cduk commented on GitHub (Dec 6, 2024): I will check again but I was referring specifically to flags -tbd, -td and -ps.
Author
Owner

@bfroemel commented on GitHub (Dec 10, 2024):

I could imagine using draft models with Ollama Modelfiles to be quite a nice combo, e.g:

FROM qwen2:72b-q4_k_m
DRAFT qwen2:1.5-q4_k_m

agreeing, the draft model (model-draft) and most related parameters (draft-max, draft-min, draft-p-min, ctx-size-draft) should be specified in the Modelfile. some parameters could be defined in the environment (device-draft, gpu-layers-draft) unless there is a good way to derive them automatically.

  • model-draft: [Modelfile, DRAFT] ollama uses layers (media type: application/vnd.ollama.image.model) to reference model blobs. it would be nice to reuse the same mechanism and store the layer reference in the model manifest. Maybe just assume that the first application/vnd.ollama.image.model layer is the main model, and an optional additional application/vnd.ollama.image.model layer is the draft model?
  • draft-max, draft-min, draft-p-min, ctx-size-draft [Modelfile, PARAMETER] Those parameters appear to be just Runner options, and can probably be very easily added.
<!-- gh-comment-id:2531677443 --> @bfroemel commented on GitHub (Dec 10, 2024): > I could imagine using draft models with Ollama Modelfiles to be quite a nice combo, e.g: > > ``` > FROM qwen2:72b-q4_k_m > DRAFT qwen2:1.5-q4_k_m > ``` agreeing, the draft model (model-draft) and most related parameters (draft-max, draft-min, draft-p-min, ctx-size-draft) should be specified in the Modelfile. some parameters could be defined in the environment (device-draft, gpu-layers-draft) unless there is a good way to derive them automatically. - model-draft: [Modelfile, DRAFT] ollama uses layers (media type: `application/vnd.ollama.image.model`) to reference model blobs. it would be nice to reuse the same mechanism and store the layer reference in the model manifest. Maybe just assume that the first `application/vnd.ollama.image.model` layer is the main model, and an optional additional `application/vnd.ollama.image.model` layer is the draft model? - draft-max, draft-min, draft-p-min, ctx-size-draft [Modelfile, PARAMETER] Those parameters appear to be just Runner options, and can probably be very easily added.
Author
Owner

@Steel-skull commented on GitHub (Dec 11, 2024):

looks like https://github.com/ollama/ollama/pull/7875 was merged

<!-- gh-comment-id:2533650056 --> @Steel-skull commented on GitHub (Dec 11, 2024): looks like https://github.com/ollama/ollama/pull/7875 was merged
Author
Owner

@chris-calo commented on GitHub (Dec 11, 2024):

@oxfighterjet are you still working on this?

<!-- gh-comment-id:2533666624 --> @chris-calo commented on GitHub (Dec 11, 2024): @oxfighterjet are you still working on this?
Author
Owner

@oxfighterjet commented on GitHub (Dec 11, 2024):

@chris-calo yes.

<!-- gh-comment-id:2534659773 --> @oxfighterjet commented on GitHub (Dec 11, 2024): @chris-calo yes.
Author
Owner

@bfroemel commented on GitHub (Dec 11, 2024):

just because it wasn't obvious to me: getting this into ollama is going to be more work than just handling down the mentioned parameters.

it appears that we basically have to replicate this as well:
9ca2e67762
and keeping track of fixes (for example, there are more):
84e1c33cde
1da7b76569

<!-- gh-comment-id:2535247022 --> @bfroemel commented on GitHub (Dec 11, 2024): just because it wasn't obvious to me: getting this into ollama is going to be more work than just handling down the mentioned parameters. it appears that we basically have to replicate this as well: https://github.com/ggerganov/llama.cpp/commit/9ca2e677626fce759d5d95c407c03677b9c87a26 and keeping track of fixes (for example, there are more): https://github.com/ggerganov/llama.cpp/commit/84e1c33cde9e0a7aafcda2d4f21ba51c300482d7 https://github.com/ggerganov/llama.cpp/commit/1da7b765692764a8b33b08da61cbee63812a7bd9
Author
Owner

@bfroemel commented on GitHub (Dec 11, 2024):

Before moving forward with a prototype implementation it may be helpful to discuss the necessary changes?

Imo we roughly have the following tasks:

  1. Parameter handling, how and where to define them, how to reference draft models such that the existing model repository can be used for loading a draft model, how to pass all required parameters down to a runner.
  2. Reimplementation of the actual draft model feature in the runner, i.e., draft model loading (cf4d7c52c4/llama/runner/runner.go (L845)) and using it during inference (cf4d7c52c4/llama/runner/runner.go (L360))
  3. Reimplement or preferably somehow use existing utility code (cgo) located in llama.cpp (https://github.com/ggerganov/llama.cpp/blob/master/common/speculative.cpp, https://github.com/ggerganov/llama.cpp/blob/master/common/common.cpp) by appropriately extending https://github.com/ollama/ollama/blob/main/llama/llama.go
<!-- gh-comment-id:2536835960 --> @bfroemel commented on GitHub (Dec 11, 2024): Before moving forward with a prototype implementation it may be helpful to discuss the necessary changes? Imo we roughly have the following tasks: 1. Parameter handling, how and where to define them, how to reference draft models such that the existing model repository can be used for loading a draft model, how to pass all required parameters down to a runner. 2. Reimplementation of the actual draft model feature in the runner, i.e., draft model loading (https://github.com/ollama/ollama/blob/cf4d7c52c47d753bd04a8791b9c6042271c40c1e/llama/runner/runner.go#L845) and using it during inference (https://github.com/ollama/ollama/blob/cf4d7c52c47d753bd04a8791b9c6042271c40c1e/llama/runner/runner.go#L360) 3. Reimplement or preferably somehow use existing utility code (cgo) located in llama.cpp (https://github.com/ggerganov/llama.cpp/blob/master/common/speculative.cpp, https://github.com/ggerganov/llama.cpp/blob/master/common/common.cpp) by appropriately extending https://github.com/ollama/ollama/blob/main/llama/llama.go
Author
Owner

@mspinelli commented on GitHub (Dec 15, 2024):

Maybe this is not helpful, but perhaps there are some additional ideas on how to easily add this functionality by looking at how the llama-swap project accomplishes this?

<!-- gh-comment-id:2544209802 --> @mspinelli commented on GitHub (Dec 15, 2024): Maybe this is not helpful, but perhaps there are some additional ideas on how to easily add this functionality by looking at how the [llama-swap](https://github.com/mostlygeek/llama-swap/blob/main/examples/speculative-decoding/README.md) project accomplishes this?
Author
Owner

@bfroemel commented on GitHub (Dec 16, 2024):

Before having looked at the source I also assumed that ollama just starts llama.cpp server instances similar to llama-swap. I guess there are or have been good reasons why ollama reimplemented that part of llama.cpp; probably because of the added flexibility and maybe being able to implement some features quicker than having to wait for upstream. At least for this feature, upstream was faster; also it is my impression that llama.cpp server nowadays appears more sophisticated than what we have in ollama, so for the long run it might really be a good idea to look into adopting llama.cpp server directly as a runner (and add potentially missing instrumentation/control API to llama.cpp server).

Anyway, as I wanted to understand speculative decoding and getting into Go I tried to move forward with the previously outlined tasks and made progress with 1., and 3. (turned out ollama already made use of interfacing c++ code via C wrappers, so this was easy to extend). The second task is a bit of a struggle to debug. As soon as I have something of an initial proof of concept-quality solution to show in a couple of days and @oxfighterjet hasn't already done so, I'll open a PR...

<!-- gh-comment-id:2545054615 --> @bfroemel commented on GitHub (Dec 16, 2024): Before having looked at the source I also assumed that ollama just starts llama.cpp server instances similar to llama-swap. I guess there are or have been good reasons why ollama reimplemented that part of llama.cpp; probably because of the added flexibility and maybe being able to implement some features quicker than having to wait for upstream. At least for this feature, upstream was faster; also it is my impression that llama.cpp server nowadays appears more sophisticated than what we have in ollama, so for the long run it might really be a good idea to look into adopting llama.cpp server directly as a runner (and add potentially missing instrumentation/control API to llama.cpp server). Anyway, as I wanted to understand speculative decoding and getting into Go I tried to move forward with the previously outlined tasks and made progress with 1., and 3. (turned out ollama already made use of interfacing c++ code via C wrappers, so this was easy to extend). The second task is a bit of a struggle to debug. As soon as I have something of an initial proof of concept-quality solution to show in a couple of days and @oxfighterjet hasn't already done so, I'll open a PR...
Author
Owner

@oxfighterjet commented on GitHub (Dec 16, 2024):

@bfroemel Thanks for sharing your thoughts and your intentions. I am mostly interested in this feature being implemented at all, but my personal availability has decreased lately, with my work requiring more of my attention before the end of the year. It seems you are interested in taking over this issue and I'm glad to hand it over, I do not want to claim any exclusivity over it. If you have some ideas of how to implement it, please go ahead. I will anyway follow the progress of this issue closely, and am hoping for this feature to be propagated all the way to the top with open-webui :)

<!-- gh-comment-id:2545346829 --> @oxfighterjet commented on GitHub (Dec 16, 2024): @bfroemel Thanks for sharing your thoughts and your intentions. I am mostly interested in this feature being implemented at all, but my personal availability has decreased lately, with my work requiring more of my attention before the end of the year. It seems you are interested in taking over this issue and I'm glad to hand it over, I do not want to claim any exclusivity over it. If you have some ideas of how to implement it, please go ahead. I will anyway follow the progress of this issue closely, and am hoping for this feature to be propagated all the way to the top with open-webui :)
Author
Owner

@bfroemel commented on GitHub (Dec 18, 2024):

@oxfighterjet @sammcj Could you take a look at https://github.com/ollama/ollama/pull/8134 ? Testing/reviews/comments very welcome ;)

<!-- gh-comment-id:2552260804 --> @bfroemel commented on GitHub (Dec 18, 2024): @oxfighterjet @sammcj Could you take a look at https://github.com/ollama/ollama/pull/8134 ? Testing/reviews/comments very welcome ;)
Author
Owner

@sammcj commented on GitHub (Jan 1, 2025):

Seeing more very positive things about the performance and surprisingly - TDP/power usage required with speculative decoding in llama.cpp: https://www.reddit.com/r/LocalLLaMA/comments/1hqlug2/revisting_llamacpp_speculative_decoding_w/

<!-- gh-comment-id:2566820545 --> @sammcj commented on GitHub (Jan 1, 2025): Seeing more very positive things about the performance and surprisingly - TDP/power usage required with speculative decoding in llama.cpp: https://www.reddit.com/r/LocalLLaMA/comments/1hqlug2/revisting_llamacpp_speculative_decoding_w/
Author
Owner

@zjh-nuc-AIOT commented on GitHub (Feb 11, 2025):

现在这个功能实现了吗

<!-- gh-comment-id:2650076843 --> @zjh-nuc-AIOT commented on GitHub (Feb 11, 2025): 现在这个功能实现了吗
Author
Owner

@sammcj commented on GitHub (Feb 12, 2025):

FYI LM Studio just added speculative decoding / draft model support in 0.3.10.

We're excited to share LM Studio 0.3.10 (b1) in Beta with... 🥁Speculative Decoding!

Speculative Decoding is a technique to gain inference speed ups, sometimes up to 1.5x-3x, using a combination of a "main model" and "draft model". This works best for a large main model and a very small draft model.

How to turn on Speculative Decoding

In the Chat or Server UI, you will see a new section titled "Speculative Decoding".
> Once you load a model, you will be able to choose a compatible draft model to be used

For API usage, pass the draft_model field in addition to the model field to pick which draft
model to use

As expected it makes a fantastic improvement to performance (10+%).

Qwen 2.5 32b q6_k, on my M2 Pro with and without the Qwen 2.5 0.5b q4_k_m draft model for speculative decoding:

  • Without speculative decoding: 10.4tk/s
  • With speculative decoding: 11.5tk/s
<!-- gh-comment-id:2654900586 --> @sammcj commented on GitHub (Feb 12, 2025): FYI LM Studio just added speculative decoding / draft model support in 0.3.10. > We're excited to share LM Studio 0.3.10 (b1) in Beta with... 🥁Speculative Decoding! > > Speculative Decoding is a technique to gain inference speed ups, sometimes up to 1.5x-3x, using a combination of a "main model" and "draft model". This works best for a large main model and a very small draft model. > > How to turn on Speculative Decoding > > In the Chat or Server UI, you will see a new section titled "Speculative Decoding". > > Once you load a model, you will be able to choose a compatible draft model to be used > > For API usage, pass the draft_model field in addition to the model field to pick which draft model to use As expected it makes a fantastic improvement to performance (10+%). Qwen 2.5 32b q6_k, on my M2 Pro with and without the Qwen 2.5 0.5b q4_k_m draft model for speculative decoding: - Without speculative decoding: 10.4tk/s - With speculative decoding: 11.5tk/s
Author
Owner

@dalisoft commented on GitHub (Feb 13, 2025):

@sammcj Temporarily we see less performance (for M1 and M2 chips) improvements and later improvements could be more

<!-- gh-comment-id:2657549590 --> @dalisoft commented on GitHub (Feb 13, 2025): @sammcj Temporarily we see less performance (for M1 and M2 chips) improvements and later improvements could be more
Author
Owner

@StevePierce commented on GitHub (Mar 2, 2025):

Posting here since this seems to be the most active thread, just wanted to ask if speculative decoding is on the roadmap?

<!-- gh-comment-id:2692813105 --> @StevePierce commented on GitHub (Mar 2, 2025): Posting here since this seems to be the most active thread, just wanted to ask if speculative decoding is on the roadmap?
Author
Owner

@hennas-waifson commented on GitHub (Mar 16, 2025):

It would be so nice to have that feature. It would make a huge difference for many Ollama users.

<!-- gh-comment-id:2727653544 --> @hennas-waifson commented on GitHub (Mar 16, 2025): It would be so nice to have that feature. It would make a huge difference for many Ollama users.
Author
Owner

@yurii-sio2 commented on GitHub (Mar 19, 2025):

My vote for this feature.

<!-- gh-comment-id:2737485252 --> @yurii-sio2 commented on GitHub (Mar 19, 2025): My vote for this feature.
Author
Owner

@iSevenDays commented on GitHub (Mar 21, 2025):

I vote for this feature too!

<!-- gh-comment-id:2743475868 --> @iSevenDays commented on GitHub (Mar 21, 2025): I vote for this feature too!
Author
Owner

@sammcj commented on GitHub (Apr 2, 2025):

Really big improvements from recent llama.cpp versions:

Qwen 2.5 Coder 32b, 32k context:

  • Llama.cpp with draft model: 39.69Tk/s
  • Ollama without draft model: 29.75Tk/s

FYI @jmorganca ^^


llama-server --port 12394 -ngl 99 --ctx-size 32768 -fa --cache-type-k q8_0 --cache-type-v q8_0 --host 0.0.0.0 --model-draft ./Qwen2.5-Coder-0.5B-Instruct-Q5_K_M.gguf --draft-max 24 --draft-min 1 --draft-p-min 0.6 --n-gpu-layers-draft 99 --parallel 4 --model ./Qwen2.5-Coder-32B-Instruct-Q5_K_M.gguf

prompt eval time =      90.59 ms /    73 tokens (    1.24 ms per token,   805.79 tokens per second)
       eval time =   20029.22 ms /   795 tokens (   25.19 ms per token,    39.69 tokens per second)
ollama run qwen2.5-coder-32b-instruct-128k:q5_k_m --verbose
>>> /set parameter num_ctx 32768

total duration:       27.74635847s
load duration:        8.672663ms
prompt eval count:    127 token(s)
prompt eval duration: 135.316513ms
prompt eval rate:     938.54 tokens/s
eval count:           821 token(s)
eval duration:        27.600339374s
eval rate:            29.75 tokens/s
<!-- gh-comment-id:2772151545 --> @sammcj commented on GitHub (Apr 2, 2025): _Really_ big improvements from recent llama.cpp versions: Qwen 2.5 Coder 32b, 32k context: - **Llama.cpp** with draft model: **39.69Tk/s** - **Ollama** without draft model: **29.75Tk/s** FYI @jmorganca ^^ --- ``` llama-server --port 12394 -ngl 99 --ctx-size 32768 -fa --cache-type-k q8_0 --cache-type-v q8_0 --host 0.0.0.0 --model-draft ./Qwen2.5-Coder-0.5B-Instruct-Q5_K_M.gguf --draft-max 24 --draft-min 1 --draft-p-min 0.6 --n-gpu-layers-draft 99 --parallel 4 --model ./Qwen2.5-Coder-32B-Instruct-Q5_K_M.gguf prompt eval time = 90.59 ms / 73 tokens ( 1.24 ms per token, 805.79 tokens per second) eval time = 20029.22 ms / 795 tokens ( 25.19 ms per token, 39.69 tokens per second) ``` ``` ollama run qwen2.5-coder-32b-instruct-128k:q5_k_m --verbose >>> /set parameter num_ctx 32768 total duration: 27.74635847s load duration: 8.672663ms prompt eval count: 127 token(s) prompt eval duration: 135.316513ms prompt eval rate: 938.54 tokens/s eval count: 821 token(s) eval duration: 27.600339374s eval rate: 29.75 tokens/s ```
Author
Owner

@bfroemel commented on GitHub (Apr 2, 2025):

Is this only abd4d0bc4f or did you notice any other changes related to speculative decoding?

<!-- gh-comment-id:2772195541 --> @bfroemel commented on GitHub (Apr 2, 2025): Is this only https://github.com/ggml-org/llama.cpp/commit/abd4d0bc4f1a9a0e429bc8ee0d5ece2a394a0a39 or did you notice any other changes related to speculative decoding?
Author
Owner

@sammcj commented on GitHub (Apr 2, 2025):

@bfroemel unsure of the exact commit that made the difference but - damn, it's impressive - that's up there with ExllamaV2 speculative decoding performance now.

I'm seeing a solid 33.5%~ performance increase by loading a tiny 0.5b draft model with hardly any additional vRAM usage and of course no quality difference.

<!-- gh-comment-id:2773687810 --> @sammcj commented on GitHub (Apr 2, 2025): @bfroemel unsure of the exact commit that made the difference but - damn, it's impressive - that's up there with ExllamaV2 speculative decoding performance now. I'm seeing a solid **33.5%~ performance increase** by loading a tiny 0.5b draft model with hardly any additional vRAM usage and of course no quality difference.
Author
Owner

@Abdulrahman392011 commented on GitHub (Apr 2, 2025):

you can always add the feature and make it dormant. apple use this strategy all the time. have it under development features and ask people to report anything related to this feature and explain that it's still in beta.

this will give a bit of content on how it will perform as to not embarrass oneself if there is a system that is incompatible or crash or lower the performance. in cases like these it's better to have as much beta testers as possible and thankfully most of the ollama users are somewhat experienced with computers and can be proactive in testing such feature.

<!-- gh-comment-id:2773821520 --> @Abdulrahman392011 commented on GitHub (Apr 2, 2025): you can always add the feature and make it dormant. apple use this strategy all the time. have it under development features and ask people to report anything related to this feature and explain that it's still in beta. this will give a bit of content on how it will perform as to not embarrass oneself if there is a system that is incompatible or crash or lower the performance. in cases like these it's better to have as much beta testers as possible and thankfully most of the ollama users are somewhat experienced with computers and can be proactive in testing such feature.
Author
Owner

@pdevine commented on GitHub (Apr 11, 2025):

I really want to do speculative decoding in Ollama, but my concern is always us trying to take too much on too quickly. Especially now that we have the new engine and we're slowly deprecating llama.cpp engine; if we add it in the new engine it would only work on a handful of models at first (although maybe that's fine). I also want to make sure we figure out the local vs. hybrid story (i.e. offloading the big model to a different Ollama server).

<!-- gh-comment-id:2797970649 --> @pdevine commented on GitHub (Apr 11, 2025): I really want to do speculative decoding in Ollama, but my concern is always us trying to take too much on too quickly. Especially now that we have the new engine and we're slowly deprecating llama.cpp engine; if we add it in the new engine it would only work on a handful of models at first (although maybe that's fine). I also want to make sure we figure out the local vs. hybrid story (i.e. offloading the big model to a different Ollama server).
Author
Owner

@pdevine commented on GitHub (Apr 11, 2025):

@sammcj hopefully you don't mind, but I changed the issue title since I think it's a broader topic. We wouldn't enable llama.cpp's draft mode since we're moving away from llama.cpp on the backend anyway.

<!-- gh-comment-id:2797976480 --> @pdevine commented on GitHub (Apr 11, 2025): @sammcj hopefully you don't mind, but I changed the issue title since I think it's a broader topic. We wouldn't enable llama.cpp's draft mode since we're moving away from llama.cpp on the backend anyway.
Author
Owner

@Abdulrahman392011 commented on GitHub (Apr 11, 2025):

well, no worries. whatever you guys are doing, keep doing it. the results are great.

these things take patience and rushing it won't give us the results we need. so no pressure, after all we understand the ollama team is doing this cause they want to, not cause they need to or have to.

<!-- gh-comment-id:2798061404 --> @Abdulrahman392011 commented on GitHub (Apr 11, 2025): well, no worries. whatever you guys are doing, keep doing it. the results are great. these things take patience and rushing it won't give us the results we need. so no pressure, after all we understand the ollama team is doing this cause they want to, not cause they need to or have to.
Author
Owner

@sammcj commented on GitHub (Apr 11, 2025):

@pdevine no worries at all! Not precious about the title by any means and am fully in support of any method of bringing speculative decoding to Ollama.

"my concern is always us trying to take too much on too quickly"

Developer & team health and well being > Product vision

I would say that once looking at new features and functionality for Ollama I think you'll have to be careful you don't fall too far behind performance wise, there's some very real, significant gains to be had from speculative decoding.

<!-- gh-comment-id:2798076329 --> @sammcj commented on GitHub (Apr 11, 2025): @pdevine no worries at all! Not precious about the title by any means and am fully in support of any method of bringing speculative decoding to Ollama. > "my concern is always us trying to take too much on too quickly" Developer & team health and well being > Product vision I would say that once looking at new features and functionality for Ollama I think you'll have to be careful you don't fall too far behind performance wise, there's some very real, significant gains to be had from speculative decoding.
Author
Owner

@Master-Pr0grammer commented on GitHub (Apr 28, 2025):

@pdevine just out of curiosity, what is the reason behind wanting to move away from llama.cpp? would it not be more efficient to stick to llama.cpp, and instead of making your own engine to support features, just contribute to llama.cpp to implement your features?

That way you get the added benefit of more community support.

Or has that been proven to difficult/inefficient?

<!-- gh-comment-id:2835793997 --> @Master-Pr0grammer commented on GitHub (Apr 28, 2025): @pdevine just out of curiosity, what is the reason behind wanting to move away from llama.cpp? would it not be more efficient to stick to llama.cpp, and instead of making your own engine to support features, just contribute to llama.cpp to implement your features? That way you get the added benefit of more community support. Or has that been proven to difficult/inefficient?
Author
Owner

@pdevine commented on GitHub (Apr 28, 2025):

@Master-Pr0grammer I have utmost respect for ggml and the llama.cpp project and what Georgi has done, but we were finding that we were diverging too much from llama.cpp and our design philosophies are very different.

<!-- gh-comment-id:2836030571 --> @pdevine commented on GitHub (Apr 28, 2025): @Master-Pr0grammer I have utmost respect for ggml and the llama.cpp project and what Georgi has done, but we were finding that we were diverging too much from llama.cpp and our design philosophies are very different.
Author
Owner

@Master-Pr0grammer commented on GitHub (Apr 28, 2025):

ah i see, makes sense. I was just curious since it was brought up.

<!-- gh-comment-id:2836081244 --> @Master-Pr0grammer commented on GitHub (Apr 28, 2025): ah i see, makes sense. I was just curious since it was brought up.
Author
Owner

@Wladastic commented on GitHub (May 4, 2025):

Instead of only adding this feature, why not allow users to split inferences in between layers?
Could even make a test script that goes through combinations of layers and stitch together a frankenmerge of the bigger and smaller llm.

<!-- gh-comment-id:2849187436 --> @Wladastic commented on GitHub (May 4, 2025): Instead of only adding this feature, why not allow users to split inferences in between layers? Could even make a test script that goes through combinations of layers and stitch together a frankenmerge of the bigger and smaller llm.
Author
Owner

@pdevine commented on GitHub (May 16, 2025):

I have some ideas around how to get this going in the new engine. This hinges on getting the logprobs, but should be doable. Hopefully I'll have something more concrete details in a few weeks once I'm finished up with some other work.

<!-- gh-comment-id:2887799256 --> @pdevine commented on GitHub (May 16, 2025): I have some ideas around how to get this going in the new engine. This hinges on getting the logprobs, but should be doable. Hopefully I'll have something more concrete details in a few weeks once I'm finished up with some other work.
Author
Owner

@pdevine commented on GitHub (Jul 24, 2025):

OK, I haven't forgotten about this, but we've been trying to get 0.10.0 out the door. We still need logprobs to be exposed properly to make it work.

<!-- gh-comment-id:3114342366 --> @pdevine commented on GitHub (Jul 24, 2025): OK, I haven't forgotten about this, but we've been trying to get 0.10.0 out the door. We still need logprobs to be exposed properly to make it work.
Author
Owner

@sammcj commented on GitHub (Jul 24, 2025):

Thanks @pdevine , love your work!

<!-- gh-comment-id:3115241231 --> @sammcj commented on GitHub (Jul 24, 2025): Thanks @pdevine , love your work!
Author
Owner

@rpeinl commented on GitHub (Aug 2, 2025):

There is a new GLM model version 4.5 out there in a bigger and smaller version similar to Llama4
https://huggingface.co/zai-org/GLM-4.5-Air
This looks very promising regarding model accuracy and it can do multi-token prediction (MTP).
Unfortunately, there is not much information available about how this works in the inference engine. However, there is a recent paper from Apple that links MTP to speculative decoding.
https://arxiv.org/html/2507.11851v1
Since tools like LMStudio already support GLM 4.5 and also supports speculative decoding, maybe it only works together.
Anyway, I would be extremely interested in getting this model to work in ollama, including MTP.

<!-- gh-comment-id:3146269489 --> @rpeinl commented on GitHub (Aug 2, 2025): There is a new GLM model version 4.5 out there in a bigger and smaller version similar to Llama4 https://huggingface.co/zai-org/GLM-4.5-Air This looks very promising regarding model accuracy and it can do multi-token prediction (MTP). Unfortunately, there is not much information available about how this works in the inference engine. However, there is a recent paper from Apple that links MTP to speculative decoding. https://arxiv.org/html/2507.11851v1 Since tools like LMStudio already support GLM 4.5 and also supports speculative decoding, maybe it only works together. Anyway, I would be extremely interested in getting this model to work in ollama, including MTP.
Author
Owner

@BigArty commented on GitHub (Aug 10, 2025):

Is it possible that there will be a way to make speculative decoding based on n-grams of some given text (or prompt and dialogue history)? It is by far the best way for weaker GPUs and similar or faster then 0.5B assistant models for ~8B - 14B generator models.

<!-- gh-comment-id:3172755306 --> @BigArty commented on GitHub (Aug 10, 2025): Is it possible that there will be a way to make speculative decoding based on n-grams of some given text (or prompt and dialogue history)? It is by far the best way for weaker GPUs and similar or faster then 0.5B assistant models for ~8B - 14B generator models.
Author
Owner

@BigArty commented on GitHub (Oct 14, 2025):

@pdevine Are there any chance that this feature is still in development?

<!-- gh-comment-id:3401553569 --> @BigArty commented on GitHub (Oct 14, 2025): @pdevine Are there any chance that this feature is still in development?
Author
Owner

@dhirajlochib commented on GitHub (Jan 8, 2026):

hi, ahm i've been working on implementing speculative decoding support and have completed the foundational infrastructure... here's the current status:

Implemented:

  1. Modelfile DRAFT Command - Parse and store draft model references

FROM qwen2.5:3b
DRAFT qwen2.5:0.5b

  1. API & Config Support - Added Draft field throughout the stack:
  • api.CreateRequest and api.ShowResponse
  • types.ConfigV2 for persistence
  • Model storage/retrieval in server/create.go and server/images.go
  1. Scheduler Integration - Co-loading of draft and target models:
  • loadDraftModel() for async background loading
  • GetLoadedRunner() to retrieve loaded draft model
  1. Speculative Engine (speculative/speculative.go):
  • Draft token generation
  • Batch verification with target model
  • Acceptance criterion using rejection sampling (per Leviathan et al., 2022)
  • Metrics tracking (acceptance rate, speedup estimation)
  1. Tests & Documentation - Parser tests, unit tests, and Modelfile docs

what's not working yet

The actual 2-4x speedup doesn't activate because the integration needs to go deeper into the runner's token generation loop. Currently:

  • Draft model loads successfully
  • But GenerateHandler still uses standard single-model completion
  • The SpeculativeCompletion method exists but needs integration into the runner's core inference loop

Testing shows identical token generation rates with/without draft model because speculative decoding isn't engaging.

need guidance:
The final step requires changes to runner/ollamarunner/runner.go - specifically the token-by-token generation logic. This touches critical inference code that I'm less familiar with.

Questions:

  1. Should the speculative logic live in the runner, or can it wrap the completion flow at a higher level?
  2. Are there specific patterns in the runner for batched token verification I should follow?
  3. Would the maintainers prefer to handle the runner integration, or should I continue working on it?

Branch: feature/speculative-decoding

Happy to continue working on this with guidance, or hand off the runner integration to someone more familiar with that codebase. The foundation is solid and ready for the final piece!!!

<!-- gh-comment-id:3722391355 --> @dhirajlochib commented on GitHub (Jan 8, 2026): hi, ahm i've been working on implementing speculative decoding support and have completed the foundational infrastructure... here's the current status: Implemented: 1. **Modelfile DRAFT Command** - Parse and store draft model references FROM qwen2.5:3b DRAFT qwen2.5:0.5b 2. **API & Config Support** - Added `Draft` field throughout the stack: - `api.CreateRequest` and `api.ShowResponse` - `types.ConfigV2` for persistence - Model storage/retrieval in `server/create.go` and `server/images.go` 3. **Scheduler Integration** - Co-loading of draft and target models: - `loadDraftModel()` for async background loading - `GetLoadedRunner()` to retrieve loaded draft model 4. **Speculative Engine** (`speculative/speculative.go`): - Draft token generation - Batch verification with target model - Acceptance criterion using rejection sampling (per Leviathan et al., 2022) - Metrics tracking (acceptance rate, speedup estimation) 5. **Tests & Documentation** - Parser tests, unit tests, and Modelfile docs what's not working yet The actual **2-4x speedup doesn't activate** because the integration needs to go deeper into the runner's token generation loop. Currently: - Draft model loads successfully - But `GenerateHandler` still uses standard single-model completion - The `SpeculativeCompletion` method exists but needs integration into the runner's core inference loop Testing shows identical token generation rates with/without draft model because speculative decoding isn't engaging. need guidance: The final step requires changes to `runner/ollamarunner/runner.go` - specifically the token-by-token generation logic. This touches critical inference code that I'm less familiar with. **Questions:** 1. Should the speculative logic live in the runner, or can it wrap the completion flow at a higher level? 2. Are there specific patterns in the runner for batched token verification I should follow? 3. Would the maintainers prefer to handle the runner integration, or should I continue working on it? **Branch:** [`feature/speculative-decoding`](https://github.com/dhirajlochib/ollama/tree/feature/speculative-decoding) Happy to continue working on this with guidance, or hand off the runner integration to someone more familiar with that codebase. The foundation is solid and ready for the final piece!!!
Author
Owner

@Filipp-Druan commented on GitHub (Apr 12, 2026):

Hello!
Please tell me what's going on with speculative decoding?
It's really important to me that this feature works. It's really hard without it! The models are incredibly slow!

Perhaps you could add Prompt Lookup Decoding? I really, really need fast program execution!

<!-- gh-comment-id:4232456008 --> @Filipp-Druan commented on GitHub (Apr 12, 2026): Hello! Please tell me what's going on with speculative decoding? It's really important to me that this feature works. It's really hard without it! The models are incredibly slow! Perhaps you could add Prompt Lookup Decoding? I really, really need fast program execution!
Author
Owner

@pdevine commented on GitHub (Apr 13, 2026):

OK, an update on this. Yes, I'm still looking at it, but I've been focusing on the MLX runner. I have a prototype of MTP working w/ MLX and a new multi-token sampler, but we need to get the new batching changes for the MLX runner in first which are also going to change the sampler before we can get this in.

Also, for non-Metal users, we're also working on getting the MLX runner to work on other platforms (i.e. CUDA), so there's a bit of juggling that needs to happen.

<!-- gh-comment-id:4238942239 --> @pdevine commented on GitHub (Apr 13, 2026): OK, an update on this. Yes, I'm still looking at it, but I've been focusing on the MLX runner. I have a prototype of MTP working w/ MLX and a new multi-token sampler, but we need to get the new batching changes for the MLX runner in first which are also going to change the sampler before we can get this in. Also, for non-Metal users, we're also working on getting the MLX runner to work on other platforms (i.e. CUDA), so there's a bit of juggling that needs to happen.
Author
Owner

@alexander-potemkin commented on GitHub (Apr 14, 2026):

for non-Metal users, we're also working on getting the MLX runner to work on other platforms (i.e. CUDA), so there's a bit of juggling that needs to happen.

@pdevine , thanks for sharing! Does MLX runner has any benefits for non Mac?
I haven't heard anything on that, but it seems like there must be some, since you consider porting the code?

<!-- gh-comment-id:4244072041 --> @alexander-potemkin commented on GitHub (Apr 14, 2026): > for non-Metal users, we're also working on getting the MLX runner to work on other platforms (i.e. CUDA), so there's a bit of juggling that needs to happen. @pdevine , thanks for sharing! Does MLX runner has any benefits for non Mac? I haven't heard anything on that, but it seems like there must be some, since you consider porting the code?
Author
Owner

@Filipp-Druan commented on GitHub (Apr 14, 2026):

@pdevine
Excuse me, but what about speculative decoding based on n-grams? This is really, really, really important to me!
Llama.cpp already has this feature! You just need to add a command line option to Ollama! This can speed up inference significantly!

<!-- gh-comment-id:4244121256 --> @Filipp-Druan commented on GitHub (Apr 14, 2026): @pdevine Excuse me, but what about speculative decoding based on n-grams? This is really, really, really important to me! Llama.cpp already has this feature! You just need to add a command line option to Ollama! This can speed up inference significantly!
Author
Owner

@ucffool commented on GitHub (Apr 14, 2026):

meh. It can, but with MoE models doing some of the same heavy lifting built-in, that seems to be the current direction for speeding up inference.

<!-- gh-comment-id:4244360978 --> @ucffool commented on GitHub (Apr 14, 2026): meh. It can, but with MoE models doing some of the same heavy lifting built-in, that seems to be the current direction for speeding up inference.
Author
Owner

@Filipp-Druan commented on GitHub (Apr 14, 2026):

meh. It can, but with MoE models doing some of the same heavy lifting built-in, that seems to be the current direction for speeding up inference.

That's not it. I need other models, not MoE.
I need to speed up the inference of regular models! You see, simple n-gram-based acceleration can improve speed at a very small cost!

<!-- gh-comment-id:4244556522 --> @Filipp-Druan commented on GitHub (Apr 14, 2026): > meh. It can, but with MoE models doing some of the same heavy lifting built-in, that seems to be the current direction for speeding up inference. That's not it. I need other models, not MoE. I need to speed up the inference of regular models! You see, simple n-gram-based acceleration can improve speed at a very small cost!
Author
Owner

@Filipp-Druan commented on GitHub (Apr 14, 2026):

Using MoE reduces the model's capabilities compared to dense versions of the same size.
But n-grams don't!

<!-- gh-comment-id:4244587829 --> @Filipp-Druan commented on GitHub (Apr 14, 2026): Using MoE reduces the model's capabilities compared to dense versions of the same size. But n-grams don't!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65653