[GH-ISSUE #9216] Speculative Decoding for faster inference. #31766

Closed
opened 2026-04-22 12:31:45 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @Meshwa428 on GitHub (Feb 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9216

llama.cpp supports speculative decoding.
Which loads the main model and a draft/smaller model which runs at lower latency.

will it be implemented in ollama?
is it in your future plans?

i saw a PR which was rejected

but it was because of backend rewrites. But it has already been done right? then why not add speculative decoding?

Originally created by @Meshwa428 on GitHub (Feb 19, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9216 [llama.cpp](https://github.com/ggml-org/llama.cpp/pull/2926) supports speculative decoding. Which loads the main model and a draft/smaller model which runs at lower latency. will it be implemented in ollama? is it in your future plans? i saw a PR which was [rejected](https://github.com/ollama/ollama/pull/8134) but it was because of backend rewrites. But it has already been done right? then why not add speculative decoding?
GiteaMirror added the feature request label 2026-04-22 12:31:45 -05:00
Author
Owner

@Meshwa428 commented on GitHub (Feb 19, 2025):

@jmorganca ??

<!-- gh-comment-id:2668138107 --> @Meshwa428 commented on GitHub (Feb 19, 2025): @jmorganca ??
Author
Owner

@definitiontv commented on GitHub (Mar 9, 2025):

Sounds like a useful feature.

<!-- gh-comment-id:2709095364 --> @definitiontv commented on GitHub (Mar 9, 2025): Sounds like a useful feature.
Author
Owner

@thiswillbeyourgithub commented on GitHub (May 13, 2025):

Adressed in #5800

<!-- gh-comment-id:2876894596 --> @thiswillbeyourgithub commented on GitHub (May 13, 2025): Adressed in #5800
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#31766