[PR #8134] [CLOSED] feat: Introduce speculative decoding #74936

Closed
opened 2026-05-05 07:15:49 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/8134
Author: @bfroemel
Created: 12/17/2024
Status: Closed

Base: mainHead: feature/draft-model


📝 Commits (10+)

  • 22200c9 added vendor wrappers for draft model feature
  • 119db22 runner.go: separated token processing from processBatch()
  • 888ffec feat: intro draft models and speculative decoding
  • 58d0569 server/sched.go: use ShortName instead of ModelPath as key for Scheduler.loaded map
  • 4c06617 Merge remote-tracking branch 'upstream/main' into feature/draft-model
  • d539e92 make/Makefile.sync: separated speculative decoding vendoring from llava vendoring
  • 90db844 docs/faq.md: style fixes
  • ea724a1 server/sched_test.go: added valid model ShortName
  • 5dbaf51 llama: style fixes
  • 59e036b speculative decoding: handle draft model device allocation automatically

📊 Changes

22 files changed (+1394 additions, -373 deletions)

View changed files

📝 api/types.go (+25 -18)
📝 cmd/cmd.go (+2 -2)
📝 docs/faq.md (+17 -0)
📝 docs/modelfile.md (+13 -0)
📝 llama/llama.go (+96 -0)
📝 llama/runner/runner.go (+243 -40)
📝 llama/sampling_ext.cpp (+24 -0)
📝 llama/sampling_ext.h (+5 -0)
llama/speculative.cpp (+300 -0)
llama/speculative.h (+54 -0)
llama/speculative_ext.cpp (+50 -0)
llama/speculative_ext.h (+35 -0)
📝 llm/memory.go (+297 -226)
📝 llm/memory_test.go (+46 -13)
📝 llm/server.go (+79 -21)
📝 make/Makefile.sync (+19 -2)
📝 parser/parser.go (+6 -2)
📝 parser/parser_test.go (+5 -0)
📝 server/images.go (+17 -3)
📝 server/routes_generate_test.go (+4 -4)

...and 2 more files

📄 Description

This PR aims to replicate speculative decoding as implemented in https://github.com/ggerganov/llama.cpp/blob/master/examples/server/server.cpp.

See hints in the documentation (docs/faq.md) for trying it out.

Kindly asking for feedback to get ready for merging.

Fixes #5800.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/8134 **Author:** [@bfroemel](https://github.com/bfroemel) **Created:** 12/17/2024 **Status:** ❌ Closed **Base:** `main` ← **Head:** `feature/draft-model` --- ### 📝 Commits (10+) - [`22200c9`](https://github.com/ollama/ollama/commit/22200c969c4c24964a740e50d06d9a7259502a97) added vendor wrappers for draft model feature - [`119db22`](https://github.com/ollama/ollama/commit/119db220c2abee1fc79538e331ba7a9bd25aecd7) runner.go: separated token processing from processBatch() - [`888ffec`](https://github.com/ollama/ollama/commit/888ffecc94922e148f681341814597d25b850357) feat: intro draft models and speculative decoding - [`58d0569`](https://github.com/ollama/ollama/commit/58d0569ed84ed8d09b10079abba511ad32982a5a) server/sched.go: use ShortName instead of ModelPath as key for Scheduler.loaded map - [`4c06617`](https://github.com/ollama/ollama/commit/4c066179b0f8604fb1ce1c95e45cdd0332bee240) Merge remote-tracking branch 'upstream/main' into feature/draft-model - [`d539e92`](https://github.com/ollama/ollama/commit/d539e92eb66cc86047de5bd3e1906349cf9083c4) make/Makefile.sync: separated speculative decoding vendoring from llava vendoring - [`90db844`](https://github.com/ollama/ollama/commit/90db8444955bdbc2ae90cb84abe8c5156bd6e287) docs/faq.md: style fixes - [`ea724a1`](https://github.com/ollama/ollama/commit/ea724a1a6be02635cba86539841d2068e8774a66) server/sched_test.go: added valid model ShortName - [`5dbaf51`](https://github.com/ollama/ollama/commit/5dbaf51c673b46ff1ffe238b6c5c524c3d47d42f) llama: style fixes - [`59e036b`](https://github.com/ollama/ollama/commit/59e036b6b0e3b2674c27c7d57aae0f7d529c0f31) speculative decoding: handle draft model device allocation automatically ### 📊 Changes **22 files changed** (+1394 additions, -373 deletions) <details> <summary>View changed files</summary> 📝 `api/types.go` (+25 -18) 📝 `cmd/cmd.go` (+2 -2) 📝 `docs/faq.md` (+17 -0) 📝 `docs/modelfile.md` (+13 -0) 📝 `llama/llama.go` (+96 -0) 📝 `llama/runner/runner.go` (+243 -40) 📝 `llama/sampling_ext.cpp` (+24 -0) 📝 `llama/sampling_ext.h` (+5 -0) ➕ `llama/speculative.cpp` (+300 -0) ➕ `llama/speculative.h` (+54 -0) ➕ `llama/speculative_ext.cpp` (+50 -0) ➕ `llama/speculative_ext.h` (+35 -0) 📝 `llm/memory.go` (+297 -226) 📝 `llm/memory_test.go` (+46 -13) 📝 `llm/server.go` (+79 -21) 📝 `make/Makefile.sync` (+19 -2) 📝 `parser/parser.go` (+6 -2) 📝 `parser/parser_test.go` (+5 -0) 📝 `server/images.go` (+17 -3) 📝 `server/routes_generate_test.go` (+4 -4) _...and 2 more files_ </details> ### 📄 Description This PR aims to replicate speculative decoding as implemented in https://github.com/ggerganov/llama.cpp/blob/master/examples/server/server.cpp. See hints in the documentation (docs/faq.md) for trying it out. Kindly asking for feedback to get ready for merging. Fixes #5800. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 07:15:49 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#74936