[GH-ISSUE #1292] add 'prompt lookup decoding' for faster inference #667

Open
opened 2026-04-12 10:21:22 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @0xdevalias on GitHub (Nov 27, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1292

  • https://github.com/apoorvumang/prompt-lookup-decoding (@apoorvumang)
    • Prompt Lookup Decoding

    • TLDR: We modify speculative decoding where we replace the draft model with simple string matching in the prompt to generate candidate token sequences. This results in significant speedups (2x-4x) in input-grounded tasks, with no effect on output quality. This method can be used with any decoder model without model changes or external datastore, and with both greedy and sampling techniques.

Related Issues:

Originally created by @0xdevalias on GitHub (Nov 27, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1292 - https://github.com/apoorvumang/prompt-lookup-decoding (@apoorvumang) - > Prompt Lookup Decoding - > TLDR: We modify speculative decoding where we replace the draft model with simple string matching in the prompt to generate candidate token sequences. This results in significant speedups (2x-4x) in input-grounded tasks, with no effect on output quality. This method can be used with any decoder model without model changes or external datastore, and with both greedy and sampling techniques. Related Issues: - https://github.com/ggerganov/llama.cpp/issues/4226
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#667