[GH-ISSUE #792] Implement Streaming LLM #26139

Closed
opened 2026-04-22 02:11:00 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @Liuxyly on GitHub (Oct 15, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/792

I read the following llama.cpp Issue, I want use this feature. How can I do?

https://github.com/ggerganov/llama.cpp/issues/3440

Originally created by @Liuxyly on GitHub (Oct 15, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/792 I read the following llama.cpp Issue, I want use this feature. How can I do? https://github.com/ggerganov/llama.cpp/issues/3440
GiteaMirror added the feature request label 2026-04-22 02:11:00 -05:00
Author
Owner

@jploski commented on GitHub (Oct 31, 2023):

I read the following llama.cpp Issue, I want use this feature. How can I do?

ggerganov/llama.cpp#3440

All you need to do is use the option --keep to specify how many tokens from the initial prompt you want to retain (the default is -1, meaning the entire prompt, which is not a bad idea for many use cases). In case of StreamingLLM they suggest something like "4", but that is only to avoid the purported degradation of generation quality after --ctx-size tokens. However, if you don't want your generation to "forget" what was in the initial prompt, you may wish to set this higher (like llama.cpp already does for you).

In short the StreamingLLM paper describes what llama.cpp has already done for a while. It adds some quantitative measurements and explains why keeping some initial anchor tokens in cache is important when generating sequences longer than the context window length. The key point from the paper is that if you set --keep 0, your generation quality would suffer after the initial tokens slide out of the KV cache (i.e. after --ctx-size tokens).

Athough it is not entirely clear whether the quality "improvement" reported in the paper comes from keeping of the fixed intial tokens or maybe simply from applying the positional embedding based on the context-window position of a token rather than on its absolute position in the document (which again, llama.cpp has always done this way, I'm not sure about other KV caches). There is a sentence in the paper which hints that the initial tokens don't really matter, only assigning their positions does ("This suggests that the absolute position of the starting tokens, rather than their semantic value, holds greater significance.")

Overall, this paper seems to solve an implementation problem which as far as I'm aware has never existed in llama.cpp's "infinite generation" (otherwise users would have noticed - the paper reports a dramatic explosion of perplexity after the context window length).

<!-- gh-comment-id:1786271752 --> @jploski commented on GitHub (Oct 31, 2023): > I read the following llama.cpp Issue, I want use this feature. How can I do? > > [ggerganov/llama.cpp#3440](https://github.com/ggerganov/llama.cpp/issues/3440) All you need to do is use the option --keep to specify how many tokens from the initial prompt you want to retain (the default is -1, meaning the entire prompt, which is not a bad idea for many use cases). In case of StreamingLLM they suggest something like "4", but that is only to avoid the purported degradation of generation quality after --ctx-size tokens. However, if you don't want your generation to "forget" what was in the initial prompt, you may wish to set this higher (like llama.cpp already does for you). In short the StreamingLLM paper describes what llama.cpp has already done for a while. It adds some quantitative measurements and explains why keeping some initial anchor tokens in cache is important when generating sequences longer than the context window length. The key point from the paper is that if you set --keep 0, your generation quality would suffer after the initial tokens slide out of the KV cache (i.e. after --ctx-size tokens). Athough it is not entirely clear whether the quality "improvement" reported in the paper comes from keeping of the fixed intial tokens or maybe simply from applying the positional embedding based on the context-window position of a token rather than on its absolute position in the document (which again, llama.cpp has always done this way, I'm not sure about other KV caches). There is a sentence in the paper which hints that the initial tokens don't really matter, only assigning their positions does ("This suggests that the absolute position of the starting tokens, rather than their semantic value, holds greater significance.") Overall, this paper seems to solve an implementation problem which as far as I'm aware has never existed in llama.cpp's "infinite generation" (otherwise users would have noticed - the paper reports a dramatic explosion of perplexity after the context window length).
Author
Owner

@sandangel commented on GitHub (Dec 7, 2023):

@jploski

All you need to do is use the option --keep to specify how many tokens from the initial prompt you want to retain (the default is -1, meaning the entire prompt, which is not a bad idea for many use cases)

Do you mean we can specify that --keep flag when starting ollama server? Can you share a bit details how can I enable it?

<!-- gh-comment-id:1844853299 --> @sandangel commented on GitHub (Dec 7, 2023): @jploski > All you need to do is use the option --keep to specify how many tokens from the initial prompt you want to retain (the default is -1, meaning the entire prompt, which is not a bad idea for many use cases) Do you mean we can specify that `--keep` flag when starting ollama server? Can you share a bit details how can I enable it?
Author
Owner

@MoonRide303 commented on GitHub (Mar 29, 2024):

@jmorganca It would be really nice to have support for attention sinks available in ollama - it looks like it can prevent degrading perplexity in longer chats:
image

I am observing this problem pretty often in current version of ollama, when chats get long - implementing this feature could be good solution for that.

Reference implementation in transformers was added as https://github.com/huggingface/transformers/issues/26553.

<!-- gh-comment-id:2027682142 --> @MoonRide303 commented on GitHub (Mar 29, 2024): @jmorganca It would be really nice to have support for [attention sinks](https://arxiv.org/abs/2309.17453) available in ollama - it looks like it can prevent degrading perplexity in longer chats: ![image](https://github.com/ollama/ollama/assets/130458190/6a72d310-fb91-4d2c-88cd-d4a3b1e5a99a) I am observing this problem pretty often in current version of ollama, when chats get long - implementing this feature could be good solution for that. Reference implementation in transformers was added as https://github.com/huggingface/transformers/issues/26553.
Author
Owner

@jmorganca commented on GitHub (Sep 4, 2024):

I believe this should be supported now (the num_keep option). Let me know if that isn't the case. Thanks for the issue!

<!-- gh-comment-id:2329710649 --> @jmorganca commented on GitHub (Sep 4, 2024): I believe this should be supported now (the `num_keep` option). Let me know if that isn't the case. Thanks for the issue!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#26139