[GH-ISSUE #5896] Linear-time chat API #50189

Closed
opened 2026-04-28 14:37:34 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @MostAwesomeDude on GitHub (Jul 23, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5896

The /api/chat endpoint provokes quadratic-time behavior when used for an extended chat session. This is a design issue, not an implementation issue.

The standard analogy we use to understand this issue is known as "Schlemiel the painter", after a traditional Yiddish joke. Imagine a painter whose paintcan is fixed and whose canvas is large; they must traverse the distance between the paintcan and the canvas repeatedly. Here, we face a similar issue; the initial state of the model is fixed with respect to the prompt and we must traverse the entire intermediate body of the chat before we can generate fresh tokens at the end.

One possible solution is to set up multiple paintcans. There is a statistical data structure called a skip list which could be used to cache intermediate fragments of chats. This would decrease the average time taken to something acceptable, but it would still vary depending on a PRNG. I don't like the privacy implications of this either, but they might be acceptable.

Another approach is to always move the paintcan to the end of the canvas, requiring a per-canvas paintcan. A session-based chat API would work; when a new chat is started, a session ID is returned representing a server-side stored model state, and subsequent calls to the same session reuse that state. This would require keeping runners alive for much longer than a single request, and also require some sort of eviction and serialization of model state from GPUs; it's not a trivial feature to implement.

As it currently stands, this is a showstopping issue for me when compared to my creaky old pile of HuggingFace wrappers.

See also #1556 for other folks being affected by this issue.

Originally created by @MostAwesomeDude on GitHub (Jul 23, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5896 The `/api/chat` endpoint provokes quadratic-time behavior when used for an extended chat session. This is a design issue, not an implementation issue. The standard analogy we use to understand this issue is known as "Schlemiel the painter", after a traditional Yiddish joke. Imagine a painter whose paintcan is fixed and whose canvas is large; they must traverse the distance between the paintcan and the canvas repeatedly. Here, we face a similar issue; the initial state of the model is fixed with respect to the prompt and we must traverse the entire intermediate body of the chat before we can generate fresh tokens at the end. One possible solution is to set up multiple paintcans. There is a statistical data structure called a [skip list](https://en.wikipedia.org/wiki/Skip_list) which could be used to cache intermediate fragments of chats. This would decrease the average time taken to something acceptable, but it would still vary depending on a PRNG. I don't like the privacy implications of this either, but they might be acceptable. Another approach is to always move the paintcan to the end of the canvas, requiring a per-canvas paintcan. A session-based chat API would work; when a new chat is started, a session ID is returned representing a server-side stored model state, and subsequent calls to the same session reuse that state. This would require keeping runners alive for much longer than a single request, and also require some sort of eviction and serialization of model state from GPUs; it's not a trivial feature to implement. As it currently stands, this is a showstopping issue for me when compared to my creaky old pile of HuggingFace wrappers. See also #1556 for other folks being affected by this issue.
GiteaMirror added the feature request label 2026-04-28 14:37:34 -05:00
Author
Owner

@jessegross commented on GitHub (Sep 12, 2024):

@MostAwesomeDude Sorry for the delay in following up on this.

Does this slow down still happen on the current version? If so, is it with a single or multiple sessions going to the chat endpoint?

Recent versions implementing caching for evaluation of the chat history, so I wouldn't expect this to be a problem any more, at least for simpler scenarios.

<!-- gh-comment-id:2347339989 --> @jessegross commented on GitHub (Sep 12, 2024): @MostAwesomeDude Sorry for the delay in following up on this. Does this slow down still happen on the current version? If so, is it with a single or multiple sessions going to the chat endpoint? Recent versions implementing caching for evaluation of the chat history, so I wouldn't expect this to be a problem any more, at least for simpler scenarios.
Author
Owner

@MostAwesomeDude commented on GitHub (Sep 14, 2024):

I don't have the setup for testing this anymore; I really did go back to the pile of HF wrappers. That said, I'm sure that you can see the design issue, so if you said you fixed it, then I believe you.

<!-- gh-comment-id:2350788605 --> @MostAwesomeDude commented on GitHub (Sep 14, 2024): I don't have the setup for testing this anymore; I really did go back to the pile of HF wrappers. That said, I'm sure that you can see the design issue, so if you said you fixed it, then I believe you.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50189