[GH-ISSUE #2653] Ollama serve fails silently when an input is too long #63613

Open
opened 2026-05-03 14:28:28 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @logancyang on GitHub (Feb 21, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/2653

When I use ollama serve and provide a context of ~30k tokens with a mistral model that has a max context window of 32768, the server doesn't show any error and proceeds to return as usual. That gave me the impression that it successfully took in the entire context.

But after digging a bit deeper, I see it's not.
SCR-20240221-lpyt

So when I do this below it started working fine

ollama run <model>
/set parameter num_ctx 32768
/save

Perhaps it's because there are flags to set with ollama serve which I don't know about after reading the docs. Is there a better way to set the context window for ollama serve?

In my mind, the expected behavior is to show an error message when the input is exceeding the set context window length. LM Studio does this

SCR-20240221-lsnn

Please let me know if it's because I'm not using it with the right flags or if this is a legit concern.

Originally created by @logancyang on GitHub (Feb 21, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/2653 When I use `ollama serve` and provide a context of ~30k tokens with a mistral model that has a max context window of 32768, the server doesn't show any error and proceeds to return as usual. That gave me the impression that it successfully took in the entire context. But after digging a bit deeper, I see it's not. ![SCR-20240221-lpyt](https://github.com/ollama/ollama/assets/4860545/8caef175-f97d-4304-9f19-1a8103770427) So when I do this below it started working fine ``` ollama run <model> /set parameter num_ctx 32768 /save ``` Perhaps it's because there are flags to set with `ollama serve` which I don't know about after reading the docs. Is there a better way to set the context window for `ollama serve`? In my mind, the expected behavior is to show an error message when the input is exceeding the set context window length. LM Studio does this <img width="1426" alt="SCR-20240221-lsnn" src="https://github.com/ollama/ollama/assets/4860545/ee4f2408-bbce-4fb8-bd74-6306aca08b3c"> Please let me know if it's because I'm not using it with the right flags or if this is a legit concern.
GiteaMirror added the bug label 2026-05-03 14:28:28 -05:00
Author
Owner

@vividfog commented on GitHub (Feb 21, 2024):

This variable and many others are settings per model. Not per server. And they must be per model because every model needs a different setup. When the server starts, it doesn't even know which model you will run, and you may run 10 different models next back to back.

Doing it once and for all via /save (or you could have added via the Modelfile approach, see docs) then applies it forever for you.

It sounds like you may be conflating "serve" and "run" as the same thing. When you start flipping between more than a few models I believe you'll end up preferring that these are not "global" variables for all models at once. That would lead to all sorts of errors when changing from Mistral to the new Gemma for example.

Or maybe I misunderstood your (mis)usecase :)

<!-- gh-comment-id:1957980472 --> @vividfog commented on GitHub (Feb 21, 2024): This variable and many others are settings per model. Not per server. And they must be per model because every model needs a different setup. When the server starts, it doesn't even know which model you will run, and you may run 10 different models next back to back. Doing it once and for all via /save (or you could have added via the Modelfile approach, see docs) then applies it forever for you. It sounds like you may be conflating "serve" and "run" as the same thing. When you start flipping between more than a few models I believe you'll end up preferring that these are not "global" variables for all models at once. That would lead to all sorts of errors when changing from Mistral to the new Gemma for example. Or maybe I misunderstood your (mis)usecase :)
Author
Owner

@logancyang commented on GitHub (Feb 21, 2024):

@vividfog i understand serve and run are different things, serve ofc doesn't know which model the user is gonna call. However, my original point is that it's failing silently without showing that the input is too long.

As for the UX for serve, I do believe there are better and clearer options than running every model and having /set parameter ... and /save. It's quite tedious and error-prone. If there's a config for the server where you can set once for all models that is a UX improvement IMO.

At the very least, the doc should be clear about what's expected for ollama serve for use cases or (mis) use cases.

<!-- gh-comment-id:1958051066 --> @logancyang commented on GitHub (Feb 21, 2024): @vividfog i understand serve and run are different things, serve ofc doesn't know which model the user is gonna call. However, my original point is that **it's failing silently without showing that the input is too long**. As for the UX for serve, I do believe there are better and clearer options than running every model and having `/set parameter ...` and `/save`. It's quite tedious and error-prone. If there's a config for the server where you can set once for all models that is a UX improvement IMO. At the very least, the doc should be clear about what's expected for `ollama serve` for use cases or (mis) use cases.
Author
Owner

@vividfog commented on GitHub (Feb 22, 2024):

@logancyang I see. Sorry about the pun, couldn't resist when it came to mind.

Failing silently when the input goes past some threshold, I agree that's not optimal. I'll have to test that too when I can. 32k context overtakes my whole laptop if I'd try now.

In the meanwhile, I did /set parameter num_ctx 5 for Mistral and then wrote more than 5 tokens. In this case it didn't fail silently, it failed by producing nonsense. Same for Qwen. I wonder why. Here too it'd be nice to have a heads-up from the app, if it can catch this.

>>> /set parameter num_ctx 5
Set parameter 'num_ctx' to '5'
>>> This is probably more than five tokens, is it?
: Question: Given the function `count_ Q(x) = QLabel("")
 QSizePolicy::ExpandRows: QUERYDSL, QuestionUtils. QuestionUtils is a class with Question and Answer pairs ( Question->text );
 QTextEdit *m_ Q: How does the FCA's approach to Question 11 in Question 2 in Figure~\ref{fig: QCD vacuum instabilities and Question Marks in QR code?
 Q: Why are you afraid of Qarib Shirin, Questioner [5
<!-- gh-comment-id:1958453995 --> @vividfog commented on GitHub (Feb 22, 2024): @logancyang I see. Sorry about the pun, couldn't resist when it came to mind. Failing silently when the input goes past some threshold, I agree that's not optimal. I'll have to test that too when I can. 32k context overtakes my whole laptop if I'd try now. In the meanwhile, I did `/set parameter num_ctx 5` for Mistral and then wrote more than 5 tokens. In this case it didn't fail silently, it failed by producing nonsense. Same for Qwen. I wonder why. Here too it'd be nice to have a heads-up from the app, if it can catch this. ``` >>> /set parameter num_ctx 5 Set parameter 'num_ctx' to '5' >>> This is probably more than five tokens, is it? : Question: Given the function `count_ Q(x) = QLabel("") QSizePolicy::ExpandRows: QUERYDSL, QuestionUtils. QuestionUtils is a class with Question and Answer pairs ( Question->text ); QTextEdit *m_ Q: How does the FCA's approach to Question 11 in Question 2 in Figure~\ref{fig: QCD vacuum instabilities and Question Marks in QR code? Q: Why are you afraid of Qarib Shirin, Questioner [5 ```
Author
Owner

@logancyang commented on GitHub (Feb 22, 2024):

@vividfog that's interesting, with a 5-token context length I guess anything is possible since it doesn't have much to work with? In any case, I think it's better to have an explicit error message. When I was testing my long prompts I knew something was off but didn't know what. The doc didn't have anything about ollama serve and context length configurations. But your comment from the other issue helped me pinpoint the problem, so thanks for that!

<!-- gh-comment-id:1958468181 --> @logancyang commented on GitHub (Feb 22, 2024): @vividfog that's interesting, with a 5-token context length I guess anything is possible since it doesn't have much to work with? In any case, I think it's better to have an explicit error message. When I was testing my long prompts I knew something was off but didn't know what. The doc didn't have anything about `ollama serve` and context length configurations. But your comment from the other issue helped me pinpoint the problem, so thanks for that!
Author
Owner

@Shawneau commented on GitHub (Feb 22, 2024):

I think this is why I was having crashes too. Open Web UI and Ollama in serve mode I guess don't talk to each other to set the context window? Like even if I set context to 8K in open web ui settings, it doesn't tell ollama serve to set up mixtral for example with 8k context....?

<!-- gh-comment-id:1959623092 --> @Shawneau commented on GitHub (Feb 22, 2024): I think this is why I was having crashes too. Open Web UI and Ollama in serve mode I guess don't talk to each other to set the context window? Like even if I set context to 8K in open web ui settings, it doesn't tell ollama serve to set up mixtral for example with 8k context....?
Author
Owner

@logancyang commented on GitHub (Feb 22, 2024):

I think this is why I was having crashes too. Open Web UI and Ollama in serve mode I guess don't talk to each other to set the context window? Like even if I set context to 8K in open web ui settings, it doesn't tell ollama serve to set up mixtral for example with 8k context....?

Your UI most likely doesn't send the context length parameter to Ollama in the way it accepts. Just check your server log and see if it shows the correct context length value.

The issue with Ollama is that it should let us know if the input is overflowing or truncated instead of silently moving on.

<!-- gh-comment-id:1960311780 --> @logancyang commented on GitHub (Feb 22, 2024): > I think this is why I was having crashes too. Open Web UI and Ollama in serve mode I guess don't talk to each other to set the context window? Like even if I set context to 8K in open web ui settings, it doesn't tell ollama serve to set up mixtral for example with 8k context....? Your UI most likely doesn't send the context length parameter to Ollama in the way it accepts. Just check your server log and see if it shows the correct context length value. The issue with Ollama is that it should let us know if the input is overflowing or truncated instead of silently moving on.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#63613