[GH-ISSUE #8528] don't show the thinking process #31259

Closed
opened 2026-04-22 11:33:02 -05:00 by GiteaMirror · 18 comments
Owner

Originally created by @sunburst-yz on GitHub (Jan 22, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8528

When I use DeepSeek-R1, the thinking process shown does not make sense to me, I only want to see the final result.

Image

Originally created by @sunburst-yz on GitHub (Jan 22, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8528 When I use DeepSeek-R1, the thinking process shown does not make sense to me, I only want to see the final result. ![Image](https://github.com/user-attachments/assets/d56c95ba-1fbc-4a4a-83e0-4ae1d7899325)
GiteaMirror added the feature request label 2026-04-22 11:33:02 -05:00
Author
Owner

@sunburst-yz commented on GitHub (Jan 22, 2025):

LM Studio has already supported the thinking process UI to be folded and hidden in version 0.3.8 beta.

<!-- gh-comment-id:2606293783 --> @sunburst-yz commented on GitHub (Jan 22, 2025): LM Studio has already supported the thinking process UI to be folded and hidden in version 0.3.8 beta.
Author
Owner

@sunburst-yz commented on GitHub (Jan 22, 2025):

https://github.com/open-webui/open-webui/issues/8706
simular issue on open-webui

<!-- gh-comment-id:2606402139 --> @sunburst-yz commented on GitHub (Jan 22, 2025): https://github.com/open-webui/open-webui/issues/8706 simular issue on open-webui
Author
Owner

@JaviLib commented on GitHub (Jan 26, 2025):

Very annoying, it should be disabled by default, with the option to be enabled. Watching the model thinking is interesting at first, but when you need to use it heavily it's disturbing.

<!-- gh-comment-id:2614540827 --> @JaviLib commented on GitHub (Jan 26, 2025): Very annoying, it should be disabled by default, with the option to be enabled. Watching the model thinking is interesting at first, but when you need to use it heavily it's disturbing.
Author
Owner

@ethanfischer commented on GitHub (Jan 30, 2025):

Yeah it's crazy verbose. Would love a way to collapse by default. Or outright hide it and only have it show when I need to debug something

<!-- gh-comment-id:2625087200 --> @ethanfischer commented on GitHub (Jan 30, 2025): Yeah it's crazy verbose. Would love a way to collapse by default. Or outright hide it and only have it show when I need to debug something
Author
Owner

@ghost commented on GitHub (Feb 2, 2025):

Is there no way to disable this 'thinking' output?

<!-- gh-comment-id:2629256076 --> @ghost commented on GitHub (Feb 2, 2025): Is there no way to disable this 'thinking' output?
Author
Owner

@rick-github commented on GitHub (Feb 6, 2025):

It's a feature of the model. If you don't want to see "thinking", don't use the model, or use a client that folds the output.

<!-- gh-comment-id:2640245119 --> @rick-github commented on GitHub (Feb 6, 2025): It's a feature of the model. If you don't want to see "thinking", don't use the model, or use a client that folds the output.
Author
Owner

@VMinB12 commented on GitHub (Mar 19, 2025):

@rick-github I understand your reasoning, but is it not reasonable for Ollama to return the thinking tokens separately from the final output tokens in the response?

<!-- gh-comment-id:2736756540 --> @VMinB12 commented on GitHub (Mar 19, 2025): @rick-github I understand your reasoning, but is it not reasonable for Ollama to return the thinking tokens separately from the final output tokens in the response?
Author
Owner

@rick-github commented on GitHub (Mar 19, 2025):

The model just generates tokens. ollama has no idea what the tokens mean, it just converts them into UTF-8 and sends the results to the client. Each reasoning model uses a different set of strings to denote "thinking" (deepseek <think></think>, granite3.2 'Here is my thought process:','Here is my response', exaone <thought></thought>, etc) so there is currently no consensus on how to process the output. OpenAI have it easier because they have a limited number of models and they can train their model to output appropriate markers so that the API interface can break the stream of tokens into reasoning vs response. So currently, the client (which selected the model and hence knows how to deal with the output) is responsible for distinguishing reasoning from response. I'm sure that down the track there will be a meeting of minds on how to deal with this for published models and a set of guidelines drawn up, but for the moment that doesn't exist and clients need to deal with it.

<!-- gh-comment-id:2736910261 --> @rick-github commented on GitHub (Mar 19, 2025): The model just generates tokens. ollama has no idea what the tokens mean, it just converts them into UTF-8 and sends the results to the client. Each reasoning model uses a different set of strings to denote "thinking" (deepseek `<think></think>`, granite3.2 `'Here is my thought process:','Here is my response'`, exaone `<thought></thought>`, etc) so there is currently no consensus on how to process the output. OpenAI have it easier because they have a limited number of models and they can train their model to output appropriate markers so that the API interface can break the stream of tokens into `reasoning` vs `response`. So currently, the client (which selected the model and hence knows how to deal with the output) is responsible for distinguishing `reasoning` from `response`. I'm sure that down the track there will be a meeting of minds on how to deal with this for published models and a set of guidelines drawn up, but for the moment that doesn't exist and clients need to deal with it.
Author
Owner

@VMinB12 commented on GitHub (Mar 19, 2025):

Thanks for the context, I agree that the different formats for thinking tokens is a complicating factor here. Yet, isn't this issue also present for tool calling, where each model has there own chat prompt template? And yet, Ollama does support tool calling (and it is much appreciated!). There, the models come packaged with this prompt template, so you can effectively parse the model response using the provided template. Perhaps you can consider to similarly package a thinking configuration with each model? A start_thinking_token and end_thinking_token should suffice I think. Let me know your thoughts.

<!-- gh-comment-id:2736962731 --> @VMinB12 commented on GitHub (Mar 19, 2025): Thanks for the context, I agree that the different formats for thinking tokens is a complicating factor here. Yet, isn't this issue also present for tool calling, where each model has there own chat prompt template? And yet, Ollama does support tool calling (and it is much appreciated!). There, the models come packaged with this prompt template, so you can effectively parse the model response using the provided template. Perhaps you can consider to similarly package a thinking configuration with each model? A `start_thinking_token` and `end_thinking_token` should suffice I think. Let me know your thoughts.
Author
Owner

@VMinB12 commented on GitHub (Mar 19, 2025):

To add to that, already today you have params for each model. These starting and ending thinking tokens could be added here?
e.g. https://ollama.com/library/qwq/blobs/e5229acc2492

{
    "repeat_penalty": 1,
    "stop": [
        "<|im_start|>",
        "<|im_end|>"
    ],
    "temperature": 0.6,
    "top_k": 40,
    "top_p": 0.95,
    "thinking_start": "<think>",
    "thinking_end": "</think>"
}

If a model is uploaded to your hub without thinking_start or thinking_end, then you just default to the current behaviour that is to show all tokens.

<!-- gh-comment-id:2736975630 --> @VMinB12 commented on GitHub (Mar 19, 2025): To add to that, already today you have `params` for each model. These starting and ending thinking tokens could be added here? e.g. https://ollama.com/library/qwq/blobs/e5229acc2492 ```json { "repeat_penalty": 1, "stop": [ "<|im_start|>", "<|im_end|>" ], "temperature": 0.6, "top_k": 40, "top_p": 0.95, "thinking_start": "<think>", "thinking_end": "</think>" } ``` If a model is uploaded to your hub without `thinking_start` or `thinking_end`, then you just default to the current behaviour that is to show all tokens.
Author
Owner

@rick-github commented on GitHub (Mar 19, 2025):

I'm sure that down the track there will be a meeting of minds on how to deal with this for published models and a set of guidelines drawn up, but for the moment that doesn't exist and clients need to deal with it.

<!-- gh-comment-id:2736983644 --> @rick-github commented on GitHub (Mar 19, 2025): > I'm sure that down the track there will be a meeting of minds on how to deal with this for published models and a set of guidelines drawn up, but for the moment that doesn't exist and clients need to deal with it.
Author
Owner

@sunburst-yz commented on GitHub (Mar 19, 2025):

Thanks for the context, I agree that the different formats for thinking tokens is a complicating factor here. Yet, isn't this issue also present for tool calling, where each model has there own chat prompt template? And yet, Ollama does support tool calling (and it is much appreciated!). There, the models come packaged with this prompt template, so you can effectively parse the model response using the provided template. Perhaps you can consider to similarly package a thinking configuration with each model? A start_thinking_token and end_thinking_token should suffice I think. Let me know your thoughts.

I agree. Reasoning models outperform others significantly. However, if the thinking tokens are not separated from the result output, it will pose challenges for developers when using the Ollama API to test local models. Official LLM APIs do not intermingle thinking tokens and result tokens. As a result, agent libraries typically do not grapple with this problem either.

<!-- gh-comment-id:2737136473 --> @sunburst-yz commented on GitHub (Mar 19, 2025): > Thanks for the context, I agree that the different formats for thinking tokens is a complicating factor here. Yet, isn't this issue also present for tool calling, where each model has there own chat prompt template? And yet, Ollama does support tool calling (and it is much appreciated!). There, the models come packaged with this prompt template, so you can effectively parse the model response using the provided template. Perhaps you can consider to similarly package a thinking configuration with each model? A `start_thinking_token` and `end_thinking_token` should suffice I think. Let me know your thoughts. I agree. Reasoning models outperform others significantly. However, if the thinking tokens are not separated from the result output, it will pose challenges for developers when using the Ollama API to test local models. Official LLM APIs do not intermingle thinking tokens and result tokens. As a result, agent libraries typically do not grapple with this problem either.
Author
Owner

@mikhail-shevtsov-wiregate commented on GitHub (May 11, 2025):

For those who want to try thinking models right now I've created a simple proxy that strips chain-of-thought from LLM output. It's a temporary solution until community finds better way to handle thinking tokens: https://gitlab.com/wiregate-public/ollama-unthink-proxy

<!-- gh-comment-id:2869952171 --> @mikhail-shevtsov-wiregate commented on GitHub (May 11, 2025): For those who want to try `thinking` models right now I've created a simple proxy that strips chain-of-thought from LLM output. It's a temporary solution until community finds better way to handle thinking tokens: https://gitlab.com/wiregate-public/ollama-unthink-proxy
Author
Owner

@rick-github commented on GitHub (May 11, 2025):

#10584

<!-- gh-comment-id:2870066853 --> @rick-github commented on GitHub (May 11, 2025): #10584
Author
Owner

@caesarhtx commented on GitHub (Jun 20, 2025):

use re if you just do not want the <think>xxx</think> part in the response:

response = ollama.chat(model='qwen3:8b', messages=[{'role': 'user', 'content': prompt}])
answer = response['message']['content'].strip().lower()
# remove think part:
cleaned_content = re.sub(r"<think>.*?</think>\n?", "", answer, flags=re.DOTALL).strip()
answer = cleaned_content if cleaned_content else answer
<!-- gh-comment-id:2989602198 --> @caesarhtx commented on GitHub (Jun 20, 2025): use re if you just do not want the \<think\>xxx\</think\> part in the response: ```Python response = ollama.chat(model='qwen3:8b', messages=[{'role': 'user', 'content': prompt}]) answer = response['message']['content'].strip().lower() # remove think part: cleaned_content = re.sub(r"<think>.*?</think>\n?", "", answer, flags=re.DOTALL).strip() answer = cleaned_content if cleaned_content else answer
Author
Owner

@rocketedtech commented on GitHub (Jul 17, 2025):

use re if you just do not want the xxx part in the response:

response = ollama.chat(model='qwen3:8b', messages=[{'role': 'user', 'content': prompt}])
answer = response['message']['content'].strip().lower()

remove think part:

cleaned_content = re.sub(r".*?\n?", "", answer, flags=re.DOTALL).strip()
answer = cleaned_content if cleaned_content else answer

Where do you insert this code? Newb here...

<!-- gh-comment-id:3085239384 --> @rocketedtech commented on GitHub (Jul 17, 2025): > use re if you just do not want the <think>xxx</think> part in the response: > > response = ollama.chat(model='qwen3:8b', messages=[{'role': 'user', 'content': prompt}]) > answer = response['message']['content'].strip().lower() > # remove think part: > cleaned_content = re.sub(r"<think>.*?</think>\n?", "", answer, flags=re.DOTALL).strip() > answer = cleaned_content if cleaned_content else answer Where do you insert this code? Newb here...
Author
Owner

@rick-github commented on GitHub (Jul 18, 2025):

Where do you insert this code? Newb here...

Is your code using the ollama-python SDK? Use this:

prompt="Why is the sky blue?"
response = ollama.chat(model='qwen3:8b', messages=[{'role': 'user', 'content': prompt}],think=True)
answer = response['message']['content']
print(answer)
<!-- gh-comment-id:3086369944 --> @rick-github commented on GitHub (Jul 18, 2025): > Where do you insert this code? Newb here... Is your code using the ollama-python SDK? Use this: ```python prompt="Why is the sky blue?" response = ollama.chat(model='qwen3:8b', messages=[{'role': 'user', 'content': prompt}],think=True) answer = response['message']['content'] print(answer) ```
Author
Owner

@rocketedtech commented on GitHub (Jul 18, 2025):

No, I have Ollama running in Docker. Also, added the Ollama AI chat to n8n.

On Thu, Jul 17, 2025 at 8:45 PM frob @.***> wrote:

rick-github left a comment (ollama/ollama#8528)
https://github.com/ollama/ollama/issues/8528#issuecomment-3086369944

Where do you insert this code? Newb here...

Is your code using the ollama-python SDK? Use this:

prompt="Why is the sky blue?"response = ollama.chat(model='qwen3:8b', messages=[{'role': 'user', 'content': prompt}],think=True)answer = response['message']['content']print(answer)


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/8528#issuecomment-3086369944,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AO5IJ5MF53DMEPTR2ICACHT3JBGTTAVCNFSM6AAAAABVT6S4L6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAOBWGM3DSOJUGQ
.
You are receiving this because you commented.Message ID:
@.***>

<!-- gh-comment-id:3086383400 --> @rocketedtech commented on GitHub (Jul 18, 2025): No, I have Ollama running in Docker. Also, added the Ollama AI chat to n8n. On Thu, Jul 17, 2025 at 8:45 PM frob ***@***.***> wrote: > *rick-github* left a comment (ollama/ollama#8528) > <https://github.com/ollama/ollama/issues/8528#issuecomment-3086369944> > > Where do you insert this code? Newb here... > > Is your code using the ollama-python SDK? Use this: > > prompt="Why is the sky blue?"response = ollama.chat(model='qwen3:8b', messages=[{'role': 'user', 'content': prompt}],think=True)answer = response['message']['content']print(answer) > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/8528#issuecomment-3086369944>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AO5IJ5MF53DMEPTR2ICACHT3JBGTTAVCNFSM6AAAAABVT6S4L6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAOBWGM3DSOJUGQ> > . > You are receiving this because you commented.Message ID: > ***@***.***> >
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#31259