[GH-ISSUE #6877] OpenAI o1-like Chain-of-thought (CoT) inference workflow #30107

Open
opened 2026-04-22 09:34:48 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @kozuch on GitHub (Sep 19, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6877

Well, I am surprised that the "main" and "great" new feature of the new OpenAI o1 model is actually doing say "more sophisticated" inference workflow while employing something like Chain-of-thought process. Basically I understand it that even a "dumb" model can perform much better when it "thinks more" during inference. The great news they are telling us is that by "thinking more" you can get smarter, which is probably very true also for humans.

The o1 model is probably trained to come up with its own CoT workflow for any given prompt, but I think it could be interesting to try to even hardcode some kind of workflow which any standard LLM model may try to follow during inference. Basically let the model analyze the prompt from various perspectives first and then try to judge on what type of "inference workflow" it should employ.

The hardcoded workflow could look like this:

  1. Prompt is submitted to the model.
  2. The model asks itself couple of hard-coded questions about the prompt, maybe:
  • is that some light conversation (needing soft-skills like empathy etc)
  • does it look like a science problem (math, physics etc.)
  • can I break the prompt down to subtasks - if yes, the workflow will feed each subtask into the model separately, then combine the result etc.
  • is the problem easy/hard
  • do I have all information I need (do I need to ask the user for further input/clarification)
  1. The workflow would run, maybe in multiple iterations on various its levels, maybe trying to fit some "quality checks" for the answer
  2. The output is presented to the user (the "hidden" thinking may be optionally viewed by user)

Anyone having the same feelings as I do about the CoT thing? Looks like even a hard-coded process may give some interesting results.

Originally created by @kozuch on GitHub (Sep 19, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6877 Well, I am surprised that the "main" and "great" new feature of the new OpenAI o1 model is actually doing say "more sophisticated" inference workflow while employing something like Chain-of-thought process. Basically I understand it that even a "dumb" model can perform much better when it "thinks more" during inference. The great news they are telling us is that by "thinking more" you can get smarter, which is probably very true also for humans. The o1 model is probably trained to come up with its own CoT workflow for any given prompt, but I think it could be interesting to try to even hardcode some kind of workflow which any standard LLM model may try to follow during inference. Basically let the model analyze the prompt from various perspectives first and then try to judge on what type of "inference workflow" it should employ. The hardcoded workflow could look like this: 1. Prompt is submitted to the model. 2. The model asks itself couple of hard-coded questions about the prompt, maybe: - is that some light conversation (needing soft-skills like empathy etc) - does it look like a science problem (math, physics etc.) - can I break the prompt down to subtasks - if yes, the workflow will feed each subtask into the model separately, then combine the result etc. - is the problem easy/hard - do I have all information I need (do I need to ask the user for further input/clarification) 3. The workflow would run, maybe in multiple iterations on various its levels, maybe trying to fit some "quality checks" for the answer 4. The output is presented to the user (the "hidden" thinking may be optionally viewed by user) Anyone having the same feelings as I do about the CoT thing? Looks like even a hard-coded process may give some interesting results.
GiteaMirror added the feature request label 2026-04-22 09:34:48 -05:00
Author
Owner

@rezzie-rich commented on GitHub (Sep 19, 2024):

Groq offers somethign similer called 'mixtral of agent'. another great way to achieve this through mental model.

reference blog: https://jamesclear.com/feynman-mental-models

there could be an option where executing 'ollama run qwen2.5' will launch just the model and using 'ollama run -cot qwen2.5' will launch the model under an agentic framework replicating the o1 functionality.

<!-- gh-comment-id:2362382484 --> @rezzie-rich commented on GitHub (Sep 19, 2024): Groq offers somethign similer called 'mixtral of agent'. another great way to achieve this through mental model. reference blog: https://jamesclear.com/feynman-mental-models there could be an option where executing 'ollama run qwen2.5' will launch just the model and using 'ollama run -cot qwen2.5' will launch the model under an agentic framework replicating the o1 functionality.
Author
Owner

@codelion commented on GitHub (Sep 21, 2024):

You can do this with my open-source optimizing inference proxy optillm - https://github.com/codelion/optillm There are several other SOTA techniques that are implemented in that proxy that all trade inference time compute for accuracy.

<!-- gh-comment-id:2365015484 --> @codelion commented on GitHub (Sep 21, 2024): You can do this with my open-source optimizing inference proxy optillm - https://github.com/codelion/optillm There are several other SOTA techniques that are implemented in that proxy that all trade inference time compute for accuracy.
Author
Owner

@MichaelFomenko commented on GitHub (Sep 21, 2024):

You can do this with my open-source optimizing inference proxy optillm - https://github.com/codelion/optillm There are several other SOTA techniques that are implemented in that proxy that all trade inference time compute for accuracy.

how can I use your solution as an proxy, where your solution have exact the same api like ollama, so i can use it as a direkt ollama replacement and it will use ollama in the background? Maybe there will be needed some extra configuration files, where is defined in which mode it should work.

<!-- gh-comment-id:2365131837 --> @MichaelFomenko commented on GitHub (Sep 21, 2024): > You can do this with my open-source optimizing inference proxy optillm - https://github.com/codelion/optillm There are several other SOTA techniques that are implemented in that proxy that all trade inference time compute for accuracy. how can I use your solution as an proxy, where your solution have exact the same api like ollama, so i can use it as a direkt ollama replacement and it will use ollama in the background? Maybe there will be needed some extra configuration files, where is defined in which mode it should work.
Author
Owner

@codelion commented on GitHub (Sep 21, 2024):

@MichaelFomenko ollama has an OpenAI API compatible endpoint so just run the optillm proxy with python optillm.py --base_url http://localhost:11434/v1 make sure to set the OPENAI_API_KEY with export OPENAI_API_KEY=ollama, it is not used by the OpenAI client expects it. Then you can call proxy in your own code as shown in here and it will use ollama in background.

<!-- gh-comment-id:2365136608 --> @codelion commented on GitHub (Sep 21, 2024): @MichaelFomenko ollama has an OpenAI API compatible endpoint so just run the optillm proxy with `python optillm.py --base_url http://localhost:11434/v1` make sure to set the OPENAI_API_KEY with `export OPENAI_API_KEY=ollama`, it is not used by the OpenAI client expects it. Then you can call proxy in your own code as shown in [here](https://github.com/codelion/optillm?tab=readme-ov-file#usage) and it will use ollama in background.
Author
Owner

@MichaelFomenko commented on GitHub (Sep 21, 2024):

@MichaelFomenko ollama has an OpenAI API compatible endpoint so just run the optillm proxy with python optillm.py --base_url http://localhost:11434/v1 make sure to set the OPENAI_API_KEY with export OPENAI_API_KEY=ollama, it is not used by the OpenAI client expects it. Then you can call proxy in your own code as shown in here and it will use ollama in background.

Thank you, I don't wanna use it in my code, I wanna use it in the Open WebUI app.

<!-- gh-comment-id:2365146089 --> @MichaelFomenko commented on GitHub (Sep 21, 2024): > @MichaelFomenko ollama has an OpenAI API compatible endpoint so just run the optillm proxy with `python optillm.py --base_url http://localhost:11434/v1` make sure to set the OPENAI_API_KEY with `export OPENAI_API_KEY=ollama`, it is not used by the OpenAI client expects it. Then you can call proxy in your own code as shown in [here](https://github.com/codelion/optillm?tab=readme-ov-file#usage) and it will use ollama in background. Thank you, I don't wanna use it in my code, I wanna use it in the Open WebUI app.
Author
Owner

@codelion commented on GitHub (Sep 21, 2024):

You can do that by just pointing Open WebUI to use the proxy endpoint as the backend. There is documentation on how to do it https://docs.openwebui.com/getting-started/env-configuration#openai

If you are just looking to compare and see how it works you can try my HF space - https://huggingface.co/spaces/codelion/optillm

<!-- gh-comment-id:2365154173 --> @codelion commented on GitHub (Sep 21, 2024): You can do that by just pointing Open WebUI to use the proxy endpoint as the backend. There is documentation on how to do it https://docs.openwebui.com/getting-started/env-configuration#openai If you are just looking to compare and see how it works you can try my HF space - https://huggingface.co/spaces/codelion/optillm
Author
Owner

@Donno191 commented on GitHub (Sep 23, 2024):

You can do that by just pointing Open WebUI to use the proxy endpoint as the backend. There is documentation on how to do it https://docs.openwebui.com/getting-started/env-configuration#openai

If you are just looking to compare and see how it works you can try my HF space - https://huggingface.co/spaces/codelion/optillm

https://www.reddit.com/r/LocalLLaMA/comments/1fnjnm0/visual_tree_of_thoughts_for_webui/

<!-- gh-comment-id:2369779583 --> @Donno191 commented on GitHub (Sep 23, 2024): > You can do that by just pointing Open WebUI to use the proxy endpoint as the backend. There is documentation on how to do it https://docs.openwebui.com/getting-started/env-configuration#openai > > If you are just looking to compare and see how it works you can try my HF space - https://huggingface.co/spaces/codelion/optillm https://www.reddit.com/r/LocalLLaMA/comments/1fnjnm0/visual_tree_of_thoughts_for_webui/
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#30107