[GH-ISSUE #11768] Access Ollama Turbo through the local Ollama API #33557

Closed
opened 2026-04-22 16:24:20 -05:00 by GiteaMirror · 14 comments
Owner

Originally created by @owenzhao on GitHub (Aug 7, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11768

Originally assigned to: @pdevine on GitHub.

Ollama currently offers two access methods: the locally deployed Ollama that can be accessed without an API key, and Ollama turbo which requires an API key for access. My suggestion is to add a method for calling locally deployed Ollama without requiring an API key, which would allow Ollama to automatically call Ollama Turbo when needed. The reasons are as follows:

  1. The unified API enables third-party developers to support all Ollama features with minimal code modifications.
  2. Not requiring API keys enhances security for third-party applications. This avoids unsafe practices such as developers writing API keys directly into their code.
  3. Increasing the installation rate and usability of Ollama. The adoption of Ollama by third-party applications will lead to higher installation rates of Ollama itself. This is because even users who don’t download local models still need to deploy a local instance of Ollama for proxy purposes when using turbo. In the future, Ollama can achieve minimum operability by pre-installing an open-source small local model, such as Qwen3 in 1.7B or 4B parameter sizes.
  4. To better serve developers. According to current App Store guidelines, macOS applications that require users to input API Keys are most likely to be rejected during Apple’s review process, while iOS applications do not face this limitation. If Ollama adopts this approach, developers won’t need to have macOS users enter API Keys. Previously, macOS developers had to relay requests through their own web servers, which is more costly and requires higher technical capabilities.

The specific method I’m proposing is that Ollama should determine whether the model called by the user is available locally. If not, and when the user has turbo access, it should automatically call the turbo model. This way, users don’t need to input Ollama’s turbo in third-party applications and can use them directly. Ollama could then include settings that allow users to configure whether a specific application is allowed to access turbo. The specific design for this can be considered by Ollama itself.

Originally created by @owenzhao on GitHub (Aug 7, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11768 Originally assigned to: @pdevine on GitHub. Ollama currently offers two access methods: the locally deployed Ollama that can be accessed without an API key, and Ollama turbo which requires an API key for access. My suggestion is to add a method for calling locally deployed Ollama without requiring an API key, which would allow Ollama to automatically call Ollama Turbo when needed. The reasons are as follows: 1. The unified API enables third-party developers to support all Ollama features with minimal code modifications. 2. Not requiring API keys enhances security for third-party applications. This avoids unsafe practices such as developers writing API keys directly into their code. 3. Increasing the installation rate and usability of Ollama. The adoption of Ollama by third-party applications will lead to higher installation rates of Ollama itself. This is because even users who don’t download local models still need to deploy a local instance of Ollama for proxy purposes when using turbo. In the future, Ollama can achieve minimum operability by pre-installing an open-source small local model, such as Qwen3 in 1.7B or 4B parameter sizes. 4. To better serve developers. According to current App Store guidelines, macOS applications that require users to input API Keys are most likely to be rejected during Apple’s review process, while iOS applications do not face this limitation. If Ollama adopts this approach, developers won’t need to have macOS users enter API Keys. Previously, macOS developers had to relay requests through their own web servers, which is more costly and requires higher technical capabilities. The specific method I’m proposing is that Ollama should determine whether the model called by the user is available locally. If not, and when the user has turbo access, it should automatically call the turbo model. This way, users don’t need to input Ollama’s turbo in third-party applications and can use them directly. Ollama could then include settings that allow users to configure whether a specific application is allowed to access turbo. The specific design for this can be considered by Ollama itself.
GiteaMirror added the cloudfeature request labels 2026-04-22 16:24:20 -05:00
Author
Owner

@BumpyClock commented on GitHub (Aug 7, 2025):

+1 I feel pretty dumb subbing to Turbo, I took the access via github CLI to mean that it would also be accessible via the local ollama API. without that I don't quite see the point of Turbo, that's the most common use case for me, I know I can use it with the python library but then I have to switch all my code from the OpenAI SDK with just the baseURL switch to ollama.

<!-- gh-comment-id:3165027540 --> @BumpyClock commented on GitHub (Aug 7, 2025): +1 I feel pretty dumb subbing to Turbo, I took the access via github CLI to mean that it would also be accessible via the local ollama API. without that I don't quite see the point of Turbo, that's the most common use case for me, I know I can use it with the python library but then I have to switch all my code from the OpenAI SDK with just the baseURL switch to ollama.
Author
Owner

@pdevine commented on GitHub (Aug 7, 2025):

Hey guys, this is something I've already been looking at and have a working prototype. I'll see if I can get it into shippable shape soon. The cool thing is you can just use a Modelfile and set whatever parameters you want so you can tweak the default Turbo mode settings for the model.

<!-- gh-comment-id:3165379491 --> @pdevine commented on GitHub (Aug 7, 2025): Hey guys, this is something I've already been looking at and have a working prototype. I'll see if I can get it into shippable shape soon. The cool thing is you can just use a Modelfile and set whatever parameters you want so you can tweak the default Turbo mode settings for the model.
Author
Owner

@LivioGama commented on GitHub (Aug 7, 2025):

Great @pdevine !
I also wrote this to one of your colleague that reached out per email:

After thinking, that's actually inaccurate. SSH keys cannot be used to communicate directly over HTTP, the protocol are not made for that.
Since you probably faced this issue, I decided to write to you. Because there still be a clever way:
It's a bit of code but since the users are required to send they SSH key just like in the tuto: https://github.com/ollama/ollama/blob/main/docs/turbo.md
You could smartly assumed that this key (or part of it) IS the first bearer, and set it in the profile. Then Ollama client simply needs an update to take into account that the SSH key might be the actual bearer and challenge auth with it.
It's creative, but not perfect. There is still the question of expiration and rotation, but the current token implementation does not solve this either, anyway :)

<!-- gh-comment-id:3165442295 --> @LivioGama commented on GitHub (Aug 7, 2025): Great @pdevine ! I also wrote this to one of your colleague that reached out per email: > After thinking, that's actually inaccurate. SSH keys cannot be used to communicate directly over HTTP, the protocol are not made for that. > Since you probably faced this issue, I decided to write to you. Because there still be a clever way: > It's a bit of code but since the users are required to send they SSH key just like in the tuto: https://github.com/ollama/ollama/blob/main/docs/turbo.md > You could smartly assumed that this key (or part of it) IS the first bearer, and set it in the profile. Then Ollama client simply needs an update to take into account that the SSH key might be the actual bearer and challenge auth with it. > It's creative, but not perfect. There is still the question of expiration and rotation, but the current token implementation does not solve this either, anyway :)
Author
Owner

@pdevine commented on GitHub (Aug 7, 2025):

@LivioGama The way that the ed25519 keys work is that you use your private key to sign the request, and then the pubkey and the signature are sent as the bearer token. Your pubkey is then matched and the signature is verified against the request being made. This is between your local ollama server and ollama.com.

I have another change which implements this same method between the local client and the local server. It uses an authorized_keys file with simple RBAC for any of the API endpoints. The draft PR is up at #11574

<!-- gh-comment-id:3165489086 --> @pdevine commented on GitHub (Aug 7, 2025): @LivioGama The way that the ed25519 keys work is that you use your private key to sign the request, and then the pubkey and the signature are sent as the bearer token. Your pubkey is then matched and the signature is verified against the request being made. This is between your local ollama server and ollama.com. I have another change which implements this same method between the local client and the local server. It uses an `authorized_keys` file with simple RBAC for any of the API endpoints. The draft PR is up at #11574
Author
Owner

@LivioGama commented on GitHub (Aug 8, 2025):

Thank you very much, I truly appreciate your reactivity.
I can see by your answer that you really know what you are doing. Therefore could you explain concisely what went wrong with this release so the remote turbo models does not show up? From what I understand it should have worked with the ed25519 key. Unless you are actually talking about the implementation you recently prepared to fix the issue?

<!-- gh-comment-id:3166297538 --> @LivioGama commented on GitHub (Aug 8, 2025): Thank you very much, I truly appreciate your reactivity. I can see by your answer that you really know what you are doing. Therefore could you explain concisely what went wrong with this release so the remote turbo models does not show up? From what I understand it should have worked with the ed25519 key. Unless you are actually talking about the implementation you recently prepared to fix the issue?
Author
Owner

@pdevine commented on GitHub (Aug 8, 2025):

@LivioGama Yes, the feature isn't ready yet. Instead what was released was the ollama API running on ollama.com. Using the CLI you can run OLLAMA_HOST=ollama.com ollama ls and see each of the Turbo models, and then use OLLAMA_HOST=ollama.com ollama run gpt-oss:120b to run the 120b model remotely.

<!-- gh-comment-id:3166588984 --> @pdevine commented on GitHub (Aug 8, 2025): @LivioGama Yes, the feature isn't ready yet. Instead what was released was the ollama API running on ollama.com. Using the CLI you can run `OLLAMA_HOST=ollama.com ollama ls` and see each of the Turbo models, and then use `OLLAMA_HOST=ollama.com ollama run gpt-oss:120b` to run the 120b model remotely.
Author
Owner

@LivioGama commented on GitHub (Aug 8, 2025):

Alright, so for now the OpenAI Compatible way I found is the only solution for IDE 😊 https://github.com/LivioGama/gpt-oss-120b-MAX

<!-- gh-comment-id:3166751437 --> @LivioGama commented on GitHub (Aug 8, 2025): Alright, so for now the OpenAI Compatible way I found is the only solution for IDE 😊 https://github.com/LivioGama/gpt-oss-120b-MAX
Author
Owner

@BumpyClock commented on GitHub (Aug 8, 2025):

What a legend! Thank you for sharing that

<!-- gh-comment-id:3168913043 --> @BumpyClock commented on GitHub (Aug 8, 2025): What a legend! Thank you for sharing that
Author
Owner

@BumpyClock commented on GitHub (Aug 11, 2025):

@LivioGama Yes, the feature isn't ready yet. Instead what was released was the ollama API running on ollama.com. Using the CLI you can run OLLAMA_HOST=ollama.com ollama ls and see each of the Turbo models, and then use OLLAMA_HOST=ollama.com ollama run gpt-oss:120b to run the 120b model remotely.

The problem with doing

OLLAMA_HOST=ollama.com ollama run gpt-oss:120b

is that it's not really accesible via the local API that way. To me that was the biggest appeal for Tubo, that I can just point it to Ollama for local testing and use a larger model and see how that will work. That not being available in Turbo is the biggest bummer. I know you're working on it. Looking forward to it, until them I'm using the workaround that @LivioGama posted and it's working okay so far.

<!-- gh-comment-id:3174927987 --> @BumpyClock commented on GitHub (Aug 11, 2025): > [@LivioGama](https://github.com/LivioGama) Yes, the feature isn't ready yet. Instead what was released was the ollama API running on ollama.com. Using the CLI you can run `OLLAMA_HOST=ollama.com ollama ls` and see each of the Turbo models, and then use `OLLAMA_HOST=ollama.com ollama run gpt-oss:120b` to run the 120b model remotely. The problem with doing > OLLAMA_HOST=ollama.com ollama run gpt-oss:120b is that it's not really accesible via the local API that way. To me that was the biggest appeal for Tubo, that I can just point it to Ollama for local testing and use a larger model and see how that will work. That not being available in Turbo is the biggest bummer. I know you're working on it. Looking forward to it, until them I'm using the workaround that @LivioGama posted and it's working okay so far.
Author
Owner

@LivioGama commented on GitHub (Aug 13, 2025):

@BumpyClock I have a phenomenal good news!
It turns out that my implem was working only for "non stream" mode. I reviewed it completely to support streaming, but also tool calling etc.. Now it works completely with RooCode/KiloCode and other llm tools ! Just need to pull and you're good

Image Image
<!-- gh-comment-id:3186098513 --> @LivioGama commented on GitHub (Aug 13, 2025): @BumpyClock I have a phenomenal good news! It turns out that my implem was working only for "non stream" mode. I reviewed it completely to support streaming, but also tool calling etc.. Now it works completely with RooCode/KiloCode and other llm tools ! Just need to pull and you're good <img width="602" height="1037" alt="Image" src="https://github.com/user-attachments/assets/4505ef80-b33b-40a0-8099-69a1435db638" /> <img width="593" height="947" alt="Image" src="https://github.com/user-attachments/assets/a89e69d6-2af9-4e10-b2fa-c85b2eebd1d7" />
Author
Owner

@BumpyClock commented on GitHub (Aug 14, 2025):

Love it! In my local branch I made it a transparent proxy for ollama endpoints so I could use it with openwebui and it worked in streming mode. I'll checkout the latest. Appreciate it!

Edit: @LivioGama here's my fork, I (well Claude) updated it to be a transparent proxy that combines the local and cloud models for transparent access.

<!-- gh-comment-id:3189525295 --> @BumpyClock commented on GitHub (Aug 14, 2025): Love it! In my local branch I made it a transparent proxy for ollama endpoints so I could use it with openwebui and it worked in streming mode. I'll checkout the latest. Appreciate it! Edit: @LivioGama [here's my fork](https://github.com/BumpyClock/gpt-oss-120b-MAX), I (well Claude) updated it to be a transparent proxy that combines the local and cloud models for transparent access.
Author
Owner

@LivioGama commented on GitHub (Aug 15, 2025):

@BumpyClock I also did it in my project!
I run both:

  • Ollama proxy on localhost:3305
  • OpenAI compatible on localhost:3304

What I noticed:

Don't hesitate to share if you have feedback from your side too, curious to see this evolving

PS: Still getting some nasty errors:

Unexpected API Response: The language model did not provide any assistant
messages. This may indicate an issue with the API or the model's output.

It's really that the tools expects a non empty assistant message, which is not always provided...
Image
I tried to enforce it with a rule/guideline:

ALWAYS ALWAYS INCLUDE A NON EMPTY MESSAGE FROM THE ROLE ASSISTANT IN ALL OF OUR INTERACTIONS AS THE FIRST MESSAGE OF YOUR ANSWER. FAILING TO COMPLY TO THIS RULE IS ACTUALLY THE WORSE MISTAKE TO DO, IT BREAKS EVERYTHING AND I GET "Unexpected API Response: The language model did not provide any assistant messages. This may indicate an issue with the API or the model's output."

It's a bit better but not a complete fix... Maybe there is something to do to enforce non empty assistant message on the proxy, but seems complicated with streaming...

🎁 As a bonus if you read until here (thanks 🎉): I managed to make cline a bit more "agentic" and take initiative like cursor, crafting this rule/guideline:

Try to take initiative, don't assume I only want the task I asked for, guess what I would like after that (in term of actions non related to code), and run them. For example, if I ask for code, it's very likely that I don't only want code, I wanted to be from up to date libraries using mcp context7. I also want you to check that it's compiling. And I also want you to run the program if possible. Then the more important is that you need to analyze the output of that run in order to detect possible errors and fix them.

🎁 As bonus 2: I also added in local file logging and it is very interesting to see how roo code / kilo code where built: they simply created their own language with tags on top of the LLM:

Image

Enjoy 🎉

<!-- gh-comment-id:3190481089 --> @LivioGama commented on GitHub (Aug 15, 2025): @BumpyClock I also did it in my project! I run both: - Ollama proxy on localhost:3305 - OpenAI compatible on localhost:3304 What I noticed: - [Roo Code will not work with Ollama proxy](https://discord.com/channels/1128867683291627614/1128867684130508875/1405352237081038899) but OpenAI yes, I filed a bug and it was auto fix by a bot and auto reviewed by the same bot 😅 https://github.com/RooCodeInc/Roo-Code/issues/7070 - Kilo code will run correctly with both approach (which is funny since Kilo code is a fork from RooCode) - JetBrains AI will not work with Ollama but [OpenAI yes](https://discord.com/channels/1318600112242561076/1318600112695677031/1405538818979008634) - [Cline supports the header bearer](https://discord.com/channels/1128867683291627614/1128867684130508875/1405356128778457219) directly which means no need for this proxy. [Example video](https://discord.com/channels/1318600112242561076/1318600112695677031/1405681117159358584) Don't hesitate to share if you have feedback from your side too, curious to see this evolving PS: Still getting some nasty errors: ``` Unexpected API Response: The language model did not provide any assistant messages. This may indicate an issue with the API or the model's output. ``` It's really that the tools expects a non empty assistant message, which is not always provided... <img width="1026" height="1750" alt="Image" src="https://github.com/user-attachments/assets/9b385da6-38c5-424a-9b9c-988f871ed34f" /> I tried to enforce it with a rule/guideline: ``` ALWAYS ALWAYS INCLUDE A NON EMPTY MESSAGE FROM THE ROLE ASSISTANT IN ALL OF OUR INTERACTIONS AS THE FIRST MESSAGE OF YOUR ANSWER. FAILING TO COMPLY TO THIS RULE IS ACTUALLY THE WORSE MISTAKE TO DO, IT BREAKS EVERYTHING AND I GET "Unexpected API Response: The language model did not provide any assistant messages. This may indicate an issue with the API or the model's output." ``` It's a bit better but not a complete fix... Maybe there is something to do to enforce non empty assistant message on the proxy, but seems complicated with streaming... 🎁 As a bonus if you read until here (thanks 🎉): I managed to make cline a bit more "agentic" and take initiative like cursor, crafting this rule/guideline: ``` Try to take initiative, don't assume I only want the task I asked for, guess what I would like after that (in term of actions non related to code), and run them. For example, if I ask for code, it's very likely that I don't only want code, I wanted to be from up to date libraries using mcp context7. I also want you to check that it's compiling. And I also want you to run the program if possible. Then the more important is that you need to analyze the output of that run in order to detect possible errors and fix them. ``` 🎁 As bonus 2: I also added in local file logging and it is very interesting to see how roo code / kilo code where built: they simply created their own language with tags on top of the LLM: ![Image](https://github.com/user-attachments/assets/531e4fc1-902e-468c-a9e3-32305f091c42) Enjoy 🎉
Author
Owner

@mdlmarkham commented on GitHub (Sep 3, 2025):

I was able to connect OpenWebUI to Ollama Turbo following the docs... and then use OpenWebUI to proxy both of the GPT-OSS models locally through the /api/v1 endpoint. GPT-OSS Works with tools in n8n.

<!-- gh-comment-id:3247573130 --> @mdlmarkham commented on GitHub (Sep 3, 2025): I was able to connect OpenWebUI to Ollama Turbo following the docs... and then use OpenWebUI to proxy both of the GPT-OSS models locally through the /api/v1 endpoint. GPT-OSS Works with tools in n8n.
Author
Owner

@jmorganca commented on GitHub (Sep 21, 2025):

This is now possible with cloud models! With Ollama 0.12.0, you can now run:

ollama pull qwen3-coder:480b-cloud

And then sign in by running:

ollama signin

Then refer to qwen3-coder:480b-cloud in the API or other tools

Let me know if you have any trouble getting up and running

<!-- gh-comment-id:3316320975 --> @jmorganca commented on GitHub (Sep 21, 2025): This is now possible with [cloud models](https://ollama.com/blog/cloud-models)! With Ollama 0.12.0, you can now run: ``` ollama pull qwen3-coder:480b-cloud ``` And then sign in by running: ``` ollama signin ``` Then refer to `qwen3-coder:480b-cloud` in the API or other tools Let me know if you have any trouble getting up and running
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33557