* don't require pulling stubs for cloud models
This is a first in a series of PRs that will better integrate Ollama's
cloud into the API and CLI. Previously we used to have a layer of
indirection where you'd first have to pull a "stub" model that contains
a reference to a cloud model. With this change, you don't have to pull
first, you can just use a cloud model in various routes like `/api/chat`
and `/api/show`. This change respects
<https://github.com/ollama/ollama/pull/14221>, so if cloud is disabled,
these models won't be accessible.
There's also a new, simpler pass-through proxy that doesn't convert the
requests ahead of hitting the cloud models, which they themselves
already support various formats (e.g., `v1/chat/completions` or Open
Responses, etc.). This will help prevent issues caused by double
converting (e.g., `v1/chat/completions` converted to `api/chat` on the
client, then calling cloud and converting back to a
`v1/chat/completions` response instead of the cloud model handling the
original `v1/chat/completions` request first).
There's now a notion of "source tags", which can be mixed with existing
tags. So instead of having different formats like`gpt-oss:20b-cloud` vs.
`kimi-k2.5:cloud` (`-cloud` suffix vs. `:cloud`), you can now specify
cloud by simply appending `:cloud`. This PR doesn't change model
resolution yet, but sets us up to allow for things like omitting the
non-source tag, which would make something like `ollama run
gpt-oss:cloud` work the same way that `ollama run gpt-oss` already works
today.
More detailed changes:
- Added a shared model selector parser in `types/modelselector`:
- supports `:cloud` and `:local`
- accepts source tags in any position
- supports legacy `:<tag>-cloud`
- rejects conflicting source tags
- Integrated selector handling across server inference/show routes:
- `GenerateHandler`, `ChatHandler`, `EmbedHandler`,
`EmbeddingsHandler`, `ShowHandler`
- Added explicit-cloud passthrough proxy for ollama.com:
- same-endpoint forwarding for `/api/*`, `/v1/*`, and `/v1/messages`
- normalizes `model` (and `name` for `/api/show`) before forwarding
- forwards request headers except hop-by-hop/proxy-managed headers
- uses bounded response-header timeout
- handles auth failures in a friendly way
- Preserved cloud-disable behavior (`OLLAMA_NO_CLOUD`)
- Updated create flow to support `FROM ...:cloud` model sources (though
this flow uses the legacy proxy still, supporting Modelfile overrides
is more complicated with the direct proxy approach)
- Updated CLI/TUI/config cloud detection to use shared selector logic
- Updated CLI preflight behavior so explicit cloud requests do not
auto-pull local stubs
What's next?
- Cloud discovery/listing and cache-backed `ollama ls` / `/api/tags`
- Modelfile overlay support for virtual cloud models on OpenAI/Anthropic
request families
- Recommender/default-selection behavior for ambiguous model families
- Fully remove the legacy flow
Fixes: https://github.com/ollama/ollama/issues/13801
* consolidate pull logic into confirmAndPull helper
pullIfNeeded and ShowOrPull shared identical confirm-and-pull logic.
Extract confirmAndPull to eliminate the duplication.
* skip local existence checks for cloud models
ModelExists and the TUI's modelExists both check the local model list,
which causes cloud models to appear missing. Return true early for
explicit cloud models so the TUI displays them beside the integration
name and skips re-prompting the model picker on relaunch.
* support optionally pulling stubs for newly-style names
We now normalize names like `<family>:<size>:cloud` into legacy-style
names like `<family>:<size>-cloud` for pulling and deleting (this also
supports stripping `:local`). Support for pulling cloud models is
temporary, once we integrate properly into `/api/tags` we won't need
this anymore.
* Fix server alias syncing
* Update cmd/cmd.go
Co-authored-by: Parth Sareen <parth.sareen@ollama.com>
* address comments
* improve some naming
---------
Co-authored-by: ParthSareen <parth.sareen@ollama.com>
* add ability to disable cloud
Users can now easily opt-out of cloud inference and web search by
setting
```
"disable_ollama_cloud": true
```
in their `~/.ollama/server.json` settings file. After a setting update,
the server must be restarted.
Alternatively, setting the environment variable `OLLAMA_NO_CLOUD=1` will
also disable cloud features. While users previously were able to avoid
cloud models by not pulling or `ollama run`ing them, this gives them an
easy way to enforce that decision. Any attempt to run a cloud model when
cloud is disabled will fail.
The app's old "airplane mode" setting, which did a similar thing for
hiding cloud models within the app is now unified with this new cloud
disabled mode. That setting has been replaced with a "Cloud" toggle,
which behind the scenes edits `server.json` and then restarts the
server.
* gate cloud models across TUI and launch flows when cloud is disabled
Block cloud models from being selected, launched, or written to
integration configs when cloud mode is turned off:
- TUI main menu: open model picker instead of launching with a
disabled cloud model
- cmd.go: add IsCloudModelDisabled checks for all Selection* paths
- LaunchCmd: filter cloud models from saved Editor configs before
launch, fall through to picker if none remain
- Editor Run() methods (droid, opencode, openclaw): filter cloud
models before calling Edit() and persist the cleaned list
- Export SaveIntegration, remove SaveIntegrationModel wrapper that
was accumulating models instead of replacing them
* rename saveIntegration to SaveIntegration in config.go and tests
* cmd/config: add --model guarding and empty model list fixes
* Update docs/faq.mdx
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* Update internal/cloud/policy.go
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* Update internal/cloud/policy.go
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* Update server/routes.go
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
* Revert "Update internal/cloud/policy.go"
This reverts commit 8bff8615f9.
Since this error shows up in other integrations, we want it to be
prefixed with Ollama
* rename cloud status
* more status renaming
* fix tests that weren't updated after rename
---------
Co-authored-by: ParthSareen <parth.sareen@ollama.com>
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
When trying to use cloud model with OLLAMA_HOST="ollama.com" while not signed in a helpful error message is displayed when the user is not signed in telling them they must sign in to use cloud models. This should be the same experience for models which specify a remote instance.
When a browser is available open it to the connect URL automatically when running the `ollama signin` command. Browser is not opened in any other unauthorized scenario.
- Fix panic in ollama show for image gen models (safe type assertion)
- Add vision capability for Flux2KleinPipeline models at create time
- Flatten transparent PNG images onto white background for better results
Move the unload check (empty prompt + KeepAlive=0) before the image
generation model dispatch in GenerateHandler. This prevents models like
flux from being loaded into memory just to be immediately unloaded when
running `ollama rm`.
Also fix a bug in DeleteHandler where `args[0]` was used instead of
`arg` in the delete loop, causing only the first model to be unloaded
when deleting multiple models.
* x: make `ollama create --experimental` import from safetensors
This change allows pulling in safetensors models into the new experimental model format, and also
fixes the `ollama show` command to be able to correctly display the model information.
* gofumpt the linter
* gofumpt the linter again
* validate the model name
TeaCache:
- Timestep embedding similarity caching for diffusion models
- Polynomial rescaling with configurable thresholds
- Reduces transformer forward passes by ~30-50%
FP8 quantization:
- Support for FP8 quantized models (8-bit weights with scales)
- QuantizedMatmul on Metal, Dequantize on CUDA
- Client-side quantization via ollama create --quantize fp8
Other bug fixes:
- Fix `/api/show` API for image generation models
- Server properly returns model info (architecture, parameters, quantization)
- Memory allocation optimizations
- CLI improvements for image generation
This change:
* fixes rope scaling in the mistral converter
* updates ministral to include llama4 scaling
* includes a new ministral parser for parsing reasoning and tool calling
---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: A-Akhil <akhilrahul70@gmail.com>
This PR introduces a new ollama embed command that allows users to generate embeddings directly from the command line.
Added ollama embed MODEL [TEXT...] command for generating text embeddings
Supports both direct text arguments and stdin piping for scripted workflows
Outputs embeddings as JSON arrays (one per line)
this change fixes two bugs with `ollama rm`:
1. before a model is removed, it will first be stopped. this only
happens for the first argument and skipped for all other models
2. models are unloaded indiscriminately. this errors for cloud models
and should be omitted
There are two bugs when using `/load <model>` for a model that doesn't exist, namely:
1. it will not restore the current model settings if the current model is a thinking model; and
2. it will crash is the current model is a non-thinking model
This bug fix saves the current runOptions and then restores them if the model load
doesn't happen. It also fixes the crash happening for non-thinking models.
* auth: fix problems with the ollama keypairs
This change adds several fixes including:
- reading in the pubkey files correctly
- fixing the push unit test to create a keypair file in a temp directory
- not return 500 errors for normal status error
* bf16
* tests
* gpt-oss
* enable gptoss for engine
* rough estimate
* convert to mxfp4
* handle safetensors U8
* clamp glu/linear
* update tokenizer
* MXFP4 support
This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.
* Unit tests for MXFP4 support
This exercises various operations and shapes on both CPU and GPU (if detected
on the system)
* cuda graph
* unit test adjustments
* cuda: optimize memory access
Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4
* mac: fix crash on old macos versions
cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors. Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.
* server: Minimum context length for gptoss
This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.
* ggml: Multiply by numParallel for gptoss sliding window
When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.
* gpt-oss integration
includes harmony parser and thinking levels, etc.
* fix sync
* fix tests
* fix lint
---------
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>
- Both `/api/generate` and `/api/chat` now accept a `"think"`
option that allows specifying whether thinking mode should be on or
not
- Templates get passed this new option so, e.g., qwen3's template can
put `/think` or `/no_think` in the system prompt depending on the
value of the setting
- Models' thinking support is inferred by inspecting model templates.
The prefix and suffix the parser uses to identify thinking support is
also automatically inferred from templates
- Thinking control & parsing is opt-in via the API to prevent breaking
existing API consumers. If the `"think"` option is not specified, the
behavior is unchanged from previous versions of ollama
- Add parsing for thinking blocks in both streaming/non-streaming mode
in both `/generate` and `/chat`
- Update the CLI to make use of these changes. Users can pass `--think`
or `--think=false` to control thinking, or during an interactive
session they can use the commands `/set think` or `/set nothink`
- A `--hidethinking` option has also been added to the CLI. This makes
it easy to use thinking in scripting scenarios like
`ollama run qwen3 --think --hidethinking "my question here"` where you
just want to see the answer but still want the benefits of thinking
models