This change adds a new MLX based runner which includes:
* Method-based MLX bindings
* Subprocess-based MLX runner (x/mlxrunner)
* KV cache with tree management
* A basic sampler
The GLM4-MoE-Lite model has been ported to use the new bindings.
---------
Co-authored-by: Michael Yang <git@mxy.ng>
This change includes:
- changes to the safetensors metadata format
- changes to the create command to properly create the blobs with the new format
- changes to load the new format
- fixes ollama show to properly show each tensor
This adds a new powershell install script suitable for running via
irm https://ollama.com/install.ps1 | iex
If you download the script and run '-?' it reports basic usage
information, as well as usage examples for common customization
options. The script is signed as part of the release process
to ensure it can run on a typically configured Windows system.
This does not include doc updates - we can merge those after a release
ships to avoid user confusion.
Set ANTHROPIC_DEFAULT_OPUS_MODEL, ANTHROPIC_DEFAULT_SONNET_MODEL,
ANTHROPIC_DEFAULT_HAIKU_MODEL, and CLAUDE_CODE_SUBAGENT_MODEL when
launching Claude Code so all model tiers route through Ollama.
Allow installing Ollama on MacOS directly from the command line. This is in line with other CLI tools and results in a more streamlined experience when the user is looking to use the CLI specifically.
When numPredict is set, the user will receive one less token
than the requested limit. In addition, the stats will incorrectly
show the number of tokens returned as the limit. In cases where
numPredict is not set, the number of tokens is reported correctly.
This occurs because numPredict is checked when setting up the next
batch but hitting the limit will terminate the current batch as well.
Instead, is is better to check the limit as we actually predict them.
When trying to use cloud model with OLLAMA_HOST="ollama.com" while not signed in a helpful error message is displayed when the user is not signed in telling them they must sign in to use cloud models. This should be the same experience for models which specify a remote instance.
If a sequence is replaced in s.seqs while a batch is computing, the old logits can be decoded into the new sequence. This change rechecks the sequence pointer after compute and skips decoding for replaced entries, preventing stale results from being applied.
Change the truncation algorithm to start with all messages and remove
from the front until it fits, rather than adding messages one at a time
from the back. This reduces tokenization calls from O(n) to O(1) in the
common case where all messages fit in context.
When a browser is available open it to the connect URL automatically when running the `ollama signin` command. Browser is not opened in any other unauthorized scenario.
When context length is clamped to the model's trained context length,
ollama ps now shows the actual clamped value instead of the originally
configured value.
When launching OpenClaw without prior onboarding, run the onboarding
wizard instead of going straight to gateway. This ensures proper
gateway configuration (mode, token, etc.) before first use.
- Add onboarded() to check for wizard.lastRunAt marker in config
- Run onboard with --auth-choice skip --gateway-token ollama for fresh installs
- Existing installs (onboarding completed) run gateway directly