ollama

mirror of https://github.com/ollama/ollama.git synced 2026-03-08 23:04:13 -05:00

Author	SHA1	Message	Date
Patrick Devine	d126467d5d	x/mlxrunner: replace sampler interface chain with single stateful Sampler (#14652 ) - Collapse MLX sampling state into a single sample.Sampler struct (options + history). - Replace interface-based sampler chain (TopP, TopK, penalty, etc.) with function-based transforms. - Update request/pipeline wiring to use *sample.Sampler, seed history from prompt tokens, and append generated tokens each step. - Implement top_p, min_p, repeat_penalty, and frequency_penalty	2026-03-07 17:50:57 -08:00
Devon Rifkin	afb4c62fbf	cloud_proxy: handle stream disconnects gracefully (#14685 ) Previously we were printing out bad errors for expected cases like clients disconnecting. Now we only debug log when that happens (which still might help in cases where we're figuring out why an integration isn't working). For other errors, we print out a proper warning now	2026-03-06 19:18:52 -08:00
Patrick Devine	e790dc435b	mlx: int4 groupsize 64 (#14682 ) Change affine 4bit integers to use groupsize 64	2026-03-06 16:39:47 -08:00
Daniel Hiltgen	288077c3a3	build: smarter docker parallelism (#14653 ) Our Dockerfile leverages parallel stages for more efficient builds. However, our old parallel settings were naive and lead to under/over utilization depending on the capabilities of your build system. This change switches to using Ninja for all our docker cmake builds to leverage its smarter parallel logic. We tell Ninja to target a load of nproc so each of the build stages will share the load on the system aiming for full CPU use without oversaturation. The GPU parallelism settings are also adjusted to 4 to avoid a long-tail for the last few GPU targets as they work through the long list of GPU architectures. This also fixes the Dockerfile to move Vulkan install to just the stage that needs it instead of blocking most other GPU installs. This should speed up CI which always has a clean build cache.	2026-03-06 16:36:22 -08:00
Daniel Hiltgen	4425c54eda	create: fix localhost handling (#14681 )	2026-03-06 16:35:58 -08:00
Michael Yang	778899a5d2	docs: format compat docs (#14678 )	2026-03-06 14:53:17 -08:00
Jeffrey Morgan	4eab60c1e2	Reapply "don't require pulling stubs for cloud models" again (#14608 ) * Revert "Revert "Reapply "don't require pulling stubs for cloud models"" (#14606)" This reverts commit `39982a954e`. * fix test + do cloud lookup only when seeing cloud models --------- Co-authored-by: ParthSareen <parth.sareen@ollama.com>	2026-03-06 14:27:47 -08:00
Bruce MacDonald	1af850e6e3	parsers: repair unclosed arg_value tags in GLM tool calls (#14656 ) GLM models sometimes omits </arg_value> closing tags in tool call XML, causing xml.Unmarshal to fail with "element <arg_value> closed by </tool_call>". This is a known issue across the GLM family. Sanitize the input to fix closing arg_key values so encoding/xml can handle it.	2026-03-06 14:08:34 -08:00
Parth Sareen	9b0c7cc7b9	cmd: override stale entries for context window pi (#14655 ) v0.17.7-rc2 v0.17.7	2026-03-05 16:30:24 -08:00
Daniel Hiltgen	6928630601	mlx: prevent remote creation mismatch (#14651 ) If the user is pointing at a remote OLLAMA_HOST, fail experimental safetensor based create operations as we only support local creation currently.	2026-03-05 14:59:00 -08:00
Parth Sareen	9896e3627f	cmd/config: fix cloud model limit lookups in integrations (#14650 ) v0.17.7-rc1	2026-03-05 13:57:28 -08:00
Bruce MacDonald	15732f0ea7	cmd: use native Ollama API endpoint for OpenClaw (#14649 ) Remove the /v1 suffix from the OpenClaw provider baseUrl so it uses the native Ollama API instead of the OpenAI-compatible endpoint. The /v1 endpoint my break tool calling in OpenClaw.	2026-03-05 13:29:17 -08:00
Parth Sareen	562c76d7cc	cmd: add qwen3.5 context length for launch (#14626 ) v0.17.7-rc0	2026-03-04 14:10:52 -08:00
Parth Sareen	122c68c151	server: loosen thinking level constraint (#14625 )	2026-03-04 13:42:18 -08:00
Jeffrey Morgan	82848a7806	model: fix renderer and parser for qwen3.5 (#14605 ) v0.17.6	2026-03-03 20:58:29 -08:00
Jeffrey Morgan	39982a954e	Revert "Reapply "don't require pulling stubs for cloud models"" (#14606 ) This reverts commit `799e51d419`.	2026-03-03 20:56:10 -08:00
Patrick Devine	e9f6ea232f	Add qwen3.5-next-moe support to MLX runner and models (#14417 ) This change adds support for qwen3.5-next-moe models (qwen3-next/qwen3.5-next/qwen3-coder) to the MLX runner. It also: * introduces recurrent cache support and related MLX ops * updates pipeline/runner integration and adds tests * properly quantizes stacked expert tensors * a Gated Delta Metal kernel for fast SSM inference * adds new MLX calls for Conv1d, DepthwideConv1d, Contiguous, Exp, Log, SoftmaxAxis	2026-03-03 16:39:22 -08:00
Patrick Devine	110eff01a9	chore: remove old imagegen LLMs models (#14597 ) These models are implemented in the x/mlxrunner instead.	2026-03-03 13:23:40 -08:00
Jeffrey Morgan	799e51d419	Reapply "don't require pulling stubs for cloud models" This reverts commit `97d2f05a6d`.	2026-03-03 13:17:10 -08:00
Victor-Quqi	e8fcb29586	model/renderers: fix glm-ocr image tags in renderer prompts (#14584 )	2026-03-03 12:51:34 -08:00
Jeffrey Morgan	97d2f05a6d	Revert "don't require pulling stubs for cloud models (#14574 )" (#14596 ) This reverts commit `8207e55ec7`.	2026-03-03 12:51:23 -08:00
Devon Rifkin	8207e55ec7	don't require pulling stubs for cloud models (#14574 ) * don't require pulling stubs for cloud models This is a first in a series of PRs that will better integrate Ollama's cloud into the API and CLI. Previously we used to have a layer of indirection where you'd first have to pull a "stub" model that contains a reference to a cloud model. With this change, you don't have to pull first, you can just use a cloud model in various routes like `/api/chat` and `/api/show`. This change respects <https://github.com/ollama/ollama/pull/14221>, so if cloud is disabled, these models won't be accessible. There's also a new, simpler pass-through proxy that doesn't convert the requests ahead of hitting the cloud models, which they themselves already support various formats (e.g., `v1/chat/completions` or Open Responses, etc.). This will help prevent issues caused by double converting (e.g., `v1/chat/completions` converted to `api/chat` on the client, then calling cloud and converting back to a `v1/chat/completions` response instead of the cloud model handling the original `v1/chat/completions` request first). There's now a notion of "source tags", which can be mixed with existing tags. So instead of having different formats like`gpt-oss:20b-cloud` vs. `kimi-k2.5:cloud` (`-cloud` suffix vs. `:cloud`), you can now specify cloud by simply appending `:cloud`. This PR doesn't change model resolution yet, but sets us up to allow for things like omitting the non-source tag, which would make something like `ollama run gpt-oss:cloud` work the same way that `ollama run gpt-oss` already works today. More detailed changes: - Added a shared model selector parser in `types/modelselector`: - supports `:cloud` and `:local` - accepts source tags in any position - supports legacy `:<tag>-cloud` - rejects conflicting source tags - Integrated selector handling across server inference/show routes: - `GenerateHandler`, `ChatHandler`, `EmbedHandler`, `EmbeddingsHandler`, `ShowHandler` - Added explicit-cloud passthrough proxy for ollama.com: - same-endpoint forwarding for `/api/`, `/v1/`, and `/v1/messages` - normalizes `model` (and `name` for `/api/show`) before forwarding - forwards request headers except hop-by-hop/proxy-managed headers - uses bounded response-header timeout - handles auth failures in a friendly way - Preserved cloud-disable behavior (`OLLAMA_NO_CLOUD`) - Updated create flow to support `FROM ...:cloud` model sources (though this flow uses the legacy proxy still, supporting Modelfile overrides is more complicated with the direct proxy approach) - Updated CLI/TUI/config cloud detection to use shared selector logic - Updated CLI preflight behavior so explicit cloud requests do not auto-pull local stubs What's next? - Cloud discovery/listing and cache-backed `ollama ls` / `/api/tags` - Modelfile overlay support for virtual cloud models on OpenAI/Anthropic request families - Recommender/default-selection behavior for ambiguous model families - Fully remove the legacy flow Fixes: https://github.com/ollama/ollama/issues/13801 * consolidate pull logic into confirmAndPull helper pullIfNeeded and ShowOrPull shared identical confirm-and-pull logic. Extract confirmAndPull to eliminate the duplication. * skip local existence checks for cloud models ModelExists and the TUI's modelExists both check the local model list, which causes cloud models to appear missing. Return true early for explicit cloud models so the TUI displays them beside the integration name and skips re-prompting the model picker on relaunch. * support optionally pulling stubs for newly-style names We now normalize names like `<family>:<size>:cloud` into legacy-style names like `<family>:<size>-cloud` for pulling and deleting (this also supports stripping `:local`). Support for pulling cloud models is temporary, once we integrate properly into `/api/tags` we won't need this anymore. * Fix server alias syncing * Update cmd/cmd.go Co-authored-by: Parth Sareen <parth.sareen@ollama.com> * address comments * improve some naming --------- Co-authored-by: ParthSareen <parth.sareen@ollama.com>	2026-03-03 10:46:33 -08:00
Jesse Gross	ad16bffc7d	mlx: Remove peak memory from the API This is still in flux so it is better to just log it for now.	2026-03-02 15:56:18 -08:00
Jesse Gross	c1e3ef4bcc	mlxrunner: Refcount pinned tensors Otherwise, it is error prone to manage multiple components working with the same tensor.	2026-03-02 15:56:06 -08:00
Parth Sareen	a3093cd5e5	cmd/opencode: rename provider from "Ollama (local)" to "Ollama" (#14566 ) The "(local)" qualifier is unnecessary since there's only one Ollama provider. Existing configs with the old name are migrated automatically; custom names are left unchanged.	2026-03-02 14:17:18 -08:00
Bruce MacDonald	23d4cad1a2	server: verify digest is not empty on create (#14555 ) An empty digest is not a valid digest for an incoming create request. Reject empty digests at the api level.	2026-03-02 13:43:35 -08:00
Jeffrey Morgan	86513cb697	runner: add token history sampling parameters to ollama runner (#14537 ) v0.17.5	2026-03-01 19:16:07 -08:00
Jeffrey Morgan	3490e9590b	model/qwen3next: avoid crash in in DeltaNet when offloading (#14541 ) Co-authored-by: Yossi Ovadia <jabadia@gmail.com>	2026-03-01 18:44:04 -08:00
Jeffrey Morgan	8da09b1e7e	qwen3next: add compatibility with imported GGUF models (#14517 )	2026-02-28 14:21:42 -08:00
Jesse Gross	a60b9adcce	mlxrunner: Fix prompt eval timing and count metrics Only the last token's processing time is included in prompt processing, giving an artificially high rate. In addition, the number of tokens only included the tokens that miss the cache, instead of our historic total tokens.	2026-02-27 17:29:47 -08:00
Jesse Gross	a16f96658b	mlxrunner: Enforce model context limit Currently, context length is unbounded - the cache will keep growing forever independent of the model's trained context length. This caps it and enforces semantics similar to most cloud services: - Long prompts will result in an error, not truncation. - Generation that exceeds the context will be stopped	2026-02-27 17:29:47 -08:00
Jesse Gross	18ab09b431	mlxrunner: Propagate pipeline errors to client via api.StatusError Errors that occur during pipeline processing are currently only logged but not sent back to the client. Rather than using HTTP status codes as we have historically done, this serializes errors as messages to allow sending them at any time during the stream.	2026-02-27 17:29:47 -08:00
Jesse Gross	638faeac54	mlxrunner: Report actual memory usage from runner The MLX runner previously reported a static VRAM estimate that was computed at load time and consisted only of the weights. This is strictly less than the actual memory usage, as it does not include the KV cache or compute graph.	2026-02-27 17:29:47 -08:00
Jesse Gross	dd5eb6337d	mlxrunner: Fix panic on full KV cache hit When the entire prompt was already cached (e.g. repeated prompt), findRemaining returned an empty slice, causing FromValues to panic on an index-out-of-range accessing a zero-length byte slice. Fix by always keeping at least one token to re-evaluate so the pipeline can seed token generation. Also reject empty prompts early rather than panicking.	2026-02-27 11:07:03 -08:00
Patrick Devine	79917cf80b	show peak memory usage (#14485 )	2026-02-26 18:38:27 -08:00
Parth Sareen	cc90a035a0	model/parsers: add stable tool call indexing for glm47 and qwen3 parsers (#14484 ) v0.17.4	2026-02-26 18:14:29 -08:00
Jeffrey Morgan	d98dda4676	model: fix qwen3 tool calling in thinking (#14477 ) Align Qwen parser behavior with Transformers serve by allowing <tool_call> parsing while still in thinking collection. Changes: - qwen3vl: detect <tool_call> before </think> in thinking state and transition to tool parsing - qwen3: same thinking-state tool detection and partial-tag overlap handling - tests: update qwen3vl thinking/tool interleaving expectations - tests: add qwen3 cases for tool call before </think> and split <tool_call> streaming v0.17.3	2026-02-26 16:13:18 -08:00
Eva H	d69ddc1edc	fix: window app crash on startup when update is pending (#14451 ) v0.17.2	2026-02-26 16:47:12 -05:00
Eva H	9bf41969f0	app: fix first update check delayed by 1 hour (#14427 ) v0.17.1	2026-02-25 18:29:55 -05:00
Jesse Gross	0f23b7bff5	mlxrunner: Cancel in-flight requests when the client disconnects Currently, a canceled request can result in computation continuing in the background to completion. It can also trigger a deadlock when there is nobody to read the output tokens and the pipeline cannot continue to the next request.	2026-02-25 14:00:42 -08:00
Jesse Gross	4e57d2094e	mlxrunner: Simplify pipeline memory and cache management Particularly in error cases, it can be difficult to ensure that all pinned memory is unpinned, MLX buffers are released and cache state is consistent. This encapsulates those pieces and sets up proper deferrals so that this happens automatically on exit.	2026-02-25 14:00:42 -08:00
Jeffrey Morgan	7f9efd53df	model: add support for qwen3.5-27b model (#14415 ) v0.17.1-rc2	2026-02-25 01:09:58 -08:00
Jeffrey Morgan	da70c3222e	model: support for qwen3.5 architecture (#14378 ) v0.17.1-rc1	2026-02-24 20:08:05 -08:00
Bruce MacDonald	9d902d63ce	ggml: ensure tensor size is valid (#14406 ) When quantizing tensors during model creation validate that the resulting sizes match what is expected based on the shape.	2026-02-24 21:52:44 -04:00
Daniel Hiltgen	f4f0a4a471	update mlx-c bindings to 0.5.0 (#14380 ) * chore: update mlx-c bindings to 0.5.0 (#14303) * linux: use gcc 11 --------- Co-authored-by: Patrick Devine <patrick@infrahq.com> v0.17.1-rc0	2026-02-23 16:44:29 -08:00
Eva H	3323c1d319	app: add upgrade configuration to settings page (#13512 )	2026-02-23 18:08:52 -05:00
Jesse Gross	f20dc6b698	mlx: don't default to affine quantization for unquantized models Otherwise the BF16 version of models trigger segfaults when they call into quantized kernels.	2026-02-23 15:03:53 -08:00
Jeffrey Morgan	4b2ac1f369	model: improvements to LFM architectures (#14368 )	2026-02-23 14:38:10 -08:00
Jesse Gross	8daf47fb3a	mlxrunner: Fix duplicate log prefixes and reduce log noise Pass subprocess stdout/stderr through to the parent's stderr directly instead of re-wrapping each line with slog. The subprocess already writes structured slog output, so the re-wrapping produced nested timestamps, levels, and message fields that were hard to read. Also downgrade verbose KV cache debug logs to trace level.	2026-02-23 14:09:20 -08:00
Eva H	6c980579cd	ui: use capability-based detection for web search (#14336 )	2026-02-23 15:00:09 -05:00

1 2 3 4 5 ...

5153 Commits