ollama/x/mlxrunner/mlx at main - ollama - Computersurge

github-starred/ollama

mirror of https://github.com/ollama/ollama.git synced 2026-05-10 15:44:38 -05:00

Files

History

Patrick Devine 15e6076d79 mlx: Gemma4 MTP speculative decoding (#15980 )

This change adds support for MTP (multi-token prediction) speculative decoding for the
gemma4 model family.

It includes:
  * support for importing safetensors based gemma4 draft models with `ollama create`
  * a new DRAFT command in the Modelfile for specifying draft models
  * a --quantize-draft flag for the ollama create command to quantize the draft model
  * cache support for speculation
  * changes to the rotating cache to be able to handle MTP correctly
  * sampling support for draft model token prediction

---------

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

2026-05-05 08:55:04 -07:00

..

ci: Fix windows build (#14754 )

2026-03-09 19:27:59 -07:00

Update MLX and MLX-C with threading fixes (#15845 )

2026-05-03 10:03:14 -07:00

.gitignore

Add MLX runner with GLM4-MoE-Lite model support (#14185 )

2026-02-10 14:57:57 -08:00

act.go

New models (#15861 )

2026-04-28 11:50:12 -07:00

array_test.go

mlx: Gemma4 MTP speculative decoding (#15980 )

2026-05-05 08:55:04 -07:00

array.go

mlx: improve thread safety of array management

2026-04-21 14:38:49 -07:00

CMakeLists.txt

mlx: update as of 3/23 (#14789 )

2026-03-23 11:28:44 -07:00

compile_test.go

mlx: add compiled closure support

2026-04-14 16:38:32 -07:00

compile.go

mlx: improve thread safety of array management

2026-04-21 14:38:49 -07:00

dtype.go

MLX: add header vendoring and remove go build tag (#14642 )

2026-03-09 17:24:45 -07:00

dynamic_darwin.go

mlx: Improve M5 performance with NAX (#15345 )

2026-04-07 08:12:24 -07:00

dynamic_other.go

mlx: Improve M5 performance with NAX (#15345 )

2026-04-07 08:12:24 -07:00

dynamic.c

mlx: remove noisy error output from dynamic library loading (#14346 )

2026-02-20 23:46:07 -08:00

dynamic.go

mlx: Improve M5 performance with NAX (#15345 )

2026-04-07 08:12:24 -07:00

dynamic.h

MLX: add header vendoring and remove go build tag (#14642 )

2026-03-09 17:24:45 -07:00

fast.go

mlxrunner: decouple models from attention cache storage layout

2026-04-27 20:04:46 -07:00

gated_delta.go

mlxrunner: decouple models from attention cache storage layout

2026-04-27 20:04:46 -07:00

generated.c

Update MLX and MLX-C with threading fixes (#15845 )

2026-05-03 10:03:14 -07:00

generated.h

Update MLX and MLX-C with threading fixes (#15845 )

2026-05-03 10:03:14 -07:00

io.go

mlx: Support NVIDIA TensorRT Model Optimizer import (#15566 )

2026-04-27 18:28:10 -07:00

memory.go

MLX: add header vendoring and remove go build tag (#14642 )

2026-03-09 17:24:45 -07:00

mlx.go

mlx: add compiled closure support

2026-04-14 16:38:32 -07:00

nn.go

mlx: improve thread safety of array management

2026-04-21 14:38:49 -07:00

ops_extra.go

mlxrunner: decouple models from attention cache storage layout

2026-04-27 20:04:46 -07:00

ops.go

mlx: Gemma4 MTP speculative decoding (#15980 )

2026-05-05 08:55:04 -07:00

random.go

mlx: Gemma4 MTP speculative decoding (#15980 )

2026-05-05 08:55:04 -07:00

slice.go

mlxrunner: fix Slice(0, 0) returning full dimension instead of empty

2026-03-18 16:06:33 -07:00

stream.go

Update MLX and MLX-C with threading fixes (#15845 )

2026-05-03 10:03:14 -07:00

thread_test.go

Update MLX and MLX-C with threading fixes (#15845 )

2026-05-03 10:03:14 -07:00