ollama

mirror of https://github.com/ollama/ollama.git synced 2026-05-07 16:40:08 -05:00

Files

Jesse Gross 2bbe2405fe mlxrunner: decouple models from attention cache storage layout

Models build their own attention masks and read K/V directly from
the cache's buffers, which ties them to the cache's storage layout.
That blocks multi-sequence batching — right-padded rows need a
query-padding mask composed onto every model — and rules out
variants like paged attention where K/V isn't one contiguous tensor.

Caches now hand back a per-layer KVHistory holding post-update K, V,
and a MaskApplier that merges the cache's storage restrictions into
the model's logical mask. Models describe their mask in logical
terms; SDPA composes model, padding, and applier contributions and
dispatches to the kernel's causal or no-mask fast path when it can.
KVHistory still exposes K, V, and the composed mask for manual
attention paths (e.g. CUDA prefill at head_dim > 128).

Performance for single-sequence inference is unchanged.

2026-04-27 20:04:46 -07:00

batch.go

mlxrunner: decouple models from attention cache storage layout

2026-04-27 20:04:46 -07:00