mirror of
https://github.com/ollama/ollama.git
synced 2026-05-07 16:40:08 -05:00
Models build their own attention masks and read K/V directly from the cache's buffers, which ties them to the cache's storage layout. That blocks multi-sequence batching — right-padded rows need a query-padding mask composed onto every model — and rules out variants like paged attention where K/V isn't one contiguous tensor. Caches now hand back a per-layer KVHistory holding post-update K, V, and a MaskApplier that merges the cache's storage restrictions into the model's logical mask. Models describe their mask in logical terms; SDPA composes model, padding, and applier contributions and dispatches to the kernel's causal or no-mask fast path when it can. KVHistory still exposes K, V, and the composed mask for manual attention paths (e.g. CUDA prefill at head_dim > 128). Performance for single-sequence inference is unchanged.