Files
Jesse Gross 1108d8b34e ggml: Enable flash attention for vision encoders
Although the vision component of multimodal models typically already
call the optimized nn.Attention, it is converted into non-fused
operations. That is because the backend-specific fused kernels may
have requirements, such as padding, and they is performed by the
cache, which vision encoders don't use.

This implements a fallback path in the backend, softening the
requirements into optimizations. In turn, this allows flash attention
to be used for vision encoders, saving a significant amount of VRAM
and improving performance.
2025-12-04 15:19:06 -08:00
..
2025-05-20 15:51:08 -07:00
2025-11-06 10:19:22 -08:00
2025-10-29 11:03:43 -07:00
2025-02-13 16:31:21 -08:00
2025-02-13 16:31:21 -08:00