* model: add MLA absorption for glm4moelite
Split the combined KV_B tensor into separate K_B and V_B tensors
during conversion, enabling MLA (Multi-head Latent Attention)
absorption which compresses the KV cache for improved efficiency.
* ggml: enable MLA flash attention for GLM-4.7-flash
Add support for gqa_ratio 4 in MLA flash attention kernels. GLM-4.7-flash
uses head size 576 with gqa_ratio 4, which was previously only supported
for gqa_ratio 16 (DeepSeek).
Metal changes:
- Enable head size 576 for flash attention
- Increase simdgroups to 8 for large heads (>=512)
- Add case 8 kernel dispatch for 8 simdgroups
CUDA changes:
- Add gqa_ratio 4 support for head 576/512
- Add tile configs for (576, 512, 4) and (576, 512, 8)
- Add MMA config cases for ncols 4
- Add template instances for ncols2=4
* model: add compatibility validation for glm4moelite architecture
Remove static VRAM estimation (EstimateVRAM, CheckMemoryRequirements)
which wasn't helpful. Instead, report the actual tensor weight size
from the manifest for ollama ps.
- Remove memory estimation check from runner startup
- Remove EstimateVRAM, CheckMemoryRequirements, modelVRAMEstimates
- Add TotalTensorSize() to get actual weight size from manifest
- Use weight size for Server.vramSize instead of estimates
Note: This is better than showing 0 or inaccurate estimates, but the
weight size is a drastic underestimation of actual memory usage since
it doesn't account for activations, intermediate tensors, or MLX
overhead. Future work should query real-time memory from MLX
(e.g., MetalGetActiveMemory) for accurate reporting.
Remove the Qwen image generation and image editing model packages
to clean up the codebase. These models will be reintroduced later.
- Delete x/imagegen/models/qwen_image/ (10 files)
- Delete x/imagegen/models/qwen_image_edit/ (5 files)
- Remove related CLI flags and imports from cmd/engine/main.go
- Update comments in cache/step.go to remove Qwen-specific references
For `/api/show`, a fully missing `model_info` field trips up various
integrators (including a recent Android Studio integration).
The primary source of missing info tends to come from models with a
remote that are also missing other data. It seems better to me to return
an empty `model_info` than making up some other fields within
`model_info` (like saying the architecture is `remote` or something like
that). So this does slightly change `/api/show`'s behavior that possibly
someone is relying on, but it seems more important to ensure the field
is always there (from a quick sampling integrations seem to be robust to
missing fields _within_ it).
Fixes: https://github.com/ollama/ollama/issues/13783
Move the unload check (empty prompt + KeepAlive=0) before the image
generation model dispatch in GenerateHandler. This prevents models like
flux from being loaded into memory just to be immediately unloaded when
running `ollama rm`.
Also fix a bug in DeleteHandler where `args[0]` was used instead of
`arg` in the delete loop, causing only the first model to be unloaded
when deleting multiple models.
Add --quantize fp4 support to ollama create for image generation models
(flux2, z-image-turbo), using MLX's affine 4-bit quantization.
Changes:
- Add fp4 to validation in CreateImageGenModel
- Add FP4 case to quantizeTensor (group_size=32, bits=4, affine mode)
- Add GetQuantization() to WeightSource interface for dynamic params
- Update LoadLinearLayer to use quantization params from model metadata
The loadImageGen function was not setting Options on the runnerRef,
causing needsReload() to always return true (since it checks if
runner.Options == nil). This resulted in the image generation
subprocess being killed and restarted for every request.
Simplify Nemotron3NanoParser by delegating tool call parsing to
Qwen3CoderParser instead of duplicating the parsing logic. The
Nemotron parser now only handles the thinking state machine and
transitions to Qwen3CoderParser for content and tool call parsing.
This also fixes an issue where tool calls without </think> would
cause the parser to get stuck in thinking mode.
* MLX - dynamic loading of mlx-c
Create a wrapper layer to indirect the dependency on mlx-c so
the main ollama binary does not have a load-time dependency on mlx-c, mlx, and on linux, cuda. Lazy load the library via dlopen
so we can adjust the path to ensure the dependencies are found
and fail gracefully if not present.
* review comments
* fix broken tests
* x: make `ollama create --experimental` import from safetensors
This change allows pulling in safetensors models into the new experimental model format, and also
fixes the `ollama show` command to be able to correctly display the model information.
* gofumpt the linter
* gofumpt the linter again
* validate the model name
Added validation to ensure auth redirects stay on the same host as the original request. The fix is a single check in getAuthorizationToken comparing the realm URL's host against the request host. Added tests for the auth flow.
Co-Authored-By: Gecko Security <188164982+geckosecurity@users.noreply.github.com>
* gofmt
---------
Co-authored-by: Gecko Security <188164982+geckosecurity@users.noreply.github.com>
Add --norsrc flag to ditto commands when creating Ollama-darwin.zip
to exclude AppleDouble resource fork files (._* files) from the archive.
The mlx.metallib file has extended attributes, which causes ditto to
include a ._mlx.metallib AppleDouble file in the zip. Since this file
is not part of the code signature seal, macOS rejects the bundle during
auto-update verification with:
"a sealed resource is missing or invalid"
"file added: .../._mlx.metallib"
The --norsrc flag prevents ditto from preserving resource forks and
extended attributes, ensuring only signed files are included in the
release archive.
The CMake condition for installing mlx.metallib checks
CMAKE_OSX_ARCHITECTURES, but this variable is only set when explicitly
passed - not auto-detected. The arm64 build was missing this flag,
causing the metallib to not be installed, which then caused codesign
to fail on the unexpanded glob pattern.
- Install mlx.metallib for arm64 builds (required for Metal GPU acceleration)
- Apply rpath settings to all macOS builds, not just x86_64
- Add CMAKE_BUILD_WITH_INSTALL_RPATH to avoid install_name_tool errors
- Update build_darwin.sh to copy, sign, and package the metallib
TeaCache:
- Timestep embedding similarity caching for diffusion models
- Polynomial rescaling with configurable thresholds
- Reduces transformer forward passes by ~30-50%
FP8 quantization:
- Support for FP8 quantized models (8-bit weights with scales)
- QuantizedMatmul on Metal, Dequantize on CUDA
- Client-side quantization via ollama create --quantize fp8
Other bug fixes:
- Fix `/api/show` API for image generation models
- Server properly returns model info (architecture, parameters, quantization)
- Memory allocation optimizations
- CLI improvements for image generation