[PR #13683] [MERGED] x/imagegen: add naive TeaCache and FP8 quantization support #19608

Closed
opened 2026-04-16 07:11:39 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13683
Author: @jmorganca
Created: 1/12/2026
Status: Merged
Merged: 1/12/2026
Merged by: @jmorganca

Base: mainHead: jmorganca/diffusion-optimizations


📝 Commits (6)

📊 Changes

26 files changed (+1216 additions, -257 deletions)

View changed files

📝 api/client.go (+1 -1)
📝 cmd/cmd.go (+15 -15)
📝 cmd/cmd_test.go (+73 -0)
📝 server/routes.go (+13 -0)
📝 x/imagegen/README.md (+14 -0)
📝 x/imagegen/api/handler.go (+14 -18)
x/imagegen/cache/teacache.go (+197 -0)
📝 x/imagegen/cli.go (+88 -86)
📝 x/imagegen/client/create.go (+70 -10)
x/imagegen/client/quantize.go (+120 -0)
x/imagegen/client/quantize_stub.go (+18 -0)
📝 x/imagegen/cmd/engine/main.go (+14 -7)
📝 x/imagegen/create.go (+35 -5)
📝 x/imagegen/image.go (+6 -3)
📝 x/imagegen/mlx/mlx.go (+83 -4)
📝 x/imagegen/models/qwen_image/qwen_image.go (+23 -3)
📝 x/imagegen/models/qwen_image_edit/qwen_image_edit.go (+22 -4)
📝 x/imagegen/models/zimage/text_encoder.go (+9 -9)
📝 x/imagegen/models/zimage/transformer.go (+111 -27)
📝 x/imagegen/models/zimage/vae.go (+15 -2)

...and 6 more files

📄 Description

This improves performance of z-image on macOS and CUDA. There's still some work to simplify this as well as merge the image generation pipeline runtime between different diffusion models. FP4 was also explored but will be revisited in a follow up.

TeaCache:

  • Timestep embedding similarity caching for diffusion models
  • Polynomial rescaling with configurable thresholds
  • Reduces transformer forward passes by ~30-50%

FP8 quantization:

  • Support for FP8 quantized models (8-bit weights with scales)
  • QuantizedMatmul on Metal, Dequantize on CUDA
  • Client-side quantization via ollama create --quantize fp8

Other improvements:

  • Fix Show API for image generation models
  • Server properly returns model info (architecture, parameters, quantization)
  • Memory allocation optimizations
  • CLI improvements for image generation

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13683 **Author:** [@jmorganca](https://github.com/jmorganca) **Created:** 1/12/2026 **Status:** ✅ Merged **Merged:** 1/12/2026 **Merged by:** [@jmorganca](https://github.com/jmorganca) **Base:** `main` ← **Head:** `jmorganca/diffusion-optimizations` --- ### 📝 Commits (6) - [`47b8def`](https://github.com/ollama/ollama/commit/47b8def07e977d76288b0c54b2aad2a11de26724) x/imagegen: add TeaCache and FP8 quantization support - [`73ad90b`](https://github.com/ollama/ollama/commit/73ad90bf0852bfb88989925ddcee648f1b7002fd) fix - [`6b4fb18`](https://github.com/ollama/ollama/commit/6b4fb18235ec97405c455fe5b2a8c3695cda4f88) undo readme - [`6ca8a91`](https://github.com/ollama/ollama/commit/6ca8a917716dd383df34610572adc06cfd296bd3) remove additional file - [`a5b166c`](https://github.com/ollama/ollama/commit/a5b166c6b8bacf7c0ebc5120c6eb7a3eaacf5c05) cmd: add tests for cmd changes - [`52857b0`](https://github.com/ollama/ollama/commit/52857b0bbcaafb1c19f18078ad2daaab3ab74fa3) fix linter ### 📊 Changes **26 files changed** (+1216 additions, -257 deletions) <details> <summary>View changed files</summary> 📝 `api/client.go` (+1 -1) 📝 `cmd/cmd.go` (+15 -15) 📝 `cmd/cmd_test.go` (+73 -0) 📝 `server/routes.go` (+13 -0) 📝 `x/imagegen/README.md` (+14 -0) 📝 `x/imagegen/api/handler.go` (+14 -18) ➕ `x/imagegen/cache/teacache.go` (+197 -0) 📝 `x/imagegen/cli.go` (+88 -86) 📝 `x/imagegen/client/create.go` (+70 -10) ➕ `x/imagegen/client/quantize.go` (+120 -0) ➕ `x/imagegen/client/quantize_stub.go` (+18 -0) 📝 `x/imagegen/cmd/engine/main.go` (+14 -7) 📝 `x/imagegen/create.go` (+35 -5) 📝 `x/imagegen/image.go` (+6 -3) 📝 `x/imagegen/mlx/mlx.go` (+83 -4) 📝 `x/imagegen/models/qwen_image/qwen_image.go` (+23 -3) 📝 `x/imagegen/models/qwen_image_edit/qwen_image_edit.go` (+22 -4) 📝 `x/imagegen/models/zimage/text_encoder.go` (+9 -9) 📝 `x/imagegen/models/zimage/transformer.go` (+111 -27) 📝 `x/imagegen/models/zimage/vae.go` (+15 -2) _...and 6 more files_ </details> ### 📄 Description This improves performance of z-image on macOS and CUDA. There's still some work to simplify this as well as merge the image generation pipeline runtime between different diffusion models. FP4 was also explored but will be revisited in a follow up. TeaCache: - Timestep embedding similarity caching for diffusion models - Polynomial rescaling with configurable thresholds - Reduces transformer forward passes by ~30-50% FP8 quantization: - Support for FP8 quantized models (8-bit weights with scales) - QuantizedMatmul on Metal, Dequantize on CUDA - Client-side quantization via `ollama create --quantize fp8` Other improvements: - Fix Show API for image generation models - Server properly returns model info (architecture, parameters, quantization) - Memory allocation optimizations - CLI improvements for image generation --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-16 07:11:39 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#19608