[PR #11685] [MERGED] ggml: Prevent kv cache quanitization on gpt-oss #75898

Closed
opened 2026-05-05 08:19:17 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/11685
Author: @jessegross
Created: 8/5/2025
Status: Merged
Merged: 8/5/2025
Merged by: @jessegross

Base: mainHead: jessegross/kv_cache


📝 Commits (1)

  • 01fa47e ggml: Prevent kv cache quanitization on gpt-oss

📊 Changes

1 file changed (+4 additions, -0 deletions)

View changed files

📝 fs/ggml/ggml.go (+4 -0)

📄 Description

KV cache quantization has a dependency on the flash attention kernel. We currently cannot use flash attention with gpt-oss as it requires additional operations.

The model definition does not call flash attention, so it works regardless of the setting but the cache will pick up the quantization type. This updates the flash attention setting earlier in the loading flow so that all downstream settings are also set correctly.

Fixes: #11671


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/11685 **Author:** [@jessegross](https://github.com/jessegross) **Created:** 8/5/2025 **Status:** ✅ Merged **Merged:** 8/5/2025 **Merged by:** [@jessegross](https://github.com/jessegross) **Base:** `main` ← **Head:** `jessegross/kv_cache` --- ### 📝 Commits (1) - [`01fa47e`](https://github.com/ollama/ollama/commit/01fa47eebf8194dadd06bba4b1ccdf7b25e3c830) ggml: Prevent kv cache quanitization on gpt-oss ### 📊 Changes **1 file changed** (+4 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `fs/ggml/ggml.go` (+4 -0) </details> ### 📄 Description KV cache quantization has a dependency on the flash attention kernel. We currently cannot use flash attention with gpt-oss as it requires additional operations. The model definition does not call flash attention, so it works regardless of the setting but the cache will pick up the quantization type. This updates the flash attention setting earlier in the loading flow so that all downstream settings are also set correctly. Fixes: #11671 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 08:19:17 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#75898