[PR #15902] create: keep linear_attn in_proj_qkv and in_proj_z in BF16 for NVFP4/MXFP4/MXFP8 #77645

Open
opened 2026-05-05 10:19:19 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/15902
Author: @ArkaD171717
Created: 4/30/2026
Status: 🔄 Open

Base: mainHead: fix/qwen35-nvfp4-qkv-bf16


📝 Commits (1)

  • 0e1a7b4 create: keep linear_attn in_proj_qkv and in_proj_z in BF16 for NVFP4/MXFP4/MXFP8

📊 Changes

2 files changed (+8 additions, -0 deletions)

View changed files

📝 x/create/create_test.go (+4 -0)
📝 x/create/qwen35.go (+4 -0)

📄 Description

Summary

Adds in_proj_qkv.weight and in_proj_z.weight to the BF16 exemption
list in qwen35ShouldKeepBF16ForDirectNonAffine. The upstream source
(RedHatAI/Qwen3.6-35B-A3B-NVFP4) keeps all linear_attn projections in
BF16 -- in_proj_a and in_proj_b were already exempted but in_proj_qkv
and in_proj_z were missing, so they got NVFP4-quantized.

The K-projection rows in in_proj_qkv have small magnitudes (0.01-0.04)
in early layers that fall below the smallest FP4 codepoint at group_size=16,
zeroing out layers 0-1 entirely.

Weight magnitude analysis (source BF16, Qwen/Qwen3.6-35B-A3B)

Tensor Layer Median abs % < 0.01 % < 0.05
in_proj_qkv 0 0.0093 52.7% 99.5%
in_proj_z 0 0.0109 46.3% 99.4%
in_proj_qkv 1 0.0100 49.8% 99.4%
in_proj_z 1 0.0101 49.4% 99.5%

Both tensors have ~50% of weights below the smallest FP4 codepoint at group_size=16. in_proj_z has the same zero-collapse risk as in_proj_qkv.

Test updated, passes locally for nvfp4/mxfp8/mxfp4.

Fixes #15866


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/15902 **Author:** [@ArkaD171717](https://github.com/ArkaD171717) **Created:** 4/30/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `fix/qwen35-nvfp4-qkv-bf16` --- ### 📝 Commits (1) - [`0e1a7b4`](https://github.com/ollama/ollama/commit/0e1a7b47007c68f493ee4f170015fe27d8f20b6a) create: keep linear_attn in_proj_qkv and in_proj_z in BF16 for NVFP4/MXFP4/MXFP8 ### 📊 Changes **2 files changed** (+8 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `x/create/create_test.go` (+4 -0) 📝 `x/create/qwen35.go` (+4 -0) </details> ### 📄 Description ## Summary Adds `in_proj_qkv.weight` and `in_proj_z.weight` to the BF16 exemption list in `qwen35ShouldKeepBF16ForDirectNonAffine`. The upstream source (RedHatAI/Qwen3.6-35B-A3B-NVFP4) keeps all `linear_attn` projections in BF16 -- `in_proj_a` and `in_proj_b` were already exempted but `in_proj_qkv` and `in_proj_z` were missing, so they got NVFP4-quantized. The K-projection rows in `in_proj_qkv` have small magnitudes (0.01-0.04) in early layers that fall below the smallest FP4 codepoint at group_size=16, zeroing out layers 0-1 entirely. ### Weight magnitude analysis (source BF16, Qwen/Qwen3.6-35B-A3B) | Tensor | Layer | Median abs | % < 0.01 | % < 0.05 | |--------|-------|-----------|----------|----------| | in_proj_qkv | 0 | 0.0093 | 52.7% | 99.5% | | in_proj_z | 0 | 0.0109 | 46.3% | 99.4% | | in_proj_qkv | 1 | 0.0100 | 49.8% | 99.4% | | in_proj_z | 1 | 0.0101 | 49.4% | 99.5% | Both tensors have ~50% of weights below the smallest FP4 codepoint at group_size=16. in_proj_z has the same zero-collapse risk as in_proj_qkv. Test updated, passes locally for nvfp4/mxfp8/mxfp4. Fixes #15866 --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 10:19:20 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#77645