[PR #14506] [CLOSED] model: Add qwen35moe architecture support #61402

Closed
opened 2026-04-29 16:28:10 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/14506
Author: @ElViejoPaulino
Created: 2/28/2026
Status: Closed

Base: mainHead: feat/qwen35moe-architecture


📝 Commits (1)

  • 64c8b5c model: Add qwen35moe architecture support

📊 Changes

11 files changed (+961 additions, -8 deletions)

View changed files

📝 fs/ggml/ggml.go (+27 -3)
📝 llama/llama.cpp/src/llama-arch.cpp (+37 -0)
📝 llama/llama.cpp/src/llama-arch.h (+3 -0)
📝 llama/llama.cpp/src/llama-context.cpp (+1 -1)
📝 llama/llama.cpp/src/llama-kv-cache.cpp (+1 -1)
📝 llama/llama.cpp/src/llama-model.cpp (+97 -1)
📝 llama/llama.cpp/src/llama-model.h (+4 -0)
📝 llama/llama.cpp/src/models/models.h (+52 -0)
llama/llama.cpp/src/models/qwen35moe.cpp (+734 -0)
📝 model/models/qwen3next/deltanet.go (+2 -2)
📝 model/models/qwen3next/model.go (+3 -0)

📄 Description

Problem:
Ollama v0.17.4 throws unknown model architecture: qwen35moe when attempting to load any Qwen3.5 model with MoE architecture.

Solution:
-C++ runner (vendored llama.cpp):
1.- Register LLM_ARCH_QWEN35MOE enum, tensor mappings, and hyperparameters
2.- New qwen35moe.cpp graph builder handling hybrid delta_net + full attention + MoE (256 experts, 8 active)
3.- Model loading with correct tensor shapes (separate ssm_beta/ssm_alpha, 1D ssm_a)
4.- Increased max_nodes for qwen35moe
5.- KV cache get_can_shift() fix for IMROPE models (n_pos_per_embd > 1)

-Go scheduler (memory estimation)
1.- HeadCount()/HeadCountKV() in ggml.go detect full_attention_interval and zero out recurrent layers when head counts are scalar, fixes memory estimation for any hybrid architecture
2.- Fix ssm_dt tensor name to ssm_dt.bias in deltanet.go (matches GGUF naming from llama.cpp converter)
3.- Defer qwen3next Go engine to C++ runner in model.go

Testing:
Qwen3.5-35B-A3B Q4_K_M on RTX 4090 + RTX 3090 (48 GB VRAM)
262K context: 30 GB / 100% GPU (was 60 GB with CPU spill before memory fix)
Decode: 101 tps (was 16 tps, about 6x improvement)
Prefill: 716 t/s, TTFT: 245ms

Files changed (11):
1.- qwen35moe.cpp | New: graph builder
2.- llama/llama.cpp/src/llama-arch.cpp | Architecture registration
3.- llama/llama.cpp/src/llama-arch.h | Enum + tensor types
4.- llama-model.cpp | Model loading
5.- llama-model.h | Tensor declarations
6.- llama/llama.cpp/src/llama-context.cpp | max_nodes
7.- llama-kv-cache.cpp | IMROPE shift fix
8.- llama/llama.cpp/src/models/models.h | Graph builder registration
9.- ggml.go| HeadCount/HeadCountKV memory fix
10.- deltanet.go | ssm_dt.bias tensor name
11.- model.go | Defer to C++ runner

Update: My first push only added C++ runner support, after further testing I found that without a Go side fix, the scheduler was still allocating full context KV cache for all 40 layers (instead of only the 10 full attention layers), causing approximately a 2x memory overuse and GPU spill at long contexts.
This push adds the memory estimation fix to HeadCount()/HeadCountKV() and the ssm_dt.bias tensor name fix.
Comments sections were also cleaned from the code.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/14506 **Author:** [@ElViejoPaulino](https://github.com/ElViejoPaulino) **Created:** 2/28/2026 **Status:** ❌ Closed **Base:** `main` ← **Head:** `feat/qwen35moe-architecture` --- ### 📝 Commits (1) - [`64c8b5c`](https://github.com/ollama/ollama/commit/64c8b5c1b71283db349947c88f241a755e78b0ba) model: Add qwen35moe architecture support ### 📊 Changes **11 files changed** (+961 additions, -8 deletions) <details> <summary>View changed files</summary> 📝 `fs/ggml/ggml.go` (+27 -3) 📝 `llama/llama.cpp/src/llama-arch.cpp` (+37 -0) 📝 `llama/llama.cpp/src/llama-arch.h` (+3 -0) 📝 `llama/llama.cpp/src/llama-context.cpp` (+1 -1) 📝 `llama/llama.cpp/src/llama-kv-cache.cpp` (+1 -1) 📝 `llama/llama.cpp/src/llama-model.cpp` (+97 -1) 📝 `llama/llama.cpp/src/llama-model.h` (+4 -0) 📝 `llama/llama.cpp/src/models/models.h` (+52 -0) ➕ `llama/llama.cpp/src/models/qwen35moe.cpp` (+734 -0) 📝 `model/models/qwen3next/deltanet.go` (+2 -2) 📝 `model/models/qwen3next/model.go` (+3 -0) </details> ### 📄 Description **Problem:** Ollama v0.17.4 throws unknown model architecture: qwen35moe when attempting to load any Qwen3.5 model with MoE architecture. **Solution:** -C++ runner (vendored llama.cpp): 1.- Register LLM_ARCH_QWEN35MOE enum, tensor mappings, and hyperparameters 2.- New qwen35moe.cpp graph builder handling hybrid delta_net + full attention + MoE (256 experts, 8 active) 3.- Model loading with correct tensor shapes (separate ssm_beta/ssm_alpha, 1D ssm_a) 4.- Increased max_nodes for qwen35moe 5.- KV cache get_can_shift() fix for IMROPE models (n_pos_per_embd > 1) -Go scheduler (memory estimation) 1.- HeadCount()/HeadCountKV() in ggml.go detect full_attention_interval and zero out recurrent layers when head counts are scalar, fixes memory estimation for any hybrid architecture 2.- Fix ssm_dt tensor name to ssm_dt.bias in deltanet.go (matches GGUF naming from llama.cpp converter) 3.- Defer qwen3next Go engine to C++ runner in model.go **Testing:** Qwen3.5-35B-A3B Q4_K_M on RTX 4090 + RTX 3090 (48 GB VRAM) 262K context: 30 GB / 100% GPU (was 60 GB with CPU spill before memory fix) Decode: 101 tps (was 16 tps, about 6x improvement) Prefill: 716 t/s, TTFT: 245ms **Files changed (11):** 1.- qwen35moe.cpp | New: graph builder 2.- llama/llama.cpp/src/llama-arch.cpp | Architecture registration 3.- llama/llama.cpp/src/llama-arch.h | Enum + tensor types 4.- llama-model.cpp | Model loading 5.- llama-model.h | Tensor declarations 6.- llama/llama.cpp/src/llama-context.cpp | max_nodes 7.- llama-kv-cache.cpp | IMROPE shift fix 8.- llama/llama.cpp/src/models/models.h | Graph builder registration 9.- ggml.go| HeadCount/HeadCountKV memory fix 10.- deltanet.go | ssm_dt.bias tensor name 11.- model.go | Defer to C++ runner **Update:** My first push only added C++ runner support, after further testing I found that without a Go side fix, the scheduler was still allocating full context KV cache for all 40 layers (instead of only the 10 full attention layers), causing approximately a 2x memory overuse and GPU spill at long contexts. This push adds the memory estimation fix to HeadCount()/HeadCountKV() and the ssm_dt.bias tensor name fix. Comments sections were also cleaned from the code. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 16:28:10 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#61402