[GH-ISSUE #15898] qwen35moe architecture missing from vendored llama.cpp -- mmproj/vision loading fails #72189

Open
opened 2026-05-05 03:36:36 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @ArkaD171717 on GitHub (Apr 30, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15898

Bug

Attaching an mmproj (vision projector) GGUF to a qwen35moe model fails with:

llama_model_load: error loading model: error loading model architecture:
unknown model architecture: 'qwen35moe'

This blocks ALL inference (text + vision) when an mmproj is attached via dual-FROM Modelfile.

Reproduction

Reproduced on Kaggle T4x2 (2026-04-30) using:

  • Text GGUF: bartowski/Qwen_Qwen3.6-35B-A3B-GGUF (IQ2_XS)
  • mmproj: Youseff1987/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF-with-mmproj (mmproj-F16)

Modelfile:

FROM /path/to/Qwen_Qwen3.6-35B-A3B-IQ2_XS.gguf
FROM /path/to/mmproj-F16.gguf

ollama create succeeds, but any /api/generate call returns unable to load model. Server log shows the architecture error above.

Full reproduction notebook: https://github.com/ArkaD171717/Qwen3.6-Compat/blob/main/ollama/test_mmproj_clip_runner.ipynb

Root cause

PR #14517 added qwen35moe to the Go engine's text runner. But the Go engine does not support split vision models for this architecture -- when projectors are present, it falls back to the C++ llama.cpp runner. Ollama's vendored llama.cpp fork does not have qwen35 or qwen35moe in its architecture table, so the fallback fails.

Upstream ggml-org/llama.cpp already supports both architectures (LLM_ARCH_QWEN35, LLM_ARCH_QWEN35MOE).

Proposed fix

Sync qwen35/qwen35moe architecture support from upstream ggml-org/llama.cpp into:

  • llama/llama.cpp/src/llama-arch.h (enum entries)
  • llama/llama.cpp/src/llama-arch.cpp (name map + tensor maps)
  • llama/llama.cpp/src/llama-model.cpp (hparams + graph building)
Originally created by @ArkaD171717 on GitHub (Apr 30, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15898 ## Bug Attaching an mmproj (vision projector) GGUF to a `qwen35moe` model fails with: ``` llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen35moe' ``` This blocks ALL inference (text + vision) when an mmproj is attached via dual-FROM Modelfile. ## Reproduction Reproduced on Kaggle T4x2 (2026-04-30) using: - Text GGUF: `bartowski/Qwen_Qwen3.6-35B-A3B-GGUF` (IQ2_XS) - mmproj: `Youseff1987/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF-with-mmproj` (mmproj-F16) Modelfile: ``` FROM /path/to/Qwen_Qwen3.6-35B-A3B-IQ2_XS.gguf FROM /path/to/mmproj-F16.gguf ``` `ollama create` succeeds, but any `/api/generate` call returns `unable to load model`. Server log shows the architecture error above. Full reproduction notebook: https://github.com/ArkaD171717/Qwen3.6-Compat/blob/main/ollama/test_mmproj_clip_runner.ipynb ## Root cause PR #14517 added `qwen35moe` to the Go engine's text runner. But the Go engine does not support split vision models for this architecture -- when projectors are present, it falls back to the C++ llama.cpp runner. Ollama's vendored llama.cpp fork does not have `qwen35` or `qwen35moe` in its architecture table, so the fallback fails. Upstream `ggml-org/llama.cpp` already supports both architectures (`LLM_ARCH_QWEN35`, `LLM_ARCH_QWEN35MOE`). ## Proposed fix Sync `qwen35`/`qwen35moe` architecture support from upstream `ggml-org/llama.cpp` into: - `llama/llama.cpp/src/llama-arch.h` (enum entries) - `llama/llama.cpp/src/llama-arch.cpp` (name map + tensor maps) - `llama/llama.cpp/src/llama-model.cpp` (hparams + graph building) ## Related issues - #14730 (same error, closed as dup of #14575) - #14575 (open, Qwen3.5 loading failures) - #15747 (same error on Ollama 0.21.0) - #15499 (same error, closed as dup of #14575) - #14517 (text runner fix, merged)
Author
Owner

@ArkaD171717 commented on GitHub (Apr 30, 2026):

Upstream references for the sync

The exact upstream commits/files in ggml-org/llama.cpp that already implement qwen35/qwen35moe:

Headers:

  • src/llama-arch.h -- LLM_ARCH_QWEN35, LLM_ARCH_QWEN35MOE enum entries, plus LLM_TENSOR_SSM_ALPHA, LLM_TENSOR_SSM_BETA
  • src/llama-arch.cpp -- name maps, tensor name maps, llm_arch_is_hybrid() entries

Model loading:

  • src/llama-model.cpp -- load_hparams() cases, tensor creation blocks, IMROPE rope type entries, build_graphs() dispatch

Model implementations:

  • src/models/qwen35.cpp -- dense (27B) graph building
  • src/models/qwen35moe.cpp -- MoE (35B-A3B) graph building

Key adaptation note for Ollama's vendored copy:
Upstream refactored to llm_build_delta_net_base (with a built-in chunking/autoregressive dispatcher). Ollama's fork doesn't have this base class -- it uses llm_graph_context_mamba instead (see models/qwen3next.cpp). The qwen35 code would need the same adaptation pattern that qwen3next uses: inline build_delta_net_chunking() and build_delta_net_autoregressive() as class methods.

Other differences to watch:

  • Separate ssm_alpha + ssm_beta tensors (upstream) vs combined ssm_beta_alpha (ollama's qwen3next)
  • LLAMA_ROPE_TYPE_IMROPE with ggml_rope_multi() (upstream) vs LLAMA_ROPE_TYPE_NEOX with ggml_rope_ext() (ollama's qwen3next)
  • build_lora_mm() signature differences (upstream has LoRA adapter params)
  • build_moe_ffn() parameter ordering differences

Happy to submit a PR with the adapted code if that would be helpful.

<!-- gh-comment-id:4352919536 --> @ArkaD171717 commented on GitHub (Apr 30, 2026): ## Upstream references for the sync The exact upstream commits/files in ggml-org/llama.cpp that already implement qwen35/qwen35moe: **Headers:** - `src/llama-arch.h` -- `LLM_ARCH_QWEN35`, `LLM_ARCH_QWEN35MOE` enum entries, plus `LLM_TENSOR_SSM_ALPHA`, `LLM_TENSOR_SSM_BETA` - `src/llama-arch.cpp` -- name maps, tensor name maps, `llm_arch_is_hybrid()` entries **Model loading:** - `src/llama-model.cpp` -- `load_hparams()` cases, tensor creation blocks, IMROPE rope type entries, `build_graphs()` dispatch **Model implementations:** - `src/models/qwen35.cpp` -- dense (27B) graph building - `src/models/qwen35moe.cpp` -- MoE (35B-A3B) graph building **Key adaptation note for Ollama's vendored copy:** Upstream refactored to `llm_build_delta_net_base` (with a built-in chunking/autoregressive dispatcher). Ollama's fork doesn't have this base class -- it uses `llm_graph_context_mamba` instead (see `models/qwen3next.cpp`). The qwen35 code would need the same adaptation pattern that qwen3next uses: inline `build_delta_net_chunking()` and `build_delta_net_autoregressive()` as class methods. Other differences to watch: - Separate `ssm_alpha` + `ssm_beta` tensors (upstream) vs combined `ssm_beta_alpha` (ollama's qwen3next) - `LLAMA_ROPE_TYPE_IMROPE` with `ggml_rope_multi()` (upstream) vs `LLAMA_ROPE_TYPE_NEOX` with `ggml_rope_ext()` (ollama's qwen3next) - `build_lora_mm()` signature differences (upstream has LoRA adapter params) - `build_moe_ffn()` parameter ordering differences Happy to submit a PR with the adapted code if that would be helpful.
Author
Owner

@ArkaD171717 commented on GitHub (Apr 30, 2026):

Opened #15899 to address this

<!-- gh-comment-id:4353656331 --> @ArkaD171717 commented on GitHub (Apr 30, 2026): Opened #15899 to address this
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#72189