[PR #13278] [CLOSED] feat: support split multimodal models with M-RoPE (Qwen3-VL) #40018

Closed
opened 2026-04-23 01:01:29 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13278
Author: @iosub
Created: 11/30/2025
Status: Closed

Base: mainHead: feat/mrope-split-models


📝 Commits (9)

  • a9c6818 Revert "vulkan: temporary cary of vulkan fixes (#12971)"
  • d2917b7 ggml update to b7087
  • 366ed3e fix argsort on metal
  • 9a4271c update to b7108
  • af56743 fix bakllava regression
  • 4fd4574 fix lint logic to only compare against merge base and ignore files that aren't touched in this PR.
  • 18a8e92 Merge remote-tracking branch 'dhiltgen/ggml_bump' into feat/mrope-split-models
  • 773d9c0 feat: support split multimodal models with M-RoPE (Qwen3-VL)
  • fc49110 fix: address Copilot review comments

📊 Changes

287 files changed (+27550 additions, -22424 deletions)

View changed files

📝 .github/workflows/test.yaml (+1 -1)
📝 Makefile.sync (+1 -1)
📝 discover/runner.go (+1 -0)
📝 fs/ggml/ggml.go (+180 -3)
📝 fs/ggml/gguf.go (+4 -1)
📝 llama/build-info.cpp (+1 -1)
📝 llama/llama.cpp/.rsync-filter (+3 -0)
📝 llama/llama.cpp/common/common.cpp (+34 -5)
📝 llama/llama.cpp/common/common.h (+15 -1)
📝 llama/llama.cpp/common/json-schema-to-grammar.cpp (+21 -3)
📝 llama/llama.cpp/common/json-schema-to-grammar.h (+2 -0)
📝 llama/llama.cpp/common/log.cpp (+6 -0)
📝 llama/llama.cpp/common/log.h (+2 -0)
📝 llama/llama.cpp/include/llama.h (+7 -3)
📝 llama/llama.cpp/src/llama-arch.cpp (+140 -0)
📝 llama/llama.cpp/src/llama-arch.h (+13 -0)
📝 llama/llama.cpp/src/llama-batch.cpp (+63 -31)
📝 llama/llama.cpp/src/llama-batch.h (+12 -1)
📝 llama/llama.cpp/src/llama-chat.cpp (+32 -0)
📝 llama/llama.cpp/src/llama-chat.h (+1 -0)

...and 80 more files

📄 Description

This PR adds support for loading and running split GGUF multimodal models (separate text and vision files) with proper M-RoPE position encoding.

Features

Split Model Loading

  • Extended MetaGGML to handle ForeignTensors from vision projector files
  • Added split model detection via gguf.general.split.* metadata
  • Server loads both text and vision GGUF files when split is detected

M-RoPE (Multi-dimensional Rotary Position Embedding)

Qwen2-VL/Qwen3-VL models use M-RoPE which requires 4 position values per image token instead of 1:

  • pos[0]: temporal position (constant for images)
  • pos[1]: y position (row in image grid)
  • pos[2]: x position (column in image grid)
  • pos[3]: unused (always 0)

Implementation:

  • NewBatchMRoPE() allocates batches with 4 positions per token
  • AddImageMRoPE() sets 2D positions based on image grid (nx x ny)
  • Position advance uses max(nx,ny) per llama.cpp mtmd conventions

Multimodal Embedding Fixes

  • Fixed buffer allocation using n_embd_inp() instead of n_embd (8192 vs 2048 for vision projector)
  • Fixed tensor reads using actual t_embd->ne[0] dimension
  • Prevents GGML_ASSERT failures and access violations with images

Stability Improvements

  • Clear KV cache when prompt contains image embeddings
  • Added repetition loop detection to stop infinite generation

Testing

Tested with Qwen3-VL models (2B, 4B, 8B) in Q4_K_M and Q8_0 quantizations.
Both single-file and split model formats work correctly.

Built on top of PR #12992 (ggml_bump) which adds qwen3vl architecture to llama.cpp.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13278 **Author:** [@iosub](https://github.com/iosub) **Created:** 11/30/2025 **Status:** ❌ Closed **Base:** `main` ← **Head:** `feat/mrope-split-models` --- ### 📝 Commits (9) - [`a9c6818`](https://github.com/ollama/ollama/commit/a9c6818a2bc1e3aa7194b577a4fff24abdaa400e) Revert "vulkan: temporary cary of vulkan fixes (#12971)" - [`d2917b7`](https://github.com/ollama/ollama/commit/d2917b76aefa322dc2a5870ebdbdc361671f0ddf) ggml update to b7087 - [`366ed3e`](https://github.com/ollama/ollama/commit/366ed3e30f593c346aef6dd826527aa78f911e1c) fix argsort on metal - [`9a4271c`](https://github.com/ollama/ollama/commit/9a4271cafa81d37f45bf1ed9e9c4b70216f1ff2f) update to b7108 - [`af56743`](https://github.com/ollama/ollama/commit/af567437c959e54bafc2cded3fdfc2a3d7676f6e) fix bakllava regression - [`4fd4574`](https://github.com/ollama/ollama/commit/4fd45744f034b3212df22121b9524123fc5e2ca5) fix lint logic to only compare against merge base and ignore files that aren't touched in this PR. - [`18a8e92`](https://github.com/ollama/ollama/commit/18a8e9233751e08027abc255d7d778d032be0622) Merge remote-tracking branch 'dhiltgen/ggml_bump' into feat/mrope-split-models - [`773d9c0`](https://github.com/ollama/ollama/commit/773d9c0d9738a17a6965ba5c071fcc3d2275f41f) feat: support split multimodal models with M-RoPE (Qwen3-VL) - [`fc49110`](https://github.com/ollama/ollama/commit/fc491104b6d33fbf8e08f7f5414dc70ba5f04d5b) fix: address Copilot review comments ### 📊 Changes **287 files changed** (+27550 additions, -22424 deletions) <details> <summary>View changed files</summary> 📝 `.github/workflows/test.yaml` (+1 -1) 📝 `Makefile.sync` (+1 -1) 📝 `discover/runner.go` (+1 -0) 📝 `fs/ggml/ggml.go` (+180 -3) 📝 `fs/ggml/gguf.go` (+4 -1) 📝 `llama/build-info.cpp` (+1 -1) 📝 `llama/llama.cpp/.rsync-filter` (+3 -0) 📝 `llama/llama.cpp/common/common.cpp` (+34 -5) 📝 `llama/llama.cpp/common/common.h` (+15 -1) 📝 `llama/llama.cpp/common/json-schema-to-grammar.cpp` (+21 -3) 📝 `llama/llama.cpp/common/json-schema-to-grammar.h` (+2 -0) 📝 `llama/llama.cpp/common/log.cpp` (+6 -0) 📝 `llama/llama.cpp/common/log.h` (+2 -0) 📝 `llama/llama.cpp/include/llama.h` (+7 -3) 📝 `llama/llama.cpp/src/llama-arch.cpp` (+140 -0) 📝 `llama/llama.cpp/src/llama-arch.h` (+13 -0) 📝 `llama/llama.cpp/src/llama-batch.cpp` (+63 -31) 📝 `llama/llama.cpp/src/llama-batch.h` (+12 -1) 📝 `llama/llama.cpp/src/llama-chat.cpp` (+32 -0) 📝 `llama/llama.cpp/src/llama-chat.h` (+1 -0) _...and 80 more files_ </details> ### 📄 Description This PR adds support for loading and running split GGUF multimodal models (separate text and vision files) with proper M-RoPE position encoding. ## Features ### Split Model Loading - Extended MetaGGML to handle ForeignTensors from vision projector files - Added split model detection via `gguf.general.split.*` metadata - Server loads both text and vision GGUF files when split is detected ### M-RoPE (Multi-dimensional Rotary Position Embedding) Qwen2-VL/Qwen3-VL models use M-RoPE which requires 4 position values per image token instead of 1: - `pos[0]`: temporal position (constant for images) - `pos[1]`: y position (row in image grid) - `pos[2]`: x position (column in image grid) - `pos[3]`: unused (always 0) Implementation: - `NewBatchMRoPE()` allocates batches with 4 positions per token - `AddImageMRoPE()` sets 2D positions based on image grid (nx x ny) - Position advance uses `max(nx,ny)` per llama.cpp mtmd conventions ### Multimodal Embedding Fixes - Fixed buffer allocation using `n_embd_inp()` instead of `n_embd` (8192 vs 2048 for vision projector) - Fixed tensor reads using actual `t_embd->ne[0]` dimension - Prevents GGML_ASSERT failures and access violations with images ### Stability Improvements - Clear KV cache when prompt contains image embeddings - Added repetition loop detection to stop infinite generation ## Testing Tested with Qwen3-VL models (2B, 4B, 8B) in Q4_K_M and Q8_0 quantizations. Both single-file and split model formats work correctly. ## Related Built on top of PR #12992 (ggml_bump) which adds qwen3vl architecture to llama.cpp. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-23 01:01:29 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#40018