[PR #13387] [CLOSED] feat: add support for split GGUF models (separate vision encoder) #14190

Closed
opened 2026-04-13 00:47:56 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13387
Author: @iosub
Created: 12/9/2025
Status: Closed

Base: mainHead: qwen3vl-split-pr


📝 Commits (4)

  • 2502bc0 feat: add support for split GGUF models (separate vision encoder)
  • 9abd638 fix: M-RoPE position encoding and address PR review feedback
  • 0d0f759 fix: add [img] tokens for split GGUF models without renderer
  • e3599f2 model/qwen3vl: fix split vision deepstack outputs

📊 Changes

20 files changed (+2576 additions, -248 deletions)

View changed files

📝 fs/ggml/gguf.go (+0 -8)
📝 llama/llama.go (+212 -20)
📝 llm/server.go (+32 -28)
📝 ml/backend.go (+14 -0)
📝 ml/backend/ggml/ggml.go (+585 -10)
ml/nn/fast/rope.go (+21 -0)
📝 model/model.go (+18 -0)
📝 model/models/qwen3vl/imageprocessor.go (+66 -13)
📝 model/models/qwen3vl/model.go (+434 -9)
📝 model/models/qwen3vl/model_text.go (+5 -4)
📝 model/models/qwen3vl/model_vision.go (+533 -59)
model/vision_bridge.go (+150 -0)
📝 runner/llamarunner/cache.go (+42 -2)
📝 runner/llamarunner/image.go (+54 -3)
📝 runner/llamarunner/runner.go (+284 -27)
📝 runner/ollamarunner/cache.go (+19 -0)
📝 runner/ollamarunner/runner.go (+27 -10)
📝 server/images.go (+14 -2)
📝 server/prompt.go (+22 -0)
📝 server/routes.go (+44 -53)

📄 Description

Summary

This PR adds support for loading split GGUF models where the vision encoder is stored in a separate file (e.g., model.gguf + vision.gguf). This enables Ollama to run models like Qwen3-VL from Unsloth and similar providers that distribute multimodal models as separate files.

Motivation

Several model providers (notably Unsloth) distribute vision-language models with the vision encoder in a separate GGUF file. Currently, Ollama only supports unified GGUF files, which limits compatibility with these popular model distributions.

Models that benefit from this change:

  • unsloth/Qwen3-VL-4B-Instruct-GGUF
  • unsloth/Qwen3-VL-8B-Instruct-GGUF
  • Other split multimodal models

Changes

Core Infrastructure

  • ml/backend.go: Added LoadSecondary() and RegisterTensorAlias() interfaces to support loading secondary model files
  • ml/backend/ggml/ggml.go: Implemented secondary model loading with tensor aliasing; fixed Vulkan device filtering based on OLLAMA_VULKAN env var
  • llm/server.go: Added logic to detect and load vision.gguf alongside the main model

Model Support

  • model/models/qwen3vl/*: Updated Qwen3-VL implementation to work with split models
  • model/vision_bridge.go: New helper for vision model integration
  • ml/nn/fast/rope.go: New fast RoPE implementation for vision models

Cache Fixes

  • runner/ollamarunner/cache.go: Clear KV cache when prompt contains multimodal embeddings (ported from llamarunner)
  • runner/llamarunner/cache.go: Same fix for llama runner

Server

  • server/images.go: Handle vision blob detection and loading
  • server/routes.go: Minor updates for split model support

Testing

Tested with:

  • hf.co/unsloth/Qwen3-VL-4B-Instruct-GGUF:Q4_K_M (split model)
  • hf.co/unsloth/Qwen3-VL-8B-Instruct-GGUF:Q4_K_M (split model)
  • qwen3-vl:latest (unified model - regression test)

Test commands:

ollama run hf.co/unsloth/Qwen3-VL-8B-Instruct-GGUF:Q4_K_M "describe this image" --images photo.jpg

Breaking Changes

None. Unified GGUF models continue to work as before.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13387 **Author:** [@iosub](https://github.com/iosub) **Created:** 12/9/2025 **Status:** ❌ Closed **Base:** `main` ← **Head:** `qwen3vl-split-pr` --- ### 📝 Commits (4) - [`2502bc0`](https://github.com/ollama/ollama/commit/2502bc05568169b6f5293ff5eb504e66afde355a) feat: add support for split GGUF models (separate vision encoder) - [`9abd638`](https://github.com/ollama/ollama/commit/9abd638f9aceaa9c5374f53b615cb65a11433a4e) fix: M-RoPE position encoding and address PR review feedback - [`0d0f759`](https://github.com/ollama/ollama/commit/0d0f759fb6dc75be7b67da0eed9d6a6089419f08) fix: add [img] tokens for split GGUF models without renderer - [`e3599f2`](https://github.com/ollama/ollama/commit/e3599f2aa735d69b98be498d29cd8a2246764884) model/qwen3vl: fix split vision deepstack outputs ### 📊 Changes **20 files changed** (+2576 additions, -248 deletions) <details> <summary>View changed files</summary> 📝 `fs/ggml/gguf.go` (+0 -8) 📝 `llama/llama.go` (+212 -20) 📝 `llm/server.go` (+32 -28) 📝 `ml/backend.go` (+14 -0) 📝 `ml/backend/ggml/ggml.go` (+585 -10) ➕ `ml/nn/fast/rope.go` (+21 -0) 📝 `model/model.go` (+18 -0) 📝 `model/models/qwen3vl/imageprocessor.go` (+66 -13) 📝 `model/models/qwen3vl/model.go` (+434 -9) 📝 `model/models/qwen3vl/model_text.go` (+5 -4) 📝 `model/models/qwen3vl/model_vision.go` (+533 -59) ➕ `model/vision_bridge.go` (+150 -0) 📝 `runner/llamarunner/cache.go` (+42 -2) 📝 `runner/llamarunner/image.go` (+54 -3) 📝 `runner/llamarunner/runner.go` (+284 -27) 📝 `runner/ollamarunner/cache.go` (+19 -0) 📝 `runner/ollamarunner/runner.go` (+27 -10) 📝 `server/images.go` (+14 -2) 📝 `server/prompt.go` (+22 -0) 📝 `server/routes.go` (+44 -53) </details> ### 📄 Description ## Summary This PR adds support for loading split GGUF models where the vision encoder is stored in a separate file (e.g., `model.gguf` + `vision.gguf`). This enables Ollama to run models like Qwen3-VL from Unsloth and similar providers that distribute multimodal models as separate files. ## Motivation Several model providers (notably Unsloth) distribute vision-language models with the vision encoder in a separate GGUF file. Currently, Ollama only supports unified GGUF files, which limits compatibility with these popular model distributions. **Models that benefit from this change:** - `unsloth/Qwen3-VL-4B-Instruct-GGUF` - `unsloth/Qwen3-VL-8B-Instruct-GGUF` - Other split multimodal models ## Changes ### Core Infrastructure - **`ml/backend.go`**: Added `LoadSecondary()` and `RegisterTensorAlias()` interfaces to support loading secondary model files - **`ml/backend/ggml/ggml.go`**: Implemented secondary model loading with tensor aliasing; fixed Vulkan device filtering based on `OLLAMA_VULKAN` env var - **`llm/server.go`**: Added logic to detect and load vision.gguf alongside the main model ### Model Support - **`model/models/qwen3vl/*`**: Updated Qwen3-VL implementation to work with split models - **`model/vision_bridge.go`**: New helper for vision model integration - **`ml/nn/fast/rope.go`**: New fast RoPE implementation for vision models ### Cache Fixes - **`runner/ollamarunner/cache.go`**: Clear KV cache when prompt contains multimodal embeddings (ported from llamarunner) - **`runner/llamarunner/cache.go`**: Same fix for llama runner ### Server - **`server/images.go`**: Handle vision blob detection and loading - **`server/routes.go`**: Minor updates for split model support ## Testing Tested with: - ✅ `hf.co/unsloth/Qwen3-VL-4B-Instruct-GGUF:Q4_K_M` (split model) - ✅ `hf.co/unsloth/Qwen3-VL-8B-Instruct-GGUF:Q4_K_M` (split model) - ✅ `qwen3-vl:latest` (unified model - regression test) **Test commands:** ```bash ollama run hf.co/unsloth/Qwen3-VL-8B-Instruct-GGUF:Q4_K_M "describe this image" --images photo.jpg ``` ## Breaking Changes None. Unified GGUF models continue to work as before. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-13 00:47:56 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#14190