[PR #13456] [CLOSED] feat: support split GGUF vision models #76511

Closed
opened 2026-05-05 09:06:42 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/13456
Author: @iosub
Created: 12/13/2025
Status: Closed

Base: mainHead: qwen3vl-split-clean-pr


📝 Commits (6)

  • 0b6f0ef feat: add support for split GGUF models (separate vision encoder)
  • f458abc fix: M-RoPE position encoding and address PR review feedback
  • f4bcc34 fix: add [img] tokens for split GGUF models without renderer
  • c29a2f0 model/qwen3vl: fix split vision deepstack outputs
  • 314f7fd fix: keep embedding API and GGUF writer compatible with main
  • 6c6fee4 fix: address Copilot review

📊 Changes

21 files changed (+2611 additions, -223 deletions)

View changed files

📝 fs/ggml/gguf.go (+2 -2)
📝 llama/llama.go (+232 -20)
📝 llm/server.go (+37 -17)
📝 llm/server_test.go (+3 -0)
📝 ml/backend.go (+14 -0)
📝 ml/backend/ggml/ggml.go (+620 -10)
ml/nn/fast/rope.go (+21 -0)
📝 model/model.go (+18 -0)
📝 model/models/qwen3vl/imageprocessor.go (+66 -13)
📝 model/models/qwen3vl/model.go (+434 -9)
📝 model/models/qwen3vl/model_text.go (+5 -4)
📝 model/models/qwen3vl/model_vision.go (+533 -59)
model/vision_bridge.go (+150 -0)
📝 runner/llamarunner/cache.go (+5 -0)
📝 runner/llamarunner/image.go (+54 -3)
📝 runner/llamarunner/runner.go (+283 -25)
📝 runner/ollamarunner/cache.go (+19 -0)
📝 runner/ollamarunner/runner.go (+26 -8)
📝 server/images.go (+14 -2)
📝 server/prompt.go (+22 -0)

...and 1 more files

📄 Description

Summary
This PR adds support for loading split GGUF models where the vision encoder is stored in a separate file (e.g., model.gguf + vision.gguf). This enables Ollama to run models like Qwen3-VL from Unsloth and similar providers that distribute multimodal models as separate files.

Motivation
Several model providers (notably Unsloth) distribute vision-language models with the vision encoder in a separate GGUF file. Currently, Ollama only supports unified GGUF files, which limits compatibility with these popular model distributions.

Models that benefit from this change:

  • unsloth/Qwen3-VL-4B-Instruct-GGUF
  • unsloth/Qwen3-VL-8B-Instruct-GGUF
  • unsloth/Qwen3-VL-8B-Instruct-GGUF
  • Other split multimodal models

Changes
Core Infrastructure

  • ml/backend.go: Added LoadSecondary() and RegisterTensorAlias() interfaces to support loading secondary model files
  • ml/backend/ggml/ggml.go: Implemented secondary model loading with tensor aliasing; skip Vulkan devices unless OLLAMA_VULKAN is enabled
  • llm/server.go: Plumb projector/vision GGUF path into the runner load request for split vision models
  • fs/ggml/gguf.go: Keep GGUF KV writing compatible with current main

Model Support

  • model/models/qwen3vl/*: Updated Qwen3-VL implementation to work with split models and fixed split-vision deepstack output correctness
  • model/vision_bridge.go: Helper for vision model integration
  • ml/nn/fast/rope.go: Fast RoPE support used by vision paths
  • llama/llama.go, model/model.go: Split-model wiring and load path support

Cache / Runner

  • runner/ollamarunner/cache.go: Clear KV cache when prompt contains multimodal embeddings to prevent stale reuse across image inputs
  • runner/llamarunner/cache.go: Minor cache behavior adjustment to keep current semantics/tests consistent
  • runner/llamarunner/runner.go, runner/ollamarunner/runner.go: Include prompt_eval_count in embedding responses for compatibility

Server

  • server/images.go: Handle vision blob detection/loading for split models
  • server/prompt.go: Ensure required [img] tokens are present for split models without a renderer
  • server/routes.go: Minor route updates for embedding compatibility and split support

Testing
Tested with:

  • hf.co/unsloth/Qwen3-VL-4B-Instruct-GGUF:Q4_K_M (split model)
  • hf.co/unsloth/Qwen3-VL-8B-Instruct-GGUF:Q4_K_M (split model)
  • qwen3-vl:* (unified model - regression test)

Example command
ollama run hf.co/unsloth/Qwen3-VL-8B-Instruct-GGUF:Q4_K_M "describe this image" --images photo.jpg

Breaking Changes
None. Unified GGUF models continue to work as before.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/13456 **Author:** [@iosub](https://github.com/iosub) **Created:** 12/13/2025 **Status:** ❌ Closed **Base:** `main` ← **Head:** `qwen3vl-split-clean-pr` --- ### 📝 Commits (6) - [`0b6f0ef`](https://github.com/ollama/ollama/commit/0b6f0ef678d9eb4e251d5524af98f50f7a2a0d22) feat: add support for split GGUF models (separate vision encoder) - [`f458abc`](https://github.com/ollama/ollama/commit/f458abc42e5e076989c72a0d6066691b4fca00d6) fix: M-RoPE position encoding and address PR review feedback - [`f4bcc34`](https://github.com/ollama/ollama/commit/f4bcc3488215f0ccffd36ab47b2d1ab919d01fe6) fix: add [img] tokens for split GGUF models without renderer - [`c29a2f0`](https://github.com/ollama/ollama/commit/c29a2f089e61016e4623e6d073ad0b9e125c90d8) model/qwen3vl: fix split vision deepstack outputs - [`314f7fd`](https://github.com/ollama/ollama/commit/314f7fdbc463b82e747a7dc28994de6e15afe7df) fix: keep embedding API and GGUF writer compatible with main - [`6c6fee4`](https://github.com/ollama/ollama/commit/6c6fee4372283a9301fbaed25a615df11134e304) fix: address Copilot review ### 📊 Changes **21 files changed** (+2611 additions, -223 deletions) <details> <summary>View changed files</summary> 📝 `fs/ggml/gguf.go` (+2 -2) 📝 `llama/llama.go` (+232 -20) 📝 `llm/server.go` (+37 -17) 📝 `llm/server_test.go` (+3 -0) 📝 `ml/backend.go` (+14 -0) 📝 `ml/backend/ggml/ggml.go` (+620 -10) ➕ `ml/nn/fast/rope.go` (+21 -0) 📝 `model/model.go` (+18 -0) 📝 `model/models/qwen3vl/imageprocessor.go` (+66 -13) 📝 `model/models/qwen3vl/model.go` (+434 -9) 📝 `model/models/qwen3vl/model_text.go` (+5 -4) 📝 `model/models/qwen3vl/model_vision.go` (+533 -59) ➕ `model/vision_bridge.go` (+150 -0) 📝 `runner/llamarunner/cache.go` (+5 -0) 📝 `runner/llamarunner/image.go` (+54 -3) 📝 `runner/llamarunner/runner.go` (+283 -25) 📝 `runner/ollamarunner/cache.go` (+19 -0) 📝 `runner/ollamarunner/runner.go` (+26 -8) 📝 `server/images.go` (+14 -2) 📝 `server/prompt.go` (+22 -0) _...and 1 more files_ </details> ### 📄 Description Summary This PR adds support for loading split GGUF models where the vision encoder is stored in a separate file (e.g., model.gguf + vision.gguf). This enables Ollama to run models like Qwen3-VL from Unsloth and similar providers that distribute multimodal models as separate files. Motivation Several model providers (notably Unsloth) distribute vision-language models with the vision encoder in a separate GGUF file. Currently, Ollama only supports unified GGUF files, which limits compatibility with these popular model distributions. Models that benefit from this change: - unsloth/Qwen3-VL-4B-Instruct-GGUF - unsloth/Qwen3-VL-8B-Instruct-GGUF - unsloth/Qwen3-VL-8B-Instruct-GGUF - Other split multimodal models Changes Core Infrastructure - ml/backend.go: Added LoadSecondary() and RegisterTensorAlias() interfaces to support loading secondary model files - ml/backend/ggml/ggml.go: Implemented secondary model loading with tensor aliasing; skip Vulkan devices unless OLLAMA_VULKAN is enabled - llm/server.go: Plumb projector/vision GGUF path into the runner load request for split vision models - fs/ggml/gguf.go: Keep GGUF KV writing compatible with current main Model Support - model/models/qwen3vl/*: Updated Qwen3-VL implementation to work with split models and fixed split-vision deepstack output correctness - model/vision_bridge.go: Helper for vision model integration - ml/nn/fast/rope.go: Fast RoPE support used by vision paths - llama/llama.go, model/model.go: Split-model wiring and load path support Cache / Runner - runner/ollamarunner/cache.go: Clear KV cache when prompt contains multimodal embeddings to prevent stale reuse across image inputs - runner/llamarunner/cache.go: Minor cache behavior adjustment to keep current semantics/tests consistent - runner/llamarunner/runner.go, runner/ollamarunner/runner.go: Include prompt_eval_count in embedding responses for compatibility Server - server/images.go: Handle vision blob detection/loading for split models - server/prompt.go: Ensure required [img] tokens are present for split models without a renderer - server/routes.go: Minor route updates for embedding compatibility and split support Testing Tested with: - hf.co/unsloth/Qwen3-VL-4B-Instruct-GGUF:Q4_K_M (split model) - hf.co/unsloth/Qwen3-VL-8B-Instruct-GGUF:Q4_K_M (split model) - qwen3-vl:* (unified model - regression test) Example command ollama run hf.co/unsloth/Qwen3-VL-8B-Instruct-GGUF:Q4_K_M "describe this image" --images photo.jpg Breaking Changes None. Unified GGUF models continue to work as before. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-05 09:06:42 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#76511