[PR #12962] feat: Add Support for Qwen3-vl and Qwen2.5-vl Video Mode #12750

Open
opened 2025-11-12 17:06:01 -06:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ollama/ollama/pull/12962
Author: @vincentkoc
Created: 11/5/2025
Status: 🔄 Open

Base: mainHead: feat/video-support


📝 Commits (10+)

📊 Changes

21 files changed (+1601 additions, -57 deletions)

View changed files

📝 .github/workflows/test.yaml (+12 -0)
📝 api/types.go (+2 -1)
📝 app/cmd/app/webview.go (+2 -0)
📝 app/ui/app/src/utils/fileValidation.ts (+9 -1)
📝 app/ui/ui.go (+12 -0)
📝 docs/README.md (+1 -0)
📝 docs/api.md (+41 -0)
📝 docs/api/openai-compatibility.mdx (+29 -2)
docs/video-support.md (+345 -0)
model/imageproc/video.go (+155 -0)
model/imageproc/video_test.go (+248 -0)
📝 model/models/qwen25vl/model.go (+47 -0)
📝 model/models/qwen25vl/process_image.go (+159 -0)
📝 model/models/qwen3vl/imageprocessor.go (+271 -3)
📝 model/models/qwen3vl/model.go (+63 -3)
📝 model/models/qwen3vl/model_vision.go (+82 -41)
📝 model/renderers/qwen3vl.go (+19 -1)
📝 openai/openai.go (+80 -0)
📝 openai/openai_test.go (+4 -4)
📝 runner/ollamarunner/runner.go (+1 -1)

...and 1 more files

📄 Description

image

Adds comprehensive support for video understanding capabilities to Qwen3-VL and Qwen2.5-VL multimodal models in ollama. This enables the models to analyze video content by extracting frames with temporal awareness and processing them through the vision encoder with proper temporal patch grouping. (as per Qwens official implementation). This PR also resolves https://github.com/ollama/ollama/issues/12926.

Changes:

  • Added Documentation Updates
  • Added new video.go in models/imageproc to support video mode with supported models
  • Model specific embedding and tokenizer support for video and temporal embeddings qwen3-vl and qwen2.5-vl
  • API spec updated to support video in many formats base64, video url, and image frames (with temporal information)
  • Token usage calculated properly with video tokens
  • UI now supports video upload on supported models

Technical Details:

  • Qwen3-VL
    • Uses Interleaved-MRoPE (Multidimensional Rotary Position Embeddings) with explicit temporal dimensions
    • Position embeddings: [time, height, width, extra] in 4D to avoid GGML's 5D tensor limitation
    • Better suited for long videos, precise timestamp localization, and complex temporal reasoning
  • Qwen2.5-VL
    • Uses standard RoPE with spatial-only position embeddings
    • Temporal information encoded through patch embeddings (3D convolutions)
    • Faster processing, simpler architecture for general video understanding

API Usage Examples:

# Generate API with base64 video
curl http://localhost:11434/api/generate -d '{
  "model": "qwen3-vl",
  "prompt": "What is happening in this video?",
  "videos": ["<base64-encoded-video>"]
}

# Chat API with video
curl http://localhost:11434/api/chat -d '{
  "model": "qwen2.5-vl",
  "messages": [{
    "role": "user",
    "content": "Describe the actions in this video",
    "videos": ["<base64-encoded-video>"]
  }]
}

# OpenAI-compatible endpoint with video URL
curl http://localhost:11434/v1/chat/completions -d '{
  "model": "qwen3-vl",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "What is in this video?"},
      {"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}}
    ]
  }]
}'```

---

<sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
## 📋 Pull Request Information **Original PR:** https://github.com/ollama/ollama/pull/12962 **Author:** [@vincentkoc](https://github.com/vincentkoc) **Created:** 11/5/2025 **Status:** 🔄 Open **Base:** `main` ← **Head:** `feat/video-support` --- ### 📝 Commits (10+) - [`db6858b`](https://github.com/ollama/ollama/commit/db6858b2f49155443d01fea28269770b1cb6f97e) Update types.go - [`052abfd`](https://github.com/ollama/ollama/commit/052abfdc52bb6adc8a2b1b6c04d0b2b4a5a594b4) Update qwen3vl.go - [`299d9c2`](https://github.com/ollama/ollama/commit/299d9c2cb106662a9835b4ab40d7280f3cfc721e) Update openai.go - [`0550994`](https://github.com/ollama/ollama/commit/05509941d771d90d6495634debdc616253a87ebf) feat: image processor for video with temporal sequencing - [`3bc2e9c`](https://github.com/ollama/ollama/commit/3bc2e9c0ce6e7abfdadbd17452bdd5a245b23aff) Merge branch 'main' into feat/video-support - [`50ebf29`](https://github.com/ollama/ollama/commit/50ebf29f57a650c03ae2ae0a46d82b94013a64c9) Update runner.go - [`ea48db7`](https://github.com/ollama/ollama/commit/ea48db7a530faf982bcc398fd831b2bdca8a4c12) Update imageprocessor.go - [`88aea3e`](https://github.com/ollama/ollama/commit/88aea3ebd4816f7ca258737559613a1c692aec35) Update model.go - [`b3be9ec`](https://github.com/ollama/ollama/commit/b3be9ecad2ee46b0cd9a28f08f722c5414c93b7b) Update model_vision.go - [`5495ddb`](https://github.com/ollama/ollama/commit/5495ddbea4fa3571f01d256cf0be1e0856eab76b) Update prompt.go ### 📊 Changes **21 files changed** (+1601 additions, -57 deletions) <details> <summary>View changed files</summary> 📝 `.github/workflows/test.yaml` (+12 -0) 📝 `api/types.go` (+2 -1) 📝 `app/cmd/app/webview.go` (+2 -0) 📝 `app/ui/app/src/utils/fileValidation.ts` (+9 -1) 📝 `app/ui/ui.go` (+12 -0) 📝 `docs/README.md` (+1 -0) 📝 `docs/api.md` (+41 -0) 📝 `docs/api/openai-compatibility.mdx` (+29 -2) ➕ `docs/video-support.md` (+345 -0) ➕ `model/imageproc/video.go` (+155 -0) ➕ `model/imageproc/video_test.go` (+248 -0) 📝 `model/models/qwen25vl/model.go` (+47 -0) 📝 `model/models/qwen25vl/process_image.go` (+159 -0) 📝 `model/models/qwen3vl/imageprocessor.go` (+271 -3) 📝 `model/models/qwen3vl/model.go` (+63 -3) 📝 `model/models/qwen3vl/model_vision.go` (+82 -41) 📝 `model/renderers/qwen3vl.go` (+19 -1) 📝 `openai/openai.go` (+80 -0) 📝 `openai/openai_test.go` (+4 -4) 📝 `runner/ollamarunner/runner.go` (+1 -1) _...and 1 more files_ </details> ### 📄 Description <img width="200" height="200" alt="image" src="https://github.com/user-attachments/assets/6d2a4ae1-4898-4266-818b-8ae58731a772" /> <p> **Adds comprehensive support for video understanding capabilities to Qwen3-VL and Qwen2.5-VL multimodal models in ollama**. This enables the models to analyze video content by extracting frames with temporal awareness and processing them through the vision encoder with proper temporal patch grouping. (as per [Qwens official implementation](https://github.com/QwenLM/Qwen3-VL)). This PR also resolves https://github.com/ollama/ollama/issues/12926. Changes: - Added Documentation Updates - Added new `video.go` in `models/imageproc` to support video mode with supported models - Model specific embedding and tokenizer support for video and temporal embeddings `qwen3-vl` and `qwen2.5-vl` - API spec updated to support video in many formats base64, video url, and image frames (with temporal information) - Token usage calculated properly with video tokens - UI now supports video upload on supported models Technical Details: - Qwen3-VL - Uses Interleaved-MRoPE (Multidimensional Rotary Position Embeddings) with explicit temporal dimensions - Position embeddings: [time, height, width, extra] in 4D to avoid GGML's 5D tensor limitation - Better suited for long videos, precise timestamp localization, and complex temporal reasoning - Qwen2.5-VL - Uses standard RoPE with spatial-only position embeddings - Temporal information encoded through patch embeddings (3D convolutions) - Faster processing, simpler architecture for general video understanding API Usage Examples: ```bash # Generate API with base64 video curl http://localhost:11434/api/generate -d '{ "model": "qwen3-vl", "prompt": "What is happening in this video?", "videos": ["<base64-encoded-video>"] } # Chat API with video curl http://localhost:11434/api/chat -d '{ "model": "qwen2.5-vl", "messages": [{ "role": "user", "content": "Describe the actions in this video", "videos": ["<base64-encoded-video>"] }] } # OpenAI-compatible endpoint with video URL curl http://localhost:11434/v1/chat/completions -d '{ "model": "qwen3-vl", "messages": [{ "role": "user", "content": [ {"type": "text", "text": "What is in this video?"}, {"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}} ] }] }'``` --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the
pull-request
label 2025-11-12 17:06:01 -06:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#12750
No description provided.