feat: Vision filter - Route images to a configured vision model while keeping the conversation on the original model #6190

New Issue

GiteaMirror · 2025-11-11T16:47:32-06:00

GiteaMirror commented

2025-11-11 16:47:32 -06:00

Originally created by @GilGedje on GitHub (Aug 23, 2025).

Check Existing Issues

I have searched the existing issues and discussions.

Problem Description

•	Many strong chat models (coding/agentic) are text-only. Today, if the user adds an image, they must switch models or lose vision altogether.
•	Switching models mid-thread breaks continuity (different system prompts, safety profiles, tools).
•	Users need a simple way to extract understanding (caption/OCR/objects/layout) from images once, then proceed with the original model using the extracted text.

Desired Solution you'd like

Introduce a Vision Router that:
1. Detects image attachments in a message.
2. If the active model is not vision-capable (or if the user enabled “always pre-process images”), calls a configured vision model.
3. Receives a structured result (caption, OCR text, optional objects/layout JSON).
4. Injects the result into the prompt/context for the original model (no image content forwarded).
5. Marks the image as processed so future turns don’t re-hit the vision model automatically (opt-in reprocess button available).

Alternatives Considered

I have tried many filter functions, yet none was able to deliver with opensource models.

I was able to create a fallback to a vision model when a text model fails, but then the conversation is with the vision model, and not with the wanted model. (since there is an image in the prompt)

Additional Context

No response

Originally created by @GilGedje on GitHub (Aug 23, 2025). ### Check Existing Issues - [x] I have searched the existing issues and discussions. ### Problem Description • Many strong chat models (coding/agentic) are text-only. Today, if the user adds an image, they must switch models or lose vision altogether. • Switching models mid-thread breaks continuity (different system prompts, safety profiles, tools). • Users need a simple way to extract understanding (caption/OCR/objects/layout) from images once, then proceed with the original model using the extracted text. ### Desired Solution you'd like Introduce a Vision Router that: 1. Detects image attachments in a message. 2. If the active model is not vision-capable (or if the user enabled “always pre-process images”), calls a configured vision model. 3. Receives a structured result (caption, OCR text, optional objects/layout JSON). 4. Injects the result into the prompt/context for the original model (no image content forwarded). 5. Marks the image as processed so future turns don’t re-hit the vision model automatically (opt-in reprocess button available). ### Alternatives Considered I have tried many filter functions, yet none was able to deliver with opensource models. I was able to create a fallback to a vision model when a text model fails, but then the conversation is with the vision model, and not with the wanted model. (since there is an image in the prompt) ### Additional Context _No response_

GiteaMirror closed this issue

2025-11-11 16:47:33 -06:00

GiteaMirror commented

2025-11-11 16:47:34 -06:00

@badbubi commented on GitHub (Aug 25, 2025):

Take a look at the Dynamic Vision Router, perhaps you can adapt it for yourself.
https://openwebui.com/f/hub/dynamic_vision_router

@badbubi commented on GitHub (Aug 25, 2025): Take a look at the Dynamic Vision Router, perhaps you can adapt it for yourself. [https://openwebui.com/f/hub/dynamic_vision_router](dynamic_vision_router])

GiteaMirror commented

2025-11-11 16:47:34 -06:00

@GilGedje commented on GitHub (Aug 25, 2025):

Take a look at the Dynamic Vision Router, perhaps you can adapt it for yourself. https://openwebui.com/f/hub/dynamic_vision_router

I’m familiar with this function, yet it still doesn’t do what I suggested.
It routes the conversation to the vision model and stays with the vision model for the rest of it, since it sees the image in the context.
Modifying it would be quite difficult to do—I’m not much of a coder.
Still, I think it’s a really good feature that people would love.

@GilGedje commented on GitHub (Aug 25, 2025): > Take a look at the Dynamic Vision Router, perhaps you can adapt it for yourself. [https://openwebui.com/f/hub/dynamic_vision_router](dynamic_vision_router%5D) I’m familiar with this function, yet it still doesn’t do what I suggested. It routes the conversation to the vision model and stays with the vision model for the rest of it, since it sees the image in the context. Modifying it would be quite difficult to do—I’m not much of a coder. Still, I think it’s a really good feature that people would love.

GiteaMirror commented

2025-11-11 16:47:34 -06:00

@edacul commented on GitHub (Aug 25, 2025):

+1

@edacul commented on GitHub (Aug 25, 2025): +1

GiteaMirror referenced this issue

2026-04-19 20:24:28 -05:00

[GH-ISSUE #4959] enh: tag filtering #13800

GiteaMirror referenced this issue

2026-04-19 20:41:15 -05:00

[GH-ISSUE #6190] enh: search options #14273

GiteaMirror referenced this issue