feat: Vision filter - Route images to a configured vision model while keeping the conversation on the original model #6190

Closed
opened 2025-11-11 16:47:32 -06:00 by GiteaMirror · 3 comments
Owner

Originally created by @GilGedje on GitHub (Aug 23, 2025).

Check Existing Issues

  • I have searched the existing issues and discussions.

Problem Description

•	Many strong chat models (coding/agentic) are text-only. Today, if the user adds an image, they must switch models or lose vision altogether.
•	Switching models mid-thread breaks continuity (different system prompts, safety profiles, tools).
•	Users need a simple way to extract understanding (caption/OCR/objects/layout) from images once, then proceed with the original model using the extracted text.

Desired Solution you'd like

Introduce a Vision Router that:
1. Detects image attachments in a message.
2. If the active model is not vision-capable (or if the user enabled “always pre-process images”), calls a configured vision model.
3. Receives a structured result (caption, OCR text, optional objects/layout JSON).
4. Injects the result into the prompt/context for the original model (no image content forwarded).
5. Marks the image as processed so future turns don’t re-hit the vision model automatically (opt-in reprocess button available).

Alternatives Considered

I have tried many filter functions, yet none was able to deliver with opensource models.

I was able to create a fallback to a vision model when a text model fails, but then the conversation is with the vision model, and not with the wanted model. (since there is an image in the prompt)

Additional Context

No response

Originally created by @GilGedje on GitHub (Aug 23, 2025). ### Check Existing Issues - [x] I have searched the existing issues and discussions. ### Problem Description • Many strong chat models (coding/agentic) are text-only. Today, if the user adds an image, they must switch models or lose vision altogether. • Switching models mid-thread breaks continuity (different system prompts, safety profiles, tools). • Users need a simple way to extract understanding (caption/OCR/objects/layout) from images once, then proceed with the original model using the extracted text. ### Desired Solution you'd like Introduce a Vision Router that: 1. Detects image attachments in a message. 2. If the active model is not vision-capable (or if the user enabled “always pre-process images”), calls a configured vision model. 3. Receives a structured result (caption, OCR text, optional objects/layout JSON). 4. Injects the result into the prompt/context for the original model (no image content forwarded). 5. Marks the image as processed so future turns don’t re-hit the vision model automatically (opt-in reprocess button available). ### Alternatives Considered I have tried many filter functions, yet none was able to deliver with opensource models. I was able to create a fallback to a vision model when a text model fails, but then the conversation is with the vision model, and not with the wanted model. (since there is an image in the prompt) ### Additional Context _No response_
Author
Owner

@badbubi commented on GitHub (Aug 25, 2025):

Take a look at the Dynamic Vision Router, perhaps you can adapt it for yourself.
https://openwebui.com/f/hub/dynamic_vision_router

@badbubi commented on GitHub (Aug 25, 2025): Take a look at the Dynamic Vision Router, perhaps you can adapt it for yourself. [https://openwebui.com/f/hub/dynamic_vision_router](dynamic_vision_router])
Author
Owner

@GilGedje commented on GitHub (Aug 25, 2025):

Take a look at the Dynamic Vision Router, perhaps you can adapt it for yourself. https://openwebui.com/f/hub/dynamic_vision_router

I’m familiar with this function, yet it still doesn’t do what I suggested.
It routes the conversation to the vision model and stays with the vision model for the rest of it, since it sees the image in the context.
Modifying it would be quite difficult to do—I’m not much of a coder.
Still, I think it’s a really good feature that people would love.

@GilGedje commented on GitHub (Aug 25, 2025): > Take a look at the Dynamic Vision Router, perhaps you can adapt it for yourself. [https://openwebui.com/f/hub/dynamic_vision_router](dynamic_vision_router%5D) I’m familiar with this function, yet it still doesn’t do what I suggested. It routes the conversation to the vision model and stays with the vision model for the rest of it, since it sees the image in the context. Modifying it would be quite difficult to do—I’m not much of a coder. Still, I think it’s a really good feature that people would love.
Author
Owner

@edacul commented on GitHub (Aug 25, 2025):

+1

@edacul commented on GitHub (Aug 25, 2025): +1
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#6190