Feature request: Selector for low, high, or auto fidelity image understanding in vision models #2040

Closed
opened 2025-11-11 14:59:13 -06:00 by GiteaMirror · 0 comments
Owner

Originally created by @Jonseed on GitHub (Sep 9, 2024).

Is your feature request related to a problem? Please describe.
Currently you can only enable or disable vision capability for a model. But models like OpenAI's vision models can accept several different options for the fidelity of the image processed by the API: low, high, or auto. If none is specified, then it defaults to auto, which looks at the image input size and decides if it should use the low or high setting. auto is likely the default currently in Open WebUI, either explicitly coded or implicitly omitted, but I was not able to find the code in the repo to verify.

Some users might prefer to use the low setting, no matter how large the input image is in order to save tokens on the API, as this mode currently has a fixed cost of 85 tokens. The high setting costs 85 tokens plus 512x512 tiles of the image, which adds at least another 170 tokens for one tile, and thus triples the cost (255 tokens, although if auto works as specified, a 512x512 image should be processed in low mode). More likely, four tiles are needed for most images larger than 512x512, which adds 680 tokens, which is 9 times more expensive than low mode (765 tokens), and non-square image might need six tiles, for a total of 1,105 tokens for one image. For many use cases where high fidelity image understanding isn't needed, low mode is likely sufficient, and can save users a lot on API costs for vision tasks.

Describe the solution you'd like
Provide a selector in the Model Builder config screen, when enabling "Vision" for low, high, or auto resolution, defaulting to auto. This is controlled by adding a detail parameter in the API call. Another option would be to provide an option to the user in the UI to specify directly when they add an image to a message whether they want it to be processed in low, high, or auto (defaulting to auto or whatever is set in the model config), although this might add unnecessary clutter to the UI.

Describe alternatives you've considered
Another option is the user can resize the image prior to adding it to a message. Assuming auto works as it should, a 512x512 image should be processed in low mode. But this adds an additional inconvenient resizing step for the user for every image they want to use for vision tasks.

Additional context
I'm thinking the option could be added here in the Model Builder, perhaps in a Fidelity dropdown menu selector next to Vision.
Screenshot 2024-09-09 110656

Originally created by @Jonseed on GitHub (Sep 9, 2024). **Is your feature request related to a problem? Please describe.** Currently you can only enable or disable vision capability for a model. But models like OpenAI's vision models can accept several different options for the [fidelity of the image](https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding) processed by the API: `low`, `high`, or `auto`. If none is specified, then it defaults to `auto`, which looks at the image input size and decides if it should use the `low` or `high` setting. `auto` is likely the default currently in Open WebUI, either explicitly coded or implicitly omitted, but I was not able to find the code in the repo to verify. Some users might prefer to use the `low` setting, no matter how large the input image is in order to save tokens on the API, as this mode currently has a fixed cost of **85 tokens**. The `high` setting costs 85 tokens _plus_ 512x512 tiles of the image, which adds at least another 170 tokens for one tile, and thus triples the cost (**255 tokens**, although if `auto` works as specified, a 512x512 image should be processed in `low` mode). More likely, four tiles are needed for most images larger than 512x512, which adds 680 tokens, which is _9 times_ more expensive than `low` mode (**765 tokens**), and non-square image might need six tiles, for a total of **1,105 tokens** for one image. For many use cases where high fidelity image understanding isn't needed, `low` mode is likely sufficient, and can [save users a lot on API costs](https://platform.openai.com/docs/guides/vision/calculating-costs) for vision tasks. **Describe the solution you'd like** Provide a selector in the Model Builder config screen, when enabling "Vision" for `low`, `high`, or `auto` resolution, defaulting to `auto`. This is controlled by adding a `detail` [parameter in the API call](https://platform.openai.com/docs/guides/vision/low-or-high-fidelity-image-understanding). Another option would be to provide an option to the user in the UI to specify directly when they add an image to a message whether they want it to be processed in `low`, `high`, or `auto` (defaulting to `auto` or whatever is set in the model config), although this might add unnecessary clutter to the UI. **Describe alternatives you've considered** Another option is the user can resize the image prior to adding it to a message. Assuming `auto` works as it should, a 512x512 image should be processed in `low` mode. But this adds an additional inconvenient resizing step for the user for every image they want to use for vision tasks. **Additional context** I'm thinking the option could be added here in the Model Builder, perhaps in a `Fidelity` dropdown menu selector next to `Vision`. ![Screenshot 2024-09-09 110656](https://github.com/user-attachments/assets/449ca6a4-8059-4f50-8547-807aa74fe743)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#2040