CLIP on Open-WebUI #3332

Closed
opened 2025-11-11 15:29:24 -06:00 by GiteaMirror · 1 comment
Owner

Originally created by @Tanker365 on GitHub (Jan 21, 2025).

hi everyone! I am trying to incorporate CLIP on Open-WebUI. I am trying to process both text and images, and I found CLIP as one of the good embedders to run my interest. Is it possible to do this?

Originally created by @Tanker365 on GitHub (Jan 21, 2025). hi everyone! I am trying to incorporate CLIP on Open-WebUI. I am trying to process both text and images, and I found CLIP as one of the good embedders to run my interest. Is it possible to do this?
Author
Owner

@JPC612 commented on GitHub (Jan 21, 2025):

I’m also trying to incorporate CLIP in Open-WebUI for processing both text and images. Chroma and Qdrant both support multimodal embedding models like CLIP. It would also be great to leverage something like sentence-transformers/clip-ViT-B-32-multilingual-v1 or llamaindex/vdr-2b-multi-v1. However, from what I can tell right now, Open-WebUI only directly supports Ollama and OpenAI embeddings.

Keep in mind that if you’re not using a multimodal LLM, then only the retrieval aspect (i.e., the normal returning of image and text results) will work. Open-WebUI does support multimodal LLMs, but to get the full image+text interaction, you need to send the images along with the prompts or messages to the multimodal model during RAG-chat completion.

@JPC612 commented on GitHub (Jan 21, 2025): I’m also trying to incorporate CLIP in Open-WebUI for processing both text and images. Chroma and Qdrant both support multimodal embedding models like CLIP. It would also be great to leverage something like [sentence-transformers/clip-ViT-B-32-multilingual-v1](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1) or [llamaindex/vdr-2b-multi-v1](https://huggingface.co/llamaindex/vdr-2b-multi-v1). However, from what I can tell right now, Open-WebUI only directly supports Ollama and OpenAI embeddings. Keep in mind that if you’re not using a multimodal LLM, then only the retrieval aspect (i.e., the normal returning of image and text results) will work. Open-WebUI does support multimodal LLMs, but to get the full image+text interaction, you need to send the images along with the prompts or messages to the multimodal model during RAG-chat completion.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#3332