CLIP on Open-WebUI #3332

New Issue

GiteaMirror · 2025-11-11T15:29:24-06:00

GiteaMirror commented

2025-11-11 15:29:24 -06:00

Originally created by @Tanker365 on GitHub (Jan 21, 2025).

hi everyone! I am trying to incorporate CLIP on Open-WebUI. I am trying to process both text and images, and I found CLIP as one of the good embedders to run my interest. Is it possible to do this?

Originally created by @Tanker365 on GitHub (Jan 21, 2025). hi everyone! I am trying to incorporate CLIP on Open-WebUI. I am trying to process both text and images, and I found CLIP as one of the good embedders to run my interest. Is it possible to do this?

GiteaMirror closed this issue

2025-11-11 15:29:24 -06:00

GiteaMirror commented

2025-11-11 15:29:25 -06:00

@JPC612 commented on GitHub (Jan 21, 2025):

I’m also trying to incorporate CLIP in Open-WebUI for processing both text and images. Chroma and Qdrant both support multimodal embedding models like CLIP. It would also be great to leverage something like sentence-transformers/clip-ViT-B-32-multilingual-v1 or llamaindex/vdr-2b-multi-v1. However, from what I can tell right now, Open-WebUI only directly supports Ollama and OpenAI embeddings.

Keep in mind that if you’re not using a multimodal LLM, then only the retrieval aspect (i.e., the normal returning of image and text results) will work. Open-WebUI does support multimodal LLMs, but to get the full image+text interaction, you need to send the images along with the prompts or messages to the multimodal model during RAG-chat completion.

@JPC612 commented on GitHub (Jan 21, 2025): I’m also trying to incorporate CLIP in Open-WebUI for processing both text and images. Chroma and Qdrant both support multimodal embedding models like CLIP. It would also be great to leverage something like [sentence-transformers/clip-ViT-B-32-multilingual-v1](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1) or [llamaindex/vdr-2b-multi-v1](https://huggingface.co/llamaindex/vdr-2b-multi-v1). However, from what I can tell right now, Open-WebUI only directly supports Ollama and OpenAI embeddings. Keep in mind that if you’re not using a multimodal LLM, then only the retrieval aspect (i.e., the normal returning of image and text results) will work. Open-WebUI does support multimodal LLMs, but to get the full image+text interaction, you need to send the images along with the prompts or messages to the multimodal model during RAG-chat completion.

GiteaMirror referenced this issue

2025-11-11 17:42:50 -06:00

[PR #3332] [MERGED] i18n: Update Chinese translation #8020

GiteaMirror referenced this issue

2026-04-20 03:24:29 -05:00

[PR #3332] [MERGED] i18n: Update Chinese translation #21224

GiteaMirror referenced this issue

2026-04-25 10:34:47 -05:00

[PR #3332] [MERGED] i18n: Update Chinese translation #36854

GiteaMirror referenced this issue

2026-04-29 18:18:23 -05:00

[PR #3332] [MERGED] i18n: Update Chinese translation #44272