[GH-ISSUE #17217] feat: support video in chat messages #18210

New Issue

GiteaMirror · 2026-04-20T00:25:24-05:00

GiteaMirror commented

2026-04-20 00:25:24 -05:00

Originally created by @wei0623kb on GitHub (Sep 5, 2025).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/17217

Check Existing Issues

I have searched the existing issues and discussions.
I am using the latest version of Open WebUI.

Installation Method

Pip Install

Open WebUI Version

v0.6.26

Ollama Version (if applicable)

we use vllm，vllm version==0.10.1.1

Operating System

Ubuntu 22.04

Browser (if applicable)

Chrome

Confirmation

I have read and followed all instructions in README.md.
I am using the latest version of both Open WebUI and Ollama.
I have included the browser console logs.
I have included the Docker container logs.
I have provided every relevant configuration, setting, and environment variable used in my setup.
I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
Start with the initial platform/version/OS and dependencies used,
Specify exact install/launch/configure commands,
List URLs visited, user input (incl. example values/emails/passwords if needed),
Describe all options and toggles enabled or changed,
Include any files or environmental changes,
Identify the expected and actual result at each stage,
Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

The model should be able to analyze the visual content in the video and generate descriptive responses relevant to the actual content of the video.

Actual Behavior

The model's response does not match the video content at all, presenting random and meaningless responses without demonstrating an understanding of the video content.

Steps to Reproduce

pip install vllm==0.10.1.1 flash-attn transformers sentencepiece

pip install open-webui

nohup python -m vllm.entrypoints.openai.api_server
--model /media/vlm_model/Qwen2.5-VL-72B-Instruct
--served-model-name Qwen2.5-VL-72B
--tensor-parallel-size 8
--dtype bfloat16
--host 0.0.0.0
--port 8888
--max-model-len 128000
--swap-space 16
--enable-auto-tool-choice
--tool-call-parser hermes \

Qwen2-5-VL-72B.log 2>&1 &

export OPENAI_API_BASE_URL="http://localhost:8888/v1"
export DEFAULT_MODELS="Qwen2.5-VL-72B"
source ~/.bashrc
open-webui serve

Logs & Screenshots

https://github.com/user-attachments/assets/25b0ceb6-fc6d-4710-ab08-08f8e7c4522f

er-attachments/assets/173190c1-41de-4a4a-a8ae-54e35cd864c0)

Additional Information

Currently, I have deployed qwen2.5vl-72B based on vLLM and OpenWebUI, aiming to support content understanding and analysis of uploaded video files. However, in actual use, it was found that when a video file is uploaded and the model is requested to interpret the video content, the model fails to correctly understand the information in the video, with the response content being unrelated to the actual video content and presenting randomized responses.

Originally created by @wei0623kb on GitHub (Sep 5, 2025). Original GitHub issue: https://github.com/open-webui/open-webui/issues/17217 ### Check Existing Issues - [x] I have searched the existing issues and discussions. - [x] I am using the latest version of Open WebUI. ### Installation Method Pip Install ### Open WebUI Version v0.6.26 ### Ollama Version (if applicable) we use vllm，vllm version==0.10.1.1 ### Operating System Ubuntu 22.04 ### Browser (if applicable) Chrome ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior The model should be able to analyze the visual content in the video and generate descriptive responses relevant to the actual content of the video. ### Actual Behavior The model's response does not match the video content at all, presenting random and meaningless responses without demonstrating an understanding of the video content. ### Steps to Reproduce pip install vllm==0.10.1.1 flash-attn transformers sentencepiece pip install open-webui nohup python -m vllm.entrypoints.openai.api_server \ --model /media/vlm_model/Qwen2.5-VL-72B-Instruct \ --served-model-name Qwen2.5-VL-72B \ --tensor-parallel-size 8 \ --dtype bfloat16 \ --host 0.0.0.0 \ --port 8888 \ --max-model-len 128000 \ --swap-space 16 \ --enable-auto-tool-choice \ --tool-call-parser hermes \ > Qwen2-5-VL-72B.log 2>&1 & export OPENAI_API_BASE_URL="http://localhost:8888/v1" export DEFAULT_MODELS="Qwen2.5-VL-72B" source ~/.bashrc open-webui serve ### Logs & Screenshots ![Image](https://github.com/user-attachments/assets/de9b35b4-a9f7-4771-a0a8-73e135c0a734) https://github.com/user-attachments/assets/25b0ceb6-fc6d-4710-ab08-08f8e7c4522f er-attachments/assets/173190c1-41de-4a4a-a8ae-54e35cd864c0) ![Image](https://github.com/user-attachments/assets/3ac932ff-8a50-43b0-86cc-61415ffe8a5b) ### Additional Information Currently, I have deployed qwen2.5vl-72B based on vLLM and OpenWebUI, aiming to support content understanding and analysis of uploaded video files. However, in actual use, it was found that when a video file is uploaded and the model is requested to interpret the video content, the model fails to correctly understand the information in the video, with the response content being unrelated to the actual video content and presenting randomized responses.

GiteaMirror added the bug label 2026-04-20 00:25:24 -05:00

GiteaMirror commented

2026-04-20 00:25:25 -05:00

@tjbck commented on GitHub (Sep 5, 2025):

The video gets uploaded to Open WebUI and processed there, and does not get sent to the inference engine, this should be considered as a feature request.

@tjbck commented on GitHub (Sep 5, 2025): The video gets uploaded to Open WebUI and processed there, and does not get sent to the inference engine, this should be considered as a feature request.

GiteaMirror commented

2026-04-20 00:25:25 -05:00

@AlvinNorin commented on GitHub (Sep 14, 2025):

I second this very hard

@AlvinNorin commented on GitHub (Sep 14, 2025): I second this very hard

GiteaMirror commented

2026-04-20 00:25:26 -05:00

@warshanks commented on GitHub (Sep 21, 2025):

Direct video support would be amazing

@warshanks commented on GitHub (Sep 21, 2025): Direct video support would be amazing

GiteaMirror referenced this issue

2026-04-20 05:33:24 -05:00