mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-06 10:58:17 -05:00
[PR #14678] feat: Enhance Multi-Modal QA for Uploaded Documents with Docling File Parser and OpenAI-Compatible API #23548
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/14678
Author: @MingLin-home
Created: 6/4/2025
Status: 🔄 Open
Base:
dev← Head:docling-multi-modal-qa-dev📝 Commits (2)
e143816Enhance image handling in DoclingLoader and OpenAI router2756d24Add conditional image export mode based on environment variable📊 Changes
2 files changed (+161 additions, -5 deletions)
View changed files
📝
backend/open_webui/retrieval/loaders/main.py(+28 -4)📝
backend/open_webui/routers/openai.py(+133 -1)📄 Description
Docling parser now extracts base64-encoded images and layout metadata and appends them to the OpenAI payload, enabling users to query image content and placement via its multimodal API.
Discussion thread is here
Pull Request Checklist
Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.
Before submitting, make sure you've checked the following:
devbranch.Changelog Entry
Description
This PR enhances the Docling parser to extract images when parsing user-uploaded documents, appends the base64 encoding and metadata of embedded images into the payload of OpenAI-compatible API, allowing the API to interpret image content and layout. As a result, users can query the images within the documents, such as identifying image content or counting how many images appear on a specific page.
This PR leverages the built-in multimodal capabilities of OpenAI models to interpret images directly. It does not rely on any external image-to-caption models or require converting images to text.
This PR has been tested in “Full Context Mode” with OpenAI-compatible API.
Added
[List any new features, functionalities, or additions]
When using the Docling parser, inner (embedded) images are extracted from uploaded files as base64-encoded data, along with their associated metadata.
When images are extracted, their base64 encodings and metadata are added to markdown_content as special comment lines.
Before sending data to OpenAI API, these comment lines are parsed and removed from the final payload.
The parsed image's base64 encoding and metadata are included in the OpenAI payload and sent to the OpenAI Compatible API for processing, enabling the image QA.
Changed
Deprecated
Removed
Fixed
Security
Breaking Changes
Additional Information
Current Behavior
When Docling is used as the file parser and OpenAI serves as the LLM backend, Open-WebUI currently ignores embedded images in uploaded documents. As a result, the OpenAI API does not have access to these images and cannot process them in its responses.
What This PR Changes
This PR addresses the issue by appending extracted image base64 encodings and layout metadata to the document content. With this enhancement, the OpenAI API can now access both the image data and its positioning within the document. Users are therefore able to ask questions not only about image content, but also about image layout—for example: “What is in the image on page 2, left-hand side?”
To test this PR
Screenshots or Videos
PDF file used in this test:
Page 3:
Page 4:

Current behavior (without this PR):

With this PR:
Contributor License Agreement
By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.