[PR #14678] feat: Enhance Multi-Modal QA for Uploaded Documents with Docling File Parser and OpenAI-Compatible API #39178

Open
opened 2026-04-25 11:54:03 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/14678
Author: @MingLin-home
Created: 6/4/2025
Status: 🔄 Open

Base: devHead: docling-multi-modal-qa-dev


📝 Commits (2)

  • e143816 Enhance image handling in DoclingLoader and OpenAI router
  • 2756d24 Add conditional image export mode based on environment variable

📊 Changes

2 files changed (+161 additions, -5 deletions)

View changed files

📝 backend/open_webui/retrieval/loaders/main.py (+28 -4)
📝 backend/open_webui/routers/openai.py (+133 -1)

📄 Description

Docling parser now extracts base64-encoded images and layout metadata and appends them to the OpenAI payload, enabling users to query image content and placement via its multimodal API.

Discussion thread is here

Pull Request Checklist

Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.

Before submitting, make sure you've checked the following:

  • Target branch: Please verify that the pull request targets the dev branch.
  • Description: Provide a concise description of the changes made in this pull request.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources?
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Have you written and run sufficient tests to validate the changes?
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:
    • BREAKING CHANGE: Significant changes that may affect compatibility
    • build: Changes that affect the build system or external dependencies
    • ci: Changes to our continuous integration processes or workflows
    • chore: Refactor, cleanup, or other non-functional code changes
    • docs: Documentation update or addition
    • feat: Introduces a new feature or enhancement to the codebase
    • fix: Bug fix or error correction
    • i18n: Internationalization or localization changes
    • perf: Performance improvement
    • refactor: Code restructuring for better maintainability, readability, or scalability
    • style: Changes that do not affect the meaning of the code (white space, formatting, missing semi-colons, etc.)
    • test: Adding missing tests or correcting existing tests
    • WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

Description

  • [Concisely describe the changes made in this pull request, including any relevant motivation and impact (e.g., fixing a bug, adding a feature, or improving performance)]

This PR enhances the Docling parser to extract images when parsing user-uploaded documents, appends the base64 encoding and metadata of embedded images into the payload of OpenAI-compatible API, allowing the API to interpret image content and layout. As a result, users can query the images within the documents, such as identifying image content or counting how many images appear on a specific page.

This PR leverages the built-in multimodal capabilities of OpenAI models to interpret images directly. It does not rely on any external image-to-caption models or require converting images to text.

This PR has been tested in “Full Context Mode” with OpenAI-compatible API.

Added

  • [List any new features, functionalities, or additions]

  • When using the Docling parser, inner (embedded) images are extracted from uploaded files as base64-encoded data, along with their associated metadata.

  • When images are extracted, their base64 encodings and metadata are added to markdown_content as special comment lines.

  • Before sending data to OpenAI API, these comment lines are parsed and removed from the final payload.

  • The parsed image's base64 encoding and metadata are included in the OpenAI payload and sent to the OpenAI Compatible API for processing, enabling the image QA.

Changed

  • [List any changes, updates, refactorings, or optimizations]

Deprecated

  • [List any deprecated functionality or features that have been removed]

Removed

  • [List any removed features, files, or functionalities]

Fixed

  • [List any fixes, corrections, or bug fixes]

Security

  • [List any new or updated security-related changes, including vulnerability fixes]

Breaking Changes

  • BREAKING CHANGE: [List any breaking changes affecting compatibility or functionality]

Additional Information

  • [Insert any additional context, notes, or explanations for the changes]
    • [Reference any related issues, commits, or other relevant information]

Current Behavior

When Docling is used as the file parser and OpenAI serves as the LLM backend, Open-WebUI currently ignores embedded images in uploaded documents. As a result, the OpenAI API does not have access to these images and cannot process them in its responses.

What This PR Changes

This PR addresses the issue by appending extracted image base64 encodings and layout metadata to the document content. With this enhancement, the OpenAI API can now access both the image data and its positioning within the document. Users are therefore able to ask questions not only about image content, but also about image layout—for example: “What is in the image on page 2, left-hand side?”

To test this PR

  • Start the Docling server. For example, “docker run -p 5001:5001 quay.io/docling-project/docling-serve”
  • "export ENABLE_OPENAI_IMAGE_URL=True” to enable passing image_url to payload
    • Not all OpenAI models support image_url. Known working models: GPT-4.1, GPT-4o.
  • Open Open-WebUI and navigate to Settings → Documents.
  • Select "Docling" as the Context Extraction Engine.
  • Update the Docling server URL (for example, http://localhost:5001/).
  • Enable the “Bypass Embedding and Retrieval” option.
  • Use OpenAI as LLM backend with your API key.
  • Start a new chat, choose "GPT-4.1" or "GPT-4o".
  • Upload your PDF file, and submit your queries about image content.

Screenshots or Videos

  • [Attach any relevant screenshots or videos demonstrating the changes]

PDF file used in this test:
Page 3:

image

Page 4:
image

Current behavior (without this PR):
Picture1

With this PR:

pic2

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/14678 **Author:** [@MingLin-home](https://github.com/MingLin-home) **Created:** 6/4/2025 **Status:** 🔄 Open **Base:** `dev` ← **Head:** `docling-multi-modal-qa-dev` --- ### 📝 Commits (2) - [`e143816`](https://github.com/open-webui/open-webui/commit/e143816ce7b2192d8232f3410e70bf0b950f34ba) Enhance image handling in DoclingLoader and OpenAI router - [`2756d24`](https://github.com/open-webui/open-webui/commit/2756d24d100fe95d0938e690e012738abf891432) Add conditional image export mode based on environment variable ### 📊 Changes **2 files changed** (+161 additions, -5 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/retrieval/loaders/main.py` (+28 -4) 📝 `backend/open_webui/routers/openai.py` (+133 -1) </details> ### 📄 Description Docling parser now extracts base64-encoded images and layout metadata and appends them to the OpenAI payload, enabling users to query image content and placement via its multimodal API. Discussion thread is [here](https://github.com/open-webui/open-webui/discussions/14677) # Pull Request Checklist ### Note to first-time contributors: Please open a discussion post in [Discussions](https://github.com/open-webui/open-webui/discussions) and describe your changes before submitting a pull request. **Before submitting, make sure you've checked the following:** - [x] **Target branch:** Please verify that the pull request targets the `dev` branch. - [x] **Description:** Provide a concise description of the changes made in this pull request. - [x] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [ ] **Documentation:** Have you updated relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs), or other documentation sources? - [ ] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - [x] **Testing:** Have you written and run sufficient tests to validate the changes? - [x] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [x] **Prefix:** To clearly categorize this pull request, prefix the pull request title using one of the following: - **BREAKING CHANGE**: Significant changes that may affect compatibility - **build**: Changes that affect the build system or external dependencies - **ci**: Changes to our continuous integration processes or workflows - **chore**: Refactor, cleanup, or other non-functional code changes - **docs**: Documentation update or addition - **feat**: Introduces a new feature or enhancement to the codebase - **fix**: Bug fix or error correction - **i18n**: Internationalization or localization changes - **perf**: Performance improvement - **refactor**: Code restructuring for better maintainability, readability, or scalability - **style**: Changes that do not affect the meaning of the code (white space, formatting, missing semi-colons, etc.) - **test**: Adding missing tests or correcting existing tests - **WIP**: Work in progress, a temporary label for incomplete or ongoing work # Changelog Entry ### Description - [Concisely describe the changes made in this pull request, including any relevant motivation and impact (e.g., fixing a bug, adding a feature, or improving performance)] This PR enhances the Docling parser to extract images when parsing user-uploaded documents, appends the base64 encoding and metadata of embedded images into the payload of OpenAI-compatible API, allowing the API to interpret image content and layout. As a result, users can query the images within the documents, such as identifying image content or counting how many images appear on a specific page. This PR leverages the built-in multimodal capabilities of OpenAI models to interpret images directly. It does not rely on any external image-to-caption models or require converting images to text. This PR has been tested in “Full Context Mode” with OpenAI-compatible API. ### Added - [List any new features, functionalities, or additions] - When using the Docling parser, inner (embedded) images are extracted from uploaded files as base64-encoded data, along with their associated metadata. - When images are extracted, their base64 encodings and metadata are added to markdown_content as special comment lines. - Before sending data to OpenAI API, these comment lines are parsed and removed from the final payload. - The parsed image's base64 encoding and metadata are included in the OpenAI payload and sent to the OpenAI Compatible API for processing, enabling the image QA. ### Changed - [List any changes, updates, refactorings, or optimizations] ### Deprecated - [List any deprecated functionality or features that have been removed] ### Removed - [List any removed features, files, or functionalities] ### Fixed - [List any fixes, corrections, or bug fixes] ### Security - [List any new or updated security-related changes, including vulnerability fixes] ### Breaking Changes - **BREAKING CHANGE**: [List any breaking changes affecting compatibility or functionality] --- ### Additional Information - [Insert any additional context, notes, or explanations for the changes] - [Reference any related issues, commits, or other relevant information] **Current Behavior** When Docling is used as the file parser and OpenAI serves as the LLM backend, Open-WebUI currently ignores embedded images in uploaded documents. As a result, the OpenAI API does not have access to these images and cannot process them in its responses. **What This PR Changes** This PR addresses the issue by appending extracted image base64 encodings and layout metadata to the document content. With this enhancement, the OpenAI API can now access both the image data and its positioning within the document. Users are therefore able to ask questions not only about image content, but also about image layout—for example: “What is in the image on page 2, left-hand side?” **To test this PR** - Start the Docling server. For example, “docker run -p 5001:5001 quay.io/docling-project/docling-serve” - "export ENABLE_OPENAI_IMAGE_URL=True” to enable passing image_url to payload * Not all OpenAI models support image_url. Known working models: GPT-4.1, GPT-4o. - Open Open-WebUI and navigate to Settings → Documents. - Select "Docling" as the Context Extraction Engine. - Update the Docling server URL (for example, http://localhost:5001/). - Enable the “Bypass Embedding and Retrieval” option. - Use OpenAI as LLM backend with your API key. - Start a new chat, choose "GPT-4.1" or "GPT-4o". - Upload your PDF file, and submit your queries about image content. ### Screenshots or Videos - [Attach any relevant screenshots or videos demonstrating the changes] PDF file used in this test: Page 3: ![image](https://github.com/user-attachments/assets/9052ca6f-2021-4211-b502-825325cac8fb) Page 4: ![image](https://github.com/user-attachments/assets/5cc4e4e3-212b-4560-8c4d-9bc8cec683e1) Current behavior (without this PR): ![Picture1](https://github.com/user-attachments/assets/2f35042a-a988-4d40-9560-855f8f7ea46e) With this PR: ![pic2](https://github.com/user-attachments/assets/f5e63062-256d-4852-a90c-823cf6b40470) ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 11:54:03 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#39178