[PR #14311] [MERGED] feat: Marker api content extraction support #46488

Closed
opened 2026-04-29 21:19:22 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/14311
Author: @Hisma
Created: 5/25/2025
Status: Merged
Merged: 5/28/2025
Merged by: @tjbck

Base: devHead: marker-api-content-extraction


📝 Commits (5)

  • 9faa4c6 Merge pull request #14194 from open-webui/dev
  • b8e1621 Merge pull request #14364 from open-webui/dev
  • a9405cc feat: Marker api content extraction support
  • e12a79c fix: handle json output format correctly
  • 19bb358 fix: add Datalab Marker API to Content Extraction Engine Dropdown

📊 Changes

6 files changed (+517 additions, -1 deletions)

View changed files

📝 backend/open_webui/config.py (+54 -0)
📝 backend/open_webui/main.py (+18 -0)
backend/open_webui/retrieval/loaders/datalab_marker_loader.py (+200 -0)
📝 backend/open_webui/retrieval/loaders/main.py (+18 -1)
📝 backend/open_webui/routers/retrieval.py (+81 -0)
📝 src/lib/components/admin/Settings/Documents.svelte (+146 -0)

📄 Description

Pull Request Checklist

Note to first-time contributors: Please open a discussion post in Discussions and describe your changes before submitting a pull request.

Before submitting, make sure you've checked the following:

  • Target branch: Please verify that the pull request targets the dev branch.

  • Description: Provide a concise description of the changes made in this pull request.

  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.

  • Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources? - Detailed documentation is attached.
    Datalab_Marker_API_Quick_Reference.md
    Datalab_Marker_API_User_Guide.md

  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation? - no

  • Testing: Have you written and run sufficient tests to validate the changes? - tested in dev container - docker.io/hisma/openwebui:dev

  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? yes

  • Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:

    • BREAKING CHANGE: Significant changes that may affect compatibility
    • build: Changes that affect the build system or external dependencies
    • ci: Changes to our continuous integration processes or workflows
    • chore: Refactor, cleanup, or other non-functional code changes
    • docs: Documentation update or addition
    • feat: Introduces a new feature or enhancement to the codebase
    • fix: Bug fix or error correction
    • i18n: Internationalization or localization changes
    • perf: Performance improvement
    • refactor: Code restructuring for better maintainability, readability, or scalability
    • style: Changes that do not affect the meaning of the code (white space, formatting, missing semi-colons, etc.)
    • test: Adding missing tests or correcting existing tests
    • WIP: Work in progress, a temporary label for incomplete or ongoing work

Changelog Entry

Description

  • Marker is the most powerful open source pdf parser on the market, even beating mistral OCR.
    Marker has the option of using google gemini flash to perform pdf OCR, which is available via an "Use LLM" toggle. The user can also select multiple OCR languages via a multi-select window, and document output format (markdown, json, or html). There are also other optional features that can be toggled on/off such as force_ocr, paginate, strip_existing_ocr, disable_image_extraction, and skip_cache, giving the user a lot of flexibility over the content extraction.

Marker repo is here -
https://github.com/VikParuchuri/marker
The way this feature works is that it uses marker's official hosted API for accessing the marker OCR engine.
https://www.datalab.to/

This addon specifically implements the marker API -
https://www.datalab.to/app/docs#marker

Added

  • marker API support for RAG Document Content Extraction. Uses the official hosted version of marker, datalab.to. User signs up on datalab and creates an API key. Then in openwebui, select "Datalab Marker API" from the "content extraction Engine" list, enter the API key, and press save.

Changed

  • 6 files changed -
  • backend/open_webui/config.py
  • backend/open_webui/main.py
  • backend/open_webui/retrieval/loaders/datalab_marker_loader.py
  • backend/open_webui/retrieval/loaders/main.py
  • backend/open_webui/routers/retrieval.py
  • src/lib/components/admin/Settings/Documents.svelte

Deprecated

  • none

Removed

  • none

Fixed

  • none

Security

  • none

Breaking Changes

  • BREAKING CHANGE: none

Additional Information

  • this is designed to be a "out of box" solution for those that want to use marker without having to deal with complex set-up and integration into OWUI.

Screenshots or Videos

image

image

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/14311 **Author:** [@Hisma](https://github.com/Hisma) **Created:** 5/25/2025 **Status:** ✅ Merged **Merged:** 5/28/2025 **Merged by:** [@tjbck](https://github.com/tjbck) **Base:** `dev` ← **Head:** `marker-api-content-extraction` --- ### 📝 Commits (5) - [`9faa4c6`](https://github.com/open-webui/open-webui/commit/9faa4c6a4cd8dd643cddb93dccb65c6609488a29) Merge pull request #14194 from open-webui/dev - [`b8e1621`](https://github.com/open-webui/open-webui/commit/b8e16211b9d207e91b7da0a2055a7a679c05b6ce) Merge pull request #14364 from open-webui/dev - [`a9405cc`](https://github.com/open-webui/open-webui/commit/a9405cc10103d1cd2f91e235250051a1eea2d09c) feat: Marker api content extraction support - [`e12a79c`](https://github.com/open-webui/open-webui/commit/e12a79c0e2d6651ce8d2748f1973634a5b138f53) fix: handle json output format correctly - [`19bb358`](https://github.com/open-webui/open-webui/commit/19bb3589ee4116dfb45e10c03b391b201e58571d) fix: add `Datalab Marker API` to Content Extraction Engine Dropdown ### 📊 Changes **6 files changed** (+517 additions, -1 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+54 -0) 📝 `backend/open_webui/main.py` (+18 -0) ➕ `backend/open_webui/retrieval/loaders/datalab_marker_loader.py` (+200 -0) 📝 `backend/open_webui/retrieval/loaders/main.py` (+18 -1) 📝 `backend/open_webui/routers/retrieval.py` (+81 -0) 📝 `src/lib/components/admin/Settings/Documents.svelte` (+146 -0) </details> ### 📄 Description # Pull Request Checklist ### Note to first-time contributors: Please open a discussion post in [Discussions](https://github.com/open-webui/open-webui/discussions) and describe your changes before submitting a pull request. **Before submitting, make sure you've checked the following:** - [x] **Target branch:** Please verify that the pull request targets the `dev` branch. - [x] **Description:** Provide a concise description of the changes made in this pull request. - [x] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [x] **Documentation:** Have you updated relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs), or other documentation sources? - Detailed documentation is attached. [Datalab_Marker_API_Quick_Reference.md](https://github.com/user-attachments/files/20451322/Datalab_Marker_API_Quick_Reference.md) [Datalab_Marker_API_User_Guide.md](https://github.com/user-attachments/files/20451323/Datalab_Marker_API_User_Guide.md) - [x] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - no - [x] **Testing:** Have you written and run sufficient tests to validate the changes? - tested in dev container - `docker.io/hisma/openwebui:dev` - [x] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? yes - [x] **Prefix:** To clearly categorize this pull request, prefix the pull request title using one of the following: - **BREAKING CHANGE**: Significant changes that may affect compatibility - **build**: Changes that affect the build system or external dependencies - **ci**: Changes to our continuous integration processes or workflows - **chore**: Refactor, cleanup, or other non-functional code changes - **docs**: Documentation update or addition - **feat**: Introduces a new feature or enhancement to the codebase - **fix**: Bug fix or error correction - **i18n**: Internationalization or localization changes - **perf**: Performance improvement - **refactor**: Code restructuring for better maintainability, readability, or scalability - **style**: Changes that do not affect the meaning of the code (white space, formatting, missing semi-colons, etc.) - **test**: Adding missing tests or correcting existing tests - **WIP**: Work in progress, a temporary label for incomplete or ongoing work # Changelog Entry ### Description - Marker is the most powerful open source pdf parser on the market, even beating mistral OCR. Marker has the option of using google gemini flash to perform pdf OCR, which is available via an "Use LLM" toggle. The user can also select multiple OCR languages via a multi-select window, and document output format (markdown, json, or html). There are also other optional features that can be toggled on/off such as `force_ocr`, `paginate`, `strip_existing_ocr`, `disable_image_extraction`, and `skip_cache`, giving the user a lot of flexibility over the content extraction. Marker repo is here - https://github.com/VikParuchuri/marker The way this feature works is that it uses marker's official hosted API for accessing the marker OCR engine. https://www.datalab.to/ This addon specifically implements the marker API - https://www.datalab.to/app/docs#marker ### Added - marker API support for RAG Document Content Extraction. Uses the official hosted version of marker, datalab.to. User signs up on datalab and creates an API key. Then in openwebui, select "Datalab Marker API" from the "content extraction Engine" list, enter the API key, and press save. ### Changed - 6 files changed - - `backend/open_webui/config.py` - `backend/open_webui/main.py` - `backend/open_webui/retrieval/loaders/datalab_marker_loader.py` - `backend/open_webui/retrieval/loaders/main.py` - `backend/open_webui/routers/retrieval.py` - `src/lib/components/admin/Settings/Documents.svelte` ### Deprecated - none ### Removed - none ### Fixed - none ### Security - none ### Breaking Changes - **BREAKING CHANGE**: none --- ### Additional Information - this is designed to be a "out of box" solution for those that want to use marker without having to deal with complex set-up and integration into OWUI. ### Screenshots or Videos ![image](https://github.com/user-attachments/assets/af5c182a-3788-46f7-afa9-8b4f00aff937) ![image](https://github.com/user-attachments/assets/6851d750-79b4-4152-b376-69d3786fa494) ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-29 21:19:22 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#46488