[PR #15902] [CLOSED] feat: Add configurable API URL and additional_config for Datalab Marker API Doc Parser #62834

Closed
opened 2026-05-06 07:15:04 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/15902
Author: @Hisma
Created: 7/21/2025
Status: Closed

Base: devHead: marker-api-content-extraction


📝 Commits (8)

  • 5fbfe2b Merge pull request #15879 from open-webui/dev
  • b234b20 Update catalan translation.json
  • 241d2c8 refac: memory handling
  • 413b19a fix: dev.sh
  • ba5b554 refac/fix: channel messages
  • 5e91375 feat: add additional_config parameter
  • bcc9e43 feat: add datalab_marker_api_base_url feature
  • 32f290f fix: url not being passed through backend flow

📊 Changes

11 files changed (+197 additions, -118 deletions)

View changed files

📝 backend/dev.sh (+1 -1)
📝 backend/open_webui/config.py (+10 -4)
📝 backend/open_webui/main.py (+4 -2)
📝 backend/open_webui/models/memories.py (+13 -4)
📝 backend/open_webui/retrieval/loaders/datalab_marker.py (+9 -5)
📝 backend/open_webui/retrieval/loaders/main.py (+6 -1)
📝 backend/open_webui/routers/channels.py (+18 -14)
📝 backend/open_webui/routers/memories.py (+4 -0)
📝 backend/open_webui/routers/retrieval.py (+16 -8)
📝 src/lib/components/admin/Settings/Documents.svelte (+52 -15)
📝 src/lib/i18n/locales/ca-ES/translation.json (+64 -64)

📄 Description

Before submitting, make sure you've checked the following:

  • Target branch: Please verify that the pull request targets the dev branch.
  • Description: Provide a concise description of the changes made in this pull request.
  • Changelog: Ensure a changelog entry following the format of Keep a Changelog is added at the bottom of the PR description.
  • Documentation: Have you updated relevant documentation Open WebUI Docs, or other documentation sources?
  • Dependencies: Are there any new dependencies? Have you updated the dependency versions in the documentation?
  • Testing: Have you written and run sufficient tests to validate the changes?
  • Code review: Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards?
  • Prefix: To clearly categorize this pull request, prefix the pull request title using one of the following:

Changelog Entry

Description

Reference issue - https://github.com/open-webui/open-webui/issues/13137#issuecomment-3014486758

This PR enhances the Datalab Marker API integration by adding configurable self-hosted Marker API URL support and replacing the deprecated language selection feature with the new additional_config parameter. Users can now specify custom Datalab Marker API endpoints and have control over the latest marker API processing options.

Added

  • Configurable API Base URL field allowing users to specify custom Datalab Marker API endpoints
    • Defaults to https://www.datalab.to/api/v1/marker when left empty
    • Includes tooltip showing the default endpoint
  • Additional Config field for Datalab Marker API supporting all documented configuration options:
    • disable_links, keep_pageheader_in_output, keep_pagefooter_in_output
    • filter_blank_pages, drop_repeated_text, layout_coverage_threshold
    • merge_threshold, height_tolerance, gap_threshold, image_threshold
    • min_line_length, level_count, default_level
  • JSON validation for additional_config input field
  • Fallback logic for empty API base URL handling

Changed

  • Use LLM tooltip now correctly shows "Defaults to False" instead of "Defaults to True"
  • API URL handling with proper validation and fallback mechanisms

Deprecated

  • Language selection feature for Datalab Marker API (replaced by additional_config)

Removed

  • Deprecated language selection UI components and backend handling

Security

  • Input validation for additional_config JSON parameter
  • Proper URL validation and sanitization for custom API endpoints

Additional Information

  • Enhancement: Users can now configure custom Datalab Marker API endpoints for enterprise/self-hosted deployments
  • Replaces deprecated language feature with more comprehensive additional_config parameter as per Datalab API documentation
  • Maintains backward compatibility - existing configurations will continue to work
  • Tested in development environment using Docker container with successful document processing
  • Default API endpoint: https://www.datalab.to/api/v1/marker (can be overridden with self-hosted API)
  • Configuration flow: Frontend → Backend Router → Config → Main Loader → Datalab Marker Loader

Key Enhancements:

  1. Custom API URL Configuration - Supports enterprise/self-hosted Datalab Marker instances
  2. Comprehensive Parameter Control - Full access to all Datalab API configuration options
  3. Robust Error Handling - Proper fallbacks and validation throughout the stack

Testing Environment:

  • Built with: docker build --build-arg USE_CUDA=true --build-arg USE_CUDA_VER=cu121 --build-arg USE_OLLAMA=false -t openwebui-custom:dev .
  • Container available at: docker.io/hisma/openwebui:dev
  • Successfully tested document upload and processing with custom additional_config parameters & API URL as configurable parameter

Backend Logs Confirm Success:

INFO | open_webui.retrieval.loaders.datalab_marker:load:104 - Datalab Marker POST request parameters: {'filename': '36d5ee52-a5b4-47b0-bd04-cec9c821276a_81250143.pdf', 'mime_type': 'application/pdf', **{'use_llm': 'true', 'skip_cache': 'false', 'force_ocr': 'false', 'paginate': 'false', 'strip_existing_ocr': 'false', 'disable_image_extraction': 'false', 'output_format': 'markdown', 'additional_config': '{"keep_pageheader_in_output": true, "keep_pagefooter_in_output": true}'}}

INFO | open_webui.retrieval.loaders.datalab_marker:load:173 - Marker processing completed successfully

Screenshots

image

Contributor License Agreement

By submitting this pull request, I confirm that I have read and fully agree to the Contributor License Agreement (CLA), and I am providing my contributions under its terms.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/15902 **Author:** [@Hisma](https://github.com/Hisma) **Created:** 7/21/2025 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `marker-api-content-extraction` --- ### 📝 Commits (8) - [`5fbfe2b`](https://github.com/open-webui/open-webui/commit/5fbfe2bdcadf5f157926f6551891e4dc0802b9f3) Merge pull request #15879 from open-webui/dev - [`b234b20`](https://github.com/open-webui/open-webui/commit/b234b2072530ae22ab472c32303c7e597c93e236) Update catalan translation.json - [`241d2c8`](https://github.com/open-webui/open-webui/commit/241d2c8cc1a2e31da6eb9eeac9fd186d864fc29f) refac: memory handling - [`413b19a`](https://github.com/open-webui/open-webui/commit/413b19a7d805139de5d742792d491888d6a7e004) fix: dev.sh - [`ba5b554`](https://github.com/open-webui/open-webui/commit/ba5b5547fcc6eb2438d4e9f0cca547f0a37d6e74) refac/fix: channel messages - [`5e91375`](https://github.com/open-webui/open-webui/commit/5e913757bbedbf42bf6a61a3703f49aff1f8ef83) feat: add additional_config parameter - [`bcc9e43`](https://github.com/open-webui/open-webui/commit/bcc9e4371667b8552efe2f172b39bc01725a1d78) feat: add datalab_marker_api_base_url feature - [`32f290f`](https://github.com/open-webui/open-webui/commit/32f290fd021f070c14b51babfc3274fbf7a64771) fix: url not being passed through backend flow ### 📊 Changes **11 files changed** (+197 additions, -118 deletions) <details> <summary>View changed files</summary> 📝 `backend/dev.sh` (+1 -1) 📝 `backend/open_webui/config.py` (+10 -4) 📝 `backend/open_webui/main.py` (+4 -2) 📝 `backend/open_webui/models/memories.py` (+13 -4) 📝 `backend/open_webui/retrieval/loaders/datalab_marker.py` (+9 -5) 📝 `backend/open_webui/retrieval/loaders/main.py` (+6 -1) 📝 `backend/open_webui/routers/channels.py` (+18 -14) 📝 `backend/open_webui/routers/memories.py` (+4 -0) 📝 `backend/open_webui/routers/retrieval.py` (+16 -8) 📝 `src/lib/components/admin/Settings/Documents.svelte` (+52 -15) 📝 `src/lib/i18n/locales/ca-ES/translation.json` (+64 -64) </details> ### 📄 Description **Before submitting, make sure you've checked the following:** - [x] **Target branch:** Please verify that the pull request targets the `dev` branch. - [x] **Description:** Provide a concise description of the changes made in this pull request. - [x] **Changelog:** Ensure a changelog entry following the format of [Keep a Changelog](https://keepachangelog.com/) is added at the bottom of the PR description. - [ ] **Documentation:** Have you updated relevant documentation [Open WebUI Docs](https://github.com/open-webui/docs), or other documentation sources? - [ ] **Dependencies:** Are there any new dependencies? Have you updated the dependency versions in the documentation? - [x] **Testing:** Have you written and run sufficient tests to validate the changes? - [x] **Code review:** Have you performed a self-review of your code, addressing any coding standard issues and ensuring adherence to the project's coding standards? - [x] **Prefix:** To clearly categorize this pull request, prefix the pull request title using one of the following: # Changelog Entry ### Description Reference issue - https://github.com/open-webui/open-webui/issues/13137#issuecomment-3014486758 This PR enhances the Datalab Marker API integration by adding configurable self-hosted Marker API URL support and replacing the deprecated language selection feature with the new `additional_config` parameter. Users can now specify custom Datalab Marker API endpoints and have control over the latest marker API processing options. ### Added - **Configurable API Base URL** field allowing users to specify custom Datalab Marker API endpoints - Defaults to `https://www.datalab.to/api/v1/marker` when left empty - Includes tooltip showing the default endpoint - **Additional Config field** for Datalab Marker API supporting all documented configuration options: - `disable_links`, `keep_pageheader_in_output`, `keep_pagefooter_in_output` - `filter_blank_pages`, `drop_repeated_text`, `layout_coverage_threshold` - `merge_threshold`, `height_tolerance`, `gap_threshold`, `image_threshold` - `min_line_length`, `level_count`, `default_level` - **JSON validation** for additional_config input field - **Fallback logic** for empty API base URL handling ### Changed - **Use LLM tooltip** now correctly shows "Defaults to False" instead of "Defaults to True" - **API URL handling** with proper validation and fallback mechanisms ### Deprecated - **Language selection feature** for Datalab Marker API (replaced by additional_config) ### Removed - Deprecated language selection UI components and backend handling ### Security - Input validation for additional_config JSON parameter - Proper URL validation and sanitization for custom API endpoints --- ### Additional Information - **Enhancement**: Users can now configure custom Datalab Marker API endpoints for enterprise/self-hosted deployments - **Replaces deprecated language feature** with more comprehensive additional_config parameter as per Datalab API documentation - **Maintains backward compatibility** - existing configurations will continue to work - **Tested in development environment** using Docker container with successful document processing - **Default API endpoint**: `https://www.datalab.to/api/v1/marker` (can be overridden with self-hosted API) - **Configuration flow**: Frontend → Backend Router → Config → Main Loader → Datalab Marker Loader **Key Enhancements:** 1. **Custom API URL Configuration** - Supports enterprise/self-hosted Datalab Marker instances 2. **Comprehensive Parameter Control** - Full access to all Datalab API configuration options 3. **Robust Error Handling** - Proper fallbacks and validation throughout the stack **Testing Environment:** - Built with: `docker build --build-arg USE_CUDA=true --build-arg USE_CUDA_VER=cu121 --build-arg USE_OLLAMA=false -t openwebui-custom:dev .` - Container available at: `docker.io/hisma/openwebui:dev` - Successfully tested document upload and processing with custom additional_config parameters & API URL as configurable parameter **Backend Logs Confirm Success:** ``` INFO | open_webui.retrieval.loaders.datalab_marker:load:104 - Datalab Marker POST request parameters: {'filename': '36d5ee52-a5b4-47b0-bd04-cec9c821276a_81250143.pdf', 'mime_type': 'application/pdf', **{'use_llm': 'true', 'skip_cache': 'false', 'force_ocr': 'false', 'paginate': 'false', 'strip_existing_ocr': 'false', 'disable_image_extraction': 'false', 'output_format': 'markdown', 'additional_config': '{"keep_pageheader_in_output": true, "keep_pagefooter_in_output": true}'}} INFO | open_webui.retrieval.loaders.datalab_marker:load:173 - Marker processing completed successfully ``` ### Screenshots <img width="3373" height="860" alt="image" src="https://github.com/user-attachments/assets/71d30b37-e0bb-4e91-9bfe-8a9372b1d8aa" /> ### Contributor License Agreement By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-06 07:15:04 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#62834