[PR #20809] [CLOSED] fix: Docling page number extraction for citations #41421

Closed
opened 2026-04-25 13:40:26 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/20809
Author: @jannikstdl
Created: 1/20/2026
Status: Closed

Base: mainHead: fix/docling-page-extraction


📝 Commits (10+)

  • b464b48 Merge pull request #20581 from Classic298/fix/db-pool-memory-update
  • 3fc8661 fix(db): CRITICAL - prevent pool exhaustion in memory /reset (#20580)
  • 182d5e8 fix(db): release connection before embedding in process_files_batch (#20576)
  • 826e9ab fix(db): release connection before embeddings in knowledge /metadata/reindex (#20577)
  • 2426257 fix(db): release connection before embedding in memory /add (#20578)
  • d0c2bfd fix(db): release connection before LLM call in OpenAI /chat/completions (#20572)
  • 0b5aa6d fix(db): release connection before LLM call in Ollama /api/chat (#20571)
  • 2faab40 i18n(pl-PL): Add missing keys and update existing translations (#20562)
  • 84263fc i18n: Updated the Catalan translation file (#20566)
  • 24044b4 fix(db): release connection before LLM call in Ollama /v1/chat/completions (#20569)

📊 Changes

43 files changed (+1073 additions, -760 deletions)

View changed files

📝 backend/open_webui/config.py (+4 -0)
📝 backend/open_webui/env.py (+7 -1)
📝 backend/open_webui/models/groups.py (+7 -6)
📝 backend/open_webui/models/knowledge.py (+3 -0)
📝 backend/open_webui/models/models.py (+3 -0)
📝 backend/open_webui/retrieval/loaders/main.py (+35 -5)
📝 backend/open_webui/retrieval/vector/dbs/weaviate.py (+11 -3)
📝 backend/open_webui/routers/auths.py (+4 -0)
📝 backend/open_webui/routers/channels.py (+2 -6)
📝 backend/open_webui/routers/files.py (+2 -0)
📝 backend/open_webui/routers/knowledge.py (+22 -13)
📝 backend/open_webui/routers/memories.py (+24 -8)
📝 backend/open_webui/routers/models.py (+2 -4)
📝 backend/open_webui/routers/ollama.py (+15 -9)
📝 backend/open_webui/routers/openai.py (+5 -3)
📝 backend/open_webui/routers/retrieval.py (+6 -2)
📝 backend/open_webui/routers/users.py (+2 -4)
📝 backend/open_webui/tools/builtin.py (+180 -1)
📝 backend/open_webui/utils/auth.py (+2 -1)
📝 backend/open_webui/utils/tools.py (+11 -0)

...and 23 more files

📄 Description

Summary

  • Fix page number extraction from Docling-processed PDFs so CitationModal displays page numbers
  • Use md_page_break_placeholder parameter to split markdown content by page while preserving formatting

Problem

Page numbers from Docling-processed PDFs were not showing in CitationModal because the markdown output had no page boundary information.

Solution

Request Docling to insert page break markers (<!-- DOCLING_PAGE_BREAK -->) between pages in the markdown output, then split on those markers to create one document per page with the page metadata field.

This preserves:

  • Markdown formatting (headers, lists, etc.) for the Markdown Header Text Splitter
  • Page numbers in metadata for CitationModal display

Test plan

  • Upload a multi-page PDF with Docling engine enabled
  • Verify citations show "(page X)" in CitationModal
  • Verify markdown formatting is preserved (headers work with Markdown Header Text Splitter)

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/20809 **Author:** [@jannikstdl](https://github.com/jannikstdl) **Created:** 1/20/2026 **Status:** ❌ Closed **Base:** `main` ← **Head:** `fix/docling-page-extraction` --- ### 📝 Commits (10+) - [`b464b48`](https://github.com/open-webui/open-webui/commit/b464b48f53e6eda17813d684909254819ac599e3) Merge pull request #20581 from Classic298/fix/db-pool-memory-update - [`3fc8661`](https://github.com/open-webui/open-webui/commit/3fc866117da65c4a3e05e1a2add40b193933fd97) fix(db): CRITICAL - prevent pool exhaustion in memory /reset (#20580) - [`182d5e8`](https://github.com/open-webui/open-webui/commit/182d5e8591560dcc5a58f49068f91ad46c605952) fix(db): release connection before embedding in process_files_batch (#20576) - [`826e9ab`](https://github.com/open-webui/open-webui/commit/826e9ab317d5376c6eeb93870481dad3bf99ae96) fix(db): release connection before embeddings in knowledge /metadata/reindex (#20577) - [`2426257`](https://github.com/open-webui/open-webui/commit/242625782f03a2ee9c529b4df69a9d55481e6854) fix(db): release connection before embedding in memory /add (#20578) - [`d0c2bfd`](https://github.com/open-webui/open-webui/commit/d0c2bfdbff2b12e8190379cef8f442b1cf210470) fix(db): release connection before LLM call in OpenAI /chat/completions (#20572) - [`0b5aa6d`](https://github.com/open-webui/open-webui/commit/0b5aa6dd60c5502ad98a0bea903142763a1e3f91) fix(db): release connection before LLM call in Ollama /api/chat (#20571) - [`2faab40`](https://github.com/open-webui/open-webui/commit/2faab409d346a7abf88c9085b44e6bc73f2a14a0) i18n(pl-PL): Add missing keys and update existing translations (#20562) - [`84263fc`](https://github.com/open-webui/open-webui/commit/84263fc6a6435226e6a2ac29b421b41fad632067) i18n: Updated the Catalan translation file (#20566) - [`24044b4`](https://github.com/open-webui/open-webui/commit/24044b42ea97f8fd855472b2c0abc497a813843b) fix(db): release connection before LLM call in Ollama /v1/chat/completions (#20569) ### 📊 Changes **43 files changed** (+1073 additions, -760 deletions) <details> <summary>View changed files</summary> 📝 `backend/open_webui/config.py` (+4 -0) 📝 `backend/open_webui/env.py` (+7 -1) 📝 `backend/open_webui/models/groups.py` (+7 -6) 📝 `backend/open_webui/models/knowledge.py` (+3 -0) 📝 `backend/open_webui/models/models.py` (+3 -0) 📝 `backend/open_webui/retrieval/loaders/main.py` (+35 -5) 📝 `backend/open_webui/retrieval/vector/dbs/weaviate.py` (+11 -3) 📝 `backend/open_webui/routers/auths.py` (+4 -0) 📝 `backend/open_webui/routers/channels.py` (+2 -6) 📝 `backend/open_webui/routers/files.py` (+2 -0) 📝 `backend/open_webui/routers/knowledge.py` (+22 -13) 📝 `backend/open_webui/routers/memories.py` (+24 -8) 📝 `backend/open_webui/routers/models.py` (+2 -4) 📝 `backend/open_webui/routers/ollama.py` (+15 -9) 📝 `backend/open_webui/routers/openai.py` (+5 -3) 📝 `backend/open_webui/routers/retrieval.py` (+6 -2) 📝 `backend/open_webui/routers/users.py` (+2 -4) 📝 `backend/open_webui/tools/builtin.py` (+180 -1) 📝 `backend/open_webui/utils/auth.py` (+2 -1) 📝 `backend/open_webui/utils/tools.py` (+11 -0) _...and 23 more files_ </details> ### 📄 Description ## Summary - Fix page number extraction from Docling-processed PDFs so CitationModal displays page numbers - Use `md_page_break_placeholder` parameter to split markdown content by page while preserving formatting ## Problem Page numbers from Docling-processed PDFs were not showing in CitationModal because the markdown output had no page boundary information. ## Solution Request Docling to insert page break markers (`<!-- DOCLING_PAGE_BREAK -->`) between pages in the markdown output, then split on those markers to create one document per page with the `page` metadata field. This preserves: - Markdown formatting (headers, lists, etc.) for the Markdown Header Text Splitter - Page numbers in metadata for CitationModal display ## Test plan - [ ] Upload a multi-page PDF with Docling engine enabled - [ ] Verify citations show "(page X)" in CitationModal - [ ] Verify markdown formatting is preserved (headers work with Markdown Header Text Splitter) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-25 13:40:26 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#41421