issue: Docling and Tika are not adding page numbers to vector store document metadata #5216

Closed
opened 2025-11-11 16:14:48 -06:00 by GiteaMirror · 1 comment
Owner

Originally created by @sreesdas on GitHub (May 18, 2025).

Check Existing Issues

  • I have searched the existing issues and discussions.
  • I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

0.6.9

Ollama Version (if applicable)

0.7.0

Operating System

macOs Sonoma

Browser (if applicable)

No response

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have listed steps to reproduce the bug in detail.

Expected Behavior

The document content extraction engines: docling and tika are not storing the page number metadata into the vector store, where as default and mistral ocr are retaining page number information.

Actual Behavior

"page", "page_label", "total_pages" metadata fields in the vector store are absent in case document extraction engine is set to docling and tika metadata.

Steps to Reproduce

  1. Select docling as the content extraction engine
  2. upload a document
  3. query the point in the vector store
  4. inspect the metadata

Logs & Screenshots

Vector store snapshot when selected Mistral OCR:

Image

Vector store snapshot when selected docling OCR / tika:

Image

Additional Information

No response

Originally created by @sreesdas on GitHub (May 18, 2025). ### Check Existing Issues - [x] I have searched the existing issues and discussions. - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version 0.6.9 ### Ollama Version (if applicable) 0.7.0 ### Operating System macOs Sonoma ### Browser (if applicable) _No response_ ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [ ] I have included the browser console logs. - [ ] I have included the Docker container logs. - [x] I have listed steps to reproduce the bug in detail. ### Expected Behavior The document content extraction engines: docling and tika are not storing the page number metadata into the vector store, where as default and mistral ocr are retaining page number information. ### Actual Behavior "page", "page_label", "total_pages" metadata fields in the vector store are absent in case document extraction engine is set to docling and tika metadata. ### Steps to Reproduce 1. Select docling as the content extraction engine 2. upload a document 3. query the point in the vector store 4. inspect the metadata ### Logs & Screenshots ### Vector store snapshot when selected Mistral OCR: <img width="1056" alt="Image" src="https://github.com/user-attachments/assets/1556a96b-5195-4ff9-838e-7d54ada48af8" /> ### Vector store snapshot when selected docling OCR / tika: <img width="1030" alt="Image" src="https://github.com/user-attachments/assets/41357825-6552-4c77-8f3f-73c4979006b7" /> ### Additional Information _No response_
GiteaMirror added the bug label 2025-11-11 16:14:48 -06:00
Author
Owner

@tjbck commented on GitHub (May 18, 2025):

Intended behaviour, however PR welcome.

@tjbck commented on GitHub (May 18, 2025): Intended behaviour, however PR welcome.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#5216