[GH-ISSUE #21821] issue: Qdrant points payload text "eating" spaces #58247

Closed
opened 2026-05-05 22:39:30 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @Atrocraz on GitHub (Feb 24, 2026).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/21821

Check Existing Issues

  • I have searched for any existing and/or related issues.
  • I have searched for any existing and/or related discussions.
  • I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!).
  • I am using the latest version of Open WebUI.

Installation Method

Docker

Open WebUI Version

0.8.5

Ollama Version (if applicable)

No response

Operating System

Ubuntu 22.04

Browser (if applicable)

No response

Confirmation

  • I have read and followed all instructions in README.md.
  • I am using the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.
  • I have provided every relevant configuration, setting, and environment variable used in my setup.
  • I have clearly listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc).
  • I have documented step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation. My steps:
  • Start with the initial platform/version/OS and dependencies used,
  • Specify exact install/launch/configure commands,
  • List URLs visited, user input (incl. example values/emails/passwords if needed),
  • Describe all options and toggles enabled or changed,
  • Include any files or environmental changes,
  • Identify the expected and actual result at each stage,
  • Ensure any reasonably skilled user can follow and hit the same issue.

Expected Behavior

After loading file in knowledge base text both in payload and embeddings remains mostly untouched and readable.

Actual Behavior

For some reason text field in payload (and i suppose in embeddings too) loses some spaces.

Steps to Reproduce

Load .docx or .pdf (didn't test other formats) in knowledge via OWUI GUI.

Logs & Screenshots

Qdrant point:
Image

Docx file with the same chunk:
Image

OWUI knowledge file:

Image

Additional Information

I tried with nomic-embed v1.5 and v2.0, but i doubt it's an embedding model problem, since payload contains corrupted text.

At the same time same text in knowledge GUI remains unchanged.

Originally created by @Atrocraz on GitHub (Feb 24, 2026). Original GitHub issue: https://github.com/open-webui/open-webui/issues/21821 ### Check Existing Issues - [x] I have searched for any existing and/or related issues. - [x] I have searched for any existing and/or related discussions. - [x] I have also searched in the CLOSED issues AND CLOSED discussions and found no related items (your issue might already be addressed on the development branch!). - [x] I am using the latest version of Open WebUI. ### Installation Method Docker ### Open WebUI Version 0.8.5 ### Ollama Version (if applicable) _No response_ ### Operating System Ubuntu 22.04 ### Browser (if applicable) _No response_ ### Confirmation - [x] I have read and followed all instructions in `README.md`. - [x] I am using the latest version of **both** Open WebUI and Ollama. - [x] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have **provided every relevant configuration, setting, and environment variable used in my setup.** - [x] I have clearly **listed every relevant configuration, custom setting, environment variable, and command-line option that influences my setup** (such as Docker Compose overrides, .env values, browser settings, authentication configurations, etc). - [x] I have documented **step-by-step reproduction instructions that are precise, sequential, and leave nothing to interpretation**. My steps: - Start with the initial platform/version/OS and dependencies used, - Specify exact install/launch/configure commands, - List URLs visited, user input (incl. example values/emails/passwords if needed), - Describe all options and toggles enabled or changed, - Include any files or environmental changes, - Identify the expected and actual result at each stage, - Ensure any reasonably skilled user can follow and hit the same issue. ### Expected Behavior After loading file in knowledge base text both in payload and embeddings remains mostly untouched and readable. ### Actual Behavior For some reason text field in payload (and i suppose in embeddings too) loses some spaces. ### Steps to Reproduce Load .docx or .pdf (didn't test other formats) in knowledge via OWUI GUI. ### Logs & Screenshots Qdrant point: <img width="1069" height="377" alt="Image" src="https://github.com/user-attachments/assets/65bf785c-2097-4d47-8f59-a814a95d4224" /> Docx file with the same chunk: <img width="663" height="269" alt="Image" src="https://github.com/user-attachments/assets/ccf82915-26ea-4e2c-bd12-af4d86841581" /> OWUI knowledge file: <img width="925" height="244" alt="Image" src="https://github.com/user-attachments/assets/595ccc12-d428-4af0-93f5-3dab37542d9f" /> ### Additional Information I tried with nomic-embed v1.5 and v2.0, but i doubt it's an embedding model problem, since payload contains corrupted text. At the same time same text in knowledge GUI remains unchanged.
GiteaMirror added the bug label 2026-05-05 22:39:30 -05:00
Author
Owner

@Atrocraz commented on GitHub (Feb 24, 2026):

Well, i guess the thing is, that for whatever reason even plain text Word files are being processed via OCR\tika\etc., which makes initial text severely worse and hurts BM25 a lot.

Switching to .txt files fixes the problem, however i wouldn't say it's a solution long term, because users rarely will upload .txt files.

Is there any other way to make this problem better?

<!-- gh-comment-id:3953027614 --> @Atrocraz commented on GitHub (Feb 24, 2026): Well, i guess the thing is, that for whatever reason even plain text Word files are being processed via OCR\tika\etc., which makes initial text severely worse and hurts BM25 a lot. Switching to .txt files fixes the problem, however i wouldn't say it's a solution long term, because users rarely will upload .txt files. Is there any other way to make this problem better?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#58247