mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-27 03:48:37 -05:00
Content of documents with markdowns end up stripped out and with missing content #1363
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @gaspardpetit on GitHub (Jun 25, 2024).
Bug Report
Description
Hello, this may not be the right place to post this issue as I imagine the problem will be in an underlying dependencies, but please let me know and I'll be happy to forward it to the right place. In the mean time, it might be worth warning users about this issue.
When importing documents with markdown, some text will be completely stripped out from the processed document, leading to the documents never being identified as containing the answer or never providing the answer.
Bug Summary:
Upon importing a document containing markdown, content may end up being stripped out.
Steps to Reproduce:
nomic-embed-textIn my tests, this was done against llama3 and qwen2;
Expected Behavior:
All three quotes should be provided.
Actual Behavior:
Only two are provided, one is missing, ex.:
Additionally, if you hover over the referenced document in the quote, you will find that the chunk found is actually missing the second quote:
It appears that any quote in a markdown document starting with
>and followed by 5 spaces will be stripped out.Environment
Open WebUI Version: v0.3.5
Ollama (if applicable): 0.1.45
Operating System: Windows 11 (ollama) + Docker 26.1.4 running on Ubuntu 22.04.4 LTS (Open WebUI)
Browser (if applicable): Edge 126.0.2592.68
Reproduction Details
Confirmation:
Logs and Screenshots
Browser Console Logs:
Logs from the browser :
Docker Container Logs:
Backend logs:
notice how the document is alread stripped in the
INFO:apps.rag.main:store_data_in_vector_dblog, where the document is already being printed missing the second quote (and the markdown formatting):Screenshots (if applicable):
@tjbck commented on GitHub (Jun 25, 2024):
Truly bizarre issue, I was able to reproduce it but nothing much can be done on our side unfortunately. You might want to raise the issue this issue to
unstructuredorlangchain_community, as we use their libraries directly for our default RAG pipeline. Keep us updated if you manage to find the solution!@gaspardpetit commented on GitHub (Jun 27, 2024):
The bug has been opened with unstructured:
https://github.com/Unstructured-IO/unstructured/issues/3309
@gaspardpetit commented on GitHub (Jul 21, 2024):
A new version of unstructured was released which should now fix this issue: https://github.com/Unstructured-IO/unstructured/releases/tag/0.15.0