Content of documents with markdowns end up stripped out and with missing content #1363

Closed
opened 2025-11-11 14:43:39 -06:00 by GiteaMirror · 3 comments
Owner

Originally created by @gaspardpetit on GitHub (Jun 25, 2024).

Bug Report

Description

Hello, this may not be the right place to post this issue as I imagine the problem will be in an underlying dependencies, but please let me know and I'll be happy to forward it to the right place. In the mean time, it might be worth warning users about this issue.

When importing documents with markdown, some text will be completely stripped out from the processed document, leading to the documents never being identified as containing the answer or never providing the answer.

Bug Summary:

Upon importing a document containing markdown, content may end up being stripped out.

Steps to Reproduce:

  1. Save the following as "quotes.md"
The first quote is:

> “Not all those who wander are lost.”

The second quote is:

>     “Moonlight drowns out all but the brightest stars”


The third quote is:

>    “Courage is found in unlikely places.”
  1. Import the document in open-webui. In my case, this was tested with ollama 0.1.45 + nomic-embed-text
  2. Open a new chat
  3. Refer to the document in the chat and ask what the three quotes are, ex.:

#quotes
What are the three quotes?

In my tests, this was done against llama3 and qwen2;

Expected Behavior:

All three quotes should be provided.

Actual Behavior:

Only two are provided, one is missing, ex.:

qwen2:latest
The three quotes are as follows:
"Not all those who wander are lost."
(The second quote was not explicitly provided in the context, so it remains unspecified.)
"Courage is found in unlikely places."

Additionally, if you hover over the referenced document in the quote, you will find that the chunk found is actually missing the second quote:

Citation
Source
quotemd
Content
The first quote is:

“Not all those who wander are lost.”

The second quote is:

The third quote is:

“Courage is found in unlikely places.”

It appears that any quote in a markdown document starting with > and followed by 5 spaces will be stripped out.

Environment

  • Open WebUI Version: v0.3.5

  • Ollama (if applicable): 0.1.45

  • Operating System: Windows 11 (ollama) + Docker 26.1.4 running on Ubuntu 22.04.4 LTS (Open WebUI)

  • Browser (if applicable): Edge 126.0.2592.68

Reproduction Details

Confirmation:

  • I have read and followed all the instructions provided in the README.md.
  • I am on the latest version of both Open WebUI and Ollama.
  • I have included the browser console logs.
  • I have included the Docker container logs.

Logs and Screenshots

Browser Console Logs:
Logs from the browser :

File {name: 'quote.md', lastModified: 1719281541473, lastModifiedDate: Mon Jun 24 2024 22:12:21 GMT-0400 (Eastern Daylight Time), webkitRelativePath: '', size: 231, …}
lastModified: 1719281541473
lastModifiedDate: Mon Jun 24 2024 22:12:21 GMT-0400 (Eastern Daylight Time) {}
name: "quote.md"
size: 231
type: ""
webkitRelativePath: ""
[[Prototype]]: File

Docker Container Logs:
Backend logs:

INFO:     "GET /api/v1/documents/ HTTP/1.1" 200 OK
INFO:apps.rag.main:file.content_type: application/octet-stream
INFO:unstructured:Reading document from string ...
INFO:unstructured:Reading document ...
INFO:apps.rag.main:store_data_in_vector_db [Document(page_content='The first quote is:\n\n“Not all those who wander are lost.”\n\nThe second quote is:\n\nThe third quote is:\n\n“Courage is found in unlikely places.”', metadata={'source': '/app/backend/data/uploads/quote.md', 'start_index': 0})]
INFO:apps.rag.main:store_docs_in_vector_db [Document(page_content='The first quote is:\n\n“Not all those who wander are lost.”\n\nThe second quote is:\n\nThe third quote is:\n\n“Courage is found in unlikely places.”', metadata={'source': '/app/backend/data/uploads/quote.md', 'start_index': 0})] 1715926191f7fc05130775c6535f0ea68c8a585ab57968986793c679d3a89e8
INFO:apps.ollama.main:generate_ollama_embeddings model='nomic-embed-text:latest' prompt='The first quote is:  “Not all those who wander are lost.”  The second quote is:  The third quote is:  “Courage is found in unlikely places.”' options=None keep_alive=None
INFO:apps.ollama.main:url: http://192.168.0.200:11434
INFO:apps.ollama.main:generate_ollama_embeddings {'embedding': [0.03500867635011673, ..., -0.4094814658164978, -0.5583905577659607, -0.2819729149341583]}
INFO:     "POST /rag/api/v1/doc HTTP/1.1" 200 OK
INFO:     "POST /api/v1/documents/create HTTP/1.1" 200 OK
INFO:     "GET /api/v1/documents/ HTTP/1.1" 200 OK

notice how the document is alread stripped in the INFO:apps.rag.main:store_data_in_vector_db log, where the document is already being printed missing the second quote (and the markdown formatting):

Document(page_content='The first quote is:\n\n“Not all those who wander are lost.”\n\nThe second quote is:\n\nThe third quote is:\n\n“Courage is found in unlikely places.”', metadata={'source': '/app/backend/data/uploads/quote.md', 'start_index': 0})

Screenshots (if applicable):

image
image
image

Originally created by @gaspardpetit on GitHub (Jun 25, 2024). # Bug Report ## Description Hello, this may not be the right place to post this issue as I imagine the problem will be in an underlying dependencies, but please let me know and I'll be happy to forward it to the right place. In the mean time, it might be worth warning users about this issue. When importing documents with markdown, some text will be completely stripped out from the processed document, leading to the documents never being identified as containing the answer or never providing the answer. **Bug Summary:** Upon importing a document containing markdown, content may end up being stripped out. **Steps to Reproduce:** 1. Save the following as "quotes.md" ``` The first quote is: > “Not all those who wander are lost.” The second quote is: > “Moonlight drowns out all but the brightest stars” The third quote is: > “Courage is found in unlikely places.” ``` 2. Import the document in open-webui. In my case, this was tested with ollama 0.1.45 + `nomic-embed-text` 3. Open a new chat 4. Refer to the document in the chat and ask what the three quotes are, ex.: > #quotes > What are the three quotes? In my tests, this was done against llama3 and qwen2; **Expected Behavior:** All three quotes should be provided. **Actual Behavior:** Only two are provided, one is missing, ex.: >qwen2:latest >The three quotes are as follows: >"Not all those who wander are lost." >(The second quote was not explicitly provided in the context, so it remains unspecified.) >"Courage is found in unlikely places." Additionally, if you hover over the referenced document in the quote, you will find that the chunk found is actually missing the second quote: >Citation >Source >quotemd >Content >The first quote is: > >“Not all those who wander are lost.” > >The second quote is: > >The third quote is: > >“Courage is found in unlikely places.” It appears that any quote in a markdown document starting with `>` and followed by 5 spaces will be stripped out. ## Environment - **Open WebUI Version:** v0.3.5 - **Ollama (if applicable):** 0.1.45 - **Operating System:** Windows 11 (ollama) + Docker 26.1.4 running on Ubuntu 22.04.4 LTS (Open WebUI) - **Browser (if applicable):** Edge 126.0.2592.68 ## Reproduction Details **Confirmation:** - [X] I have read and followed all the instructions provided in the README.md. - [X] I am on the latest version of both Open WebUI and Ollama. - [X] I have included the browser console logs. - [X] I have included the Docker container logs. ## Logs and Screenshots **Browser Console Logs:** Logs from the browser : ``` File {name: 'quote.md', lastModified: 1719281541473, lastModifiedDate: Mon Jun 24 2024 22:12:21 GMT-0400 (Eastern Daylight Time), webkitRelativePath: '', size: 231, …} lastModified: 1719281541473 lastModifiedDate: Mon Jun 24 2024 22:12:21 GMT-0400 (Eastern Daylight Time) {} name: "quote.md" size: 231 type: "" webkitRelativePath: "" [[Prototype]]: File ``` **Docker Container Logs:** Backend logs: ``` INFO: "GET /api/v1/documents/ HTTP/1.1" 200 OK INFO:apps.rag.main:file.content_type: application/octet-stream INFO:unstructured:Reading document from string ... INFO:unstructured:Reading document ... INFO:apps.rag.main:store_data_in_vector_db [Document(page_content='The first quote is:\n\n“Not all those who wander are lost.”\n\nThe second quote is:\n\nThe third quote is:\n\n“Courage is found in unlikely places.”', metadata={'source': '/app/backend/data/uploads/quote.md', 'start_index': 0})] INFO:apps.rag.main:store_docs_in_vector_db [Document(page_content='The first quote is:\n\n“Not all those who wander are lost.”\n\nThe second quote is:\n\nThe third quote is:\n\n“Courage is found in unlikely places.”', metadata={'source': '/app/backend/data/uploads/quote.md', 'start_index': 0})] 1715926191f7fc05130775c6535f0ea68c8a585ab57968986793c679d3a89e8 INFO:apps.ollama.main:generate_ollama_embeddings model='nomic-embed-text:latest' prompt='The first quote is: “Not all those who wander are lost.” The second quote is: The third quote is: “Courage is found in unlikely places.”' options=None keep_alive=None INFO:apps.ollama.main:url: http://192.168.0.200:11434 INFO:apps.ollama.main:generate_ollama_embeddings {'embedding': [0.03500867635011673, ..., -0.4094814658164978, -0.5583905577659607, -0.2819729149341583]} INFO: "POST /rag/api/v1/doc HTTP/1.1" 200 OK INFO: "POST /api/v1/documents/create HTTP/1.1" 200 OK INFO: "GET /api/v1/documents/ HTTP/1.1" 200 OK ``` notice how the document is alread stripped in the `INFO:apps.rag.main:store_data_in_vector_db` log, where the document is already being printed missing the second quote (and the markdown formatting): ``` Document(page_content='The first quote is:\n\n“Not all those who wander are lost.”\n\nThe second quote is:\n\nThe third quote is:\n\n“Courage is found in unlikely places.”', metadata={'source': '/app/backend/data/uploads/quote.md', 'start_index': 0}) ``` **Screenshots (if applicable):** ![image](https://github.com/open-webui/open-webui/assets/9883156/f9274be0-d923-4213-9558-474bcb72f046) ![image](https://github.com/open-webui/open-webui/assets/9883156/153476ae-593d-4319-a392-f41c96ca2d7c) ![image](https://github.com/open-webui/open-webui/assets/9883156/23433aec-f42d-4765-8db6-3dc8edc46994)
Author
Owner

@tjbck commented on GitHub (Jun 25, 2024):

Truly bizarre issue, I was able to reproduce it but nothing much can be done on our side unfortunately. You might want to raise the issue this issue to unstructured or langchain_community, as we use their libraries directly for our default RAG pipeline. Keep us updated if you manage to find the solution!

@tjbck commented on GitHub (Jun 25, 2024): Truly bizarre issue, I was able to reproduce it but nothing much can be done on our side unfortunately. You might want to raise the issue this issue to `unstructured` or `langchain_community`, as we use their libraries directly for our default RAG pipeline. Keep us updated if you manage to find the solution!
Author
Owner

@gaspardpetit commented on GitHub (Jun 27, 2024):

The bug has been opened with unstructured:

https://github.com/Unstructured-IO/unstructured/issues/3309

@gaspardpetit commented on GitHub (Jun 27, 2024): The bug has been opened with unstructured: https://github.com/Unstructured-IO/unstructured/issues/3309
Author
Owner

@gaspardpetit commented on GitHub (Jul 21, 2024):

Truly bizarre issue, I was able to reproduce it but nothing much can be done on our side unfortunately. You might want to raise the issue this issue to unstructured or langchain_community, as we use their libraries directly for our default RAG pipeline. Keep us updated if you manage to find the solution!

A new version of unstructured was released which should now fix this issue: https://github.com/Unstructured-IO/unstructured/releases/tag/0.15.0

@gaspardpetit commented on GitHub (Jul 21, 2024): > Truly bizarre issue, I was able to reproduce it but nothing much can be done on our side unfortunately. You might want to raise the issue this issue to `unstructured` or `langchain_community`, as we use their libraries directly for our default RAG pipeline. Keep us updated if you manage to find the solution! A new version of unstructured was released which should now fix this issue: https://github.com/Unstructured-IO/unstructured/releases/tag/0.15.0
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#1363