[Feature Request] Read markdown file contents as is to maintain structure of Markdown files; remove UnstructuredMarkdownLoader. #2487

Closed
opened 2025-11-11 15:08:25 -06:00 by GiteaMirror · 2 comments
Owner

Originally created by @wbste on GitHub (Oct 27, 2024).

Feature Request

Is your feature request related to a problem? Please describe.
Currently it appears 21b8ca3459/backend/open_webui/apps/retrieval/loaders/main.py (L162) is used for parsing of markdown files. This strips out all of the context (i.e. Headers, tables, etc.) and results in lost information available during the retrieval step.

Describe the solution you'd like
See below.

Recommendation

Is there any issue with just reading markdown files as-is? That way nothing is lost in terms of formatting, and I don't think that hurts any downstream stuff...

# Read the Markdown file directly
with open("Markdown.md", "r", encoding="utf-8") as f:
    markdown_content = f.read()

print(markdown_content)

In chroma it shows up as expected, but you may have to tweak the web gui to properly render the markdown. I think ProseMirror? But not sure how to properly show tables with it...

Langchain

Some examples of both the current processing and a potential alternative, but would require more coding to convert elements to the correct merged string.

# I am a header over a table

Below table is a thing.

| Name | Number |
|------|--------|
| Bat  | 1      |
| Tom  | 2      |

## Header 2

This stuff is under Header 2

# Header 1

This stuff is under Header 1

Quick test code:

from langchain_community.document_loaders import UnstructuredMarkdownLoader

# Create a loader instance
loader = UnstructuredMarkdownLoader(
    "./Markdown.md",
    mode="single"
)

# Load documents
docs = loader.load()

# Print each document in the list
for i, doc in enumerate(docs):
    print(f"Document {i + 1}:")
    print(doc)
    print("-" * 40)  # Optional: for better separation between documents

Output from above markdown file and code snippet:

Document 1:
page_content='I am a header over a table

Below table is a thing.

Name Number Bat 1 Tom 2

Header 2

This stuff is under Header 2

Header 1

This stuff is under Header 1' metadata={'source': './Markdown.md'}

Now change the code snippet mode to elements:

Document 1:
page_content='I am a header over a table' metadata={'source': './Markdown.md', 'category_depth': 0, 'languages': ['eng'], 'file_directory': '.', 'filename': 'Markdown.md', 'filetype': 'text/markdown', 'last_modified': '2024-10-27T11:22:29', 'category': 'Title', 'element_id': 'ebd32158353f35e43dd0d95eabe0ca31'}
----------------------------------------
Document 2:
page_content='Below table is a thing.' metadata={'source': './Markdown.md', 'languages': ['eng'], 'file_directory': '.', 'filename': 'Markdown.md', 'filetype': 'text/markdown', 'last_modified': '2024-10-27T11:22:29', 'parent_id': 'ebd32158353f35e43dd0d95eabe0ca31', 'category': 'NarrativeText', 'element_id': 'cf103349ea77943cf3e77295cf3ffac3'}
----------------------------------------
Document 3:
page_content='Name Number Bat 1 Tom 2' metadata={'source': './Markdown.md', 'text_as_html': '<table><tr><td>Name</td><td>Number</td></tr><tr><td>Bat</td><td>1</td></tr><tr><td>Tom</td><td>2</td></tr></table>', 'languages': ['eng'], 'file_directory': '.', 'filename': 'Markdown.md', 'filetype': 'text/markdown', 'last_modified': '2024-10-27T11:22:29', 'parent_id': 'ebd32158353f35e43dd0d95eabe0ca31', 'category': 'Table', 'element_id': 'd5bc4ac3108cc0b2933c2aaa2094db91'}
----------------------------------------
Document 4:
page_content='Header 2' metadata={'source': './Markdown.md', 'category_depth': 1, 'languages': ['eng'], 'file_directory': '.', 'filename': 'Markdown.md', 'filetype': 'text/markdown', 'last_modified': '2024-10-27T11:22:29', 'parent_id': 'ebd32158353f35e43dd0d95eabe0ca31', 'category': 'Title', 'element_id': '2ad721a5db12be07a0b0635f4f32163f'}
----------------------------------------
Document 5:
page_content='This stuff is under Header 2' metadata={'source': './Markdown.md', 'languages': ['eng'], 'file_directory': '.', 'filename': 'Markdown.md', 'filetype': 'text/markdown', 'last_modified': '2024-10-27T11:22:29', 'parent_id': '2ad721a5db12be07a0b0635f4f32163f', 'category': 'NarrativeText', 'element_id': 'a9e9bdaa9d96c16e20f4f82b0237a912'}
----------------------------------------
Document 6:
page_content='Header 1' metadata={'source': './Markdown.md', 'category_depth': 0, 'languages': ['eng'], 'file_directory': '.', 'filename': 'Markdown.md', 'filetype': 'text/markdown', 'last_modified': '2024-10-27T11:22:29', 'category': 'Title', 'element_id': 'ebb313d9d8d8ea0bfb57be4efbd4cb0c'}
----------------------------------------
Document 7:
page_content='This stuff is under Header 1' metadata={'source': './Markdown.md', 'languages': ['eng'], 'file_directory': '.', 'filename': 'Markdown.md', 'filetype': 'text/markdown', 'last_modified': '2024-10-27T11:22:29', 'parent_id': 'ebb313d9d8d8ea0bfb57be4efbd4cb0c', 'category': 'NarrativeText', 'element_id': '77f474db60acc46eb184c819bc2d0ceb'}
Originally created by @wbste on GitHub (Oct 27, 2024). # Feature Request **Is your feature request related to a problem? Please describe.** Currently it appears https://github.com/open-webui/open-webui/blob/21b8ca345904005f4b87666e4e0ac5bb8df309ad/backend/open_webui/apps/retrieval/loaders/main.py#L162 is used for parsing of markdown files. This strips out all of the context (i.e. Headers, tables, etc.) and results in lost information available during the retrieval step. **Describe the solution you'd like** See below. ## Recommendation Is there any issue with just reading markdown files as-is? That way nothing is lost in terms of formatting, and I don't think that hurts any downstream stuff... ```python # Read the Markdown file directly with open("Markdown.md", "r", encoding="utf-8") as f: markdown_content = f.read() print(markdown_content) ``` In chroma it shows up as expected, but you may have to tweak the web gui to properly render the markdown. I think ProseMirror? But not sure how to properly show tables with it... ## Langchain Some examples of both the current processing and a potential alternative, but would require more coding to convert `elements` to the correct merged string. ```markdown # I am a header over a table Below table is a thing. | Name | Number | |------|--------| | Bat | 1 | | Tom | 2 | ## Header 2 This stuff is under Header 2 # Header 1 This stuff is under Header 1 ``` Quick test code: ```python from langchain_community.document_loaders import UnstructuredMarkdownLoader # Create a loader instance loader = UnstructuredMarkdownLoader( "./Markdown.md", mode="single" ) # Load documents docs = loader.load() # Print each document in the list for i, doc in enumerate(docs): print(f"Document {i + 1}:") print(doc) print("-" * 40) # Optional: for better separation between documents ``` Output from above markdown file and code snippet: ``` Document 1: page_content='I am a header over a table Below table is a thing. Name Number Bat 1 Tom 2 Header 2 This stuff is under Header 2 Header 1 This stuff is under Header 1' metadata={'source': './Markdown.md'} ``` Now change the code snippet mode to `elements`: ``` Document 1: page_content='I am a header over a table' metadata={'source': './Markdown.md', 'category_depth': 0, 'languages': ['eng'], 'file_directory': '.', 'filename': 'Markdown.md', 'filetype': 'text/markdown', 'last_modified': '2024-10-27T11:22:29', 'category': 'Title', 'element_id': 'ebd32158353f35e43dd0d95eabe0ca31'} ---------------------------------------- Document 2: page_content='Below table is a thing.' metadata={'source': './Markdown.md', 'languages': ['eng'], 'file_directory': '.', 'filename': 'Markdown.md', 'filetype': 'text/markdown', 'last_modified': '2024-10-27T11:22:29', 'parent_id': 'ebd32158353f35e43dd0d95eabe0ca31', 'category': 'NarrativeText', 'element_id': 'cf103349ea77943cf3e77295cf3ffac3'} ---------------------------------------- Document 3: page_content='Name Number Bat 1 Tom 2' metadata={'source': './Markdown.md', 'text_as_html': '<table><tr><td>Name</td><td>Number</td></tr><tr><td>Bat</td><td>1</td></tr><tr><td>Tom</td><td>2</td></tr></table>', 'languages': ['eng'], 'file_directory': '.', 'filename': 'Markdown.md', 'filetype': 'text/markdown', 'last_modified': '2024-10-27T11:22:29', 'parent_id': 'ebd32158353f35e43dd0d95eabe0ca31', 'category': 'Table', 'element_id': 'd5bc4ac3108cc0b2933c2aaa2094db91'} ---------------------------------------- Document 4: page_content='Header 2' metadata={'source': './Markdown.md', 'category_depth': 1, 'languages': ['eng'], 'file_directory': '.', 'filename': 'Markdown.md', 'filetype': 'text/markdown', 'last_modified': '2024-10-27T11:22:29', 'parent_id': 'ebd32158353f35e43dd0d95eabe0ca31', 'category': 'Title', 'element_id': '2ad721a5db12be07a0b0635f4f32163f'} ---------------------------------------- Document 5: page_content='This stuff is under Header 2' metadata={'source': './Markdown.md', 'languages': ['eng'], 'file_directory': '.', 'filename': 'Markdown.md', 'filetype': 'text/markdown', 'last_modified': '2024-10-27T11:22:29', 'parent_id': '2ad721a5db12be07a0b0635f4f32163f', 'category': 'NarrativeText', 'element_id': 'a9e9bdaa9d96c16e20f4f82b0237a912'} ---------------------------------------- Document 6: page_content='Header 1' metadata={'source': './Markdown.md', 'category_depth': 0, 'languages': ['eng'], 'file_directory': '.', 'filename': 'Markdown.md', 'filetype': 'text/markdown', 'last_modified': '2024-10-27T11:22:29', 'category': 'Title', 'element_id': 'ebb313d9d8d8ea0bfb57be4efbd4cb0c'} ---------------------------------------- Document 7: page_content='This stuff is under Header 1' metadata={'source': './Markdown.md', 'languages': ['eng'], 'file_directory': '.', 'filename': 'Markdown.md', 'filetype': 'text/markdown', 'last_modified': '2024-10-27T11:22:29', 'parent_id': 'ebb313d9d8d8ea0bfb57be4efbd4cb0c', 'category': 'NarrativeText', 'element_id': '77f474db60acc46eb184c819bc2d0ceb'} ```
Author
Owner

@tjbck commented on GitHub (Oct 27, 2024):

Agreed, PR welcome!

@tjbck commented on GitHub (Oct 27, 2024): Agreed, PR welcome!
Author
Owner

@tjbck commented on GitHub (Oct 28, 2024):

Updated on dev!

@tjbck commented on GitHub (Oct 28, 2024): Updated on dev!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#2487