mirror of
https://github.com/open-webui/open-webui.git
synced 2026-03-22 14:13:08 -05:00
[Feature Request] Read markdown file contents as is to maintain structure of Markdown files; remove UnstructuredMarkdownLoader. #2487
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @wbste on GitHub (Oct 27, 2024).
Feature Request
Is your feature request related to a problem? Please describe.
Currently it appears
21b8ca3459/backend/open_webui/apps/retrieval/loaders/main.py (L162)is used for parsing of markdown files. This strips out all of the context (i.e. Headers, tables, etc.) and results in lost information available during the retrieval step.Describe the solution you'd like
See below.
Recommendation
Is there any issue with just reading markdown files as-is? That way nothing is lost in terms of formatting, and I don't think that hurts any downstream stuff...
In chroma it shows up as expected, but you may have to tweak the web gui to properly render the markdown. I think ProseMirror? But not sure how to properly show tables with it...
Langchain
Some examples of both the current processing and a potential alternative, but would require more coding to convert
elementsto the correct merged string.Quick test code:
Output from above markdown file and code snippet:
Now change the code snippet mode to
elements:@tjbck commented on GitHub (Oct 27, 2024):
Agreed, PR welcome!
@tjbck commented on GitHub (Oct 28, 2024):
Updated on dev!