mirror of
https://github.com/open-webui/open-webui.git
synced 2026-06-05 00:10:27 -05:00
[GH-ISSUE #5924] BUG: "punctuation" split should also split at newlines #52839
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @thiswillbeyourgithub on GitHub (Oct 5, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/5924
Bug Report
Confirmation:
Expected Behavior:
The speech model should consider two lines with no punctuation as two sentences, instead of a single sentence containing a newline between them.
Actual Behavior:
2 lines with no punctuation are considered a single sentence. So the speech model receives unsplited text when I use markdown bullet point lists.
Description
If I have the choice between splitting on paragraphs or on punctuation, it is implied that the latter will always result in smaller chunks. But as we can see at this line:
1d225dd804/src/lib/utils/index.ts (L543)The chunking is done only after punctuation, ignoring newlines. I noticed that by looking at the logs from my openedai speech instance. The issue is that markdown bullet points frequently end abruptly with no punctuation.
A newline, albeit not strictly a "punctuation" should be a splitting delimiter when we set "punctuation" as the splitter. Also, each splitted sentence should have "str.strip()" applied of course.
Here's a particularly broken text:

Additional Information
Pinging @kiosion as they introduced the punctuation split in #4886
@kiosion commented on GitHub (Oct 5, 2024):
Hm, I'll look into this tomorrow. It'd be nice to do a second pass tidying that logic anyhow