[GH-ISSUE #5924] BUG: "punctuation" split should also split at newlines #68777

Closed
opened 2026-05-13 01:10:14 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @thiswillbeyourgithub on GitHub (Oct 5, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/5924

Bug Report

Confirmation:

  • I have read and followed all the instructions provided in the README.md.
  • I am on the latest version of both Open WebUI and Ollama.

Expected Behavior:

The speech model should consider two lines with no punctuation as two sentences, instead of a single sentence containing a newline between them.

Actual Behavior:

2 lines with no punctuation are considered a single sentence. So the speech model receives unsplited text when I use markdown bullet point lists.

Description

If I have the choice between splitting on paragraphs or on punctuation, it is implied that the latter will always result in smaller chunks. But as we can see at this line:
1d225dd804/src/lib/utils/index.ts (L543)

The chunking is done only after punctuation, ignoring newlines. I noticed that by looking at the logs from my openedai speech instance. The issue is that markdown bullet points frequently end abruptly with no punctuation.

A newline, albeit not strictly a "punctuation" should be a splitting delimiter when we set "punctuation" as the splitter. Also, each splitted sentence should have "str.strip()" applied of course.

Here's a particularly broken text:
image

Additional Information

Pinging @kiosion as they introduced the punctuation split in #4886

Originally created by @thiswillbeyourgithub on GitHub (Oct 5, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/5924 # Bug Report **Confirmation:** - [X] I have read and followed all the instructions provided in the README.md. - [X] I am on the latest version of both Open WebUI and Ollama. ## Expected Behavior: The speech model should consider two lines with no punctuation as two sentences, instead of a single sentence containing a newline between them. ## Actual Behavior: 2 lines with no punctuation are considered a single sentence. So the speech model receives unsplited text when I use markdown bullet point lists. ## Description If I have the choice between splitting on paragraphs or on punctuation, it is implied that the latter will always result in smaller chunks. But as we can see at this line: https://github.com/open-webui/open-webui/blob/1d225dd804575af9ae5981528dfdce695f7f7040/src/lib/utils/index.ts#L543 The chunking is done only after punctuation, ignoring newlines. I noticed that by looking at the logs from my [openedai speech](https://github.com/matatonic/openedai-speech/) instance. The issue is that markdown bullet points frequently end abruptly with no punctuation. A newline, albeit not strictly a "punctuation" should be a splitting delimiter when we set "punctuation" as the splitter. Also, each splitted sentence should have "str.strip()" applied of course. Here's a particularly broken text: ![image](https://github.com/user-attachments/assets/3bc070f5-62fd-463b-9b1e-ee16681cdb1c) ## Additional Information Pinging @kiosion as they introduced the punctuation split in #4886
GiteaMirror added the bug label 2026-05-13 01:10:14 -05:00
Author
Owner

@kiosion commented on GitHub (Oct 5, 2024):

Hm, I'll look into this tomorrow. It'd be nice to do a second pass tidying that logic anyhow

<!-- gh-comment-id:2395147874 --> @kiosion commented on GitHub (Oct 5, 2024): Hm, I'll look into this tomorrow. It'd be nice to do a second pass tidying that logic anyhow
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#68777