mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-07 19:38:46 -05:00
[PR #24395] [CLOSED] fix: bundle averaged_perceptron_tagger_eng NLTK resource alongside punkt_tab #66497
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/24395
Author: @vzd3v
Created: 5/5/2026
Status: ❌ Closed
Base:
dev← Head:fix/bundle-nltk-averaged-perceptron-tagger-eng📝 Commits (1)
1b1e60efix: bundle averaged_perceptron_tagger_eng NLTK resource alongside punkt_tab📊 Changes
3 files changed (+4 additions, -4 deletions)
View changed files
📝
Dockerfile(+2 -2)📝
backend/start.sh(+1 -1)📝
backend/start_windows.bat(+1 -1)📄 Description
Why
unstructured0.18.x partitioning for PPTX/Word/etc. requires two NLTK resources at runtime —punkt_tabandaveraged_perceptron_tagger_eng— referenced inunstructured/nlp/tokenize.py:Today only
punkt_tabis pre-downloaded in the image (PR #21165). The first upload of a.pptx/.docx/.pps/etc. file in a freshly built or recreated container hits:The exception path doesn't currently transition
data->>'status'tofailed, so the file row stayspendingindefinitely and the SSE stream/api/v1/files/{id}/process/status?stream=truekeeps spinning — see #24393 for full reproduction.After running
nltk.download('averaged_perceptron_tagger_eng', download_dir='/root/nltk_data')once inside the running container, identical files are processed in ~1.5s.What
Add
averaged_perceptron_tagger_engto the same one-liner that already bundlespunkt_tab. Mirrors PR #21165 in scope and intent — keeps airgapped/cold-start environments self-sufficient. Three files:Dockerfile(CUDA branch + non-CUDA branch)backend/start.sh(playwright-engine fallback)backend/start_windows.bat(playwright-engine fallback)Net diff: +4/-4.
Closes #24393. Replaces the closed-on-wrong-base #24394.
Test plan
/root/nltk_data/taggers/averaged_perceptron_tagger_eng/exists post-build..pptxto a fresh container without internet access — should reachstatus='completed'and produce chunks.punkt_tabbehaviour from #21165 is unchanged.🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.