mirror of
https://github.com/open-webui/open-webui.git
synced 2026-05-08 04:16:03 -05:00
[PR #24396] fix: bundle averaged_perceptron_tagger_eng NLTK resource alongside punkt_tab #66498
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/open-webui/open-webui/pull/24396
Author: @vzd3v
Created: 5/5/2026
Status: 🔄 Open
Base:
dev← Head:fix/bundle-nltk-averaged-perceptron-tagger-eng📝 Commits (1)
1b1e60efix: bundle averaged_perceptron_tagger_eng NLTK resource alongside punkt_tab📊 Changes
3 files changed (+4 additions, -4 deletions)
View changed files
📝
Dockerfile(+2 -2)📝
backend/start.sh(+1 -1)📝
backend/start_windows.bat(+1 -1)📄 Description
Before submitting, make sure you've checked the following:
dev.unstructured==0.18.31/nltk==3.9.3already pinned inbackend/requirements.txt.ghcr.io/open-webui/open-webui:v0.9.2; confirmed identical PPTX file is processed in ~1.5s onceaveraged_perceptron_tagger_engis present in/root/nltk_data/. See #24393 for the full repro and trace.unstructured/nlp/tokenize.pyrequires.upstream/devafter the wrong-base auto-close of #24394 and the missing-CLA close of #24395.fix:.Changelog Entry
Description
unstructured0.18.x partitioning for PPTX/Word/etc. requires bothpunkt_tabandaveraged_perceptron_tagger_eng(seeunstructured/nlp/tokenize.py). PR #21165 pre-downloadedpunkt_tabonly. The first upload of anyunstructured-handled format in a freshly built/recreated container hitsLookupError: Resource 'averaged_perceptron_tagger_eng' not found., the file row stays atdata->>'status'='pending', and the/api/v1/files/{id}/process/status?stream=trueSSE keeps spinning. This PR pre-downloads the second resource alongsidepunkt_tabso cold-start / airgapped images work out of the box.Fixed
.pptx/.docx(and any other format routed throughunstructuredpartitioners) hanging atstatus=pendingin fresh containers due to the missingaveraged_perceptron_tagger_engNLTK resource.Additional Information
main) and #24395 (auto-closed for missing CLA section).Dockerfilebranches and inbackend/start.sh/backend/start_windows.bat. Net diff: +4/-4.Contributor License Agreement
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.