[PR #24395] [CLOSED] fix: bundle averaged_perceptron_tagger_eng NLTK resource alongside punkt_tab #66497

Closed
opened 2026-05-06 12:53:50 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/24395
Author: @vzd3v
Created: 5/5/2026
Status: Closed

Base: devHead: fix/bundle-nltk-averaged-perceptron-tagger-eng


📝 Commits (1)

  • 1b1e60e fix: bundle averaged_perceptron_tagger_eng NLTK resource alongside punkt_tab

📊 Changes

3 files changed (+4 additions, -4 deletions)

View changed files

📝 Dockerfile (+2 -2)
📝 backend/start.sh (+1 -1)
📝 backend/start_windows.bat (+1 -1)

📄 Description

Why

unstructured 0.18.x partitioning for PPTX/Word/etc. requires two NLTK resources at runtime — punkt_tab and averaged_perceptron_tagger_eng — referenced in unstructured/nlp/tokenize.py:

nltk.download("averaged_perceptron_tagger_eng", quiet=True)
nltk.download("punkt_tab", quiet=True)

Today only punkt_tab is pre-downloaded in the image (PR #21165). The first upload of a .pptx/.docx/.pps/etc. file in a freshly built or recreated container hits:

LookupError: Resource 'averaged_perceptron_tagger_eng' not found.

The exception path doesn't currently transition data->>'status' to failed, so the file row stays pending indefinitely and the SSE stream /api/v1/files/{id}/process/status?stream=true keeps spinning — see #24393 for full reproduction.

After running nltk.download('averaged_perceptron_tagger_eng', download_dir='/root/nltk_data') once inside the running container, identical files are processed in ~1.5s.

What

Add averaged_perceptron_tagger_eng to the same one-liner that already bundles punkt_tab. Mirrors PR #21165 in scope and intent — keeps airgapped/cold-start environments self-sufficient. Three files:

  • Dockerfile (CUDA branch + non-CUDA branch)
  • backend/start.sh (playwright-engine fallback)
  • backend/start_windows.bat (playwright-engine fallback)

Net diff: +4/-4.

Closes #24393. Replaces the closed-on-wrong-base #24394.

Test plan

  • Build image and confirm /root/nltk_data/taggers/averaged_perceptron_tagger_eng/ exists post-build.
  • Upload a .pptx to a fresh container without internet access — should reach status='completed' and produce chunks.
  • Verify punkt_tab behaviour from #21165 is unchanged.

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/24395 **Author:** [@vzd3v](https://github.com/vzd3v) **Created:** 5/5/2026 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `fix/bundle-nltk-averaged-perceptron-tagger-eng` --- ### 📝 Commits (1) - [`1b1e60e`](https://github.com/open-webui/open-webui/commit/1b1e60ebd30812205f2f7ca8bfcde57ba089b6f1) fix: bundle averaged_perceptron_tagger_eng NLTK resource alongside punkt_tab ### 📊 Changes **3 files changed** (+4 additions, -4 deletions) <details> <summary>View changed files</summary> 📝 `Dockerfile` (+2 -2) 📝 `backend/start.sh` (+1 -1) 📝 `backend/start_windows.bat` (+1 -1) </details> ### 📄 Description ### Why `unstructured` 0.18.x partitioning for PPTX/Word/etc. requires **two** NLTK resources at runtime — `punkt_tab` and `averaged_perceptron_tagger_eng` — referenced in `unstructured/nlp/tokenize.py`: ```python nltk.download("averaged_perceptron_tagger_eng", quiet=True) nltk.download("punkt_tab", quiet=True) ``` Today only `punkt_tab` is pre-downloaded in the image (PR #21165). The first upload of a `.pptx`/`.docx`/`.pps`/etc. file in a freshly built or recreated container hits: ``` LookupError: Resource 'averaged_perceptron_tagger_eng' not found. ``` The exception path doesn't currently transition `data->>'status'` to `failed`, so the file row stays `pending` indefinitely and the SSE stream `/api/v1/files/{id}/process/status?stream=true` keeps spinning — see #24393 for full reproduction. After running `nltk.download('averaged_perceptron_tagger_eng', download_dir='/root/nltk_data')` once inside the running container, identical files are processed in ~1.5s. ### What Add `averaged_perceptron_tagger_eng` to the same one-liner that already bundles `punkt_tab`. Mirrors PR #21165 in scope and intent — keeps airgapped/cold-start environments self-sufficient. Three files: - `Dockerfile` (CUDA branch + non-CUDA branch) - `backend/start.sh` (playwright-engine fallback) - `backend/start_windows.bat` (playwright-engine fallback) Net diff: +4/-4. Closes #24393. Replaces the closed-on-wrong-base #24394. ### Test plan - [ ] Build image and confirm `/root/nltk_data/taggers/averaged_perceptron_tagger_eng/` exists post-build. - [ ] Upload a `.pptx` to a fresh container without internet access — should reach `status='completed'` and produce chunks. - [ ] Verify `punkt_tab` behaviour from #21165 is unchanged. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-06 12:53:50 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#66497