[PR #24396] fix: bundle averaged_perceptron_tagger_eng NLTK resource alongside punkt_tab #66498

Open
opened 2026-05-06 12:53:54 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/24396
Author: @vzd3v
Created: 5/5/2026
Status: 🔄 Open

Base: devHead: fix/bundle-nltk-averaged-perceptron-tagger-eng


📝 Commits (1)

  • 1b1e60e fix: bundle averaged_perceptron_tagger_eng NLTK resource alongside punkt_tab

📊 Changes

3 files changed (+4 additions, -4 deletions)

View changed files

📝 Dockerfile (+2 -2)
📝 backend/start.sh (+1 -1)
📝 backend/start_windows.bat (+1 -1)

📄 Description

Before submitting, make sure you've checked the following:

  • Target branch: Targets dev.
  • Description: Below.
  • Changelog: Below.
  • Documentation: N/A — no user-facing behaviour change; restores parity with PR #21165 for the second NLTK resource the same library already requires.
  • Dependencies: No new or upgraded dependencies. Same unstructured==0.18.31 / nltk==3.9.3 already pinned in backend/requirements.txt.
  • Testing: Reproduced the failure end-to-end on ghcr.io/open-webui/open-webui:v0.9.2; confirmed identical PPTX file is processed in ~1.5s once averaged_perceptron_tagger_eng is present in /root/nltk_data/. See #24393 for the full repro and trace.
  • Agentic AI Code: Patch was drafted with AI assistance and has gone through human review and manual testing on a real running instance — failure reproduced before the change, disappears with the resource bundled, no other behaviour affected.
  • Code review: Self-reviewed; the change mirrors PR #21165 line-for-line for the second resource that unstructured/nlp/tokenize.py requires.
  • Design & Architecture: No design change.
  • Git Hygiene: One atomic commit; rebased on upstream/dev after the wrong-base auto-close of #24394 and the missing-CLA close of #24395.
  • Title Prefix: fix:.

Changelog Entry

Description

unstructured 0.18.x partitioning for PPTX/Word/etc. requires both punkt_tab and averaged_perceptron_tagger_eng (see unstructured/nlp/tokenize.py). PR #21165 pre-downloaded punkt_tab only. The first upload of any unstructured-handled format in a freshly built/recreated container hits LookupError: Resource 'averaged_perceptron_tagger_eng' not found., the file row stays at data->>'status'='pending', and the /api/v1/files/{id}/process/status?stream=true SSE keeps spinning. This PR pre-downloads the second resource alongside punkt_tab so cold-start / airgapped images work out of the box.

Fixed

  • Knowledge-base ingestion of .pptx / .docx (and any other format routed through unstructured partitioners) hanging at status=pending in fresh containers due to the missing averaged_perceptron_tagger_eng NLTK resource.

Additional Information

  • Closes #24393.
  • Replaces #24394 (auto-closed for targeting main) and #24395 (auto-closed for missing CLA section).
  • Mirrors PR #21165 in scope; same one-liner pattern in both Dockerfile branches and in backend/start.sh / backend/start_windows.bat. Net diff: +4/-4.
  • Repro inside a fresh container:
    from langchain_community.document_loaders import UnstructuredPowerPointLoader
    UnstructuredPowerPointLoader('/path/to/file.pptx').load()
    # -> LookupError: Resource 'averaged_perceptron_tagger_eng' not found.
    
    After this change, the loader returns the parsed document in ~1.5s without any network access at runtime.

Contributor License Agreement


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/24396 **Author:** [@vzd3v](https://github.com/vzd3v) **Created:** 5/5/2026 **Status:** 🔄 Open **Base:** `dev` ← **Head:** `fix/bundle-nltk-averaged-perceptron-tagger-eng` --- ### 📝 Commits (1) - [`1b1e60e`](https://github.com/open-webui/open-webui/commit/1b1e60ebd30812205f2f7ca8bfcde57ba089b6f1) fix: bundle averaged_perceptron_tagger_eng NLTK resource alongside punkt_tab ### 📊 Changes **3 files changed** (+4 additions, -4 deletions) <details> <summary>View changed files</summary> 📝 `Dockerfile` (+2 -2) 📝 `backend/start.sh` (+1 -1) 📝 `backend/start_windows.bat` (+1 -1) </details> ### 📄 Description **Before submitting, make sure you've checked the following:** - [x] **Target branch:** Targets `dev`. - [x] **Description:** Below. - [x] **Changelog:** Below. - [ ] **Documentation:** N/A — no user-facing behaviour change; restores parity with PR #21165 for the second NLTK resource the same library already requires. - [x] **Dependencies:** No new or upgraded dependencies. Same `unstructured==0.18.31` / `nltk==3.9.3` already pinned in `backend/requirements.txt`. - [x] **Testing:** Reproduced the failure end-to-end on `ghcr.io/open-webui/open-webui:v0.9.2`; confirmed identical PPTX file is processed in ~1.5s once `averaged_perceptron_tagger_eng` is present in `/root/nltk_data/`. See #24393 for the full repro and trace. - [x] **Agentic AI Code:** Patch was drafted with AI assistance and has gone through human review and manual testing on a real running instance — failure reproduced before the change, disappears with the resource bundled, no other behaviour affected. - [x] **Code review:** Self-reviewed; the change mirrors PR #21165 line-for-line for the second resource that `unstructured/nlp/tokenize.py` requires. - [x] **Design & Architecture:** No design change. - [x] **Git Hygiene:** One atomic commit; rebased on `upstream/dev` after the wrong-base auto-close of #24394 and the missing-CLA close of #24395. - [x] **Title Prefix:** `fix:`. # Changelog Entry ### Description `unstructured` 0.18.x partitioning for PPTX/Word/etc. requires both `punkt_tab` and `averaged_perceptron_tagger_eng` (see `unstructured/nlp/tokenize.py`). PR #21165 pre-downloaded `punkt_tab` only. The first upload of any `unstructured`-handled format in a freshly built/recreated container hits `LookupError: Resource 'averaged_perceptron_tagger_eng' not found.`, the file row stays at `data->>'status'='pending'`, and the `/api/v1/files/{id}/process/status?stream=true` SSE keeps spinning. This PR pre-downloads the second resource alongside `punkt_tab` so cold-start / airgapped images work out of the box. ### Fixed - Knowledge-base ingestion of `.pptx` / `.docx` (and any other format routed through `unstructured` partitioners) hanging at `status=pending` in fresh containers due to the missing `averaged_perceptron_tagger_eng` NLTK resource. --- ### Additional Information - Closes #24393. - Replaces #24394 (auto-closed for targeting `main`) and #24395 (auto-closed for missing CLA section). - Mirrors PR #21165 in scope; same one-liner pattern in both `Dockerfile` branches and in `backend/start.sh` / `backend/start_windows.bat`. Net diff: +4/-4. - Repro inside a fresh container: ```python from langchain_community.document_loaders import UnstructuredPowerPointLoader UnstructuredPowerPointLoader('/path/to/file.pptx').load() # -> LookupError: Resource 'averaged_perceptron_tagger_eng' not found. ``` After this change, the loader returns the parsed document in ~1.5s without any network access at runtime. ### Contributor License Agreement - [x] By submitting this pull request, I confirm that I have read and fully agree to the [Contributor License Agreement (CLA)](https://github.com/open-webui/open-webui/blob/main/CONTRIBUTOR_LICENSE_AGREEMENT), and I am providing my contributions under its terms. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-05-06 12:53:54 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#66498