[PR #4038] [MERGED] chore(deps): bump unstructured from 0.14.10 to 0.15.0 in /backend #8182

Closed
opened 2025-11-11 17:47:06 -06:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/4038
Author: @dependabot[bot]
Created: 7/22/2024
Status: Merged
Merged: 7/24/2024
Merged by: @tjbck

Base: devHead: dependabot/pip/backend/dev/unstructured-0.15.0


📝 Commits (1)

  • 659bc24 chore(deps): bump unstructured from 0.14.10 to 0.15.0 in /backend

📊 Changes

1 file changed (+1 additions, -1 deletions)

View changed files

📝 backend/requirements.txt (+1 -1)

📄 Description

Bumps unstructured from 0.14.10 to 0.15.0.

Release notes

Sourced from unstructured's releases.

0.15.0

Enhancements

  • Improve text clearing process in email partitioning. Updated the email partitioner to remove both =\n and =\r\n characters during the clearing process. Previously, only =\n characters were removed.
  • Bump unstructured.paddleocr to 2.8.0.1.
  • Refine HTML parser to accommodate block element nested in phrasing. HTML parser no longer raises on a block element (e.g. <p>, <div>) nested inside a phrasing element (e.g. <strong> or <cite>). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation.
  • Install rewritten HTML parser to fix 12 existing bugs and provide headroom for refinement and growth. A rewritten HTML parser resolves a collection of outstanding bugs with HTML partitioning and provides a firm foundation for further elaborating that important partitioner.
  • CI check for dependency licenses Adds a CI check to ensure dependencies are appropriately licensed.

Features

  • Add support for specifying OCR language to partition_pdf(). Extend language specification capability to PaddleOCR in addition to TesseractOCR. Users can now specify OCR languages for both OCR engines when using partition_pdf().
  • Add AstraDB source connector Adds support for ingesting documents from AstraDB.

Fixes

  • Remedy error on Windows when nltk binaries are downloaded. Work around a quirk in the Windows implementation of tempfile.NamedTemporaryFile where accessing the temporary file by name raises PermissionError.
  • Move Astra embedded_dimension to write config
Changelog

Sourced from unstructured's changelog.

0.15.0

Enhancements

  • Improve text clearing process in email partitioning. Updated the email partitioner to remove both =\n and =\r\n characters during the clearing process. Previously, only =\n characters were removed.
  • Bump unstructured.paddleocr to 2.8.0.1.
  • Refine HTML parser to accommodate block element nested in phrasing. HTML parser no longer raises on a block element (e.g. <p>, <div>) nested inside a phrasing element (e.g. <strong> or <cite>). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation.
  • Install rewritten HTML parser to fix 12 existing bugs and provide headroom for refinement and growth. A rewritten HTML parser resolves a collection of outstanding bugs with HTML partitioning and provides a firm foundation for further elaborating that important partitioner.
  • CI check for dependency licenses Adds a CI check to ensure dependencies are appropriately licensed.

Features

  • Add support for specifying OCR language to partition_pdf(). Extend language specification capability to PaddleOCR in addition to TesseractOCR. Users can now specify OCR languages for both OCR engines when using partition_pdf().
  • Add AstraDB source connector Adds support for ingesting documents from AstraDB.

Fixes

  • Remedy error on Windows when nltk binaries are downloaded. Work around a quirk in the Windows implementation of tempfile.NamedTemporaryFile where accessing the temporary file by name raises PermissionError.
  • Move Astra embedded_dimension to write config
Commits
  • ec59abf enhancement: improve text clearing process in email partitioning (#3422)
  • 1df7908 feat: save file id for all fsspec connectors if present (#3405)
  • 0eb461a refactor: restructure PDF/Image example document organization (#3410)
  • 5d38703 bugfix: google drive connector metadata safegaurds (#3407)
  • e99e5a8 rfctr(file): make FileType enum a file-type descriptor (#3411)
  • 35ee6bf bugfix: conform all connectors to be added to registry (#3408)
  • a5c9a36 rfctr(file): improve file-type auto-detect (#3409)
  • 48bdf94 feat: partition_pdf() support language specification for PaddleOCR (#3400)
  • 6b1d5f2 rfctr: move astra arg (#3383)
  • 56ca39c rfctr(file): improve filetype tests (#3402)
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/4038 **Author:** [@dependabot[bot]](https://github.com/apps/dependabot) **Created:** 7/22/2024 **Status:** ✅ Merged **Merged:** 7/24/2024 **Merged by:** [@tjbck](https://github.com/tjbck) **Base:** `dev` ← **Head:** `dependabot/pip/backend/dev/unstructured-0.15.0` --- ### 📝 Commits (1) - [`659bc24`](https://github.com/open-webui/open-webui/commit/659bc246c9f5219224acf90a83b9bc7700df697a) chore(deps): bump unstructured from 0.14.10 to 0.15.0 in /backend ### 📊 Changes **1 file changed** (+1 additions, -1 deletions) <details> <summary>View changed files</summary> 📝 `backend/requirements.txt` (+1 -1) </details> ### 📄 Description Bumps [unstructured](https://github.com/Unstructured-IO/unstructured) from 0.14.10 to 0.15.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/Unstructured-IO/unstructured/releases">unstructured's releases</a>.</em></p> <blockquote> <h2>0.15.0</h2> <h3>Enhancements</h3> <ul> <li><strong>Improve text clearing process in email partitioning.</strong> Updated the email partitioner to remove both <code>=\n</code> and <code>=\r\n</code> characters during the clearing process. Previously, only <code>=\n</code> characters were removed.</li> <li><strong>Bump unstructured.paddleocr to 2.8.0.1.</strong></li> <li><strong>Refine HTML parser to accommodate block element nested in phrasing.</strong> HTML parser no longer raises on a block element (e.g. <code>&lt;p&gt;</code>, <code>&lt;div&gt;</code>) nested inside a phrasing element (e.g. <code>&lt;strong&gt;</code> or <code>&lt;cite&gt;</code>). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation.</li> <li><strong>Install rewritten HTML parser to fix 12 existing bugs and provide headroom for refinement and growth.</strong> A rewritten HTML parser resolves a collection of outstanding bugs with HTML partitioning and provides a firm foundation for further elaborating that important partitioner.</li> <li><strong>CI check for dependency licenses</strong> Adds a CI check to ensure dependencies are appropriately licensed.</li> </ul> <h3>Features</h3> <ul> <li><strong>Add support for specifying OCR language to <code>partition_pdf()</code>.</strong> Extend language specification capability to <code>PaddleOCR</code> in addition to <code>TesseractOCR</code>. Users can now specify OCR languages for both OCR engines when using <code>partition_pdf()</code>.</li> <li><strong>Add AstraDB source connector</strong> Adds support for ingesting documents from AstraDB.</li> </ul> <h3>Fixes</h3> <ul> <li><strong>Remedy error on Windows when <code>nltk</code> binaries are downloaded.</strong> Work around a quirk in the Windows implementation of <code>tempfile.NamedTemporaryFile</code> where accessing the temporary file by name raises <code>PermissionError</code>.</li> <li><strong>Move Astra embedded_dimension to write config</strong></li> </ul> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/Unstructured-IO/unstructured/blob/main/CHANGELOG.md">unstructured's changelog</a>.</em></p> <blockquote> <h2>0.15.0</h2> <h3>Enhancements</h3> <ul> <li><strong>Improve text clearing process in email partitioning.</strong> Updated the email partitioner to remove both <code>=\n</code> and <code>=\r\n</code> characters during the clearing process. Previously, only <code>=\n</code> characters were removed.</li> <li><strong>Bump unstructured.paddleocr to 2.8.0.1.</strong></li> <li><strong>Refine HTML parser to accommodate block element nested in phrasing.</strong> HTML parser no longer raises on a block element (e.g. <code>&lt;p&gt;</code>, <code>&lt;div&gt;</code>) nested inside a phrasing element (e.g. <code>&lt;strong&gt;</code> or <code>&lt;cite&gt;</code>). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation.</li> <li><strong>Install rewritten HTML parser to fix 12 existing bugs and provide headroom for refinement and growth.</strong> A rewritten HTML parser resolves a collection of outstanding bugs with HTML partitioning and provides a firm foundation for further elaborating that important partitioner.</li> <li><strong>CI check for dependency licenses</strong> Adds a CI check to ensure dependencies are appropriately licensed.</li> </ul> <h3>Features</h3> <ul> <li><strong>Add support for specifying OCR language to <code>partition_pdf()</code>.</strong> Extend language specification capability to <code>PaddleOCR</code> in addition to <code>TesseractOCR</code>. Users can now specify OCR languages for both OCR engines when using <code>partition_pdf()</code>.</li> <li><strong>Add AstraDB source connector</strong> Adds support for ingesting documents from AstraDB.</li> </ul> <h3>Fixes</h3> <ul> <li><strong>Remedy error on Windows when <code>nltk</code> binaries are downloaded.</strong> Work around a quirk in the Windows implementation of <code>tempfile.NamedTemporaryFile</code> where accessing the temporary file by name raises <code>PermissionError</code>.</li> <li><strong>Move Astra embedded_dimension to write config</strong></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/ec59abfabc06d11f2bce96d56e8e80a0612654e6"><code>ec59abf</code></a> enhancement: improve text clearing process in <code>email</code> partitioning (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/3422">#3422</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/1df7908f03872b03cbcb2624be0dcd5912e993ee"><code>1df7908</code></a> feat: save file id for all fsspec connectors if present (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/3405">#3405</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/0eb461acc24aa2713f7c98d762c9e32fdf717894"><code>0eb461a</code></a> refactor: restructure PDF/Image example document organization (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/3410">#3410</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/5d387030ebefcf4b5e520dbb6fab6e857e6c20ad"><code>5d38703</code></a> bugfix: google drive connector metadata safegaurds (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/3407">#3407</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/e99e5a8abdc8ff0fc9dadbf6c9c28ef21fdd8e3d"><code>e99e5a8</code></a> rfctr(file): make FileType enum a file-type descriptor (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/3411">#3411</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/35ee6bf8e48dc23606d0ac7a0238cca1cb6c04ff"><code>35ee6bf</code></a> bugfix: conform all connectors to be added to registry (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/3408">#3408</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/a5c9a3695cfa5fa7f5df09428df30ecfefdf3c5f"><code>a5c9a36</code></a> rfctr(file): improve file-type auto-detect (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/3409">#3409</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/48bdf94656b48e4e550288bdb3bd2b41b0c8de6d"><code>48bdf94</code></a> feat: <code>partition_pdf()</code> support language specification for PaddleOCR (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/3400">#3400</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/6b1d5f28bb9401b497f98d5dc0e780e51bcfeb38"><code>6b1d5f2</code></a> rfctr: move astra arg (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/3383">#3383</a>)</li> <li><a href="https://github.com/Unstructured-IO/unstructured/commit/56ca39ca7f40f7be0ec453b89372b5cf09565aab"><code>56ca39c</code></a> rfctr(file): improve filetype tests (<a href="https://redirect.github.com/Unstructured-IO/unstructured/issues/3402">#3402</a>)</li> <li>Additional commits viewable in <a href="https://github.com/Unstructured-IO/unstructured/compare/0.14.10...0.15.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=unstructured&package-manager=pip&previous-version=0.14.10&new-version=0.15.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2025-11-11 17:47:06 -06:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#8182