[PR #22037] [CLOSED] chore(deps): bump chardet from 5.2.0 to 6.0.0.post1 #49478

Closed
opened 2026-04-30 01:46:47 -05:00 by GiteaMirror · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/22037
Author: @dependabot[bot]
Created: 3/1/2026
Status: Closed

Base: devHead: dependabot/uv/dev/chardet-6.0.0.post1


📝 Commits (1)

  • d5f08a4 chore(deps): bump chardet from 5.2.0 to 6.0.0.post1

📊 Changes

4 files changed (+1368 additions, -1013 deletions)

View changed files

📝 backend/requirements-min.txt (+1 -1)
📝 backend/requirements.txt (+1 -1)
📝 pyproject.toml (+1 -1)
📝 uv.lock (+1365 -1010)

📄 Description

Bumps chardet from 5.2.0 to 6.0.0.post1.

Release notes

Sourced from chardet's releases.

6.0.0.post1

  • Fixed version number in chardet/version.py still being set to 6.0.0dev0. Otherwise identical to 6.0.0.

6.0.0

Features

  • Unified single-byte charset detection: Instead of only having trained language models for a handful of languages (Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, Turkish) and relying on special-case Latin1Prober and MacRomanProber heuristics for Western encodings, chardet now treats all single-byte charsets the same way: every encoding gets proper language-specific bigram models trained on CulturaX corpus data. This means chardet can now accurately detect both the encoding and the language for all supported single-byte encodings.
  • 38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, and Welsh. Existing models for Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, and Turkish were also retrained with the new pipeline.
  • EncodingEra filtering: New encoding_era parameter to detect allows filtering by an EncodingEra flag enum (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME, ALL) allows callers to restrict detection to encodings from a specific era. detect() and detect_all() default to MODERN_WEB. The new MODERN_WEB default should drastically improve accuracy for users who are not working with legacy data. The tiers are:
    • MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, CJK multi-byte (widely used on the web)
    • LEGACY_ISO: ISO-8859-x, KOI8-R/U (legacy but well-known standards)
    • LEGACY_MAC: Mac-specific encodings (MacRoman, MacCyrillic, etc.)
    • LEGACY_REGIONAL: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.)
    • DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.)
    • MAINFRAME: EBCDIC variants (CP037, CP500, etc.)
  • --encoding-era CLI flag: The chardetect CLI now accepts -e/--encoding-era to control which encoding eras are considered during detection.
  • max_bytes and chunk_size parameters: detect(), detect_all(), and UniversalDetector now accept max_bytes (default 200KB) and chunk_size (default 64KB) parameters for controlling how much data is examined. (#314, @​bysiber)
  • Encoding era preference tie-breaking: When multiple encodings have very close confidence scores, the detector now prefers more modern/Unicode encodings over legacy ones.
  • Charset metadata registry: New chardet.metadata.charsets module provides structured metadata about all supported encodings, including their era classification and language filter.
  • should_rename_legacy now defaults intelligently: When set to None (the new default), legacy renaming is automatically enabled when encoding_era is MODERN_WEB.
  • Direct GB18030 support: Replaced the redundant GB2312 prober with a proper GB18030 prober.
  • EBCDIC detection: Added CP037 and CP500 EBCDIC model registrations for mainframe encoding detection.
  • Binary file detection: Added basic binary file detection to abort analysis earlier on non-text files.
  • Python 3.12, 3.13, and 3.14 support (#283, @​hugovk; #311)
  • GitHub Codespace support (#312, @​oxygen-dioxide)

Fixes

  • Fix CP949 state machine: Corrected the state machine for Korean CP949 encoding detection. (#268, @​nenw)
  • Fix SJIS distribution analysis: Fixed SJISDistributionAnalysis discarding valid second-byte range >= 0x80. (#315, @​bysiber)
  • Fix UTF-16/32 detection for non-ASCII-heavy text: Improved detection of UTF-16/32 encoded CJK and other non-ASCII text by adding a MIN_RATIO threshold alongside the existing EXPECTED_RATIO.
  • Fix get_charset crash: Resolved a crash when looking up unknown charset names.
  • Fix GB18030 char_len_table: Corrected the character length table for GB18030 multi-byte sequences.
  • Fix UTF-8 state machine: Updated to be more spec-compliant.
  • Fix detect_all() returning inactive probers: Results from probers that determined "definitely not this encoding" are now excluded.
  • Fix early cutoff bug: Resolved an issue where detection could terminate prematurely.
  • Default UTF-8 fallback: If UTF-8 has not been ruled out and nothing else is above the minimum threshold, UTF-8 is now returned as the default.

Breaking changes

  • Dropped Python 3.7, 3.8, and 3.9 support: Now requires Python 3.10+. (#283, @​hugovk)
  • Removed Latin1Prober and MacRomanProber: These special-case probers have been replaced by the unified model-based approach described above. Latin-1, MacRoman, and all other single-byte encodings are now detected by SingleByteCharSetProber with trained language models, giving better accuracy and language identification.
  • Removed EUC-TW support: EUC-TW encoding detection has been removed as it is extremely rare in practice.
  • LanguageFilter.NONE removed: Use specific language filters or LanguageFilter.ALL instead.
  • Enum types changed: InputState, ProbingState, MachineState, SequenceLikelihood, and CharacterCategory are now IntEnum (previously plain classes or Enum). LanguageFilter values changed from hardcoded hex to auto().
  • detect() default behavior change: detect() now defaults to encoding_era=EncodingEra.MODERN_WEB and should_rename_legacy=None (auto-enabled for MODERN_WEB), whereas previously it defaulted to considering all encodings with no legacy renaming.

Misc changes

  • Switched from Poetry/setuptools to uv + hatchling: Build system modernized with hatch-vcs for version management.

... (truncated)

Commits
  • 2fa72d8 Update version to 6.0.0.post1
  • 8a4636b docs: modernize usage examples and reorganize table of contents
  • 20da71e docs: fix copyright start year and remove first-person reference
  • b45ae91 docs: update copyright to 2015-2026 chardet contributors
  • 3f9910d Add .readthedocs.yaml to fix RTD builds
  • 7ef7cd0 Fix pyright type errors in chardetect.py and test.py
  • 4025dfa Update documentation for 6.0.0 release
  • 1170829 Add LEGACY_REGIONAL encoding era and reclassify misplaced encodings
  • 19379ac Add --encoding-era CLI flag and improve heuristic selection
  • 61308e2 Pre-release fixes: bump to 6.0.0, fix get_charset crash, cleanup
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/22037 **Author:** [@dependabot[bot]](https://github.com/apps/dependabot) **Created:** 3/1/2026 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `dependabot/uv/dev/chardet-6.0.0.post1` --- ### 📝 Commits (1) - [`d5f08a4`](https://github.com/open-webui/open-webui/commit/d5f08a4957ea3d6fef6ed1d255918629c35c1e9d) chore(deps): bump chardet from 5.2.0 to 6.0.0.post1 ### 📊 Changes **4 files changed** (+1368 additions, -1013 deletions) <details> <summary>View changed files</summary> 📝 `backend/requirements-min.txt` (+1 -1) 📝 `backend/requirements.txt` (+1 -1) 📝 `pyproject.toml` (+1 -1) 📝 `uv.lock` (+1365 -1010) </details> ### 📄 Description Bumps [chardet](https://github.com/chardet/chardet) from 5.2.0 to 6.0.0.post1. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/chardet/chardet/releases">chardet's releases</a>.</em></p> <blockquote> <h2>6.0.0.post1</h2> <ul> <li>Fixed version number in chardet/version.py still being set to <code>6.0.0dev0</code>. Otherwise identical to <a href="https://github.com/chardet/chardet/releases/6.0.0">6.0.0</a>.</li> </ul> <h2>6.0.0</h2> <h3>Features</h3> <ul> <li><strong>Unified single-byte charset detection</strong>: Instead of only having trained language models for a handful of languages (Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, Turkish) and relying on special-case <code>Latin1Prober</code> and <code>MacRomanProber</code> heuristics for Western encodings, chardet now treats all single-byte charsets the same way: every encoding gets proper language-specific bigram models trained on CulturaX corpus data. This means chardet can now accurately detect both the encoding <em>and</em> the language for all supported single-byte encodings.</li> <li><strong>38 new languages</strong>: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, and Welsh. Existing models for Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, and Turkish were also retrained with the new pipeline.</li> <li><strong><code>EncodingEra</code> filtering</strong>: New <code>encoding_era</code> parameter to <code>detect</code> allows filtering by an <code>EncodingEra</code> flag enum (<code>MODERN_WEB</code>, <code>LEGACY_ISO</code>, <code>LEGACY_MAC</code>, <code>LEGACY_REGIONAL</code>, <code>DOS</code>, <code>MAINFRAME</code>, <code>ALL</code>) allows callers to restrict detection to encodings from a specific era. <code>detect()</code> and <code>detect_all()</code> default to <code>MODERN_WEB</code>. The new <code>MODERN_WEB</code> default should drastically improve accuracy for users who are not working with legacy data. The tiers are: <ul> <li><code>MODERN_WEB</code>: UTF-8/16/32, Windows-125x, CP874, CJK multi-byte (widely used on the web)</li> <li><code>LEGACY_ISO</code>: ISO-8859-x, KOI8-R/U (legacy but well-known standards)</li> <li><code>LEGACY_MAC</code>: Mac-specific encodings (MacRoman, MacCyrillic, etc.)</li> <li><code>LEGACY_REGIONAL</code>: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.)</li> <li><code>DOS</code>: DOS/OEM code pages (CP437, CP850, CP866, etc.)</li> <li><code>MAINFRAME</code>: EBCDIC variants (CP037, CP500, etc.)</li> </ul> </li> <li><strong><code>--encoding-era</code> CLI flag</strong>: The <code>chardetect</code> CLI now accepts <code>-e</code>/<code>--encoding-era</code> to control which encoding eras are considered during detection.</li> <li><strong><code>max_bytes</code> and <code>chunk_size</code> parameters</strong>: <code>detect()</code>, <code>detect_all()</code>, and <code>UniversalDetector</code> now accept <code>max_bytes</code> (default 200KB) and <code>chunk_size</code> (default 64KB) parameters for controlling how much data is examined. (<a href="https://redirect.github.com/chardet/chardet/issues/314">#314</a>, <a href="https://github.com/bysiber"><code>@​bysiber</code></a>)</li> <li><strong>Encoding era preference tie-breaking</strong>: When multiple encodings have very close confidence scores, the detector now prefers more modern/Unicode encodings over legacy ones.</li> <li><strong>Charset metadata registry</strong>: New <code>chardet.metadata.charsets</code> module provides structured metadata about all supported encodings, including their era classification and language filter.</li> <li><strong><code>should_rename_legacy</code> now defaults intelligently</strong>: When set to <code>None</code> (the new default), legacy renaming is automatically enabled when <code>encoding_era</code> is <code>MODERN_WEB</code>.</li> <li><strong>Direct GB18030 support</strong>: Replaced the redundant GB2312 prober with a proper GB18030 prober.</li> <li><strong>EBCDIC detection</strong>: Added CP037 and CP500 EBCDIC model registrations for mainframe encoding detection.</li> <li><strong>Binary file detection</strong>: Added basic binary file detection to abort analysis earlier on non-text files.</li> <li><strong>Python 3.12, 3.13, and 3.14 support</strong> (<a href="https://redirect.github.com/chardet/chardet/issues/283">#283</a>, <a href="https://github.com/hugovk"><code>@​hugovk</code></a>; <a href="https://redirect.github.com/chardet/chardet/issues/311">#311</a>)</li> <li><strong>GitHub Codespace support</strong> (<a href="https://redirect.github.com/chardet/chardet/issues/312">#312</a>, <a href="https://github.com/oxygen-dioxide"><code>@​oxygen-dioxide</code></a>)</li> </ul> <h3>Fixes</h3> <ul> <li><strong>Fix CP949 state machine</strong>: Corrected the state machine for Korean CP949 encoding detection. (<a href="https://redirect.github.com/chardet/chardet/issues/268">#268</a>, <a href="https://github.com/nenw"><code>@​nenw</code></a>)</li> <li><strong>Fix SJIS distribution analysis</strong>: Fixed <code>SJISDistributionAnalysis</code> discarding valid second-byte range &gt;= 0x80. (<a href="https://redirect.github.com/chardet/chardet/issues/315">#315</a>, <a href="https://github.com/bysiber"><code>@​bysiber</code></a>)</li> <li><strong>Fix UTF-16/32 detection for non-ASCII-heavy text</strong>: Improved detection of UTF-16/32 encoded CJK and other non-ASCII text by adding a <code>MIN_RATIO</code> threshold alongside the existing <code>EXPECTED_RATIO</code>.</li> <li><strong>Fix <code>get_charset</code> crash</strong>: Resolved a crash when looking up unknown charset names.</li> <li><strong>Fix GB18030 <code>char_len_table</code></strong>: Corrected the character length table for GB18030 multi-byte sequences.</li> <li><strong>Fix UTF-8 state machine</strong>: Updated to be more spec-compliant.</li> <li><strong>Fix <code>detect_all()</code> returning inactive probers</strong>: Results from probers that determined &quot;definitely not this encoding&quot; are now excluded.</li> <li><strong>Fix early cutoff bug</strong>: Resolved an issue where detection could terminate prematurely.</li> <li><strong>Default UTF-8 fallback</strong>: If UTF-8 has not been ruled out and nothing else is above the minimum threshold, UTF-8 is now returned as the default.</li> </ul> <h3>Breaking changes</h3> <ul> <li><strong>Dropped Python 3.7, 3.8, and 3.9 support</strong>: Now requires Python 3.10+. (<a href="https://redirect.github.com/chardet/chardet/issues/283">#283</a>, <a href="https://github.com/hugovk"><code>@​hugovk</code></a>)</li> <li><strong>Removed <code>Latin1Prober</code> and <code>MacRomanProber</code></strong>: These special-case probers have been replaced by the unified model-based approach described above. Latin-1, MacRoman, and all other single-byte encodings are now detected by <code>SingleByteCharSetProber</code> with trained language models, giving better accuracy and language identification.</li> <li><strong>Removed EUC-TW support</strong>: EUC-TW encoding detection has been removed as it is extremely rare in practice.</li> <li><strong><code>LanguageFilter.NONE</code> removed</strong>: Use specific language filters or <code>LanguageFilter.ALL</code> instead.</li> <li><strong>Enum types changed</strong>: <code>InputState</code>, <code>ProbingState</code>, <code>MachineState</code>, <code>SequenceLikelihood</code>, and <code>CharacterCategory</code> are now <code>IntEnum</code> (previously plain classes or <code>Enum</code>). <code>LanguageFilter</code> values changed from hardcoded hex to <code>auto()</code>.</li> <li><strong><code>detect()</code> default behavior change</strong>: <code>detect()</code> now defaults to <code>encoding_era=EncodingEra.MODERN_WEB</code> and <code>should_rename_legacy=None</code> (auto-enabled for <code>MODERN_WEB</code>), whereas previously it defaulted to considering all encodings with no legacy renaming.</li> </ul> <h3>Misc changes</h3> <ul> <li><strong>Switched from Poetry/setuptools to uv + hatchling</strong>: Build system modernized with <code>hatch-vcs</code> for version management.</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/chardet/chardet/commit/2fa72d84fdb9cb926eb1e7e40230a33b1dd81bb8"><code>2fa72d8</code></a> Update version to 6.0.0.post1</li> <li><a href="https://github.com/chardet/chardet/commit/8a4636b1d4ef21d11cc8190f6ff271670b9d20ee"><code>8a4636b</code></a> docs: modernize usage examples and reorganize table of contents</li> <li><a href="https://github.com/chardet/chardet/commit/20da71e7087159944e1ccfd586544a581744103a"><code>20da71e</code></a> docs: fix copyright start year and remove first-person reference</li> <li><a href="https://github.com/chardet/chardet/commit/b45ae91d9892c834a692e42d527670c6018d8301"><code>b45ae91</code></a> docs: update copyright to 2015-2026 chardet contributors</li> <li><a href="https://github.com/chardet/chardet/commit/3f9910dbebfc2b268edba21eaf2175532c761c52"><code>3f9910d</code></a> Add .readthedocs.yaml to fix RTD builds</li> <li><a href="https://github.com/chardet/chardet/commit/7ef7cd00f27e18c5a846c75b5b4aecd0ddd36186"><code>7ef7cd0</code></a> Fix pyright type errors in chardetect.py and test.py</li> <li><a href="https://github.com/chardet/chardet/commit/4025dfa2487e9a72d940ccf31ad933171aa178e5"><code>4025dfa</code></a> Update documentation for 6.0.0 release</li> <li><a href="https://github.com/chardet/chardet/commit/117082980035dadbe723b798cba267c537374747"><code>1170829</code></a> Add LEGACY_REGIONAL encoding era and reclassify misplaced encodings</li> <li><a href="https://github.com/chardet/chardet/commit/19379ac0110df41a76835a89756ae23abc528bd3"><code>19379ac</code></a> Add --encoding-era CLI flag and improve heuristic selection</li> <li><a href="https://github.com/chardet/chardet/commit/61308e2525dc43441ba07c6559d78bb9429ec31a"><code>61308e2</code></a> Pre-release fixes: bump to 6.0.0, fix get_charset crash, cleanup</li> <li>Additional commits viewable in <a href="https://github.com/chardet/chardet/compare/5.2.0...6.0.0.post1">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=chardet&package-manager=uv&previous-version=5.2.0&new-version=6.0.0.post1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
GiteaMirror added the pull-request label 2026-04-30 01:46:47 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/open-webui#49478