[PR #22037] [CLOSED] chore(deps): bump chardet from 5.2.0 to 6.0.0.post1 #49478

New Issue

2026-04-30T01:46:47-05:00

GiteaMirror commented

2026-04-30 01:46:47 -05:00

📋 Pull Request Information

Original PR: https://github.com/open-webui/open-webui/pull/22037
Author: @dependabot[bot]
Created: 3/1/2026
Status: ❌ Closed

Base: dev ← Head: dependabot/uv/dev/chardet-6.0.0.post1

📝 Commits (1)

d5f08a4 chore(deps): bump chardet from 5.2.0 to 6.0.0.post1

📊 Changes

4 files changed (+1368 additions, -1013 deletions)

View changed files

📝 backend/requirements-min.txt (+1 -1)
📝 backend/requirements.txt (+1 -1)
📝 pyproject.toml (+1 -1)
📝 uv.lock (+1365 -1010)

📄 Description

Bumps chardet from 5.2.0 to 6.0.0.post1.

Release notes

Sourced from chardet's releases.

6.0.0.post1

Fixed version number in chardet/version.py still being set to 6.0.0dev0. Otherwise identical to 6.0.0.

6.0.0

Features

Unified single-byte charset detection: Instead of only having trained language models for a handful of languages (Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, Turkish) and relying on special-case Latin1Prober and MacRomanProber heuristics for Western encodings, chardet now treats all single-byte charsets the same way: every encoding gets proper language-specific bigram models trained on CulturaX corpus data. This means chardet can now accurately detect both the encoding and the language for all supported single-byte encodings.

38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, and Welsh. Existing models for Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, and Turkish were also retrained with the new pipeline.

EncodingEra filtering: New encoding_era parameter to detect allows filtering by an EncodingEra flag enum (MODERN_WEB, LEGACY_ISO, LEGACY_MAC, LEGACY_REGIONAL, DOS, MAINFRAME, ALL) allows callers to restrict detection to encodings from a specific era. detect() and detect_all() default to MODERN_WEB. The new MODERN_WEB default should drastically improve accuracy for users who are not working with legacy data. The tiers are:

MODERN_WEB: UTF-8/16/32, Windows-125x, CP874, CJK multi-byte (widely used on the web)

LEGACY_ISO: ISO-8859-x, KOI8-R/U (legacy but well-known standards)

LEGACY_MAC: Mac-specific encodings (MacRoman, MacCyrillic, etc.)

LEGACY_REGIONAL: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.)

DOS: DOS/OEM code pages (CP437, CP850, CP866, etc.)

MAINFRAME: EBCDIC variants (CP037, CP500, etc.)

--encoding-era CLI flag: The chardetect CLI now accepts -e/--encoding-era to control which encoding eras are considered during detection.

max_bytes and chunk_size parameters: detect(), detect_all(), and UniversalDetector now accept max_bytes (default 200KB) and chunk_size (default 64KB) parameters for controlling how much data is examined. (#314, @bysiber)

Encoding era preference tie-breaking: When multiple encodings have very close confidence scores, the detector now prefers more modern/Unicode encodings over legacy ones.

Charset metadata registry: New chardet.metadata.charsets module provides structured metadata about all supported encodings, including their era classification and language filter.

should_rename_legacy now defaults intelligently: When set to None (the new default), legacy renaming is automatically enabled when encoding_era is MODERN_WEB.

Direct GB18030 support: Replaced the redundant GB2312 prober with a proper GB18030 prober.

EBCDIC detection: Added CP037 and CP500 EBCDIC model registrations for mainframe encoding detection.

Binary file detection: Added basic binary file detection to abort analysis earlier on non-text files.

Python 3.12, 3.13, and 3.14 support (#283, @hugovk; #311)

GitHub Codespace support (#312, @oxygen-dioxide)

Fixes

Fix CP949 state machine: Corrected the state machine for Korean CP949 encoding detection. (#268, @nenw)

Fix SJIS distribution analysis: Fixed SJISDistributionAnalysis discarding valid second-byte range >= 0x80. (#315, @bysiber)

Fix UTF-16/32 detection for non-ASCII-heavy text: Improved detection of UTF-16/32 encoded CJK and other non-ASCII text by adding a MIN_RATIO threshold alongside the existing EXPECTED_RATIO.

Fix get_charset crash: Resolved a crash when looking up unknown charset names.

Fix GB18030 char_len_table: Corrected the character length table for GB18030 multi-byte sequences.

Fix UTF-8 state machine: Updated to be more spec-compliant.

Fix detect_all() returning inactive probers: Results from probers that determined "definitely not this encoding" are now excluded.

Fix early cutoff bug: Resolved an issue where detection could terminate prematurely.

Default UTF-8 fallback: If UTF-8 has not been ruled out and nothing else is above the minimum threshold, UTF-8 is now returned as the default.

Breaking changes

Dropped Python 3.7, 3.8, and 3.9 support: Now requires Python 3.10+. (#283, @hugovk)

Removed Latin1Prober and MacRomanProber: These special-case probers have been replaced by the unified model-based approach described above. Latin-1, MacRoman, and all other single-byte encodings are now detected by SingleByteCharSetProber with trained language models, giving better accuracy and language identification.

Removed EUC-TW support: EUC-TW encoding detection has been removed as it is extremely rare in practice.

LanguageFilter.NONE removed: Use specific language filters or LanguageFilter.ALL instead.

Enum types changed: InputState, ProbingState, MachineState, SequenceLikelihood, and CharacterCategory are now IntEnum (previously plain classes or Enum). LanguageFilter values changed from hardcoded hex to auto().

detect() default behavior change: detect() now defaults to encoding_era=EncodingEra.MODERN_WEB and should_rename_legacy=None (auto-enabled for MODERN_WEB), whereas previously it defaulted to considering all encodings with no legacy renaming.

Misc changes

Switched from Poetry/setuptools to uv + hatchling: Build system modernized with hatch-vcs for version management.

... (truncated)

Commits

2fa72d8 Update version to 6.0.0.post1
8a4636b docs: modernize usage examples and reorganize table of contents
20da71e docs: fix copyright start year and remove first-person reference
b45ae91 docs: update copyright to 2015-2026 chardet contributors
3f9910d Add .readthedocs.yaml to fix RTD builds
7ef7cd0 Fix pyright type errors in chardetect.py and test.py
4025dfa Update documentation for 6.0.0 release
1170829 Add LEGACY_REGIONAL encoding era and reclassify misplaced encodings
19379ac Add --encoding-era CLI flag and improve heuristic selection
61308e2 Pre-release fixes: bump to 6.0.0, fix get_charset crash, cleanup
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/open-webui/open-webui/pull/22037 **Author:** [@dependabot[bot]](https://github.com/apps/dependabot) **Created:** 3/1/2026 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `dependabot/uv/dev/chardet-6.0.0.post1` --- ### 📝 Commits (1) - [`d5f08a4`](https://github.com/open-webui/open-webui/commit/d5f08a4957ea3d6fef6ed1d255918629c35c1e9d) chore(deps): bump chardet from 5.2.0 to 6.0.0.post1 ### 📊 Changes **4 files changed** (+1368 additions, -1013 deletions) <details> <summary>View changed files</summary> 📝 `backend/requirements-min.txt` (+1 -1) 📝 `backend/requirements.txt` (+1 -1) 📝 `pyproject.toml` (+1 -1) 📝 `uv.lock` (+1365 -1010) </details> ### 📄 Description Bumps [chardet](https://github.com/chardet/chardet) from 5.2.0 to 6.0.0.post1. <details> <summary>Release notes</summary> Sourced from <a href="https://github.com/chardet/chardet/releases">chardet's releases</a>. <blockquote> <h2>6.0.0.post1</h2> <ul> <li>Fixed version number in chardet/version.py still being set to <code>6.0.0dev0</code>. Otherwise identical to <a href="https://github.com/chardet/chardet/releases/6.0.0">6.0.0</a>.</li> </ul> <h2>6.0.0</h2> <h3>Features</h3> <ul> <li>Unified single-byte charset detection: Instead of only having trained language models for a handful of languages (Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, Turkish) and relying on special-case <code>Latin1Prober</code> and <code>MacRomanProber</code> heuristics for Western encodings, chardet now treats all single-byte charsets the same way: every encoding gets proper language-specific bigram models trained on CulturaX corpus data. This means chardet can now accurately detect both the encoding and the language for all supported single-byte encodings.</li> <li>38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German, Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian, Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik, Ukrainian, Vietnamese, and Welsh. Existing models for Bulgarian, Greek, Hebrew, Hungarian, Russian, Thai, and Turkish were also retrained with the new pipeline.</li> <li><code>EncodingEra</code> filtering: New <code>encoding_era</code> parameter to <code>detect</code> allows filtering by an <code>EncodingEra</code> flag enum (<code>MODERN_WEB</code>, <code>LEGACY_ISO</code>, <code>LEGACY_MAC</code>, <code>LEGACY_REGIONAL</code>, <code>DOS</code>, <code>MAINFRAME</code>, <code>ALL</code>) allows callers to restrict detection to encodings from a specific era. <code>detect()</code> and <code>detect_all()</code> default to <code>MODERN_WEB</code>. The new <code>MODERN_WEB</code> default should drastically improve accuracy for users who are not working with legacy data. The tiers are: <ul> <li><code>MODERN_WEB</code>: UTF-8/16/32, Windows-125x, CP874, CJK multi-byte (widely used on the web)</li> <li><code>LEGACY_ISO</code>: ISO-8859-x, KOI8-R/U (legacy but well-known standards)</li> <li><code>LEGACY_MAC</code>: Mac-specific encodings (MacRoman, MacCyrillic, etc.)</li> <li><code>LEGACY_REGIONAL</code>: Uncommon regional/national encodings (KOI8-T, KZ1048, CP1006, etc.)</li> <li><code>DOS</code>: DOS/OEM code pages (CP437, CP850, CP866, etc.)</li> <li><code>MAINFRAME</code>: EBCDIC variants (CP037, CP500, etc.)</li> </ul> </li> <li><code>--encoding-era</code> CLI flag: The <code>chardetect</code> CLI now accepts <code>-e</code>/<code>--encoding-era</code> to control which encoding eras are considered during detection.</li> <li><code>max_bytes</code> and <code>chunk_size</code> parameters: <code>detect()</code>, <code>detect_all()</code>, and <code>UniversalDetector</code> now accept <code>max_bytes</code> (default 200KB) and <code>chunk_size</code> (default 64KB) parameters for controlling how much data is examined. (<a href="https://redirect.github.com/chardet/chardet/issues/314">#314</a>, <a href="https://github.com/bysiber"><code>@bysiber</code></a>)</li> <li>Encoding era preference tie-breaking: When multiple encodings have very close confidence scores, the detector now prefers more modern/Unicode encodings over legacy ones.</li> <li>Charset metadata registry: New <code>chardet.metadata.charsets</code> module provides structured metadata about all supported encodings, including their era classification and language filter.</li> <li><code>should_rename_legacy</code> now defaults intelligently: When set to <code>None</code> (the new default), legacy renaming is automatically enabled when <code>encoding_era</code> is <code>MODERN_WEB</code>.</li> <li>Direct GB18030 support: Replaced the redundant GB2312 prober with a proper GB18030 prober.</li> <li>EBCDIC detection: Added CP037 and CP500 EBCDIC model registrations for mainframe encoding detection.</li> <li>Binary file detection: Added basic binary file detection to abort analysis earlier on non-text files.</li> <li>Python 3.12, 3.13, and 3.14 support (<a href="https://redirect.github.com/chardet/chardet/issues/283">#283</a>, <a href="https://github.com/hugovk"><code>@hugovk</code></a>; <a href="https://redirect.github.com/chardet/chardet/issues/311">#311</a>)</li> <li>GitHub Codespace support (<a href="https://redirect.github.com/chardet/chardet/issues/312">#312</a>, <a href="https://github.com/oxygen-dioxide"><code>@oxygen-dioxide</code></a>)</li> </ul> <h3>Fixes</h3> <ul> <li>Fix CP949 state machine: Corrected the state machine for Korean CP949 encoding detection. (<a href="https://redirect.github.com/chardet/chardet/issues/268">#268</a>, <a href="https://github.com/nenw"><code>@nenw</code></a>)</li> <li>Fix SJIS distribution analysis: Fixed <code>SJISDistributionAnalysis</code> discarding valid second-byte range >= 0x80. (<a href="https://redirect.github.com/chardet/chardet/issues/315">#315</a>, <a href="https://github.com/bysiber"><code>@bysiber</code></a>)</li> <li>Fix UTF-16/32 detection for non-ASCII-heavy text: Improved detection of UTF-16/32 encoded CJK and other non-ASCII text by adding a <code>MIN_RATIO</code> threshold alongside the existing <code>EXPECTED_RATIO</code>.</li> <li>Fix <code>get_charset</code> crash: Resolved a crash when looking up unknown charset names.</li> <li>Fix GB18030 <code>char_len_table</code>: Corrected the character length table for GB18030 multi-byte sequences.</li> <li>Fix UTF-8 state machine: Updated to be more spec-compliant.</li> <li>Fix <code>detect_all()</code> returning inactive probers: Results from probers that determined "definitely not this encoding" are now excluded.</li> <li>Fix early cutoff bug: Resolved an issue where detection could terminate prematurely.</li> <li>Default UTF-8 fallback: If UTF-8 has not been ruled out and nothing else is above the minimum threshold, UTF-8 is now returned as the default.</li> </ul> <h3>Breaking changes</h3> <ul> <li>Dropped Python 3.7, 3.8, and 3.9 support: Now requires Python 3.10+. (<a href="https://redirect.github.com/chardet/chardet/issues/283">#283</a>, <a href="https://github.com/hugovk"><code>@hugovk</code></a>)</li> <li>Removed <code>Latin1Prober</code> and <code>MacRomanProber</code>: These special-case probers have been replaced by the unified model-based approach described above. Latin-1, MacRoman, and all other single-byte encodings are now detected by <code>SingleByteCharSetProber</code> with trained language models, giving better accuracy and language identification.</li> <li>Removed EUC-TW support: EUC-TW encoding detection has been removed as it is extremely rare in practice.</li> <li><code>LanguageFilter.NONE</code> removed: Use specific language filters or <code>LanguageFilter.ALL</code> instead.</li> <li>Enum types changed: <code>InputState</code>, <code>ProbingState</code>, <code>MachineState</code>, <code>SequenceLikelihood</code>, and <code>CharacterCategory</code> are now <code>IntEnum</code> (previously plain classes or <code>Enum</code>). <code>LanguageFilter</code> values changed from hardcoded hex to <code>auto()</code>.</li> <li><code>detect()</code> default behavior change: <code>detect()</code> now defaults to <code>encoding_era=EncodingEra.MODERN_WEB</code> and <code>should_rename_legacy=None</code> (auto-enabled for <code>MODERN_WEB</code>), whereas previously it defaulted to considering all encodings with no legacy renaming.</li> </ul> <h3>Misc changes</h3> <ul> <li>Switched from Poetry/setuptools to uv + hatchling: Build system modernized with <code>hatch-vcs</code> for version management.</li> </ul>  </blockquote> ... (truncated) </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/chardet/chardet/commit/2fa72d84fdb9cb926eb1e7e40230a33b1dd81bb8"><code>2fa72d8</code></a> Update version to 6.0.0.post1</li> <li><a href="https://github.com/chardet/chardet/commit/8a4636b1d4ef21d11cc8190f6ff271670b9d20ee"><code>8a4636b</code></a> docs: modernize usage examples and reorganize table of contents</li> <li><a href="https://github.com/chardet/chardet/commit/20da71e7087159944e1ccfd586544a581744103a"><code>20da71e</code></a> docs: fix copyright start year and remove first-person reference</li> <li><a href="https://github.com/chardet/chardet/commit/b45ae91d9892c834a692e42d527670c6018d8301"><code>b45ae91</code></a> docs: update copyright to 2015-2026 chardet contributors</li> <li><a href="https://github.com/chardet/chardet/commit/3f9910dbebfc2b268edba21eaf2175532c761c52"><code>3f9910d</code></a> Add .readthedocs.yaml to fix RTD builds</li> <li><a href="https://github.com/chardet/chardet/commit/7ef7cd00f27e18c5a846c75b5b4aecd0ddd36186"><code>7ef7cd0</code></a> Fix pyright type errors in chardetect.py and test.py</li> <li><a href="https://github.com/chardet/chardet/commit/4025dfa2487e9a72d940ccf31ad933171aa178e5"><code>4025dfa</code></a> Update documentation for 6.0.0 release</li> <li><a href="https://github.com/chardet/chardet/commit/117082980035dadbe723b798cba267c537374747"><code>1170829</code></a> Add LEGACY_REGIONAL encoding era and reclassify misplaced encodings</li> <li><a href="https://github.com/chardet/chardet/commit/19379ac0110df41a76835a89756ae23abc528bd3"><code>19379ac</code></a> Add --encoding-era CLI flag and improve heuristic selection</li> <li><a href="https://github.com/chardet/chardet/commit/61308e2525dc43441ba07c6559d78bb9429ec31a"><code>61308e2</code></a> Pre-release fixes: bump to 6.0.0, fix get_charset crash, cleanup</li> <li>Additional commits viewable in <a href="https://github.com/chardet/chardet/compare/5.2.0...6.0.0.post1">compare view</a></li> </ul> </details> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=chardet&package-manager=uv&previous-version=5.2.0&new-version=6.0.0.post1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> --- 🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

GiteaMirror added the pull-request label 2026-04-30 01:46:47 -05:00

GiteaMirror closed this issue

2026-04-30 01:46:49 -05:00

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/open-webui#49478