Skip to content

Bump chardet from 3.0.4 to 4.0.0

Norman Ziegner requested to merge dependabot/pip/chardet-4.0.0 into master

Bumps chardet from 3.0.4 to 4.0.0.

Release notes

Sourced from chardet's releases.

chardet 4.0.0

️ This will be the last release of chardet to support Python 2.7. chardet 5.0 will only support 3.6+

Major Changes

This release is multiple years in the making, and provides so quality of life improvements to chardet. The primary user-facing changes are:

  1. Single-byte charset probers now use nested dictionaries under the hood, so they are usually a little faster than before. (See #121 for details)
  2. The CharsetGroupProber class now properly short-circuits when one of the probers in the group is considered a definite match. This lead to a substantial speedup.
  3. There is now a chardet.detect_all function that returns a list of possible encodings for the input with associated confidences.
  4. We have dropped support for Python 2.6, 3.4, and 3.5 as they are all past end-of-life.

The changes in this release have also laid the groundwork for retraining the models to make them more accurate, and to support some more encodings/languages (see #99 for progress). This is our main focus for chardet 5.0 (beyond dropping Python 2 support).

Benchmarks

Running on a MacBook Pro (15-inch, 2018) with 2.2GHz 6-core i7 processor and 32GB RAM

old version (chardet 3.0.4)

Benchmarking chardet 3.0.4 on CPython 3.7.5 (default, Sep  8 2020, 12:19:42)
[Clang 11.0.3 (clang-1103.0.32.62)]
--------------------------------------------------------------------------------
Calls per second for each encoding:
ascii: 25559.439366240098
big5: 7.187002209518091
cp932: 4.71090956645177
cp949: 2.937256786994428
euc-jp: 4.870580412090848
euc-kr: 6.6910755971933416
euc-tw: 87.71098043480079
gb2312: 6.614302607154443
ibm855: 27.595893549680685
ibm866: 29.93483661732791
iso-2022-jp: 3379.5052775763434
iso-2022-kr: 26181.67290886392
iso-8859-1: 120.63424740403983
iso-8859-5: 32.65106262196898
iso-8859-7: 62.480089080556084
koi8-r: 13.72481001727257
maccyrillic: 33.018537255804496
shift_jis: 4.996013583677438
tis-620: 14.323112928341818
utf-16: 166771.53081510935
utf-32: 198782.18009478672
utf-8: 13.966236809766901
utf-8-sig: 193732.28637413395
windows-1251: 23.038910006925768
</tr></table> ... (truncated)
Commits
  • a808ed1 Merge pull request #140 from chardet/master
  • 53854fb Add language to detect_all output
  • 1e208b7 Properly set CharsetGroupProber.state to FOUND_IT (#203)
  • a9286f7 Try to switch from Travis to GitHub Actions (#204)
  • 1db0347 Handle weird logging edge case in universaldetector.py
  • 056a2a4 Remove shebang and executable bit from chardet/cli/chardetect.py (#171)
  • 55ef330 Update links (#152)
  • e4290b6 Remove unnecessary numeric placeholders from format strings (#176)
  • 6a59c4b Remove use of deprecated 'setup.py test' (#187)
  • 4650dbf Remove shebang from nonexecutable script (#192)
  • Additional commits viewable in compare view

Merge request reports

Loading