Bump chardet from 3.0.4 to 4.0.0
Bumps chardet from 3.0.4 to 4.0.0.
Release notes
Sourced from chardet's releases.
chardet 4.0.0
⚠ ️ This will be the last release of chardet to support Python 2.7. chardet 5.0 will only support 3.6+⚠ ️Major Changes
This release is multiple years in the making, and provides so quality of life improvements to chardet. The primary user-facing changes are:
- Single-byte charset probers now use nested dictionaries under the hood, so they are usually a little faster than before. (See #121 for details)
- The
CharsetGroupProber
class now properly short-circuits when one of the probers in the group is considered a definite match. This lead to a substantial speedup.- There is now a
chardet.detect_all
function that returns a list of possible encodings for the input with associated confidences.- We have dropped support for Python 2.6, 3.4, and 3.5 as they are all past end-of-life.
The changes in this release have also laid the groundwork for retraining the models to make them more accurate, and to support some more encodings/languages (see #99 for progress). This is our main focus for chardet 5.0 (beyond dropping Python 2 support).
Benchmarks
Running on a MacBook Pro (15-inch, 2018) with 2.2GHz 6-core i7 processor and 32GB RAM
old version (chardet 3.0.4)
Benchmarking chardet 3.0.4 on CPython 3.7.5 (default, Sep 8 2020, 12:19:42) [Clang 11.0.3 (clang-1103.0.32.62)] -------------------------------------------------------------------------------- Calls per second for each encoding: ascii: 25559.439366240098 big5: 7.187002209518091 cp932: 4.71090956645177 cp949: 2.937256786994428 euc-jp: 4.870580412090848 euc-kr: 6.6910755971933416 euc-tw: 87.71098043480079 gb2312: 6.614302607154443 ibm855: 27.595893549680685 ibm866: 29.93483661732791 iso-2022-jp: 3379.5052775763434 iso-2022-kr: 26181.67290886392 iso-8859-1: 120.63424740403983 iso-8859-5: 32.65106262196898 iso-8859-7: 62.480089080556084 koi8-r: 13.72481001727257 maccyrillic: 33.018537255804496 shift_jis: 4.996013583677438 tis-620: 14.323112928341818 utf-16: 166771.53081510935 utf-32: 198782.18009478672 utf-8: 13.966236809766901 utf-8-sig: 193732.28637413395 windows-1251: 23.038910006925768 </tr></table> ... (truncated)
Commits
-
a808ed1
Merge pull request #140 from chardet/master -
53854fb
Add language to detect_all output -
1e208b7
Properly set CharsetGroupProber.state to FOUND_IT (#203) -
a9286f7
Try to switch from Travis to GitHub Actions (#204) -
1db0347
Handle weird logging edge case in universaldetector.py -
056a2a4
Remove shebang and executable bit from chardet/cli/chardetect.py (#171) -
55ef330
Update links (#152) -
e4290b6
Remove unnecessary numeric placeholders from format strings (#176) -
6a59c4b
Remove use of deprecated 'setup.py test' (#187) -
4650dbf
Remove shebang from nonexecutable script (#192) - Additional commits viewable in compare view