Fix most of `unicodedata`, unmask multiple tests, and remove `ucd` by joshuamegnauth54 · Pull Request #7947 · RustPython/RustPython
joshuamegnauth54 added a commit to joshuamegnauth54/RustPython that referenced this pull request
I removed an embedded table of non-ASCII numbers in favor of using `icu_decimal`. The benefits of using `icu4x` here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table. `icu_decimal` is also useful beyond `float()`. I'm also using it to clean up `unicodedata` in RustPython#7947.
joshuamegnauth54 added a commit to joshuamegnauth54/RustPython that referenced this pull request
I removed an embedded table of non-ASCII numbers in favor of using `icu_decimal`. The benefits of using `icu4x` here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table. `icu_decimal` is also useful beyond `float()`. I'm also using it to clean up `unicodedata` in RustPython#7947.
joshuamegnauth54 added a commit to joshuamegnauth54/RustPython that referenced this pull request
I removed an embedded table of non-ASCII numbers in favor of using `icu_decimal`. The benefits of using `icu4x` here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table. `icu_decimal` is also useful beyond `float()`. I'm also using it to clean up `unicodedata` in RustPython#7947.
joshuamegnauth54
changed the title
Fix isprintable() and fix Unicode 3.2
Fix most of unicodedata, unmask multiple tests, and remove ucd
Python bundles an old version of Unicode for compatibility. RustPython tries to mimic supporting that old version by checking the version of individual chars. This is a problem for a few reasons. The first is that the age check adds an additional hit per each char lookup in Unicode data. The check is outdated because the `unic-ucd-age` crate is several versions behind the current Unicode version. The check rejects valid chars because of the version differences. The check is subtly wrong because it returns properties for Unicode 16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0 database. Unfortunately, there isn't a crate that can help us here. `icu4x` targets modern Unicode versions. Writing a data provider for `icu4x` for Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the Unicode 3.2.0 data myself but to skip `icu4x` (mostly) to instead write small lookup tables. As of this commit, Unicode names is still wrong for 3.2.0. Luckily, the crate RustPython uses is fast and robust for modern Unicode.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters