◐ Shell
clean mode source ↗

Fix most of `unicodedata`, unmask multiple tests, and remove `ucd` by joshuamegnauth54 · Pull Request #7947 · RustPython/RustPython

ShaharNaveh

ShaharNaveh

joshuamegnauth54 added a commit to joshuamegnauth54/RustPython that referenced this pull request

May 30, 2026
I removed an embedded table of non-ASCII numbers in favor of using
`icu_decimal`. The benefits of using `icu4x` here are consistency plus
Unicode updates. As Unicode is updated, we automatically reap the
benefits without having to modify the table.

`icu_decimal` is also useful beyond `float()`. I'm also using it to
clean up `unicodedata` in RustPython#7947.

joshuamegnauth54 added a commit to joshuamegnauth54/RustPython that referenced this pull request

May 30, 2026
I removed an embedded table of non-ASCII numbers in favor of using
`icu_decimal`. The benefits of using `icu4x` here are consistency plus
Unicode updates. As Unicode is updated, we automatically reap the
benefits without having to modify the table.

`icu_decimal` is also useful beyond `float()`. I'm also using it to
clean up `unicodedata` in RustPython#7947.

joshuamegnauth54 added a commit to joshuamegnauth54/RustPython that referenced this pull request

May 30, 2026
I removed an embedded table of non-ASCII numbers in favor of using
`icu_decimal`. The benefits of using `icu4x` here are consistency plus
Unicode updates. As Unicode is updated, we automatically reap the
benefits without having to modify the table.

`icu_decimal` is also useful beyond `float()`. I'm also using it to
clean up `unicodedata` in RustPython#7947.

coderabbitai[bot]

@joshuamegnauth54 joshuamegnauth54 changed the title Fix isprintable() and fix Unicode 3.2 Fix most of unicodedata, unmask multiple tests, and remove ucd

Jun 9, 2026

youknowone

ShaharNaveh

auto-merge was automatically disabled

June 11, 2026 19:05

Pull request was converted to draft

coderabbitai[bot]

ShaharNaveh

Python bundles an old version of Unicode for compatibility. RustPython
tries to mimic supporting that old version by checking the version of
individual chars. This is a problem for a few reasons. The first is that
the age check adds an additional hit per each char lookup in Unicode
data. The check is outdated because the `unic-ucd-age` crate is several
versions behind the current Unicode version. The check rejects valid
chars because of the version differences.

The check is subtly wrong because it returns properties for Unicode
16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0
database.

Unfortunately, there isn't a crate that can help us here. `icu4x`
targets modern Unicode versions. Writing a data provider for `icu4x` for
Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the
Unicode 3.2.0 data myself but to skip `icu4x` (mostly) to instead write
small lookup tables.

As of this commit, Unicode names is still wrong for 3.2.0.
Luckily, the crate RustPython uses is fast and robust for modern
Unicode.