Fix most of unicodedata, unmask multiple tests, and remove ucd#7947
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughMigrates unicode handling from ChangesUnicode Data Infrastructure Refactor
Sequence Diagram(s)sequenceDiagram
participant build_rs as build.rs
participant UCDfiles as unicode/ucd32 & unicode/latest
participant Parser as parse_unicode_3_2
participant Emitter as codegen (write_derived / emit)
participant Runtime as unicodedata.rs
build_rs->>UCDfiles: read vendored UCD 3.2.0 + latest files
build_rs->>Parser: parse derived-property & normalization files
Parser->>Emitter: produce table entries and decomposition updates
Emitter->>build_rs: write generated Rust sources to OUT_DIR/generated
Runtime->>Emitter: include generated lookup tables (lookup_property / lookup_numeric_val)
Runtime->>Runtime: call ICU APIs for newer UnicodeVersion paths
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Sorry, something went wrong.
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: [x] test: cpython/Lib/test/test_str.py (TODO: 6) dependencies: dependent tests: (no tests depend on str) [ ] test: cpython/Lib/test/test_unicodedata.py (TODO: 13) dependencies: dependent tests: (no tests depend on unicode) Legend:
|
Sorry, something went wrong.
ShaharNaveh
left a comment
There was a problem hiding this comment.
TYSM!
Overall I really like this change.
The only thing rhat might be worth is to a README.md under crates/stdlib/unicode/ucd32/README.md with links for the soure of each file so we can upsate them in the future if needed
Sorry, something went wrong.
|
Also, the Unicode 3.2 files will never change but I agree about linking the sources somewhere. CPython seems to provide 3.2 as a compatibility for an old IDNA version (which is also provided for compatibility 😹 ). All of this work is essentially to fix a 20+ year old module that is likely not used anywhere. 😅 Unicode 3.2 - here is the source for now. I'll add it to a README after I finish the patch. 😁 |
Sorry, something went wrong.
5479c61 to
e3ad1ed
Compare
May 28, 2026 01:03
I removed an embedded table of non-ASCII numbers in favor of using `icu_decimal`. The benefits of using `icu4x` here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table. `icu_decimal` is also useful beyond `float()`. I'm also using it to clean up `unicodedata` in RustPython#7947.
I removed an embedded table of non-ASCII numbers in favor of using `icu_decimal`. The benefits of using `icu4x` here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table. `icu_decimal` is also useful beyond `float()`. I'm also using it to clean up `unicodedata` in RustPython#7947.
I removed an embedded table of non-ASCII numbers in favor of using `icu_decimal`. The benefits of using `icu4x` here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table. `icu_decimal` is also useful beyond `float()`. I'm also using it to clean up `unicodedata` in RustPython#7947.
de85853 to
dc1e675
Compare
June 7, 2026 04:28
dc1e675 to
da04480
Compare
June 7, 2026 04:32
|
I undrafted for CodeRabbit because I couldn't trigger it. There will likely be a lot of suggestions. 😅 |
Sorry, something went wrong.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/stdlib/src/unicodedata.rs (1)
500-519: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick winIncorrect NumericType check in
numericmethod.Line 509 checks
NumericType::Decimal, but thenumeric()method should accept characters with any numeric type (Decimal, Digit, or Numeric), not just Decimal. This will incorrectly reject valid numeric characters like Roman numerals (NumericType::Numeric) and circled digits (NumericType::Digit).🐛 Proposed fix
fn numeric( &self, character: PyStrRef, default: OptionalArg<PyObjectRef>, vm: &VirtualMachine, ) -> PyResult { self.extract_char(character, vm).and_then(|c| { c.to_char() .and_then(|c| { - if NumericType::for_char(c) != NumericType::Decimal { + let num_type = NumericType::for_char(c); + if num_type == NumericType::None { return None; } let value = lookup_numeric_val(c, self.unic_version)?; Some(vm.ctx.new_float(value).into()) }) .or_else(|| default.into_option()) .ok_or_else(|| vm.new_value_error("not a numeric character")) }) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/stdlib/src/unicodedata.rs` around lines 500 - 519, The numeric method is incorrectly filtering only Decimal numeric types; update the check in numeric (inside the closure from extract_char) to accept any numeric type by testing that NumericType::for_char(c) is not NumericType::None (or simply remove the Decimal-only guard), so characters with Digit or Numeric are allowed; locate the conditional that currently compares to NumericType::Decimal and change it to compare against NumericType::None (or delete it) and keep the subsequent lookup_numeric_val(c, self.unic_version) path.
🧹 Nitpick comments (1)
crates/stdlib/build.rs (1)
581-582: Use explicit Unicode input files forcargo:rerun-if-changedincrates/stdlib/build.rs.Cargo does scan directory paths recursively for
cargo:rerun-if-changed, but change detection is mtime-based, so directory-level tracking can be brittle. This build script only reads a fixed set of files from those directories, so track the exact inputs instead ofunicode/ucd32andunicode/latest.Replace:
println!("cargo:rerun-if-changed=unicode/ucd32"); println!("cargo:rerun-if-changed=unicode/latest");With
cargo:rerun-if-changedfor:
unicode/ucd32/DerivedGeneralCategory-3.2.0.txtunicode/ucd32/DerivedEastAsianWidth-3.2.0.txtunicode/ucd32/DerivedBidiClass-3.2.0.txtunicode/ucd32/DerivedBinaryProperties-3.2.0.txtunicode/ucd32/DerivedCombiningClass-3.2.0.txtunicode/ucd32/DerivedNumericValues-3.2.0.txtunicode/latest/UnicodeData.txtunicode/latest/NormalizationCorrections.txtQuick check: run
cargo build -p rustpython-stdlib -vv,touch crates/stdlib/unicode/ucd32/DerivedGeneralCategory-3.2.0.txt, then run again and confirm Cargo re-runs therustpython-stdlibbuild script.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/stdlib/build.rs` around lines 581 - 582, Replace the two directory-level rerun directives that call println!("cargo:rerun-if-changed=unicode/ucd32") and println!("cargo:rerun-if-changed=unicode/latest") with explicit cargo:rerun-if-changed entries for each input file the build script actually reads: unicode/ucd32/DerivedGeneralCategory-3.2.0.txt, unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt, unicode/ucd32/DerivedBidiClass-3.2.0.txt, unicode/ucd32/DerivedBinaryProperties-3.2.0.txt, unicode/ucd32/DerivedCombiningClass-3.2.0.txt, unicode/ucd32/DerivedNumericValues-3.2.0.txt, unicode/latest/UnicodeData.txt, and unicode/latest/NormalizationCorrections.txt so Cargo will reliably rerun the build script when any of those exact files change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@extra_tests/snippets/builtin_str_unicode.py`:
- Around line 27-29: The test asserts the wrong versioned decomposition for
U+F951: update the expectation for unicodedata.ucd_3_2_0.decomposition("\uf951")
to the pre-correction mapping "96FB" instead of "964B" so the versioned path
unicodedata.ucd_3_2_0.decomposition matches the Unicode 3.2 behavior (leave the
modern unicodedata.decomposition("\uf951") assertion as-is).
---
Outside diff comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 500-519: The numeric method is incorrectly filtering only Decimal
numeric types; update the check in numeric (inside the closure from
extract_char) to accept any numeric type by testing that
NumericType::for_char(c) is not NumericType::None (or simply remove the
Decimal-only guard), so characters with Digit or Numeric are allowed; locate the
conditional that currently compares to NumericType::Decimal and change it to
compare against NumericType::None (or delete it) and keep the subsequent
lookup_numeric_val(c, self.unic_version) path.
---
Nitpick comments:
In `@crates/stdlib/build.rs`:
- Around line 581-582: Replace the two directory-level rerun directives that
call println!("cargo:rerun-if-changed=unicode/ucd32") and
println!("cargo:rerun-if-changed=unicode/latest") with explicit
cargo:rerun-if-changed entries for each input file the build script actually
reads: unicode/ucd32/DerivedGeneralCategory-3.2.0.txt,
unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt,
unicode/ucd32/DerivedBidiClass-3.2.0.txt,
unicode/ucd32/DerivedBinaryProperties-3.2.0.txt,
unicode/ucd32/DerivedCombiningClass-3.2.0.txt,
unicode/ucd32/DerivedNumericValues-3.2.0.txt, unicode/latest/UnicodeData.txt,
and unicode/latest/NormalizationCorrections.txt so Cargo will reliably rerun the
build script when any of those exact files change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: cd95b557-2363-4034-b21e-322853c602ab
⛔ Files ignored due to path filters (2)
Cargo.lockis excluded by!**/*.lockLib/test/test_str.pyis excluded by!Lib/**
📒 Files selected for processing (16)
Cargo.tomlcrates/stdlib/Cargo.tomlcrates/stdlib/build.rscrates/stdlib/src/unicodedata.rscrates/stdlib/unicode/README.mdcrates/stdlib/unicode/latest/DerivedNumericValues.txtcrates/stdlib/unicode/latest/NormalizationCorrections.txtcrates/stdlib/unicode/latest/UnicodeData.txtcrates/stdlib/unicode/ucd32/DerivedBidiClass-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedBinaryProperties-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedCombiningClass-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedEastAsianWidth-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedGeneralCategory-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedNumericType-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedNumericValues-3.2.0.txtextra_tests/snippets/builtin_str_unicode.py
💤 Files with no reviewable changes (1)
- Cargo.toml
Sorry, something went wrong.
4db4918 to
da427d8
Compare
June 9, 2026 03:11
d982b32 to
0df2f8e
Compare
June 11, 2026 00:59
youknowone
left a comment
There was a problem hiding this comment.
Thank you so much!
Sorry, something went wrong.
ShaharNaveh
left a comment
There was a problem hiding this comment.
this is great!
tysm for work you've put into this<3
Sorry, something went wrong.
Pull request was converted to draft
|
Sorry, I forgot to draft this again. 😂 I'm almost finished with it though. |
Sorry, something went wrong.
@coderabbitai Can a CodeRabbit review be triggered by saying |
Sorry, something went wrong.
|
✅ Action performedReview finished.
|
Sorry, something went wrong.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/stdlib/build.rs`:
- Around line 436-440: The code uses field.saturating_sub(1) when selecting the
nth semicolon field, which makes field==0 and field==1 both select the first
column; change the selection to use the intended zero-based index and validate
bounds: replace fields.nth(field.saturating_sub(1)) with fields.nth(field)
(treat field as zero-based) or explicitly handle 1-based inputs by validating
and converting (e.g., check if field==0 -> error or convert 1-based via
field.checked_sub(1).expect(...)); update the selection in parse_unicode_3_2
(and the other occurrences referenced) and add a clear panic/error message when
the requested field index is out of range so the wrong column cannot be silently
parsed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 91782a19-e0f3-44db-b06b-81992ecd3580
⛔ Files ignored due to path filters (3)
Cargo.lockis excluded by!**/*.lockLib/test/test_str.pyis excluded by!Lib/**Lib/test/test_unicodedata.pyis excluded by!Lib/**
📒 Files selected for processing (16)
Cargo.tomlcrates/stdlib/Cargo.tomlcrates/stdlib/build.rscrates/stdlib/src/unicodedata.rscrates/stdlib/unicode/README.mdcrates/stdlib/unicode/latest/DerivedNumericValues.txtcrates/stdlib/unicode/latest/NormalizationCorrections.txtcrates/stdlib/unicode/latest/UnicodeData.txtcrates/stdlib/unicode/ucd32/DerivedBidiClass-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedBinaryProperties-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedCombiningClass-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedEastAsianWidth-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedGeneralCategory-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedNumericType-3.2.0.txtcrates/stdlib/unicode/ucd32/DerivedNumericValues-3.2.0.txtextra_tests/snippets/builtin_str_unicode.py
💤 Files with no reviewable changes (1)
- Cargo.toml
✅ Files skipped from review due to trivial changes (7)
- crates/stdlib/unicode/README.md
- crates/stdlib/unicode/ucd32/DerivedNumericType-3.2.0.txt
- crates/stdlib/unicode/ucd32/DerivedNumericValues-3.2.0.txt
- crates/stdlib/unicode/ucd32/DerivedCombiningClass-3.2.0.txt
- crates/stdlib/unicode/latest/NormalizationCorrections.txt
- crates/stdlib/unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt
- crates/stdlib/unicode/ucd32/DerivedBidiClass-3.2.0.txt
🚧 Files skipped from review as they are similar to previous changes (3)
- crates/stdlib/Cargo.toml
- extra_tests/snippets/builtin_str_unicode.py
- crates/stdlib/src/unicodedata.rs
Sorry, something went wrong.
37e363a to
52330d2
Compare
June 13, 2026 01:33
|
I got as far as I could get with these fixes. The bidirectional test should pass after the tests are bumped. CPython's |
Sorry, something went wrong.
🫡 |
Sorry, something went wrong.
ShaharNaveh
left a comment
There was a problem hiding this comment.
TYSM!
Sorry, something went wrong.
Python bundles an old version of Unicode for compatibility. RustPython tries to mimic supporting that old version by checking the version of individual chars. This is a problem for a few reasons. The first is that the age check adds an additional hit per each char lookup in Unicode data. The check is outdated because the `unic-ucd-age` crate is several versions behind the current Unicode version. The check rejects valid chars because of the version differences. The check is subtly wrong because it returns properties for Unicode 16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0 database. Unfortunately, there isn't a crate that can help us here. `icu4x` targets modern Unicode versions. Writing a data provider for `icu4x` for Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the Unicode 3.2.0 data myself but to skip `icu4x` (mostly) to instead write small lookup tables. As of this commit, Unicode names is still wrong for 3.2.0. Luckily, the crate RustPython uses is fast and robust for modern Unicode.
52330d2 to
3cf6039
Compare
June 13, 2026 02:05
e423160
into
RustPython:main
Jun 14, 2026
Python bundles an old version of Unicode for compatibility. RustPython tries to mimic supporting that old version by checking the version of individual chars. This is a problem for a few reasons. The first is that the age check adds an additional hit per each char lookup in Unicode data. The check is outdated because the
unic-ucd-agecrate is several versions behind the current Unicode version. The check rejects valid chars because of the version differences.The check is subtly wrong because it returns properties for Unicode 16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0 database.
Unfortunately, there isn't a crate that can help us here.
icu4xtargets modern Unicode versions. Writing a data provider foricu4xfor Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the Unicode 3.2.0 data myself but to skipicu4x(mostly) to instead write small lookup tables.As of this commit, Unicode names is still wrong for 3.2.0. Luckily, the crate RustPython uses is fast and robust for modern Unicode.
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Documentation
Tests