Replace unmaintained unic crates#7555
Conversation
📝 WalkthroughWalkthroughMigrates Unicode handling from several Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Sorry, something went wrong.
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: (module 'str test_unicodedata' not found) Legend:
|
Sorry, something went wrong.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
crates/stdlib/src/unicodedata.rs (1)
194-217: Deduplicate the normalization pipeline.Both methods repeat the same
map_utf8(...).collect()body four times and only vary by the normalizer constructor. Please extract the form-specific choice first and keep the shared path in one place. As per coding guidelines, "When branches differ only in a value but share common logic, extract the differing value first, then call the common logic once to avoid duplicate code."Also applies to: 222-245
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/stdlib/src/unicodedata.rs` around lines 194 - 217, The normalize function duplicates the map_utf8(...).collect() pipeline for each NormalizeForm branch; instead, choose the appropriate normalizer constructor based on the form (use the variants Nfc, Nfkc, Nfd, Nfkd and the constructors ComposingNormalizerBorrowed::new_nfc(), ComposingNormalizerBorrowed::new_nfkc(), DecomposingNormalizerBorrowed::new_nfd(), DecomposingNormalizerBorrowed::new_nfkd()) and store it in a single variable (or closure) and then call text.map_utf8(|s| normalizer.normalize_iter(s.chars())).collect() once; apply the same refactor to the later duplicate block as well so only the normalizer choice varies and the shared map_utf8(...).collect() path is centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 44-55: The code mixes Unicode data sources:
Age::of()/UNICODE_VERSION (unic_ucd_age, Unicode 10) vs. ICU-based APIs
(icu_properties/icu_normalizer, Unicode 17), so update the module to use one
consistent source—preferably the ICU data already used by category(),
bidirectional(), east_asian_width(), normalize(), is_normalized(), and
mirrored(): remove references to unic_ucd_age::Age and UNICODE_VERSION, derive
unidata_version from the ICU provider/metadata (the same runtime data backing
icu_properties and icu_normalizer), and change check_age() to use ICU's age
information (or remove age-based filtering) so that unidata_version accurately
reflects the data used by those functions. Ensure all places that previously
consulted Age::of() now query the ICU data provider or use the ICU-provided
UnicodeVersion so behavior and reported unidata_version remain in sync.
---
Nitpick comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 194-217: The normalize function duplicates the
map_utf8(...).collect() pipeline for each NormalizeForm branch; instead, choose
the appropriate normalizer constructor based on the form (use the variants Nfc,
Nfkc, Nfd, Nfkd and the constructors ComposingNormalizerBorrowed::new_nfc(),
ComposingNormalizerBorrowed::new_nfkc(),
DecomposingNormalizerBorrowed::new_nfd(),
DecomposingNormalizerBorrowed::new_nfkd()) and store it in a single variable (or
closure) and then call text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() once; apply the same refactor to
the later duplicate block as well so only the normalizer choice varies and the
shared map_utf8(...).collect() path is centralized.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 2897286b-ea41-43ef-acc5-3fe407520f70
⛔ Files ignored due to path filters (3)
Cargo.lockis excluded by!**/*.lockLib/test/test_str.pyis excluded by!Lib/**Lib/test/test_unicodedata.pyis excluded by!Lib/**
📒 Files selected for processing (7)
Cargo.tomlcrates/literal/Cargo.tomlcrates/literal/src/char.rscrates/stdlib/Cargo.tomlcrates/stdlib/src/unicodedata.rscrates/vm/Cargo.tomlcrates/vm/src/builtins/str.rs
Sorry, something went wrong.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
crates/stdlib/src/unicodedata.rs (2)
179-190: ⚠️ Potential issue | 🟡 MinorRefresh the stale version note above
east_asian_width().The note on Line 179 still says this path uses the 9.0.0 database, but the implementation now goes through ICU property data. Leaving it in place is misleading.
Suggested update
- /// NOTE: This function uses 9.0.0 database instead of 3.2.0 + /// NOTE: This function uses ICU property data rather than the 3.2.0 database.As per coding guidelines, "Do not delete or rewrite existing comments unless they are factually wrong or directly contradict the new code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/stdlib/src/unicodedata.rs` around lines 179 - 190, The comment above the east_asian_width method is outdated (mentions "uses 9.0.0 database") but the implementation now relies on ICU property data; update the note above fn east_asian_width to accurately state that EastAsianWidth values are derived from ICU property data (or similar current source) instead of referencing the old 9.0.0 database, ensuring the comment matches the implementation without changing code logic.
164-176: ⚠️ Potential issue | 🟠 MajorPreserve CPython's empty-string bidi result for U+FFFE.
U+FFFE (a noncharacter) has
BidiClass::BoundaryNeutral, soBidiClass::short_name()returns"BN". However,Lib/test/test_unicodedata.py:183expectsunicodedata.bidirectional('\uFFFE')to return''(empty string). The current implementation exposes ICU abbreviations directly, violating Python's API contract. Add explicit handling to return empty strings for characters that should not have a bidi classification.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/stdlib/src/unicodedata.rs` around lines 164 - 176, The bidirectional method currently returns ICU abbreviations via BidiClass::short_name(), which yields "BN" for U+FFFE; update bidirectional (in the bidirectional function where you call self.extract_char and BidiClass::for_char/short_name) to explicitly return an empty string for U+FFFE (or other noncharacter codepoints that CPython treats as no bidi) instead of the ICU short name — i.e., after extracting the char, check its codepoint (c.to_char().map or similar) and if it equals 0xFFFE (or otherwise identified as a noncharacter per CPython behavior) return "" before calling BidiClass::short_name().
🧹 Nitpick comments (1)
crates/stdlib/src/unicodedata.rs (1)
194-247: Extract the normalization-form dispatch once.
normalize()andis_normalized()now duplicate the same form-to-normalizer mapping. Pulling that into one helper will keep the two paths from drifting on the next change.As per coding guidelines, "When branches differ only in a value but share common logic, extract the differing value first, then call the common logic once to avoid duplicate code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/stdlib/src/unicodedata.rs` around lines 194 - 247, The normalize()/is_normalized() functions duplicate the same NormalizeForm -> normalizer construction; extract that dispatch into a private helper (e.g., fn get_normalizer(form: super::NormalizeForm) -> impl Fn(&str) -> impl Iterator<Item=char> or return an enum/trait object representing the chosen normalizer) that maps Nfc/Nfkc to ComposingNormalizerBorrowed::new_nfc()/new_nfkc and Nfd/Nfkd to DecomposingNormalizerBorrowed::new_nfd()/new_nfkd, then call that helper from both normalize() and is_normalized() to run text.map_utf8(|s| normalizer.normalize_iter(s.chars())).collect() and compare results, removing the duplicated match blocks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 179-190: The comment above the east_asian_width method is outdated
(mentions "uses 9.0.0 database") but the implementation now relies on ICU
property data; update the note above fn east_asian_width to accurately state
that EastAsianWidth values are derived from ICU property data (or similar
current source) instead of referencing the old 9.0.0 database, ensuring the
comment matches the implementation without changing code logic.
- Around line 164-176: The bidirectional method currently returns ICU
abbreviations via BidiClass::short_name(), which yields "BN" for U+FFFE; update
bidirectional (in the bidirectional function where you call self.extract_char
and BidiClass::for_char/short_name) to explicitly return an empty string for
U+FFFE (or other noncharacter codepoints that CPython treats as no bidi) instead
of the ICU short name — i.e., after extracting the char, check its codepoint
(c.to_char().map or similar) and if it equals 0xFFFE (or otherwise identified as
a noncharacter per CPython behavior) return "" before calling
BidiClass::short_name().
---
Nitpick comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 194-247: The normalize()/is_normalized() functions duplicate the
same NormalizeForm -> normalizer construction; extract that dispatch into a
private helper (e.g., fn get_normalizer(form: super::NormalizeForm) -> impl
Fn(&str) -> impl Iterator<Item=char> or return an enum/trait object representing
the chosen normalizer) that maps Nfc/Nfkc to
ComposingNormalizerBorrowed::new_nfc()/new_nfkc and Nfd/Nfkd to
DecomposingNormalizerBorrowed::new_nfd()/new_nfkd, then call that helper from
both normalize() and is_normalized() to run text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() and compare results, removing
the duplicated match blocks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 4b3e58a4-f77c-4b38-8fc8-84b1a65afe1f
📒 Files selected for processing (3)
Cargo.tomlcrates/stdlib/src/unicodedata.rscrates/vm/src/builtins/str.rs
✅ Files skipped from review due to trivial changes (1)
- Cargo.toml
Sorry, something went wrong.
3d96884
into
RustPython:main
Apr 2, 2026
ref: #7529
Summary by CodeRabbit
Chores
User-visible changes
Breaking Changes