Replace unmaintained `unic` crates#7555

ShaharNaveh

Summary by CodeRabbit

Chores
- Switched Unicode data and normalization backend to a different Unicode provider for improved standards compliance.
- Updated normalization and mirroring behavior to use the new provider.
User-visible changes
- Character classification and string methods (printability, whitespace, decimal, identifier checks, and width/bidi labels) may yield different results under the new provider.
Breaking Changes
- The combining() return type changed from i32 to u8.

coderabbitai

📝 Walkthrough

Walkthrough

Migrates Unicode handling from several unic-* crates to ICU libraries: adds icu_properties and icu_normalizer, removes multiple unic-* and unicode-bidi-mirroring workspace dependencies, and updates code to use ICU APIs for classification, normalization, bidi, mirroring, and combining class.

Changes

Cohort / File(s)	Summary
Workspace manifests `Cargo.toml`, `crates/literal/Cargo.toml`, `crates/stdlib/Cargo.toml`, `crates/vm/Cargo.toml`	Removed `unic-char-property`, `unic-normal`, `unic-ucd-bidi`, `unic-ucd-category`, `unic-ucd-ident`, and `unicode-bidi-mirroring` entries; added `icu_properties = "2"` and `icu_normalizer = "2"` to workspace deps.
Literal char utilities `crates/literal/src/char.rs`	Replaced `unic_ucd_category::GeneralCategory::of` with `icu_properties::props::GeneralCategory::for_char` and updated `is_printable` to explicit `GeneralCategory` matching.
Stdlib Unicode data `crates/stdlib/src/unicodedata.rs`	Switched category/bidi/width lookups to ICU (`for_char`) and `short_name()`; migrated `normalize()`/`is_normalized()` to `icu_normalizer` (composing/decomposing normalizers); replaced mirroring check with ICU `CodePointSetData::new::<BidiMirrored>().contains(ch)`; changed `combining()` return type `i32` → `u8` and use `CanonicalCombiningClass::for_char(...).to_icu4c_value()`; removed local width-abbreviation trait.
VM string builtins `crates/vm/src/builtins/str.rs`	Replaced `unic_ucd_*` imports and helpers with `icu_properties` equivalents (`GeneralCategory::for_char`, `BidiClass::for_char`, `XidStart::for_char`, `XidContinue::for_char`); updated `isdecimal`, `isspace`, and `isidentifier` logic accordingly.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hopped from unic to ICU with glee,
Swapping crates so characters roam free,
Categories, bidi, and normalization too,
I stitched each byte until behavior grew true,
A tiny rabbit cheers the Unicode spree! 🎉

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: replacing unmaintained `unic` crates with maintained alternatives, which is the core objective evident across all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

(module 'str test_unicodedata' not found)

Legend:

[+] path exists in CPython
[x] up-to-date, [ ] outdated

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

crates/stdlib/src/unicodedata.rs (1)
194-217: Deduplicate the normalization pipeline.

Both methods repeat the same map_utf8(...).collect() body four times and only vary by the normalizer constructor. Please extract the form-specific choice first and keep the shared path in one place. As per coding guidelines, "When branches differ only in a value but share common logic, extract the differing value first, then call the common logic once to avoid duplicate code."

Also applies to: 222-245
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/unicodedata.rs` around lines 194 - 217, The normalize
function duplicates the map_utf8(...).collect() pipeline for each NormalizeForm
branch; instead, choose the appropriate normalizer constructor based on the form
(use the variants Nfc, Nfkc, Nfd, Nfkd and the constructors
ComposingNormalizerBorrowed::new_nfc(), ComposingNormalizerBorrowed::new_nfkc(),
DecomposingNormalizerBorrowed::new_nfd(),
DecomposingNormalizerBorrowed::new_nfkd()) and store it in a single variable (or
closure) and then call text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() once; apply the same refactor to
the later duplicate block as well so only the normalizer choice varies and the
shared map_utf8(...).collect() path is centralized.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 44-55: The code mixes Unicode data sources:
Age::of()/UNICODE_VERSION (unic_ucd_age, Unicode 10) vs. ICU-based APIs
(icu_properties/icu_normalizer, Unicode 17), so update the module to use one
consistent source—preferably the ICU data already used by category(),
bidirectional(), east_asian_width(), normalize(), is_normalized(), and
mirrored(): remove references to unic_ucd_age::Age and UNICODE_VERSION, derive
unidata_version from the ICU provider/metadata (the same runtime data backing
icu_properties and icu_normalizer), and change check_age() to use ICU's age
information (or remove age-based filtering) so that unidata_version accurately
reflects the data used by those functions. Ensure all places that previously
consulted Age::of() now query the ICU data provider or use the ICU-provided
UnicodeVersion so behavior and reported unidata_version remain in sync.

---

Nitpick comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 194-217: The normalize function duplicates the
map_utf8(...).collect() pipeline for each NormalizeForm branch; instead, choose
the appropriate normalizer constructor based on the form (use the variants Nfc,
Nfkc, Nfd, Nfkd and the constructors ComposingNormalizerBorrowed::new_nfc(),
ComposingNormalizerBorrowed::new_nfkc(),
DecomposingNormalizerBorrowed::new_nfd(),
DecomposingNormalizerBorrowed::new_nfkd()) and store it in a single variable (or
closure) and then call text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() once; apply the same refactor to
the later duplicate block as well so only the normalizer choice varies and the
shared map_utf8(...).collect() path is centralized.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 2897286b-ea41-43ef-acc5-3fe407520f70

📥 Commits

Reviewing files that changed from the base of the PR and between d9c4c95 and 273ad65.

⛔ Files ignored due to path filters (3)

Cargo.lock is excluded by !**/*.lock
Lib/test/test_str.py is excluded by !Lib/**
Lib/test/test_unicodedata.py is excluded by !Lib/**

📒 Files selected for processing (7)

Cargo.toml
crates/literal/Cargo.toml
crates/literal/src/char.rs
crates/stdlib/Cargo.toml
crates/stdlib/src/unicodedata.rs
crates/vm/Cargo.toml
crates/vm/src/builtins/str.rs

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

crates/stdlib/src/unicodedata.rs (2)
179-190: ⚠️ Potential issue | 🟡 Minor

Refresh the stale version note above east_asian_width().

The note on Line 179 still says this path uses the 9.0.0 database, but the implementation now goes through ICU property data. Leaving it in place is misleading.
Suggested update
-        /// NOTE: This function uses 9.0.0 database instead of 3.2.0
+        /// NOTE: This function uses ICU property data rather than the 3.2.0 database.
As per coding guidelines, "Do not delete or rewrite existing comments unless they are factually wrong or directly contradict the new code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/unicodedata.rs` around lines 179 - 190, The comment above
the east_asian_width method is outdated (mentions "uses 9.0.0 database") but the
implementation now relies on ICU property data; update the note above fn
east_asian_width to accurately state that EastAsianWidth values are derived from
ICU property data (or similar current source) instead of referencing the old
9.0.0 database, ensuring the comment matches the implementation without changing
code logic.
164-176: ⚠️ Potential issue | 🟠 Major

Preserve CPython's empty-string bidi result for U+FFFE.

U+FFFE (a noncharacter) has BidiClass::BoundaryNeutral, so BidiClass::short_name() returns "BN". However, Lib/test/test_unicodedata.py:183 expects unicodedata.bidirectional('\uFFFE') to return '' (empty string). The current implementation exposes ICU abbreviations directly, violating Python's API contract. Add explicit handling to return empty strings for characters that should not have a bidi classification.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/unicodedata.rs` around lines 164 - 176, The bidirectional
method currently returns ICU abbreviations via BidiClass::short_name(), which
yields "BN" for U+FFFE; update bidirectional (in the bidirectional function
where you call self.extract_char and BidiClass::for_char/short_name) to
explicitly return an empty string for U+FFFE (or other noncharacter codepoints
that CPython treats as no bidi) instead of the ICU short name — i.e., after
extracting the char, check its codepoint (c.to_char().map or similar) and if it
equals 0xFFFE (or otherwise identified as a noncharacter per CPython behavior)
return "" before calling BidiClass::short_name().

🧹 Nitpick comments (1)

crates/stdlib/src/unicodedata.rs (1)
194-247: Extract the normalization-form dispatch once.

normalize() and is_normalized() now duplicate the same form-to-normalizer mapping. Pulling that into one helper will keep the two paths from drifting on the next change.

As per coding guidelines, "When branches differ only in a value but share common logic, extract the differing value first, then call the common logic once to avoid duplicate code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/stdlib/src/unicodedata.rs` around lines 194 - 247, The
normalize()/is_normalized() functions duplicate the same NormalizeForm ->
normalizer construction; extract that dispatch into a private helper (e.g., fn
get_normalizer(form: super::NormalizeForm) -> impl Fn(&str) -> impl
Iterator<Item=char> or return an enum/trait object representing the chosen
normalizer) that maps Nfc/Nfkc to
ComposingNormalizerBorrowed::new_nfc()/new_nfkc and Nfd/Nfkd to
DecomposingNormalizerBorrowed::new_nfd()/new_nfkd, then call that helper from
both normalize() and is_normalized() to run text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() and compare results, removing
the duplicated match blocks.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 179-190: The comment above the east_asian_width method is outdated
(mentions "uses 9.0.0 database") but the implementation now relies on ICU
property data; update the note above fn east_asian_width to accurately state
that EastAsianWidth values are derived from ICU property data (or similar
current source) instead of referencing the old 9.0.0 database, ensuring the
comment matches the implementation without changing code logic.
- Around line 164-176: The bidirectional method currently returns ICU
abbreviations via BidiClass::short_name(), which yields "BN" for U+FFFE; update
bidirectional (in the bidirectional function where you call self.extract_char
and BidiClass::for_char/short_name) to explicitly return an empty string for
U+FFFE (or other noncharacter codepoints that CPython treats as no bidi) instead
of the ICU short name — i.e., after extracting the char, check its codepoint
(c.to_char().map or similar) and if it equals 0xFFFE (or otherwise identified as
a noncharacter per CPython behavior) return "" before calling
BidiClass::short_name().

---

Nitpick comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 194-247: The normalize()/is_normalized() functions duplicate the
same NormalizeForm -> normalizer construction; extract that dispatch into a
private helper (e.g., fn get_normalizer(form: super::NormalizeForm) -> impl
Fn(&str) -> impl Iterator<Item=char> or return an enum/trait object representing
the chosen normalizer) that maps Nfc/Nfkc to
ComposingNormalizerBorrowed::new_nfc()/new_nfkc and Nfd/Nfkd to
DecomposingNormalizerBorrowed::new_nfd()/new_nfkd, then call that helper from
both normalize() and is_normalized() to run text.map_utf8(|s|
normalizer.normalize_iter(s.chars())).collect() and compare results, removing
the duplicated match blocks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 4b3e58a4-f77c-4b38-8fc8-84b1a65afe1f

📥 Commits

Reviewing files that changed from the base of the PR and between 273ad65 and 3245cd1.

📒 Files selected for processing (3)

Cargo.toml
crates/stdlib/src/unicodedata.rs
crates/vm/src/builtins/str.rs

✅ Files skipped from review due to trivial changes (1)

Cargo.toml

ShaharNaveh added 9 commits April 1, 2026 02:47

Use maintained crates

6abe80a

Fix literal

36c3886

Fix some of vm

23335bd

Fix vm

de8cd0a

Fix stdlib

15b4a82

Migrate more at stdlib

4789fc0

Use shortform

b0583ab

Mark failing test

47dadaf

Fix test marks

273ad65

ShaharNaveh commented Apr 1, 2026

View reviewed changes

ShaharNaveh added 3 commits April 1, 2026 11:44

cargo shear

5a776f5

Clippy

0cc47eb

clippy

3245cd1

coderabbitai Bot reviewed Apr 1, 2026

View reviewed changes

ShaharNaveh mentioned this pull request Apr 1, 2026

Replace unmaintained unic-ucd-category crate with icu_properties astral-sh/ruff#24344

Merged

youknowone approved these changes Apr 2, 2026

View reviewed changes

youknowone pushed a commit to youknowone/RustPython that referenced this pull request Apr 5, 2026

Replace unmaintained unic crates (RustPython#7555)

2c88743

This was referenced Apr 15, 2026

fix: Python-Rust combining char diff in isalnum #7612

Merged

Use Unicode properties for alnum, alpha, etc. #7626

Merged

Match CPython's islower/isupper exactly #7646

Merged

coderabbitai Bot mentioned this pull request May 2, 2026

Fix title() and capitalize() #7717

Merged

coderabbitai Bot mentioned this pull request May 15, 2026

Use icu4x for casefold() #7780

Merged

coderabbitai Bot mentioned this pull request Jun 7, 2026

Fix most of unicodedata, unmask multiple tests, and remove ucd #7947

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace unmaintained `unic` crates#7555