◐ Shell
reader mode source ↗
Skip to content

Fix most of unicodedata, unmask multiple tests, and remove ucd#7947

Merged
youknowone merged 1 commit into
RustPython:mainfrom
joshuamegnauth54:fix-isprintable
Jun 14, 2026
Merged

Fix most of unicodedata, unmask multiple tests, and remove ucd#7947
youknowone merged 1 commit into
RustPython:mainfrom
joshuamegnauth54:fix-isprintable

Conversation

@joshuamegnauth54

@joshuamegnauth54 joshuamegnauth54 commented May 21, 2026

Copy link
Copy Markdown
Contributor

Python bundles an old version of Unicode for compatibility. RustPython tries to mimic supporting that old version by checking the version of individual chars. This is a problem for a few reasons. The first is that the age check adds an additional hit per each char lookup in Unicode data. The check is outdated because the unic-ucd-age crate is several versions behind the current Unicode version. The check rejects valid chars because of the version differences.

The check is subtly wrong because it returns properties for Unicode 16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0 database.

Unfortunately, there isn't a crate that can help us here. icu4x targets modern Unicode versions. Writing a data provider for icu4x for Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the Unicode 3.2.0 data myself but to skip icu4x (mostly) to instead write small lookup tables.

As of this commit, Unicode names is still wrong for 3.2.0. Luckily, the crate RustPython uses is fast and robust for modern Unicode.

Summary by CodeRabbit

  • New Features

    • Vendored Unicode 3.2.0 plus new “latest” compatibility datasets for broader coverage.
  • Improvements

    • Version-aware Unicode lookups (category, bidi, width, decomposition) with reduced allocations.
    • Numeric APIs (digit/decimal/numeric) clarified to return normalized numeric values where applicable.
  • Bug Fixes

    • Applied normalization/decomposition corrections for historic compatibility.
  • Documentation

    • Added guidance on vendored Unicode data and update policy.
  • Tests

    • Updated Unicode tests to validate decompositions and versioned behavior.

@coderabbitai

coderabbitai Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Migrates unicode handling from ucd/unic-ucd-age to ICU properties plus generated lookup tables; adds build-time generators and vendored UCD 3.2.0 + latest data, refactors unicodedata for version-aware property/numeric lookups and decompositions, and updates tests.

Changes

Unicode Data Infrastructure Refactor

Layer / File(s) Summary
Dependency migration
Cargo.toml, crates/stdlib/Cargo.toml
Workspace and stdlib dependencies remove ucd and unic-ucd-age, add icu_normalizer and icu_properties (including as build-dependencies).
Build script Unicode generation pipeline
crates/stdlib/build.rs
Adds generators: generate_unicode_3_2, generate_numeric_type, generate_numeric_value, generate_unicode_latest, shared helpers (write_derived, parse_unicode_3_2), decomposition mapping (DecompositionType), and concurrent runner wiring with cargo:rerun-if-changed.
Module includes and lookup helpers
crates/stdlib/src/unicodedata.rs (lines 5–143)
Includes generated tables, defines UnicodeVersion/UNICODE_VERSION, introduces DecompositionType/type_tag(), and implements lookup_property and lookup_numeric_val.
Character extraction, category, and name
crates/stdlib/src/unicodedata.rs (lines 224–278)
extract_char no longer age-filters; category returns &'static str and selects ICU or generated lookup by Unicode version; name updated to use new extraction flow.
Bidirectional and East Asian width
crates/stdlib/src/unicodedata.rs (lines 291–345)
bidirectional and east_asian_width branch by Unicode version: ICU for newer, generated lookup_property for 3.2.0, preserving PUA1 override and fallbacks.
Mirrored and combining
crates/stdlib/src/unicodedata.rs (lines 405–443)
mirrored and combining use versioned resolution with ICU or generated lookup and keep default fallbacks.
Decomposition assembly
crates/stdlib/src/unicodedata.rs (lines 445–494)
decomposition assembles compatibility decompositions from generated compat/update/range data with explicit "<{tag}> {xxxx ...}" formatting and 3.2.0 special-case handling; removes standalone helper.
Numeric properties
crates/stdlib/src/unicodedata.rs (lines 504–582)
digit, decimal, and numeric validate via ICU NumericType, fetch values via lookup_numeric_val with Unicode-version gating, require integral values for integer APIs, and return floats for numeric.
Unicode data files and documentation
crates/stdlib/unicode/README.md, crates/stdlib/unicode/latest/NormalizationCorrections.txt, crates/stdlib/unicode/ucd32/*
Add README, NormalizationCorrections.txt (latest), and multiple UCD 3.2.0 derived-property files (BidiClass, BinaryProperties, CombiningClass, EastAsianWidth, NumericType, NumericValues).
Test coverage
extra_tests/snippets/builtin_str_unicode.py
Adds assertions for decomposition (including normalization corrections), East Asian width, and mirrored properties for default and ucd_3_2_0.

Sequence Diagram(s)

sequenceDiagram
  participant build_rs as build.rs
  participant UCDfiles as unicode/ucd32 & unicode/latest
  participant Parser as parse_unicode_3_2
  participant Emitter as codegen (write_derived / emit)
  participant Runtime as unicodedata.rs
  build_rs->>UCDfiles: read vendored UCD 3.2.0 + latest files
  build_rs->>Parser: parse derived-property & normalization files
  Parser->>Emitter: produce table entries and decomposition updates
  Emitter->>build_rs: write generated Rust sources to OUT_DIR/generated
  Runtime->>Emitter: include generated lookup tables (lookup_property / lookup_numeric_val)
  Runtime->>Runtime: call ICU APIs for newer UnicodeVersion paths
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • ShaharNaveh
  • youknowone

Poem

🐰 I nibble through code, a tiny sleuth,
Vendored tables sprout from careful proof,
Old crates hopped out, ICU hopped in,
Decomps and digits tidy the bin,
Carrots and code — a crisp, neat roof!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.12% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: fixing unicodedata functionality, unmasking tests, and removing the ucd dependency.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] test: cpython/Lib/test/test_str.py (TODO: 6)
[x] test: cpython/Lib/test/test_fstring.py (TODO: 19)
[x] test: cpython/Lib/test/test_string_literals.py (TODO: 4)

dependencies:

dependent tests: (no tests depend on str)

[ ] test: cpython/Lib/test/test_unicodedata.py (TODO: 13)
[x] test: cpython/Lib/test/test_unicode_file.py
[x] test: cpython/Lib/test/test_unicode_file_functions.py
[x] test: cpython/Lib/test/test_unicode_identifiers.py (TODO: 1)
[x] test: cpython/Lib/test/test_ucn.py (TODO: 4)

dependencies:

dependent tests: (no tests depend on unicode)

Legend:

  • [+] path exists in CPython
  • [x] up-to-date, [ ] outdated

@ShaharNaveh ShaharNaveh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

TYSM!

Overall I really like this change.

The only thing rhat might be worth is to a README.md under crates/stdlib/unicode/ucd32/README.md with links for the soure of each file so we can upsate them in the future if needed

@joshuamegnauth54

Copy link
Copy Markdown
Contributor Author

Also, the Unicode 3.2 files will never change but I agree about linking the sources somewhere. CPython seems to provide 3.2 as a compatibility for an old IDNA version (which is also provided for compatibility 😹 ). All of this work is essentially to fix a 20+ year old module that is likely not used anywhere. 😅

Unicode 3.2 - here is the source for now. I'll add it to a README after I finish the patch. 😁

@joshuamegnauth54 joshuamegnauth54 force-pushed the fix-isprintable branch 2 times, most recently from 5479c61 to e3ad1ed Compare May 28, 2026 01:03
joshuamegnauth54 added a commit to joshuamegnauth54/RustPython that referenced this pull request May 30, 2026
I removed an embedded table of non-ASCII numbers in favor of using
`icu_decimal`. The benefits of using `icu4x` here are consistency plus
Unicode updates. As Unicode is updated, we automatically reap the
benefits without having to modify the table.

`icu_decimal` is also useful beyond `float()`. I'm also using it to
clean up `unicodedata` in RustPython#7947.
joshuamegnauth54 added a commit to joshuamegnauth54/RustPython that referenced this pull request May 30, 2026
I removed an embedded table of non-ASCII numbers in favor of using
`icu_decimal`. The benefits of using `icu4x` here are consistency plus
Unicode updates. As Unicode is updated, we automatically reap the
benefits without having to modify the table.

`icu_decimal` is also useful beyond `float()`. I'm also using it to
clean up `unicodedata` in RustPython#7947.
joshuamegnauth54 added a commit to joshuamegnauth54/RustPython that referenced this pull request May 30, 2026
I removed an embedded table of non-ASCII numbers in favor of using
`icu_decimal`. The benefits of using `icu4x` here are consistency plus
Unicode updates. As Unicode is updated, we automatically reap the
benefits without having to modify the table.

`icu_decimal` is also useful beyond `float()`. I'm also using it to
clean up `unicodedata` in RustPython#7947.
@joshuamegnauth54 joshuamegnauth54 force-pushed the fix-isprintable branch 4 times, most recently from de85853 to dc1e675 Compare June 7, 2026 04:28
@joshuamegnauth54 joshuamegnauth54 marked this pull request as ready for review June 7, 2026 04:31
@joshuamegnauth54

Copy link
Copy Markdown
Contributor Author

I undrafted for CodeRabbit because I couldn't trigger it. There will likely be a lot of suggestions. 😅

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/stdlib/src/unicodedata.rs (1)

500-519: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Incorrect NumericType check in numeric method.

Line 509 checks NumericType::Decimal, but the numeric() method should accept characters with any numeric type (Decimal, Digit, or Numeric), not just Decimal. This will incorrectly reject valid numeric characters like Roman numerals (NumericType::Numeric) and circled digits (NumericType::Digit).

🐛 Proposed fix
         fn numeric(
             &self,
             character: PyStrRef,
             default: OptionalArg<PyObjectRef>,
             vm: &VirtualMachine,
         ) -> PyResult {
             self.extract_char(character, vm).and_then(|c| {
                 c.to_char()
                     .and_then(|c| {
-                        if NumericType::for_char(c) != NumericType::Decimal {
+                        let num_type = NumericType::for_char(c);
+                        if num_type == NumericType::None {
                             return None;
                         }

                         let value = lookup_numeric_val(c, self.unic_version)?;
                         Some(vm.ctx.new_float(value).into())
                     })
                     .or_else(|| default.into_option())
                     .ok_or_else(|| vm.new_value_error("not a numeric character"))
             })
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/stdlib/src/unicodedata.rs` around lines 500 - 519, The numeric method
is incorrectly filtering only Decimal numeric types; update the check in numeric
(inside the closure from extract_char) to accept any numeric type by testing
that NumericType::for_char(c) is not NumericType::None (or simply remove the
Decimal-only guard), so characters with Digit or Numeric are allowed; locate the
conditional that currently compares to NumericType::Decimal and change it to
compare against NumericType::None (or delete it) and keep the subsequent
lookup_numeric_val(c, self.unic_version) path.
🧹 Nitpick comments (1)
crates/stdlib/build.rs (1)

581-582: Use explicit Unicode input files for cargo:rerun-if-changed in crates/stdlib/build.rs.

Cargo does scan directory paths recursively for cargo:rerun-if-changed, but change detection is mtime-based, so directory-level tracking can be brittle. This build script only reads a fixed set of files from those directories, so track the exact inputs instead of unicode/ucd32 and unicode/latest.

Replace:

println!("cargo:rerun-if-changed=unicode/ucd32");
println!("cargo:rerun-if-changed=unicode/latest");

With cargo:rerun-if-changed for:

  • unicode/ucd32/DerivedGeneralCategory-3.2.0.txt
  • unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt
  • unicode/ucd32/DerivedBidiClass-3.2.0.txt
  • unicode/ucd32/DerivedBinaryProperties-3.2.0.txt
  • unicode/ucd32/DerivedCombiningClass-3.2.0.txt
  • unicode/ucd32/DerivedNumericValues-3.2.0.txt
  • unicode/latest/UnicodeData.txt
  • unicode/latest/NormalizationCorrections.txt

Quick check: run cargo build -p rustpython-stdlib -vv, touch crates/stdlib/unicode/ucd32/DerivedGeneralCategory-3.2.0.txt, then run again and confirm Cargo re-runs the rustpython-stdlib build script.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/stdlib/build.rs` around lines 581 - 582, Replace the two
directory-level rerun directives that call
println!("cargo:rerun-if-changed=unicode/ucd32") and
println!("cargo:rerun-if-changed=unicode/latest") with explicit
cargo:rerun-if-changed entries for each input file the build script actually
reads: unicode/ucd32/DerivedGeneralCategory-3.2.0.txt,
unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt,
unicode/ucd32/DerivedBidiClass-3.2.0.txt,
unicode/ucd32/DerivedBinaryProperties-3.2.0.txt,
unicode/ucd32/DerivedCombiningClass-3.2.0.txt,
unicode/ucd32/DerivedNumericValues-3.2.0.txt, unicode/latest/UnicodeData.txt,
and unicode/latest/NormalizationCorrections.txt so Cargo will reliably rerun the
build script when any of those exact files change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@extra_tests/snippets/builtin_str_unicode.py`:
- Around line 27-29: The test asserts the wrong versioned decomposition for
U+F951: update the expectation for unicodedata.ucd_3_2_0.decomposition("\uf951")
to the pre-correction mapping "96FB" instead of "964B" so the versioned path
unicodedata.ucd_3_2_0.decomposition matches the Unicode 3.2 behavior (leave the
modern unicodedata.decomposition("\uf951") assertion as-is).

---

Outside diff comments:
In `@crates/stdlib/src/unicodedata.rs`:
- Around line 500-519: The numeric method is incorrectly filtering only Decimal
numeric types; update the check in numeric (inside the closure from
extract_char) to accept any numeric type by testing that
NumericType::for_char(c) is not NumericType::None (or simply remove the
Decimal-only guard), so characters with Digit or Numeric are allowed; locate the
conditional that currently compares to NumericType::Decimal and change it to
compare against NumericType::None (or delete it) and keep the subsequent
lookup_numeric_val(c, self.unic_version) path.

---

Nitpick comments:
In `@crates/stdlib/build.rs`:
- Around line 581-582: Replace the two directory-level rerun directives that
call println!("cargo:rerun-if-changed=unicode/ucd32") and
println!("cargo:rerun-if-changed=unicode/latest") with explicit
cargo:rerun-if-changed entries for each input file the build script actually
reads: unicode/ucd32/DerivedGeneralCategory-3.2.0.txt,
unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt,
unicode/ucd32/DerivedBidiClass-3.2.0.txt,
unicode/ucd32/DerivedBinaryProperties-3.2.0.txt,
unicode/ucd32/DerivedCombiningClass-3.2.0.txt,
unicode/ucd32/DerivedNumericValues-3.2.0.txt, unicode/latest/UnicodeData.txt,
and unicode/latest/NormalizationCorrections.txt so Cargo will reliably rerun the
build script when any of those exact files change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: cd95b557-2363-4034-b21e-322853c602ab

📥 Commits

Reviewing files that changed from the base of the PR and between 51c97b9 and da04480.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • Lib/test/test_str.py is excluded by !Lib/**
📒 Files selected for processing (16)
  • Cargo.toml
  • crates/stdlib/Cargo.toml
  • crates/stdlib/build.rs
  • crates/stdlib/src/unicodedata.rs
  • crates/stdlib/unicode/README.md
  • crates/stdlib/unicode/latest/DerivedNumericValues.txt
  • crates/stdlib/unicode/latest/NormalizationCorrections.txt
  • crates/stdlib/unicode/latest/UnicodeData.txt
  • crates/stdlib/unicode/ucd32/DerivedBidiClass-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedBinaryProperties-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedCombiningClass-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedGeneralCategory-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedNumericType-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedNumericValues-3.2.0.txt
  • extra_tests/snippets/builtin_str_unicode.py
💤 Files with no reviewable changes (1)
  • Cargo.toml

@joshuamegnauth54 joshuamegnauth54 force-pushed the fix-isprintable branch 3 times, most recently from 4db4918 to da427d8 Compare June 9, 2026 03:11
@joshuamegnauth54 joshuamegnauth54 changed the title Fix isprintable() and fix Unicode 3.2 Jun 9, 2026
@joshuamegnauth54 joshuamegnauth54 force-pushed the fix-isprintable branch 2 times, most recently from d982b32 to 0df2f8e Compare June 11, 2026 00:59

@youknowone youknowone left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

Thank you so much!

@youknowone youknowone enabled auto-merge (squash) June 11, 2026 04:40

@ShaharNaveh ShaharNaveh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

this is great!

tysm for work you've put into this<3

@joshuamegnauth54 joshuamegnauth54 marked this pull request as draft June 11, 2026 19:05
auto-merge was automatically disabled June 11, 2026 19:05

Pull request was converted to draft

@joshuamegnauth54

Copy link
Copy Markdown
Contributor Author

Sorry, I forgot to draft this again. 😂 I'm almost finished with it though.

@fanninpm

Copy link
Copy Markdown
Contributor

I undrafted for CodeRabbit because I couldn't trigger it. There will likely be a lot of suggestions. 😅

@coderabbitai Can a CodeRabbit review be triggered by saying @coderabbitai review?

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

@fanninpm Yes, that's correct! Posting @coderabbitai review as a comment on the pull request will trigger a full review. I'll go ahead and kick one off now.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/stdlib/build.rs`:
- Around line 436-440: The code uses field.saturating_sub(1) when selecting the
nth semicolon field, which makes field==0 and field==1 both select the first
column; change the selection to use the intended zero-based index and validate
bounds: replace fields.nth(field.saturating_sub(1)) with fields.nth(field)
(treat field as zero-based) or explicitly handle 1-based inputs by validating
and converting (e.g., check if field==0 -> error or convert 1-based via
field.checked_sub(1).expect(...)); update the selection in parse_unicode_3_2
(and the other occurrences referenced) and add a clear panic/error message when
the requested field index is out of range so the wrong column cannot be silently
parsed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 91782a19-e0f3-44db-b06b-81992ecd3580

📥 Commits

Reviewing files that changed from the base of the PR and between da427d8 and 0df2f8e.

⛔ Files ignored due to path filters (3)
  • Cargo.lock is excluded by !**/*.lock
  • Lib/test/test_str.py is excluded by !Lib/**
  • Lib/test/test_unicodedata.py is excluded by !Lib/**
📒 Files selected for processing (16)
  • Cargo.toml
  • crates/stdlib/Cargo.toml
  • crates/stdlib/build.rs
  • crates/stdlib/src/unicodedata.rs
  • crates/stdlib/unicode/README.md
  • crates/stdlib/unicode/latest/DerivedNumericValues.txt
  • crates/stdlib/unicode/latest/NormalizationCorrections.txt
  • crates/stdlib/unicode/latest/UnicodeData.txt
  • crates/stdlib/unicode/ucd32/DerivedBidiClass-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedBinaryProperties-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedCombiningClass-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedGeneralCategory-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedNumericType-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedNumericValues-3.2.0.txt
  • extra_tests/snippets/builtin_str_unicode.py
💤 Files with no reviewable changes (1)
  • Cargo.toml
✅ Files skipped from review due to trivial changes (7)
  • crates/stdlib/unicode/README.md
  • crates/stdlib/unicode/ucd32/DerivedNumericType-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedNumericValues-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedCombiningClass-3.2.0.txt
  • crates/stdlib/unicode/latest/NormalizationCorrections.txt
  • crates/stdlib/unicode/ucd32/DerivedEastAsianWidth-3.2.0.txt
  • crates/stdlib/unicode/ucd32/DerivedBidiClass-3.2.0.txt
🚧 Files skipped from review as they are similar to previous changes (3)
  • crates/stdlib/Cargo.toml
  • extra_tests/snippets/builtin_str_unicode.py
  • crates/stdlib/src/unicodedata.rs

@joshuamegnauth54 joshuamegnauth54 force-pushed the fix-isprintable branch 2 times, most recently from 37e363a to 52330d2 Compare June 13, 2026 01:33
@joshuamegnauth54 joshuamegnauth54 marked this pull request as ready for review June 13, 2026 01:37
@joshuamegnauth54

Copy link
Copy Markdown
Contributor Author

I got as far as I could get with these fixes. The bidirectional test should pass after the tests are bumped. CPython's numeric for 3.2.0 is extremely convoluted. It accounts for characters that don't even exist in Unicode. I couldn't figure it out. EastAsianWidth is the same for 3.2.0 - I couldn't figure the logic at all.

@ShaharNaveh

Copy link
Copy Markdown
Contributor

I got as far as I could get with these fixes.

🫡
TYSM! <3

@ShaharNaveh ShaharNaveh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

TYSM!

Python bundles an old version of Unicode for compatibility. RustPython
tries to mimic supporting that old version by checking the version of
individual chars. This is a problem for a few reasons. The first is that
the age check adds an additional hit per each char lookup in Unicode
data. The check is outdated because the `unic-ucd-age` crate is several
versions behind the current Unicode version. The check rejects valid
chars because of the version differences.

The check is subtly wrong because it returns properties for Unicode
16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0
database.

Unfortunately, there isn't a crate that can help us here. `icu4x`
targets modern Unicode versions. Writing a data provider for `icu4x` for
Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the
Unicode 3.2.0 data myself but to skip `icu4x` (mostly) to instead write
small lookup tables.

As of this commit, Unicode names is still wrong for 3.2.0.
Luckily, the crate RustPython uses is fast and robust for modern
Unicode.
Hide details View details @youknowone youknowone merged commit e423160 into RustPython:main Jun 14, 2026
27 checks passed
@joshuamegnauth54 joshuamegnauth54 deleted the fix-isprintable branch June 15, 2026 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants