◐ Shell
clean mode source ↗

Use icu4x for UTF-8 float(); remove an alloc by joshuamegnauth54 · Pull Request #7998 · RustPython/RustPython

@joshuamegnauth54

I removed an embedded table of non-ASCII numbers in favor of using icu_decimal. The benefits of using icu4x here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table.

icu_decimal is also useful beyond float(). I'm also using it to clean up unicodedata in #7947.

Summary

  • Removes a hand rolled table in lieu of icu_decimal
  • Removes an unneeded heap allocation

Summary by CodeRabbit

  • Refactor

    • Improved float-from-string handling to better parse Unicode/locale decimal inputs with more reliable fallback behavior.
  • Chores

    • Added a decimal-parsing dependency to support the improved parsing.
  • Performance

    • Introduced a compact buffer writer to reduce allocations during string-to-number conversion.
  • Breaking Changes

    • Removed a legacy character-to-decimal conversion helper (public API removed).

Review Change Stack

@coderabbitai

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 9633f91a-533f-4ed1-9e54-f0995f622fbd

📥 Commits

Reviewing files that changed from the base of the PR and between a2b5735 and 892b897.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • Cargo.toml
  • crates/common/src/str.rs
  • crates/vm/Cargo.toml
  • crates/vm/src/builtins/float.rs
  • crates/vm/src/utils.rs
💤 Files with no reviewable changes (1)
  • crates/common/src/str.rs
🚧 Files skipped from review as they are similar to previous changes (4)
  • Cargo.toml
  • crates/vm/src/utils.rs
  • crates/vm/src/builtins/float.rs
  • crates/vm/Cargo.toml

📝 Walkthrough

Walkthrough

Integrates icu_decimal for UTF‑8 decimal parsing in float conversion, adds SliceFmtWriter for bounded formatting into a byte buffer, and removes the old char_to_decimal lookup and export.

Changes

ICU Decimal Parsing Integration

Layer / File(s) Summary
Workspace and crate dependency setup
Cargo.toml, crates/vm/Cargo.toml
Adds icu_decimal = "2" to workspace dependencies and icu_decimal = { workspace = true } to crates/vm dependencies.
Remove char_to_decimal helper
crates/common/src/str.rs
Removes UNICODE_DECIMAL_VALUES table and pub fn char_to_decimal(), and adds pub use ascii; re-export.
SliceFmtWriter buffer utility
crates/vm/src/utils.rs
Introduces SliceFmtWriter<'slice> implementing core::fmt::Write with new(), remainder(), written(), and bounded write_str() that errors if a write does not fully fit.
Float parsing with icu_decimal
crates/vm/src/builtins/float.rs
Imports icu_decimal::input::Decimal and SliceFmtWriter; UTF‑8 branch now parses s.trim() with Decimal::try_from_str(), serializes into a fixed 1024‑byte buffer via SliceFmtWriter, and falls back to s.as_bytes() on parse or write failure.

Sequence Diagram(s)

sequenceDiagram
  participant FloatStr as float_from_string
  participant Decimal as icu_decimal::Decimal
  participant Writer as SliceFmtWriter
  participant FloatParse as std::parse
  FloatStr->>Decimal: try_from_str(trimmed_utf8)
  alt parse succeeds
    Decimal-->>FloatStr: decimal object
    FloatStr->>Writer: write_str(formatted)
    Writer-->>FloatStr: formatted buffer
    FloatStr->>FloatParse: parse buffer
  else parse fails
    Decimal-->>FloatStr: error
    FloatStr->>FloatParse: parse original bytes
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Suggested reviewers

  • youknowone
  • ShaharNaveh

Poem

🐰 I nibble digits snug and neat,
icu parses each tiny beat,
SliceFmtWriter stacks them tight,
Old tables hop away from sight,
Floats now leap on airy feet ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: replacing an embedded Unicode decimal table with icu4x's icu_decimal library and removing a heap allocation from UTF-8 float parsing.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

ShaharNaveh

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYSM!

coderabbitai[bot]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/vm/src/builtins/float.rs`:
- Around line 219-223: The ICU Decimal branch currently ignores
decimal.write_to's Result and always passes the full mapped_str_buf (including
unwritten trailing bytes) into parse_bytes; change the code in the
Decimal::try_from_str branch to capture the result of decimal.write_to(&mut
writer), handle/propagate any Err from write_to, read the actual number of bytes
written from the SliceFmtWriter (e.g., writer.written() or expose a written
field), and pass only &mapped_str_buf[..written] to
crate::literal::float::parse_bytes so trailing zero bytes are not fed into
lexical-core parsing.

In `@crates/vm/src/utils.rs`:
- Around line 113-118: SliceFmtWriter::write_str is using remainder as an index
and never advancing self.written; change it to treat remainder as the remaining
capacity, compute let to_write = remainder.min(s.len()), copy
s.as_bytes()[..to_write] into self.buf[self.written..self.written + to_write],
then increment self.written by to_write; return Err(fmt::Error) if to_write <
s.len(), otherwise Ok(()). This fixes bounds, limits the copied bytes to
available space, and advances the writer state (refer to the write_str method,
self.buf, self.written, and remainder()).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 52267e3c-a8da-4d0e-b64e-0c094ff82b04

📥 Commits

Reviewing files that changed from the base of the PR and between 26f817a and 95c9965.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • Cargo.toml
  • crates/common/src/str.rs
  • crates/vm/Cargo.toml
  • crates/vm/src/builtins/float.rs
  • crates/vm/src/utils.rs
💤 Files with no reviewable changes (1)
  • crates/common/src/str.rs

coderabbitai[bot]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
crates/vm/src/utils.rs (1)

115-117: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix source slice bounds in partial copy logic.

On Line 116, &s.as_bytes()[to_copy..] is the wrong half of the input and can trigger copy_from_slice length mismatch panics when truncating. Use the prefix slice instead.

Suggested fix
-        self.buf[self.written..self.written + to_copy].copy_from_slice(&s.as_bytes()[to_copy..]);
+        self.buf[self.written..self.written + to_copy].copy_from_slice(&s.as_bytes()[..to_copy]);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/vm/src/utils.rs` around lines 115 - 117, The partial-copy currently
takes the wrong suffix slice (&s.as_bytes()[to_copy..]) causing length
mismatches; change the source to the prefix slice of length to_copy (use
&s.as_bytes()[..to_copy]) when calling copy_from_slice so the copied length
equals to_copy, and keep updating self.written as before in the function that
computes to_copy via self.remainder().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@crates/vm/src/utils.rs`:
- Around line 115-117: The partial-copy currently takes the wrong suffix slice
(&s.as_bytes()[to_copy..]) causing length mismatches; change the source to the
prefix slice of length to_copy (use &s.as_bytes()[..to_copy]) when calling
copy_from_slice so the copied length equals to_copy, and keep updating
self.written as before in the function that computes to_copy via
self.remainder().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: f43ddc22-cecd-46a4-bdbb-7b876f4871d7

📥 Commits

Reviewing files that changed from the base of the PR and between 95c9965 and 0c3a9c5.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • Cargo.toml
  • crates/common/src/str.rs
  • crates/vm/Cargo.toml
  • crates/vm/src/builtins/float.rs
  • crates/vm/src/utils.rs
💤 Files with no reviewable changes (1)
  • crates/common/src/str.rs
🚧 Files skipped from review as they are similar to previous changes (3)
  • crates/vm/Cargo.toml
  • Cargo.toml
  • crates/vm/src/builtins/float.rs

coderabbitai[bot]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
crates/vm/src/utils.rs (1)

117-127: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Source slice index is inverted — will panic on every call.

Line 120 uses &s.as_bytes()[to_copy..] which yields the bytes after to_copy, with length s.len() - to_copy. The destination slice has length to_copy. This mismatch causes copy_from_slice to panic in nearly all cases (e.g., s = "hello", to_copy = 5 → source is empty, destination is 5 bytes).

The slice should be [..to_copy] (the first to_copy bytes).

Proposed fix
 impl fmt::Write for SliceFmtWriter<'_> {
     fn write_str(&mut self, s: &str) -> fmt::Result {
         let to_copy = self.remainder().min(s.len());
-        self.buf[self.written..self.written + to_copy].copy_from_slice(&s.as_bytes()[to_copy..]);
+        self.buf[self.written..self.written + to_copy].copy_from_slice(&s.as_bytes()[..to_copy]);
         self.written += to_copy;
         if to_copy == s.len() {
             Ok(())
         } else {
             Err(fmt::Error)
         }
     }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/vm/src/utils.rs` around lines 117 - 127, The write_str implementation
in impl fmt::Write for SliceFmtWriter<'_> copies the wrong source slice (using
&s.as_bytes()[to_copy..]) which produces a length mismatch and panics; change
the source slice to the first to_copy bytes (&s.as_bytes()[..to_copy]) so it
matches the destination length, keeping the remainder() call and updates to
self.buf and self.written intact in the write_str method.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@crates/vm/src/utils.rs`:
- Around line 117-127: The write_str implementation in impl fmt::Write for
SliceFmtWriter<'_> copies the wrong source slice (using
&s.as_bytes()[to_copy..]) which produces a length mismatch and panics; change
the source slice to the first to_copy bytes (&s.as_bytes()[..to_copy]) so it
matches the destination length, keeping the remainder() call and updates to
self.buf and self.written intact in the write_str method.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 66467756-7fae-4975-92bd-e1829f2a3631

📥 Commits

Reviewing files that changed from the base of the PR and between 0c3a9c5 and a2b5735.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • Cargo.toml
  • crates/common/src/str.rs
  • crates/vm/Cargo.toml
  • crates/vm/src/builtins/float.rs
  • crates/vm/src/utils.rs
💤 Files with no reviewable changes (1)
  • crates/common/src/str.rs
✅ Files skipped from review due to trivial changes (1)
  • Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (2)
  • crates/vm/Cargo.toml
  • crates/vm/src/builtins/float.rs
I removed an embedded table of non-ASCII numbers in favor of using
`icu_decimal`. The benefits of using `icu4x` here are consistency plus
Unicode updates. As Unicode is updated, we automatically reap the
benefits without having to modify the table.

`icu_decimal` is also useful beyond `float()`. I'm also using it to
clean up `unicodedata` in RustPython#7947.

@joshuamegnauth54

Okay, it turns out there isn't a nice way to do this without complicating the code. 😵‍💫

A string like:

float("0" * 100000 + "1e50")

Would still need to parse.