Use icu4x for UTF-8 float(); remove an alloc by joshuamegnauth54 · Pull Request #7998 · RustPython/RustPython
I removed an embedded table of non-ASCII numbers in favor of using icu_decimal. The benefits of using icu4x here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table.
icu_decimal is also useful beyond float(). I'm also using it to clean up unicodedata in #7947.
- Closes #xxxx
- This PR follows our AI policy
Summary
- Removes a hand rolled table in lieu of
icu_decimal - Removes an unneeded heap allocation
Summary by CodeRabbit
-
Refactor
- Improved float-from-string handling to better parse Unicode/locale decimal inputs with more reliable fallback behavior.
-
Chores
- Added a decimal-parsing dependency to support the improved parsing.
-
Performance
- Introduced a compact buffer writer to reduce allocations during string-to-number conversion.
-
Breaking Changes
- Removed a legacy character-to-decimal conversion helper (public API removed).
No actionable comments were generated in the recent review. 🎉
ℹ️ Recent review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 9633f91a-533f-4ed1-9e54-f0995f622fbd
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
Cargo.tomlcrates/common/src/str.rscrates/vm/Cargo.tomlcrates/vm/src/builtins/float.rscrates/vm/src/utils.rs
💤 Files with no reviewable changes (1)
- crates/common/src/str.rs
🚧 Files skipped from review as they are similar to previous changes (4)
- Cargo.toml
- crates/vm/src/utils.rs
- crates/vm/src/builtins/float.rs
- crates/vm/Cargo.toml
📝 Walkthrough
Walkthrough
Integrates icu_decimal for UTF‑8 decimal parsing in float conversion, adds SliceFmtWriter for bounded formatting into a byte buffer, and removes the old char_to_decimal lookup and export.
Changes
ICU Decimal Parsing Integration
| Layer / File(s) | Summary |
|---|---|
Workspace and crate dependency setup Cargo.toml, crates/vm/Cargo.toml |
Adds icu_decimal = "2" to workspace dependencies and icu_decimal = { workspace = true } to crates/vm dependencies. |
Remove char_to_decimal helper crates/common/src/str.rs |
Removes UNICODE_DECIMAL_VALUES table and pub fn char_to_decimal(), and adds pub use ascii; re-export. |
SliceFmtWriter buffer utility crates/vm/src/utils.rs |
Introduces SliceFmtWriter<'slice> implementing core::fmt::Write with new(), remainder(), written(), and bounded write_str() that errors if a write does not fully fit. |
Float parsing with icu_decimal crates/vm/src/builtins/float.rs |
Imports icu_decimal::input::Decimal and SliceFmtWriter; UTF‑8 branch now parses s.trim() with Decimal::try_from_str(), serializes into a fixed 1024‑byte buffer via SliceFmtWriter, and falls back to s.as_bytes() on parse or write failure. |
Sequence Diagram(s)
sequenceDiagram
participant FloatStr as float_from_string
participant Decimal as icu_decimal::Decimal
participant Writer as SliceFmtWriter
participant FloatParse as std::parse
FloatStr->>Decimal: try_from_str(trimmed_utf8)
alt parse succeeds
Decimal-->>FloatStr: decimal object
FloatStr->>Writer: write_str(formatted)
Writer-->>FloatStr: formatted buffer
FloatStr->>FloatParse: parse buffer
else parse fails
Decimal-->>FloatStr: error
FloatStr->>FloatParse: parse original bytes
end
Estimated code review effort
🎯 3 (Moderate) | ⏱️ ~22 minutes
Suggested reviewers
- youknowone
- ShaharNaveh
Poem
🐰 I nibble digits snug and neat,
icu parses each tiny beat,
SliceFmtWriter stacks them tight,
Old tables hop away from sight,
Floats now leap on airy feet ✨
🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. | Write docstrings for the functions missing them to satisfy the coverage threshold. |
✅ Passed checks (4 passed)
| Check name | Status | Explanation |
|---|---|---|
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
| Title check | ✅ Passed | The title accurately summarizes the main changes: replacing an embedded Unicode decimal table with icu4x's icu_decimal library and removing a heap allocation from UTF-8 float parsing. |
| Linked Issues check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
| Out of Scope Changes check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
🧪 Generate unit tests (beta)
- Create PR with unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TYSM!
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/vm/src/builtins/float.rs`:
- Around line 219-223: The ICU Decimal branch currently ignores
decimal.write_to's Result and always passes the full mapped_str_buf (including
unwritten trailing bytes) into parse_bytes; change the code in the
Decimal::try_from_str branch to capture the result of decimal.write_to(&mut
writer), handle/propagate any Err from write_to, read the actual number of bytes
written from the SliceFmtWriter (e.g., writer.written() or expose a written
field), and pass only &mapped_str_buf[..written] to
crate::literal::float::parse_bytes so trailing zero bytes are not fed into
lexical-core parsing.
In `@crates/vm/src/utils.rs`:
- Around line 113-118: SliceFmtWriter::write_str is using remainder as an index
and never advancing self.written; change it to treat remainder as the remaining
capacity, compute let to_write = remainder.min(s.len()), copy
s.as_bytes()[..to_write] into self.buf[self.written..self.written + to_write],
then increment self.written by to_write; return Err(fmt::Error) if to_write <
s.len(), otherwise Ok(()). This fixes bounds, limits the copied bytes to
available space, and advances the writer state (refer to the write_str method,
self.buf, self.written, and remainder()).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 52267e3c-a8da-4d0e-b64e-0c094ff82b04
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
Cargo.tomlcrates/common/src/str.rscrates/vm/Cargo.tomlcrates/vm/src/builtins/float.rscrates/vm/src/utils.rs
💤 Files with no reviewable changes (1)
- crates/common/src/str.rs
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
♻️ Duplicate comments (1)
crates/vm/src/utils.rs (1)
115-117:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winFix source slice bounds in partial copy logic.
On Line 116,
&s.as_bytes()[to_copy..]is the wrong half of the input and can triggercopy_from_slicelength mismatch panics when truncating. Use the prefix slice instead.Suggested fix
- self.buf[self.written..self.written + to_copy].copy_from_slice(&s.as_bytes()[to_copy..]); + self.buf[self.written..self.written + to_copy].copy_from_slice(&s.as_bytes()[..to_copy]);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/vm/src/utils.rs` around lines 115 - 117, The partial-copy currently takes the wrong suffix slice (&s.as_bytes()[to_copy..]) causing length mismatches; change the source to the prefix slice of length to_copy (use &s.as_bytes()[..to_copy]) when calling copy_from_slice so the copied length equals to_copy, and keep updating self.written as before in the function that computes to_copy via self.remainder().
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@crates/vm/src/utils.rs`:
- Around line 115-117: The partial-copy currently takes the wrong suffix slice
(&s.as_bytes()[to_copy..]) causing length mismatches; change the source to the
prefix slice of length to_copy (use &s.as_bytes()[..to_copy]) when calling
copy_from_slice so the copied length equals to_copy, and keep updating
self.written as before in the function that computes to_copy via
self.remainder().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: f43ddc22-cecd-46a4-bdbb-7b876f4871d7
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
Cargo.tomlcrates/common/src/str.rscrates/vm/Cargo.tomlcrates/vm/src/builtins/float.rscrates/vm/src/utils.rs
💤 Files with no reviewable changes (1)
- crates/common/src/str.rs
🚧 Files skipped from review as they are similar to previous changes (3)
- crates/vm/Cargo.toml
- Cargo.toml
- crates/vm/src/builtins/float.rs
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
♻️ Duplicate comments (1)
crates/vm/src/utils.rs (1)
117-127:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winSource slice index is inverted — will panic on every call.
Line 120 uses
&s.as_bytes()[to_copy..]which yields the bytes afterto_copy, with lengths.len() - to_copy. The destination slice has lengthto_copy. This mismatch causescopy_from_sliceto panic in nearly all cases (e.g.,s = "hello",to_copy = 5→ source is empty, destination is 5 bytes).The slice should be
[..to_copy](the firstto_copybytes).Proposed fix
impl fmt::Write for SliceFmtWriter<'_> { fn write_str(&mut self, s: &str) -> fmt::Result { let to_copy = self.remainder().min(s.len()); - self.buf[self.written..self.written + to_copy].copy_from_slice(&s.as_bytes()[to_copy..]); + self.buf[self.written..self.written + to_copy].copy_from_slice(&s.as_bytes()[..to_copy]); self.written += to_copy; if to_copy == s.len() { Ok(()) } else { Err(fmt::Error) } } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/vm/src/utils.rs` around lines 117 - 127, The write_str implementation in impl fmt::Write for SliceFmtWriter<'_> copies the wrong source slice (using &s.as_bytes()[to_copy..]) which produces a length mismatch and panics; change the source slice to the first to_copy bytes (&s.as_bytes()[..to_copy]) so it matches the destination length, keeping the remainder() call and updates to self.buf and self.written intact in the write_str method.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@crates/vm/src/utils.rs`:
- Around line 117-127: The write_str implementation in impl fmt::Write for
SliceFmtWriter<'_> copies the wrong source slice (using
&s.as_bytes()[to_copy..]) which produces a length mismatch and panics; change
the source slice to the first to_copy bytes (&s.as_bytes()[..to_copy]) so it
matches the destination length, keeping the remainder() call and updates to
self.buf and self.written intact in the write_str method.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 66467756-7fae-4975-92bd-e1829f2a3631
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
Cargo.tomlcrates/common/src/str.rscrates/vm/Cargo.tomlcrates/vm/src/builtins/float.rscrates/vm/src/utils.rs
💤 Files with no reviewable changes (1)
- crates/common/src/str.rs
✅ Files skipped from review due to trivial changes (1)
- Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (2)
- crates/vm/Cargo.toml
- crates/vm/src/builtins/float.rs
I removed an embedded table of non-ASCII numbers in favor of using `icu_decimal`. The benefits of using `icu4x` here are consistency plus Unicode updates. As Unicode is updated, we automatically reap the benefits without having to modify the table. `icu_decimal` is also useful beyond `float()`. I'm also using it to clean up `unicodedata` in RustPython#7947.
Okay, it turns out there isn't a nice way to do this without complicating the code. 😵💫
A string like:
float("0" * 100000 + "1e50")
Would still need to parse.