Refactor PyUtf8Str by youknowone · Pull Request #6047 · RustPython/RustPython

coderabbitai

vm/src/utils.rs (1)

5-5: LGTM! Properly extends ToCString trait for PyUtf8Str.

The import addition and trait implementation correctly enable PyUtf8Str to be converted to C strings, maintaining consistency with the existing PyStr implementation. This is essential for C API interoperability in modules like SQLite.

Also applies to: 38-38

vm/src/format.rs (1)

115-115: LGTM! Correctly uses WTF-8 string view.

The change from as_ref() to as_wtf8() is more explicit about the encoding and aligns with the broader refactor to handle UTF-8/WTF-8 string representations consistently across the codebase.

vm/src/builtins/slice.rs (1)

295-295: LGTM! Beneficial code simplification.

Removing the unnecessary .as_str() calls simplifies the code since PyStrRef implements the Display trait and can be used directly in the format! macro. This improves readability while maintaining the same functionality.

Also applies to: 300-300

vm/src/stdlib/ctypes/structure.rs (1)

42-42: LGTM! Streamlined string conversion.

The direct call to to_string() eliminates the unnecessary intermediate .as_str() step since PyStr implements the ToString trait directly. This simplifies the code while maintaining the same functionality.

vm/src/builtins/mod.rs (1)

62-62: LGTM! Properly exposes new UTF-8 string types.

Adding PyUtf8Str and PyUtf8StrRef to the public exports enables their use across the codebase as part of the UTF-8 validation refactor. The naming follows established conventions and placement with other string types is appropriate.

vm/src/vm/mod.rs (1)

475-475: LGTM! Consistent with UTF-8 string handling improvements.

This change aligns with the broader refactoring to use UTF-8 validated string types and removes the need for explicit .as_str() conversion when printing error messages.

vm/src/builtins/builtin_func.rs (2)

5-5: LGTM! Necessary import for WTF-8 string handling.

The import of Wtf8 supports the updated debug formatting that uses WTF-8 string types consistently.

31-31: LGTM! Consistent WTF-8 formatting for debug output.

The change ensures consistent type handling in debug formatting by using Wtf8::new("<unknown>") instead of a plain string literal, matching the Wtf8 type returned by m.as_wtf8() when the module is present.

vm/src/stdlib/pwd.rs (2)

9-9: LGTM! Updated import for UTF-8 validated string types.

The import change supports the function signature update to use PyUtf8StrRef for better type safety.

57-57: LGTM! Improved type safety with UTF-8 validated strings.

Changing the parameter type from PyStrRef to PyUtf8StrRef enforces UTF-8 validation at the API boundary, which is appropriate for usernames and aligns with the broader migration to UTF-8 validated string types.

vm/src/builtins/namespace.rs (1)

92-94: LGTM! Improved string representation handling with WTF-8.

The change from as_str() to as_wtf8() improves robustness by handling strings that may contain surrogate code points or ill-formed UTF-8 sequences, while maintaining the same representation logic for namespace objects.

vm/src/stdlib/posix.rs (2)

27-27: LGTM: Import addition supports UTF-8 validated strings.

The addition of PyUtf8StrRef to the import list is consistent with the PR's objective to introduce UTF-8 validated string types across the codebase.

2245-2245: LGTM: UTF-8 validated string conversion improves type safety.

The change from PyStrRef::try_from_object to PyUtf8StrRef::try_from_object ensures UTF-8 validity at the parameter level for system configuration name parsing. This eliminates the need for fallible UTF-8 conversions while maintaining the same parsing logic.

vm/src/stdlib/codecs.rs (2)

10-10: LGTM: Import addition supports UTF-8 validated strings.

The addition of PyUtf8StrRef to the import list enables the use of UTF-8 validated string references in the codec lookup functionality.

26-29: LGTM: UTF-8 validated encoding parameter simplifies codec lookup.

The changes improve the lookup function by:

Accepting PyUtf8StrRef parameter to ensure UTF-8 validity of encoding names
Using encoding.as_str() directly instead of the fallible try_to_str(vm)? conversion
Eliminating potential error handling for UTF-8 validation

This is consistent with the broader refactoring to use UTF-8 validated string references across the codebase.

vm/src/builtins/type.rs (1)

343-343: LGTM: Efficient byte-level comparison for ASCII slot names.

This optimization replaces UTF-8 string slice comparisons with direct byte comparisons when checking for Python special method names (dunder methods). Since we're only checking for ASCII characters (__), this is both safe and more efficient, avoiding unnecessary UTF-8 validation overhead.

Also applies to: 349-349

vm/src/builtins/super.rs (1)

107-107: LGTM: Consistent byte-level comparison for __class__ checks.

Both changes replace string slice comparisons with byte slice comparisons when checking for the __class__ identifier. Since __class__ is pure ASCII, this optimization is safe and more efficient, avoiding UTF-8 validation overhead while maintaining identical behavior.

Also applies to: 165-165

vm/src/stdlib/operator.rs (3)

8-9: LGTM: Clean import organization.

Good organization of imports - adding Wtf8 and grouping Either with other function imports improves readability and prepares for WTF-8 string handling.

327-327: LGTM: Simplified ASCII check.

Direct call to is_ascii() on PyStrRef is cleaner and more readable than the previous approach, maintaining the same functionality with better code clarity.

374-382: LGTM: Optimized attribute name handling with WTF-8 support.

This change optimizes dotted attribute name processing by:

Working at the byte level for splitting (more efficient)
Using WTF-8 conversion to handle potentially ill-formed UTF-8 sequences
Maintaining compatibility with the existing API

The approach is consistent with the broader string handling optimizations throughout the codebase and properly handles edge cases with non-standard string encodings.

vm/src/vm/vm_ops.rs (2)

4-5: LGTM: Necessary imports for UTF-8 string support.

Adding PyRef and PyUtf8Str imports to support the new format_utf8 method. The imports are properly organized and necessary for the new functionality.

521-523: LGTM: Well-designed UTF-8 formatting method.

The new format_utf8 method provides a convenient API for formatting objects into UTF-8 validated strings. The implementation:

Reuses the existing format method appropriately
Uses fallible conversion (try_into_utf8) to ensure UTF-8 validity
Follows the established pattern for UTF-8 string type usage
Provides clear error handling through the PyResult return type

This complements the broader UTF-8 string type introduction across the codebase.

vm/src/stdlib/time.rs (3)

38-38: LGTM: Import addition is correct.

The addition of PyUtf8StrRef to the imports is necessary and appropriate for the strftime function signature change.

347-351: LGTM: Function signature update is semantically correct.

Changing the format parameter from PyStrRef to PyUtf8StrRef is appropriate since strftime format strings should be valid UTF-8. This ensures the function receives UTF-8 validated input, eliminating the need for redundant validation.

362-363: LGTM: Direct UTF-8 string access leverages type guarantees.

The changes correctly utilize the UTF-8 validity guarantee of PyUtf8StrRef:

format.as_str() safely provides direct access to the UTF-8 string slice
format.to_string() appropriately handles the error fallback case

This simplifies the code by eliminating redundant UTF-8 validation that was previously required with PyStrRef.

vm/src/stdlib/sys.rs (1)

715-715: LGTM: Improved error handling for strings with surrogates.

The change from str.as_str() to str.to_str().unwrap_or("<str with surrogate>") is a defensive programming improvement that:

Safely handles strings containing surrogates without panicking
Provides a clear, descriptive fallback message for debugging
Prevents crashes in error handling paths when dealing with malformed UTF-8

This aligns well with the broader goal of safer string handling throughout the codebase.

vm/src/warn.rs (5)

125-125: LGTM: UTF-8 validation for warning action strings.

Changing from str(vm) to str_utf8(vm) is appropriate for warning action strings since they should be valid UTF-8 (e.g., "error", "ignore", "default"). This provides the necessary UTF-8 guarantee for downstream processing.

204-205: LGTM: Robust filename handling for potentially invalid UTF-8.

The changes improve filename handling robustness:

Line 204: Using as_bytes().ends_with(b".py") provides byte-level comparison that doesn't require UTF-8 validity
Line 205: Using as_wtf8() for slicing safely handles filenames with potentially invalid UTF-8 sequences

This prevents crashes when processing filenames containing invalid UTF-8, which can occur in real-world scenarios.

235-235: LGTM: WTF-8 handling for warning message text.

Changing from as_str() to as_wtf8() appropriately handles warning messages that might contain invalid UTF-8 sequences. WTF-8 can safely represent any byte sequence, preventing potential panics when processing malformed warning text.

281-281: LGTM: Consistent UTF-8 validation for action comparisons.

The changes from str(vm)?.as_str() to str_utf8(vm)?.as_str() are consistent with the UTF-8 validation approach for warning actions. Since these comparisons are against string literals ("error", "ignore"), UTF-8 validation is appropriate and ensures type consistency.

Also applies to: 285-285

348-348: LGTM: Simplified string formatting without explicit conversion.

Removing the .as_str() call is a good simplification since writeln! can handle PyStrRef directly. This also avoids potential panics if the warning text contains surrogates, making the code more robust.

vm/src/object/payload.rs (2)

29-31: LGTM: Simple and correct downcast viability check.

The downcastable_from method provides a clean interface for checking if an object can be downcasted to the payload type. The implementation correctly uses payload_type_id() for type comparison, following established patterns in the codebase.

33-49: LGTM: Well-designed error handling for safe downcasting.

The try_downcast_from method implementation demonstrates excellent Rust practices:

Type Safety: Uses downcastable_from check before proceeding
Performance: Cold function optimization prevents inlining the error path
Consistency: Leverages vm.new_downcast_type_error for uniform error messaging
Ergonomics: Returns PyResult<()> following established VM patterns

This provides a safe, efficient foundation for downcasting operations throughout the VM with proper error reporting.

vm/src/protocol/object.rs (6)

8-8: LGTM! Import addition aligns with UTF-8 string type introduction.

The addition of PyUtf8Str to the imports is consistent with the broader refactoring to introduce UTF-8 validated string types across the codebase.

331-333: LGTM! UTF-8 validated repr method implementation is correct.

The repr_utf8 method properly delegates to the existing repr method and converts the result to a UTF-8 validated string using try_into_utf8. This follows the established pattern for providing UTF-8 validated variants.

335-335: LGTM! Explicit return type clarification.

Making the return type explicitly PyRef<PyStr> instead of implicit helps distinguish between the generic string method and its UTF-8 validated counterpart (repr_utf8).

353-353: LGTM! Consistent usage of UTF-8 validated repr.

The ascii method now correctly uses repr_utf8() instead of repr(), which is appropriate since ASCII conversion requires valid UTF-8 input. This ensures type safety and UTF-8 validation.

358-360: LGTM! UTF-8 validated str method implementation is correct.

The str_utf8 method follows the same pattern as repr_utf8, properly delegating to the existing str method and converting to UTF-8 validated string. This provides a clean API for callers who need UTF-8 guarantees.

361-361: LGTM! Explicit return type clarification.

Similar to the repr method, making the return type explicitly PyRef<PyStr> helps distinguish this generic method from its UTF-8 validated counterpart (str_utf8).

vm/src/exceptions.rs (2)

201-205: LGTM! Cleaner filename handling with direct match.

The refactoring simplifies the filename handling by using a direct match statement instead of chaining unwrap_or_else and map. This is more readable and maintainable while preserving the same behavior - using the provided filename or defaulting to "<string>".

1497-1497: LGTM! Improved control flow with single return point.

The refactoring consolidates the return logic to use a single return point by assigning the result to a local str variable. This is a good practice in Rust as it:

Reduces code duplication
Makes the control flow clearer
Maintains the same functionality while improving readability

The logic remains identical - return the formatted OS error string for 2-argument cases, otherwise delegate to the base __str__ method.

Also applies to: 1517-1517, 1519-1521

vm/src/types/slot.rs (3)

172-172: LGTM! Type alias improvement for better clarity.

The change from PyStrRef to PyRef<PyStr> improves type explicitness and aligns with the broader refactoring effort to distinguish string types in the codebase.

253-253: LGTM! Function signature updated consistently.

The return type change aligns with the StringifyFunc type alias update and maintains consistency throughout the codebase.

980-980: LGTM! Trait method signatures updated consistently.

All Representable trait methods have been updated to use the more explicit PyRef<PyStr> return type, maintaining consistency with the StringifyFunc type alias and the broader string type refactoring effort.

Also applies to: 989-989, 994-994

vm/src/object/core.rs (4)

544-547: LGTM! Safe downcasting method with proper error handling.

The new try_downcast method provides a safe alternative to the existing downcast method by using T::try_downcast_from for validation and VM-based error generation. The unsafe downcast_unchecked is only called after successful validation, which is the correct pattern.

728-731: LGTM! Internal method for exposing TypeId.

The typeid method provides necessary access to the object's internal TypeId for downcasting operations. The pub(crate) visibility is appropriate for this internal implementation detail.

736-737: LGTM! Delegating downcast check to payload type.

The updated downcastable method delegates to T::downcastable_from(self), allowing payload types to customize their downcasting logic. This is particularly useful for types like PyUtf8Str that need additional validation beyond type ID matching.

740-746: LGTM! Safe reference-based downcasting method.

The new try_downcast_ref method provides a safe reference-based alternative to downcast_ref with VM-based error handling. The implementation correctly validates through T::try_downcast_from before performing the unsafe cast, maintaining consistency with the other safe downcasting methods.

vm/src/stdlib/io.rs (6)

123-123: LGTM! Import addition is correct.

The addition of PyUtf8StrRef to the imports is necessary for the UTF-8 string refactoring and is properly placed alphabetically.

1582-1582: LGTM! Consistent return type updates for repr methods.

The changes from PyStrRef to PyRef<PyStr> for both slot_repr and __repr__ methods are consistent with the broader string type refactoring. The implementation logic remains unchanged, maintaining functionality while improving type safety.

Also applies to: 1595-1595

2317-2317: LGTM! Improved string access pattern.

The change from encoding.try_to_str(vm)? to encoding.as_str() is correct and more efficient. Since the parameter is now PyUtf8StrRef (UTF-8 validated), the fallible conversion is no longer needed, and direct string slice access is appropriate.

2417-2417: LGTM! Consistent UTF-8 string access pattern.

The change from encoding.try_to_str(vm)? to encoding.as_str() follows the same correct pattern as other changes in the file, taking advantage of UTF-8 pre-validation to eliminate unnecessary error handling.

3895-3895: LGTM! Systematic UTF-8 string type adoption.

The changes systematically replace PyStrRef with PyUtf8StrRef for mode and encoding parameters in I/O-related structs:

IoOpenArgs.mode: Now uses UTF-8 validated strings for file modes
OpenArgs.encoding: Now uses UTF-8 validated strings for encoding parameters
FileIOArgs.mode: Consistent with the pattern
Import addition in fileio module enables the new types

These changes are consistent, well-structured, and align with the UTF-8 validation goals of the refactoring.

Also applies to: 3921-3921, 4133-4133, 4260-4260

4298-4299: LGTM! Completion of UTF-8 string refactoring pattern.

The changes correctly complete the UTF-8 string adoption:

Line 4298: Using PyUtf8Str::from("rb") instead of PyStr::from("rb") is appropriate since "rb" is valid UTF-8
Line 4299: Direct string access with as_str() replaces fallible conversion, taking advantage of UTF-8 pre-validation

These changes are consistent with the systematic refactoring throughout the file.

vm/src/builtins/str.rs (7)

18-18: LGTM: Import addition for trait implementation

The addition of MaybeTraverse, Traverse, and TraverseFn imports is necessary for the new PyUtf8Str type implementation.

67-69: Type aliases properly defined

The type aliases follow the established pattern and provide convenient shorthand for both regular and UTF-8 validated string references.

354-354: Constructor return type correctly updated

The change from PyResult<PyRef<Self>> to PyResult allows for proper subclassing support.

438-455: Enhanced UTF-8 validation with better error reporting

The method is correctly made private and now provides detailed error information including the exact position of invalid surrogates.

954-974: Format method properly updated for UTF-8 handling

The changes ensure format specifications are valid strings and the formatted result is UTF-8 validated before processing.

1475-1483: UTF-8 validation method correctly implemented

The try_as_utf8 method properly validates UTF-8 before performing the safe cast to PyUtf8Str.

1931-2087: Well-structured PyUtf8Str implementation

The repr(transparent) wrapper design ensures zero overhead while maintaining type safety. The custom downcastable_from implementation correctly validates UTF-8 before allowing downcasts.

stdlib/src/sqlite.rs (13)

62-62: LGTM: Correct import addition for UTF-8 validated string types.

The addition of PyUtf8Str and PyUtf8StrRef imports is necessary and appropriate for the systematic refactoring to use UTF-8 validated string types in SQL contexts.

855-855: LGTM: Correct parameter type change for SQL statements.

Changing from PyStrRef to PyUtf8StrRef is appropriate for SQL statement parameters, ensuring UTF-8 validation at the type level.

990-990: LGTM: Appropriate type change for SQL parameter.

Using PyUtf8StrRef for the SQL parameter ensures UTF-8 validation, which is essential for SQL statement execution.

1002-1002: LGTM: Consistent type change for SQL parameter.

The change to PyUtf8StrRef for the SQL parameter maintains consistency with other execute methods and ensures UTF-8 validation.

1014-1014: LGTM: Appropriate type change for SQL script parameter.

Using PyUtf8StrRef for the script parameter ensures UTF-8 validation for SQL scripts, which is necessary for proper execution.

1163-1163: LGTM: Correct type change for collation name.

Using PyUtf8StrRef for the collation name parameter ensures UTF-8 validation, which is appropriate for database collation names.

1494-1494: LGTM: Consistent type change for SQL parameter.

The change to PyUtf8StrRef maintains consistency with Connection execute methods and ensures UTF-8 validation for SQL statements.

1566-1566: LGTM: Appropriate type change for SQL parameter.

Using PyUtf8StrRef for the SQL parameter ensures UTF-8 validation and maintains consistency across execute methods.

1640-1640: LGTM: Consistent type change for SQL script parameter.

The change to PyUtf8StrRef maintains consistency with Connection::executescript and ensures UTF-8 validation for SQL scripts.

2376-2376: LGTM: Appropriate type change for Statement constructor.

Using PyUtf8StrRef for the SQL parameter in Statement::new ensures UTF-8 validation at the statement creation level, which is foundational for proper SQL handling.

2731-2731: LGTM: Correct UTF-8 conversion for string binding.

The explicit conversion to UTF-8 using try_as_utf8(vm)? is appropriate when binding PyStr parameters that haven't been upgraded to PyUtf8StrRef, ensuring UTF-8 validation is maintained.

2991-2991: LGTM: Consistent UTF-8 conversion for result handling.

The explicit UTF-8 conversion using try_as_utf8(vm)? maintains consistency with parameter binding and ensures UTF-8 validation for string results.

3072-3078: LGTM: Excellent improvement to eliminate redundant UTF-8 validation.

The change from PyStrRef to &PyUtf8Str is a significant improvement that:

Provides compile-time UTF-8 validation guarantees
Eliminates the need for runtime UTF-8 checks within the function
Makes the code more robust and efficient

This represents the core benefit of the refactoring effort.