gh-145234: Fix SystemError in parser when \r is introduced after code… by gourijain029-del · Pull Request #145276 · python/cpython

gourijain029-del

Description
This PR fixes a SystemError: Parser/string_parser.c:286: bad argument to internal function that occurred when a Python file used an encoding (like UTF-7) that introduced \r characters after decoding.

Root Cause
The crash was caused by a synchronization failure between the tokenizer, the lexer, and the string parser:

Tokenizer: When the file tokenizer recoded a line (e.g., from UTF-7 to UTF-8), it was not normalizing newlines. If the codec introduced a \r, it remained in the buffer.
Lexer: The lexer skipped \r characters but did not correctly trigger "beginning-of-line" (atbol) logic. This meant that if a \r followed a comment (#...), the lexer would remain in a state where it thought it was still on the same line, causing it to merge the comment and the subsequent string literal into a single, invalid token.
String Parser: When

_PyPegen_parse_string
received this broken token (which didn't start with a quote character), it raised a SystemError.
Changes

Parser/lexer/lexer.c
: Updated the lexer to treat a standalone \r as a full newline. It now correctly sets atbol = 1 and resets the current token start, preventing the "merging" of tokens across lines.

Parser/tokenizer/file_tokenizer.c
:
Updated

tok_readline_recode
to explicitly call

_PyTokenizer_translate_newlines
on the UTF-8 decoded buffer.
Optimized

tok_underflow_file
to immediately discard and re-decode the buffer as soon as a coding spec is identified, preventing raw bytes from leaking into the parser.

Lib/test/test_parser_utf7_r.py
: Added a new regression test that uses a UTF-7 encoded \r to reproduce the original crash.

Issue: \rs introduced after codec decoding cause SystemError #145234