bpo-34454: Fix issue with non-UTF8 separator strings by pganssle · Pull Request #8862 · python/cpython
It is possible to pass a non-UTF-8 string as a separator in datetime.isoformat, but the current implementation starts by decoding to UTF-8, which will fail even for some valid strings. In the special case of non-UTF-8 separators, we take a performance hit by encoding the string as ASCII and replacing any invalid characters with ?.
Previously this would end up dereferencing a NULL pointer if the PyUnicode_AsUTF8AndSize call failed, this makes it so that the same error as any other parsing error is raised.
This increases performance for valid non-UTF-8 strings by avoiding an error condition, and minimizes the impact on the rest of the algorithm.
Rather than splitting the string at position 10 and re-joining it with PyUnicode_Format, this copies the original unicode object and overwrites the separator character. Co-Authored-By: Alexey Izbyshev <izbyshev@ispras.ru>
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request
…gate code points (pythonGH-8862) The current C implementations **crash** if the input includes a surrogate Unicode code point, which is not possible to encode in UTF-8. Important notes: 1. It is possible to pass a non-UTF-8 string as a separator to the `.isoformat()` methods. 2. The pure-Python `datetime.fromisoformat()` implementation accepts strings with a surrogate as the separator. In `datetime.fromisoformat()`, in the special case of non-UTF-8 separators, this implementation will take a performance hit by making a copy of the input string and replacing the separator with 'T'. Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io> (cherry picked from commit 096329f) Co-authored-by: Paul Ganssle <pganssle@users.noreply.github.com>
miss-islington added a commit that referenced this pull request
…gate code points (GH-8862) The current C implementations **crash** if the input includes a surrogate Unicode code point, which is not possible to encode in UTF-8. Important notes: 1. It is possible to pass a non-UTF-8 string as a separator to the `.isoformat()` methods. 2. The pure-Python `datetime.fromisoformat()` implementation accepts strings with a surrogate as the separator. In `datetime.fromisoformat()`, in the special case of non-UTF-8 separators, this implementation will take a performance hit by making a copy of the input string and replacing the separator with 'T'. Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io> (cherry picked from commit 096329f) Co-authored-by: Paul Ganssle <pganssle@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters