bpo-34454: Fix issue with non-UTF8 separator strings by pganssle · Pull Request #8862 · python/cpython

the-knights-who-say-ni

It is possible to pass a non-UTF-8 string as a separator in
datetime.isoformat, but the current implementation starts by decoding to
UTF-8, which will fail even for some valid strings.

In the special case of non-UTF-8 separators, we take a performance hit
by encoding the string as ASCII and replacing any invalid characters
with ?.

Previously this would end up dereferencing a NULL pointer if the
PyUnicode_AsUTF8AndSize call failed, this makes it so that the same
error as any other parsing error is raised.

This increases performance for valid non-UTF-8 strings by avoiding
an error condition, and minimizes the impact on the rest of the
algorithm.

Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru>
Co-authored-by: Paul Ganssle <paul@ganssle.io>

Rather than splitting the string at position 10 and re-joining it with
PyUnicode_Format, this copies the original unicode object and overwrites
the separator character.

Co-Authored-By: Alexey Izbyshev <izbyshev@ispras.ru>

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request

Aug 23, 2018

…gate code points (pythonGH-8862)

The current C implementations **crash** if the input includes a surrogate
Unicode code point, which is not possible to encode in UTF-8.

Important notes:

1.  It is possible to pass a non-UTF-8 string as a separator to the
    `.isoformat()` methods.
2.  The pure-Python `datetime.fromisoformat()` implementation accepts
    strings with a surrogate as the separator.

In `datetime.fromisoformat()`, in the special case of non-UTF-8 separators,
this implementation will take a performance hit by making a copy of the
input string and replacing the separator with 'T'.

Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru>
Co-authored-by: Paul Ganssle <paul@ganssle.io>
(cherry picked from commit 096329f)

Co-authored-by: Paul Ganssle <pganssle@users.noreply.github.com>

miss-islington added a commit that referenced this pull request

Aug 23, 2018

…gate code points (GH-8862)

The current C implementations **crash** if the input includes a surrogate
Unicode code point, which is not possible to encode in UTF-8.

Important notes:

1.  It is possible to pass a non-UTF-8 string as a separator to the
    `.isoformat()` methods.
2.  The pure-Python `datetime.fromisoformat()` implementation accepts
    strings with a surrogate as the separator.

In `datetime.fromisoformat()`, in the special case of non-UTF-8 separators,
this implementation will take a performance hit by making a copy of the
input string and replacing the separator with 'T'.

Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru>
Co-authored-by: Paul Ganssle <paul@ganssle.io>
(cherry picked from commit 096329f)

Co-authored-by: Paul Ganssle <pganssle@users.noreply.github.com>