bpo-34454: Fix issue with non-UTF8 separator strings#8862
Conversation
|
@taleinat If you want you want to use this feel free to rebase against my branch. This PR is mainly because as part of figuring out how to make your PR fast, I actually re-wrote your PR, so it seemed easier to just push my changes. |
Sorry, something went wrong.
0baa78c to
71eeb20
Compare
August 22, 2018 20:45
|
I've merged in the tests and NEWS from #8859, but I now think this PR should be merged instead of that one. Comparing performance (using the script from this comment) of this PR (updated after sanitize_isoformat_str refactor): Compared with #8859: It's much faster in at least one common(ish) case (utf-8) and essentially the same performance in all other cases. IMO, this one also is more readable, since it's essentially equivalent to: def new_isoformat(dtstr):
if len(dtstr) > 10 and is_surrogate(dtstr[10]):
dtstr = "%sT%s" % (dtstr[0:10], dtstr[11:])
return old_isoformat_minus_segfaults(dtstr)It does not require the more complicated fast-path/slow-path branching in #8859 and proliferation of intermediate PyObjects (and associated refcounts) is kept to an absolute minimum. |
Sorry, something went wrong.
Sorry, something went wrong.
71eeb20 to
b5eeba0
Compare
August 22, 2018 21:04
taleinat
left a comment
There was a problem hiding this comment.
Looks good, just a few small details to amend.
Sorry, something went wrong.
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Sorry, something went wrong.
2162a43 to
f73b230
Compare
August 22, 2018 21:31
|
I believe all the changes are fixed. Thanks for the review @taleinat |
Sorry, something went wrong.
|
@pganssle, for your consideration: If you also do the UTF-8 encoding in |
Sorry, something went wrong.
|
@taleinat I'm not entirely sure, but I think that wouldn't work; when the RC of the temporary |
Sorry, something went wrong.
|
@pganssle, you're right, I hadn't considered that. Better to leave it as it is then. Just remove the Also, you're welcome to add yourself to the NEWS section, i.e. "Patch by Paul Ganssle". |
Sorry, something went wrong.
e45c28a to
b89d4f5
Compare
August 23, 2018 13:11
It is possible to pass a non-UTF-8 string as a separator in datetime.isoformat, but the current implementation starts by decoding to UTF-8, which will fail even for some valid strings. In the special case of non-UTF-8 separators, we take a performance hit by encoding the string as ASCII and replacing any invalid characters with ?.
Previously this would end up dereferencing a NULL pointer if the PyUnicode_AsUTF8AndSize call failed, this makes it so that the same error as any other parsing error is raised.
This increases performance for valid non-UTF-8 strings by avoiding an error condition, and minimizes the impact on the rest of the algorithm.
Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io>
Rather than splitting the string at position 10 and re-joining it with PyUnicode_Format, this copies the original unicode object and overwrites the separator character. Co-Authored-By: Alexey Izbyshev <izbyshev@ispras.ru>
|
@taleinat Fixed the duplicate |
Sorry, something went wrong.
|
@pganssle, looks good! I'm a core dev and will merge this so my name will be on it anyways. |
Sorry, something went wrong.
…gate code points (pythonGH-8862) The current C implementations **crash** if the input includes a surrogate Unicode code point, which is not possible to encode in UTF-8. Important notes: 1. It is possible to pass a non-UTF-8 string as a separator to the `.isoformat()` methods. 2. The pure-Python `datetime.fromisoformat()` implementation accepts strings with a surrogate as the separator. In `datetime.fromisoformat()`, in the special case of non-UTF-8 separators, this implementation will take a performance hit by making a copy of the input string and replacing the separator with 'T'. Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io> (cherry picked from commit 096329f) Co-authored-by: Paul Ganssle <pganssle@users.noreply.github.com>
…gate code points (GH-8862) The current C implementations **crash** if the input includes a surrogate Unicode code point, which is not possible to encode in UTF-8. Important notes: 1. It is possible to pass a non-UTF-8 string as a separator to the `.isoformat()` methods. 2. The pure-Python `datetime.fromisoformat()` implementation accepts strings with a surrogate as the separator. In `datetime.fromisoformat()`, in the special case of non-UTF-8 separators, this implementation will take a performance hit by making a copy of the input string and replacing the separator with 'T'. Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io> (cherry picked from commit 096329f) Co-authored-by: Paul Ganssle <pganssle@users.noreply.github.com>
|
@serhiy-storchaka I'll make a second PR with the cleanup. |
Sorry, something went wrong.
It is possible to pass a non-UTF-8 string as a separator in
datetime.isoformat, but the current implementation starts by decoding to UTF-8, which will fail even for some valid strings.In the special case of non-UTF-8 separators, we replace the separator character with
Tbefore encoding as UTF-8, so that encoding errors only occur on invalid ISO 8601 strings, and are handled as a standardValueError(as would occur in the pure Python version).bpo-34454: Implementation of the fix without significant performance problems.
https://bugs.python.org/issue34454