bpo-34454: Fix issue with non-UTF8 separator strings#8862

pganssle

It is possible to pass a non-UTF-8 string as a separator in datetime.isoformat, but the current implementation starts by decoding to UTF-8, which will fail even for some valid strings.

In the special case of non-UTF-8 separators, we replace the separator character with T before encoding as UTF-8, so that encoding errors only occur on invalid ISO 8601 strings, and are handled as a standard ValueError (as would occur in the pure Python version).

bpo-34454: Implementation of the fix without significant performance problems.

https://bugs.python.org/issue34454

pganssle

@taleinat If you want you want to use this feel free to rebase against my branch. This PR is mainly because as part of figuring out how to make your PR fast, I actually re-wrote your PR, so it seemed easier to just push my changes.

pganssle

I've merged in the tests and NEWS from #8859, but I now think this PR should be merged instead of that one.

Comparing performance (using the script from this comment) of this PR (updated after sanitize_isoformat_str refactor):

datetime constructor:                1192.5ns
fromisoformat:                       561.3ns
fromisoformat (special characters):  599.7ns
fromisoformat (non-utf8):            1289.5ns
fromisoformat (fail, non-utf8):      3501.5ns
fromisoformat (fail, utf8):          1738.7ns

Compared with #8859:

datetime constructor:                1165.1ns
fromisoformat:                       520.7ns
fromisoformat (special characters):  1153.1ns
fromisoformat (non-utf8):            1165.8ns
fromisoformat (fail, non-utf8):      2815.5ns
fromisoformat (fail, utf8):          1648.3ns

It's much faster in at least one common(ish) case (utf-8) and essentially the same performance in all other cases. IMO, this one also is more readable, since it's essentially equivalent to:

def new_isoformat(dtstr):
    if len(dtstr) > 10 and is_surrogate(dtstr[10]):
        dtstr = "%sT%s" % (dtstr[0:10], dtstr[11:])
    return old_isoformat_minus_segfaults(dtstr)

It does not require the more complicated fast-path/slow-path branching in #8859 and proliferation of intermediate PyObjects (and associated refcounts) is kept to an absolute minimum.

pganssle

CC @abalkin @serhiy-storchaka

taleinat

Looks good, just a few small details to amend.

bedevere-bot

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

pganssle

I believe all the changes are fixed. Thanks for the review @taleinat

taleinat

@pganssle, for your consideration: If you also do the UTF-8 encoding in _sanitize_isoformat_str (renamed appropriately) and return a const char * and length, you can avoid the conditional Py_DECREF() in the fast-path at the end of datetime_fromisoformat().

pganssle

@taleinat I'm not entirely sure, but I think that wouldn't work; when the RC of the temporary dtstr reaches 0 it would be deleted, and I think that the temporary dtstr is managing the memory for dt_ptr. If that's not how it works, I'm not sure what would be managing that memory, since I'm not allocating any memory for it, or freeing it later.

taleinat

@pganssle, you're right, I hadn't considered that. Better to leave it as it is then.

Just remove the PyUnicode_GetLength() call and it should be ready to go in.

Also, you're welcome to add yourself to the NEWS section, i.e. "Patch by Paul Ganssle".

It is possible to pass a non-UTF-8 string as a separator in datetime.isoformat, but the current implementation starts by decoding to UTF-8, which will fail even for some valid strings. In the special case of non-UTF-8 separators, we take a performance hit by encoding the string as ASCII and replacing any invalid characters with ?.

Previously this would end up dereferencing a NULL pointer if the PyUnicode_AsUTF8AndSize call failed, this makes it so that the same error as any other parsing error is raised.

This increases performance for valid non-UTF-8 strings by avoiding an error condition, and minimizes the impact on the rest of the algorithm.

Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io>

Rather than splitting the string at position 10 and re-joining it with PyUnicode_Format, this copies the original unicode object and overwrites the separator character. Co-Authored-By: Alexey Izbyshev <izbyshev@ispras.ru>

pganssle

@taleinat Fixed the duplicate PyUnicode_GetLength and the missing NULL check. I don't really need to be mentioned in the NEWS, plus I think it would be complicated to properly assign credit for this patch, as it was a collaborative effort between me, you and @izbyshev.

taleinat

@pganssle, looks good!

I'm a core dev and will merge this so my name will be on it anyways.

miss-islington

Thanks @pganssle for the PR, and @taleinat for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7.
🐍🍒⛏🤖

…gate code points (pythonGH-8862) The current C implementations **crash** if the input includes a surrogate Unicode code point, which is not possible to encode in UTF-8. Important notes: 1. It is possible to pass a non-UTF-8 string as a separator to the `.isoformat()` methods. 2. The pure-Python `datetime.fromisoformat()` implementation accepts strings with a surrogate as the separator. In `datetime.fromisoformat()`, in the special case of non-UTF-8 separators, this implementation will take a performance hit by making a copy of the input string and replacing the separator with 'T'. Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io> (cherry picked from commit 096329f) Co-authored-by: Paul Ganssle <pganssle@users.noreply.github.com>

bedevere-bot

GH-8877 is a backport of this pull request to the 3.7 branch.

…gate code points (GH-8862) The current C implementations **crash** if the input includes a surrogate Unicode code point, which is not possible to encode in UTF-8. Important notes: 1. It is possible to pass a non-UTF-8 string as a separator to the `.isoformat()` methods. 2. The pure-Python `datetime.fromisoformat()` implementation accepts strings with a surrogate as the separator. In `datetime.fromisoformat()`, in the special case of non-UTF-8 separators, this implementation will take a performance hit by making a copy of the input string and replacing the separator with 'T'. Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io> (cherry picked from commit 096329f) Co-authored-by: Paul Ganssle <pganssle@users.noreply.github.com>

pganssle

@serhiy-storchaka I'll make a second PR with the cleanup.

the-knights-who-say-ni added the CLA signed label Aug 22, 2018

bedevere-bot added the awaiting review label Aug 22, 2018

pganssle mentioned this pull request Aug 22, 2018

bpo-34454: fix crash in .fromisoformat() methods when given inputs with surrogate code points #8859

Closed

pganssle force-pushed the fromisoformat_fix_nonutf8_crash branch 3 times, most recently from 0baa78c to 71eeb20 Compare August 22, 2018 20:45

pganssle force-pushed the fromisoformat_fix_nonutf8_crash branch from 71eeb20 to b5eeba0 Compare August 22, 2018 21:04

taleinat requested changes Aug 22, 2018

View reviewed changes

bedevere-bot added awaiting changes and removed awaiting review labels Aug 22, 2018

taleinat reviewed Aug 22, 2018

View reviewed changes

pganssle force-pushed the fromisoformat_fix_nonutf8_crash branch 2 times, most recently from 2162a43 to f73b230 Compare August 22, 2018 21:31

izbyshev reviewed Aug 22, 2018

View reviewed changes

taleinat reviewed Aug 23, 2018

View reviewed changes

izbyshev reviewed Aug 23, 2018

View reviewed changes

pganssle force-pushed the fromisoformat_fix_nonutf8_crash branch from e45c28a to b89d4f5 Compare August 23, 2018 13:11

pganssle and others added 5 commits August 23, 2018 09:11

Fix non-UTF8 crash for (date|time)_fromisoformat …

85a99ca

Previously this would end up dereferencing a NULL pointer if the PyUnicode_AsUTF8AndSize call failed, this makes it so that the same error as any other parsing error is raised.

Refactor non-UTF-8 sanitization …

dd82aa0

This increases performance for valid non-UTF-8 strings by avoiding an error condition, and minimizes the impact on the rest of the algorithm.

Add tests for surrogate code points …

c24388e

Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io>

Add news entry for bpo-34454

a0246a0

Refactor sanitize_isoformat_str …

b89d4f5

Rather than splitting the string at position 10 and re-joining it with PyUnicode_Format, this copies the original unicode object and overwrites the separator character. Co-Authored-By: Alexey Izbyshev <izbyshev@ispras.ru>

taleinat approved these changes Aug 23, 2018

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting changes labels Aug 23, 2018

added mention of patch author in NEWS

160e779

taleinat added needs backport to 3.7 type-bug An unexpected behavior, bug, or error labels Aug 23, 2018

taleinat merged commit 096329f into python:master Aug 23, 2018

bedevere-bot removed the awaiting merge label Aug 23, 2018

bedevere-bot removed the needs backport to 3.7 label Aug 23, 2018

taleinat mentioned this pull request Aug 23, 2018

bpo-34454: datetime: Fix crash on PyUnicode_AsUTF8AndSize() failure #8850

Closed

serhiy-storchaka reviewed Aug 24, 2018

View reviewed changes

pganssle mentioned this pull request Aug 27, 2018

bpo-34454: Clean up datetime.fromisoformat surrogate handling #8959

Merged

taleinat mentioned this pull request Oct 23, 2018

bpo-34482: Add tests for proper handling of non-UTF-8-encodable strin… #8878

Merged

Conversation

pganssle commented Aug 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pganssle commented Aug 22, 2018

Uh oh!

pganssle commented Aug 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pganssle commented Aug 22, 2018

Uh oh!

taleinat left a comment

Choose a reason for hiding this comment

Uh oh!

bedevere-bot commented Aug 22, 2018

Uh oh!

pganssle commented Aug 22, 2018

Uh oh!

taleinat commented Aug 23, 2018

Uh oh!

pganssle commented Aug 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taleinat commented Aug 23, 2018

Uh oh!

pganssle commented Aug 23, 2018

Uh oh!

taleinat commented Aug 23, 2018

Uh oh!

miss-islington commented Aug 23, 2018

Uh oh!

bedevere-bot commented Aug 23, 2018

Uh oh!

pganssle commented Aug 24, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pganssle commented Aug 22, 2018 •

edited

Loading

pganssle commented Aug 22, 2018 •

edited

Loading

pganssle commented Aug 23, 2018 •

edited

Loading