bpo-37966: Fully implement the UAX #15 quick-check algorithm.#15558

gnprice

The purpose of the unicodedata.is_normalized function is to answer
the question str == unicodedata.normalized(form, str) more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX #15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm. This greatly speeds up
unicodedata.is_normalized in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
    -- 'unicodedata.is_normalized("NFD", s)'
50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \
    -- 'unicodedata.is_normalized("NFD", s)'
5000000 loops, best of 5: 58.2 nsec per loop

https://bugs.python.org/issue37966

This link doesn't work. Going back through that UAX's history to find the version that was current when this code was added in commit 7a0fedf in 2009-04, we find that that anchor still works in that version: https://www.unicode.org/reports/tr15/tr15-29.html#Annex8 It's a section heading "14. Detecting Normalization Forms". Happily the anchor that the corresponding section heading now offers looks much more reasonable -- it's the title of the section -- and so likely to be long-term stable. ("Annex 8" seems like some kind of editing error.) Switch to that.

The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX python#15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop

This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop

benjaminp

Thanks. Looks good. Here are a few nitty comments.

miss-islington

Thanks @gnprice for the PR, and @benjaminp for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8.
🐍🍒⛏🤖

bedevere-bot

GH-15671 is a backport of this pull request to the 3.8 branch.

…orithm. (pythonGH-15558) The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX pythonGH-15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop (cherry picked from commit 2f09413) Co-authored-by: Greg Price <gnprice@gmail.com>

GH-15558) The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX GH-15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop (cherry picked from commit 2f09413) Co-authored-by: Greg Price <gnprice@gmail.com>

gnprice

Thanks for all the reviews and the merge!

@benjaminp , I'll go on and send next some of those followups you suggested.

gnprice

@benjaminp , I'll go on and send next some of those followups you suggested.

Filed bpo-38043 for that, and sent PRs for two of them.

…ithm. (pythonGH-15558) The purpose of the `unicodedata.is_normalized` function is to answer the question `str == unicodedata.normalized(form, str)` more efficiently than writing just that, by using the "quick check" optimization described in the Unicode standard in UAX python#15. However, it turns out the code doesn't implement the full algorithm from the standard, and as a result we often miss the optimization and end up having to compute the whole normalized string after all. Implement the standard's algorithm. This greatly speeds up `unicodedata.is_normalized` in many cases where our partial variant of quick-check had been returning MAYBE and the standard algorithm returns NO. At a quick test on my desktop, the existing code takes about 4.4 ms/MB (so 4.4 ns per byte) when the partial quick-check returns MAYBE and it has to do the slow normalize-and-compare: $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 50 loops, best of 5: 4.39 msec per loop With this patch, it gets the answer instantly (58 ns) on the same 1 MB string: $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \ -- 'unicodedata.is_normalized("NFD", s)' 5000000 loops, best of 5: 58.2 nsec per loop This restores a small optimization that the original version of this code had for the `unicodedata.normalize` use case. With this, that case is actually faster than in master! $ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 561 usec per loop $ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \ -- 'unicodedata.normalize("NFD", s)' 500 loops, best of 5: 512 usec per loop

gnprice added 3 commits August 27, 2019 19:39

the-knights-who-say-ni added the CLA signed label Aug 28, 2019

bedevere-bot added the awaiting review label Aug 28, 2019

benjaminp reviewed Aug 29, 2019

View reviewed changes

gnprice added 2 commits August 28, 2019 20:43

Move UAX python#15 link to doc-comment.

27e8122

Use bool for a boolean.

3762787

vstinner requested a review from serhiy-storchaka August 29, 2019 13:19

benjaminp added the needs backport to 3.8 label Sep 4, 2019

benjaminp approved these changes Sep 4, 2019

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting review labels Sep 4, 2019

benjaminp merged commit 2f09413 into python:master Sep 4, 2019

bedevere-bot removed the awaiting merge label Sep 4, 2019

bedevere-bot removed the needs backport to 3.8 label Sep 4, 2019

gnprice deleted the pr-quickcheck branch September 6, 2019 03:23

gnprice mentioned this pull request Sep 9, 2019

bpo-38043: Move unicodedata.normalize tests into test_unicodedata. #15712

Merged

gnprice mentioned this pull request Apr 10, 2022

small cleanups in Unicode normalization code #82224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-37966: Fully implement the UAX #15 quick-check algorithm.#15558

bpo-37966: Fully implement the UAX #15 quick-check algorithm.#15558
benjaminp merged 5 commits into
python:masterfrom
gnprice:pr-quickcheck

gnprice commented Aug 28, 2019 •

edited by bedevere-bot

Loading

Uh oh!

benjaminp left a comment

Uh oh!

miss-islington commented Sep 4, 2019

Uh oh!

bedevere-bot commented Sep 4, 2019

Uh oh!

gnprice commented Sep 6, 2019

Uh oh!

gnprice commented Sep 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

gnprice commented Aug 28, 2019 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjaminp left a comment

Choose a reason for hiding this comment

Uh oh!

miss-islington commented Sep 4, 2019

Uh oh!

bedevere-bot commented Sep 4, 2019

Uh oh!

gnprice commented Sep 6, 2019

Uh oh!

gnprice commented Sep 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gnprice commented Aug 28, 2019 •

edited by bedevere-bot

Loading