gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize()#149080
gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize()#149080encukou merged 2 commits into
Conversation
…ze() Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters in alternating CCC order. Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>
|
Reviewers: Note that there are pending changes from previous reviews. |
Sorry, something went wrong.
serhiy-storchaka
left a comment
There was a problem hiding this comment.
There is a potential for optimization, but in general LGTM. 👍
Sorry, something went wrong.
|
Ideas for optimization:
|
Sorry, something went wrong.
tim-one
left a comment
There was a problem hiding this comment.
I'm not sure there's ever an end to suggestions, so I'd prefer to ship this already. Good work, good enough,, and thank you for your care and patience!
Sorry, something went wrong.
|
@sethmlarson a little ping, there are some suggestions above that need applying. |
Sorry, something went wrong.
|
Most of my suggestions can wait to the following PRs. Although some of them can be easy to implement, so you can include them in this PR if you wish. |
Sorry, something went wrong.
encukou
left a comment
There was a problem hiding this comment.
Converting @maurycy's comments to GitHub suggestions:
Sorry, something went wrong.
|
Following Petr's example, I wrote sethmlarson#1 to apply all open suggestions (except #149080 (comment), which I think is fine as-is, and Serhiy's future performance plans). |
Sorry, something went wrong.
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com> Co-authored-by: Petr Viktorin <encukou@gmail.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
|
Sorry for the delay folks, I've merged the changes from @StanFromIreland 🙏 |
Sorry, something went wrong.
Documentation build overview
118 files changed ·
|
Sorry, something went wrong.
991224b
into
python:main
Jun 2, 2026
|
Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.10. |
Sorry, something went wrong.
|
Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.11. |
Sorry, something went wrong.
|
Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12. |
Sorry, something went wrong.
|
Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13. |
Sorry, something went wrong.
|
Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.14. |
Sorry, something went wrong.
|
Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.15. |
Sorry, something went wrong.
|
Sorry, @sethmlarson and @encukou, I could not cleanly backport this to |
Sorry, something went wrong.
|
Sorry, @sethmlarson and @encukou, I could not cleanly backport this to |
Sorry, something went wrong.
|
Sorry, @sethmlarson and @encukou, I could not cleanly backport this to |
Sorry, something went wrong.
|
Sorry, @sethmlarson and @encukou, I could not cleanly backport this to |
Sorry, something went wrong.
…ize() (GH-149080) (#150776) Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters in alternating CCC order. (cherry picked from commit 991224b) Co-authored-by: Seth Larson <seth@python.org> Co-authored-by: ch4n3-yoon <ch4n3.yoon@gmail.com> Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com> Co-authored-by: Stan Ulbrych <stan@python.org> Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com> Co-authored-by: Petr Viktorin <encukou@gmail.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
…ize() (GH-149080) Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters in alternating CCC order. (cherry picked from commit 991224b) Co-authored-by: Seth Larson <seth@python.org> Co-authored-by: ch4n3-yoon <ch4n3.yoon@gmail.com> Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com> Co-authored-by: Stan Ulbrych <stan@python.org> Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com> Co-authored-by: Petr Viktorin <encukou@gmail.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
…ize() (GH-149080) (#150780) Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters in alternating CCC order. (cherry picked from commit 991224b) Co-authored-by: Seth Larson <seth@python.org> Co-authored-by: ch4n3-yoon <ch4n3.yoon@gmail.com> Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com> Co-authored-by: Stan Ulbrych <stan@python.org> Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com> Co-authored-by: Serhiy Storchaka <storchaka@gmail.com> Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters with a large number of inversions in combing class order.
unicodedata.normalize("NFC")canonical ordering #149079