◐ Shell
reader mode source ↗
Skip to content

gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize()#149080

Merged
encukou merged 2 commits into
python:mainfrom
sethmlarson:quadratic-unicodedata-normalize
Jun 2, 2026
Merged

gh-149079: Fix O(n^2) canonical ordering in unicodedata.normalize()#149080
encukou merged 2 commits into
python:mainfrom
sethmlarson:quadratic-unicodedata-normalize

Conversation

@sethmlarson

@sethmlarson sethmlarson commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Replace the insertion sort used for canonical ordering of combining characters with a hybrid approach: insertion sort for short runs (< 20) and counting sort for longer runs, reducing worst-case complexity from O(n^2) to O(n). This prevents denial of service via crafted Unicode strings with many combining characters with a large number of inversions in combing class order.

…ze()

Replace the insertion sort used for canonical ordering of combining
characters with a hybrid approach: insertion sort for short runs (< 20)
and counting sort for longer runs, reducing worst-case complexity from
O(n^2) to O(n). This prevents denial of service via crafted Unicode
strings with many combining characters in alternating CCC order.

Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>
@StanFromIreland

Copy link
Copy Markdown
Member

Reviewers: Note that there are pending changes from previous reviews.

@serhiy-storchaka serhiy-storchaka self-requested a review April 27, 2026 22:16

@serhiy-storchaka serhiy-storchaka left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

There is a potential for optimization, but in general LGTM. 👍

@serhiy-storchaka

Copy link
Copy Markdown
Member

Ideas for optimization:

  • We already have the Py_UCS4 output buffer. It is better to sort it, without using more costly PyUnicode_READ and PyUnicode_WRITE.
  • It is perhaps possible to combine sorting routines with the code that determines the length. This will reduce the number of costly _getrecord_ex() calls but requires heavy rewriting.
  • Since Unicode characters only need 21 bits of 32, they can be combined with 8-bit combining in the temporary buffer, reducing the number of costly _getrecord_ex() calls. But this will make the code more difficult to read.

@tim-one tim-one left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

I'm not sure there's ever an end to suggestions, so I'd prefer to ship this already. Good work, good enough,, and thank you for your care and patience!

@StanFromIreland

Copy link
Copy Markdown
Member

@sethmlarson a little ping, there are some suggestions above that need applying.

@serhiy-storchaka

serhiy-storchaka commented May 13, 2026

Copy link
Copy Markdown
Member

Most of my suggestions can wait to the following PRs. Although some of them can be easy to implement, so you can include them in this PR if you wish.

@encukou encukou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

Converting @maurycy's comments to GitHub suggestions:

@StanFromIreland

Copy link
Copy Markdown
Member

Following Petr's example, I wrote sethmlarson#1 to apply all open suggestions (except #149080 (comment), which I think is fine as-is, and Serhiy's future performance plans).

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Petr Viktorin <encukou@gmail.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
@sethmlarson

Copy link
Copy Markdown
Contributor Author

Sorry for the delay folks, I've merged the changes from @StanFromIreland 🙏

@sethmlarson sethmlarson requested a review from encukou May 26, 2026 21:29
@read-the-docs-community

Copy link
Copy Markdown

Hide details View details @encukou encukou merged commit 991224b into python:main Jun 2, 2026
56 checks passed
20 hidden items Load more…
@StanFromIreland StanFromIreland added needs backport to 3.13 bugs and security fixes needs backport to 3.14 needs backport to 3.15 pre-release feature fixes, bugs and security fixes labels Jun 2, 2026
@miss-islington-app

Copy link
Copy Markdown

Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.10.
🐍🍒⛏🤖

@miss-islington-app

Copy link
Copy Markdown

Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.11.
🐍🍒⛏🤖

@miss-islington-app

Copy link
Copy Markdown

Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12.
🐍🍒⛏🤖

@miss-islington-app

Copy link
Copy Markdown

Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13.
🐍🍒⛏🤖

@miss-islington-app

Copy link
Copy Markdown

Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.14.
🐍🍒⛏🤖

@miss-islington-app

Copy link
Copy Markdown

Thanks @sethmlarson for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.15.
🐍🍒⛏🤖

@miss-islington-app

Copy link
Copy Markdown

Sorry, @sethmlarson and @encukou, I could not cleanly backport this to 3.10 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 991224b1e8311c85f198f6dd8208bf8cff7fc26f 3.10

@miss-islington-app

miss-islington-app Bot commented Jun 2, 2026

Copy link
Copy Markdown

Sorry, @sethmlarson and @encukou, I could not cleanly backport this to 3.11 due to a conflict.
See the Backporting merged changes section of the Developer's Guide for instructions on backporting using the cherry_picker tool:

cherry_picker 991224b1e8311c85f198f6dd8208bf8cff7fc26f 3.11

@miss-islington-app

Copy link
Copy Markdown

Sorry, @sethmlarson and @encukou, I could not cleanly backport this to 3.12 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 991224b1e8311c85f198f6dd8208bf8cff7fc26f 3.12

@miss-islington-app

Copy link
Copy Markdown

Sorry, @sethmlarson and @encukou, I could not cleanly backport this to 3.13 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 991224b1e8311c85f198f6dd8208bf8cff7fc26f 3.13

@bedevere-app

bedevere-app Bot commented Jun 2, 2026

Copy link
Copy Markdown

GH-150775 is a backport of this pull request to the 3.14 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.14 bugs and security fixes label Jun 2, 2026
@bedevere-app

bedevere-app Bot commented Jun 2, 2026

Copy link
Copy Markdown

GH-150776 is a backport of this pull request to the 3.15 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.15 pre-release feature fixes, bugs and security fixes label Jun 2, 2026
StanFromIreland added a commit that referenced this pull request Jun 2, 2026
…ize() (GH-149080) (#150776)

Replace the insertion sort used for canonical ordering of combining
characters with a hybrid approach: insertion sort for short runs (< 20)
and counting sort for longer runs, reducing worst-case complexity from
O(n^2) to O(n). This prevents denial of service via crafted Unicode
strings with many combining characters in alternating CCC order.
(cherry picked from commit 991224b)

Co-authored-by: Seth Larson <seth@python.org>
Co-authored-by: ch4n3-yoon <ch4n3.yoon@gmail.com>
Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>
Co-authored-by: Stan Ulbrych <stan@python.org>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Petr Viktorin <encukou@gmail.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
StanFromIreland added a commit that referenced this pull request Jun 2, 2026
…ize() (GH-149080)

Replace the insertion sort used for canonical ordering of combining
characters with a hybrid approach: insertion sort for short runs (< 20)
and counting sort for longer runs, reducing worst-case complexity from
O(n^2) to O(n). This prevents denial of service via crafted Unicode
strings with many combining characters in alternating CCC order.
(cherry picked from commit 991224b)

Co-authored-by: Seth Larson <seth@python.org>
Co-authored-by: ch4n3-yoon <ch4n3.yoon@gmail.com>
Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>
Co-authored-by: Stan Ulbrych <stan@python.org>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Petr Viktorin <encukou@gmail.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
@bedevere-app

bedevere-app Bot commented Jun 2, 2026

Copy link
Copy Markdown

GH-150780 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.13 bugs and security fixes label Jun 2, 2026
@bedevere-app

bedevere-app Bot commented Jun 2, 2026

Copy link
Copy Markdown

GH-150781 is a backport of this pull request to the 3.13 branch.

encukou added a commit that referenced this pull request Jun 2, 2026
…ize() (GH-149080) (#150780)

Replace the insertion sort used for canonical ordering of combining
characters with a hybrid approach: insertion sort for short runs (< 20)
and counting sort for longer runs, reducing worst-case complexity from
O(n^2) to O(n). This prevents denial of service via crafted Unicode
strings with many combining characters in alternating CCC order.
(cherry picked from commit 991224b)

Co-authored-by: Seth Larson <seth@python.org>
Co-authored-by: ch4n3-yoon <ch4n3.yoon@gmail.com>
Co-authored-by: Seokchan Yoon <13852925+ch4n3-yoon@users.noreply.github.com>
Co-authored-by: Stan Ulbrych <stan@python.org>
Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Co-authored-by: Maurycy Pawłowski-Wieroński <maurycy@maurycy.com>
@bedevere-app

bedevere-app Bot commented Jun 3, 2026

Copy link
Copy Markdown

GH-150843 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.12 only security fixes label Jun 3, 2026
@bedevere-app

bedevere-app Bot commented Jun 17, 2026

Copy link
Copy Markdown

GH-151570 is a backport of this pull request to the 3.11 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.11 only security fixes label Jun 17, 2026
@bedevere-app

bedevere-app Bot commented Jun 17, 2026

Copy link
Copy Markdown

GH-151571 is a backport of this pull request to the 3.10 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.10 only security fixes label Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants