◐ Shell
reader mode source ↗
Skip to content

gh-91404: Improve re performance#91495

Merged
brandtbucher merged 10 commits into
python:mainfrom
brandtbucher:sre-goto
Apr 15, 2022
Merged

gh-91404: Improve re performance#91495
brandtbucher merged 10 commits into
python:mainfrom
brandtbucher:sre-goto

Conversation

@brandtbucher

@brandtbucher brandtbucher commented Apr 13, 2022

Copy link
Copy Markdown
Member

This makes a few performance improvements in sre_lib.h:

  • Like the main VM, the regex engine now uses computed gotos on supported platforms.
  • Two read- and write-heavy members of the ctx pointer (ctx->pattern and ctx->ptr) are lifted into local variables.

(The diff looks gnarly, but everything inside the main switch is just mechanical replacements to support these two changes. It's not really that bad.)

It yields nice improvements on all of the expected benchmarks, and a 1% improvement overall:

Slower (6):
- scimark_monte_carlo: 398 ms +- 14 ms -> 406 ms +- 15 ms: 1.02x slower
- chaos: 407 ms +- 13 ms -> 413 ms +- 11 ms: 1.01x slower
- scimark_fft: 1.95 sec +- 0.02 sec -> 1.97 sec +- 0.02 sec: 1.01x slower
- tornado_http: 570 ms +- 17 ms -> 577 ms +- 17 ms: 1.01x slower
- sympy_expand: 2.77 sec +- 0.02 sec -> 2.79 sec +- 0.02 sec: 1.01x slower
- meteor_contest: 618 ms +- 11 ms -> 622 ms +- 12 ms: 1.01x slower

Faster (14):
- regex_v8: 158 ms +- 9 ms -> 132 ms +- 11 ms: 1.20x faster
- regex_effbot: 19.2 ms +- 1.0 ms -> 17.5 ms +- 0.9 ms: 1.10x faster
- regex_dna: 1.36 sec +- 0.01 sec -> 1.24 sec +- 0.01 sec: 1.09x faster
- pycparser: 7.07 sec +- 0.10 sec -> 6.73 sec +- 0.10 sec: 1.05x faster
- pidigits: 1.20 sec +- 0.01 sec -> 1.16 sec +- 0.01 sec: 1.04x faster
- pickle_pure_python: 1.86 ms +- 0.13 ms -> 1.80 ms +- 0.13 ms: 1.03x faster
- scimark_sparse_mat_mult: 28.8 ms +- 1.5 ms -> 27.9 ms +- 1.5 ms: 1.03x faster
- fannkuch: 2.35 sec +- 0.03 sec -> 2.28 sec +- 0.02 sec: 1.03x faster
- float: 449 ms +- 13 ms -> 439 ms +- 11 ms: 1.02x faster
- xml_etree_iterparse: 625 ms +- 15 ms -> 613 ms +- 13 ms: 1.02x faster
- spectral_norm: 594 ms +- 11 ms -> 587 ms +- 13 ms: 1.01x faster
- xml_etree_parse: 943 ms +- 14 ms -> 934 ms +- 17 ms: 1.01x faster
- 2to3: 1.56 sec +- 0.01 sec -> 1.55 sec +- 0.01 sec: 1.01x faster
- sympy_sum: 946 ms +- 14 ms -> 938 ms +- 13 ms: 1.01x faster

Benchmark hidden because not significant (42): chameleon, crypto_pyaes, deltablue, django_template, dulwich_log, go, hexiom, html5lib, json, json_dumps, json_loads, logging_format, logging_silent, logging_simple, mako, nbody, nqueens, pathlib, pickle, pickle_dict, pickle_list, pyflate, python_startup, python_startup_no_site, raytrace, regex_compile, richards, scimark_lu, scimark_sor, sqlalchemy_declarative, sqlalchemy_imperative, sqlite_synth, sympy_integrate, sympy_str, telco, thrift, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_generate, xml_etree_process

Geometric mean: 1.01x faster

Maybe re won't be slower in 3.11 after all! 🙃

@markshannon

markshannon commented Apr 13, 2022

Copy link
Copy Markdown
Member

Nice speedup.

I wonder how much of the speedup would be achieved by just moving ctx->pattern and ctx->ptr into local variables and leaving the dispatch alone. I'm wondering this for a couple of reasons:

  1. This paper suggests that it is the memory read, not the branch prediction, that has the larger impact on performance
  2. We are concerned about dispatching on Windows, and it might be good to apply the results of that here as well.

How easy would it be to apply the changes that move ctx->pattern and ctx->ptr, without the dispatching changes?

@serhiy-storchaka

Copy link
Copy Markdown
Member

How easy would it be to apply the changes that move ctx->pattern and ctx->ptr, without the dispatching changes?

@brandtbucher, could you show results with disabled computed gotos? Only regex-related benchmarks are interesting.

@brandtbucher brandtbucher self-assigned this Apr 13, 2022
@brandtbucher

brandtbucher commented Apr 13, 2022

Copy link
Copy Markdown
Member Author

With only computed gotos:

- regex_v8: 158 ms +- 9 ms -> 144 ms +- 11 ms: 1.09x faster
- regex_effbot: 19.2 ms +- 1.0 ms -> 18.2 ms +- 0.9 ms: 1.05x faster
- regex_dna: 1.36 sec +- 0.01 sec -> 1.30 sec +- 0.01 sec: 1.05x faster

With only new locals:

- regex_v8: 158 ms +- 9 ms -> 142 ms +- 12 ms: 1.11x faster
- regex_effbot: 19.2 ms +- 1.0 ms -> 18.2 ms +- 1.0 ms: 1.05x faster
- regex_dna: 1.36 sec +- 0.01 sec -> 1.22 sec +- 0.01 sec: 1.11x faster

Combined:

- regex_v8: 158 ms +- 9 ms -> 132 ms +- 11 ms: 1.20x faster
- regex_effbot: 19.2 ms +- 1.0 ms -> 17.5 ms +- 0.9 ms: 1.10x faster
- regex_dna: 1.36 sec +- 0.01 sec -> 1.24 sec +- 0.01 sec: 1.09x faster

For regex_v8 and regex_effbot, it looks like each optimization contributes about 50% of the speedup. regex_dna seems to benefit mostly from the new locals, for some reason (perhaps because the switch dispatch for that one is already highly stable/predicable?).

I think we should keep both.

@serhiy-storchaka

Copy link
Copy Markdown
Member

Excellent! Before merging, could you compare it with 3.10? If there is the same difference, it would be worth to add a note about speed up 10-20% in the NEWS and What's New files.

@brandtbucher brandtbucher merged commit 1b34b56 into python:main Apr 15, 2022
undingen added a commit to undingen/pyston_v2 that referenced this pull request Apr 26, 2022
This is a backport of the upstream 3.11 improvement:
python/cpython#91495

I only backported the ctx->pattern -> pattern and ctx->ptr -> ptr
part because using computed goto actually decreased perf slightly on the opt build.
@serhiy-storchaka serhiy-storchaka added the 3.11 only security fixes label May 19, 2022
@brandtbucher brandtbucher deleted the sre-goto branch July 21, 2022 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.11 only security fixes performance Performance or resource usage stdlib Standard Library Python modules in the Lib/ directory topic-regex

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants