bpo-40521: Make tuple free list per-interpreter by vstinner · Pull Request #20247 · python/cpython

vstinner

local_tuple() has an overhead of interpreting bytecode and function call. Could you use some code which creates and destroys tuples in a tight loop in C?

tl; dr IMO the overhead is not significant. It's so small that it is really hard to measure.

I attached microbench_tuple.py to https://bugs.python.org/issue40521 which requires to apply bench_tuple.patch (also attached to the same bpo).

It is really hard to measure differences with my change since differences are around 1 ns, knowning that creating a tuple of 1 item takes around 20 ns... See my previous benchmark ("The overhead is between +0.7 ns and +2.1 ns per tuple creation"): #20247 (comment)

Depending on how Python is compiled, results say "faster" or "slower". My understanding is that differences are so small (less than 5% faster or slower), the difference is not significant.

pyperf requires 2^25 loops to get a run longer than 100 ms on my machine.

I ran benchmarks on Fedora 32 with CPU isolation.

Results of microbench_tuple.py with CPU isolation, LTO and PGO. It is the most reliable way to measure benchmarks in my knowledge.

+--------------------------------+---------+-----------------------------+
| Benchmark                      | ref     | patch                       |
+================================+=========+=============================+
| ()                             | 4.84 ns | 4.47 ns: 1.08x faster (-8%) |
+--------------------------------+---------+-----------------------------+
| (None,)                        | 18.2 ns | 19.0 ns: 1.04x slower (+4%) |
+--------------------------------+---------+-----------------------------+
| (None, None, None, None, None) | 30.1 ns | 31.3 ns: 1.04x slower (+4%) |
+--------------------------------+---------+-----------------------------+

Creating an empty tuple is "faster": that's likely noise in the benchmark since numbers are too small for pyperf. Or maybe it is because PGO decided to optimize the code differently and next build will get different performance...

std dev:

(): Mean +- std dev: [ref] 4.84 ns +- 0.00 ns -> [patch] 4.47 ns +- 0.01 ns: 1.08x faster (-8%)
(None,): Mean +- std dev: [ref] 18.2 ns +- 0.2 ns -> [patch] 19.0 ns +- 0.2 ns: 1.04x slower (+4%)
(None, None, None, None, None): Mean +- std dev: [ref] 30.1 ns +- 0.1 ns -> [patch] 31.3 ns +- 0.2 ns: 1.04x slower (+4%)

Differences:

Empty tuple: -0.4 ns
tuple of 1 item: +0.8 ns
tuple of 5 items: +1.2 ns

Note: building Python with LTO+PGO takes around 10 min, so the overall benchmark takes me around 30 min.

Second benchmark run, without PGO, only LTO and CPU isolation, to check results.

+--------------------------------+---------+------------------------------+
| Benchmark                      | ref-lto | patch-lto                    |
+================================+=========+==============================+
| ()                             | 3.72 ns | 3.96 ns: 1.06x slower (+6%)  |
+--------------------------------+---------+------------------------------+
| (None,)                        | 22.5 ns | 19.8 ns: 1.14x faster (-12%) |
+--------------------------------+---------+------------------------------+
| (None, None, None, None, None) | 33.3 ns | 31.8 ns: 1.05x faster (-5%)  |
+--------------------------------+---------+------------------------------+

std dev:

(): Mean +- std dev: [ref-lto] 3.72 ns +- 0.03 ns -> [patch-lto] 3.96 ns +- 0.11 ns: 1.06x slower (+6%)
(None,): Mean +- std dev: [ref-lto] 22.5 ns +- 0.1 ns -> [patch-lto] 19.8 ns +- 0.1 ns: 1.14x faster (-12%)
(None, None, None, None, None): Mean +- std dev: [ref-lto] 33.3 ns +- 0.1 ns -> [patch-lto] 31.8 ns +- 0.2 ns: 1.05x faster (-5%)

Differences:

Empty tuple: +0.2 ns
tuple of 1 item: -2.7 ns
tuple of 5 items: -1.5 ns

The benchmark () measures PyTuple_New(0) which does almost nothing. In short, it measures:

obj = free_list[0];
Py_INCREF(obj);
Py_DECREF(obj);

and the cost of a function call in C: clal PyTuple_New().

The benchmark (None,) mostly measures PyTuple_New(1) which does:

obj = free_list[1];
 numfree[1]--;
_Py_NewReference(obj);
obj->ob_item[0] = NULL;
_PyObject_GC_TRACK(obj);

Py_INCREF(Py_None);
obj->ob_item[0] = Py_None;

PyObject_GC_UnTrack(obj);
Py_XDECREF(obj->ob_item[0]);

obj->ob_item[0] = free_list[1];
numfree[1]++;
free_list[1] = obj;