bpo-39465: Fix _PyUnicode_FromId() for subinterpreters#20058
Conversation
|
The advantage of the hash table approach (PR #20048) is that it avoids the need of a unique index. It simply uses the variable adddress as the key. |
Sorry, something went wrong.
|
Oh. I'm not longer able to reproduce this benchmark :-( I rer-run a benchmark with gcc -O3, and then again with gcc -O3 and LTO. I got two times similar results:
I also checked: volatile has no impact on performances. |
Sorry, something went wrong.
|
I rebased my PR and squashed commits to be able to update the commit message (especially the benchmark result). |
Sorry, something went wrong.
|
I rebased my PR on master which became Python 3.10. I failed to make the hashtable as fast as an array, so I closed the PR #20048 in favor of this PR which uses an array. |
Sorry, something went wrong.
and
We can use something like a pre-processor to initialize some identifiers a build time, but I'm not sure that it's a good idea to allocate in advance space for all possible identifiers. Python has a large standard library. Many applications will never room some C extensions. I like the approach to assigning identifiers dynamically and only allocate more space on demand. Only the first call to Let's say that Python has 200 identifier objects. With this PR, if an application only use 10 identifiers, Python will only allocates an array of 16 items, instead of 200 items. I chose to always allocate at least 16 items, to reduce the number of realloc. We might adjust that later if needed. Maybe for identifiers, it's not critical, but I plan to use a similar approach for other objects which should be made "per-interpreter". If we pre-allocate "everything", we will likely waste a lot of memory which will never be used. |
Sorry, something went wrong.
|
cc @encukou @ericsnowcurrently: Would you mind to review this PR? @serhiy-storchaka: Apart your two remarks, are you ok with the overall approach? Does it sound like a reasonable overhead (+1.21 ns per function call)? If we consider that subinterpreters with per-interpreter GIL can make Python (code written for subinterpreters) at least 4x faster on machines with at least 4 CPUs, IMO it's worth it. |
Sorry, something went wrong.
|
Performance impact: Context for these numbers:
(copy of my #20048 (comment) comment.) |
Sorry, something went wrong.
|
Just for curiosity, how many identifiers are allocated by |
Sorry, something went wrong.
(I'm running "./python -m test" which is quite long :-p) I used this patch: |
Sorry, something went wrong.
serhiy-storchaka
left a comment
There was a problem hiding this comment.
Concurrent programming without GIL is hard.
Sorry, something went wrong.
https://bugs.python.org/issue39465 doesn't try to remove the GIL but having one GIL per interpreter: see https://bugs.python.org/issue40512 |
Sorry, something went wrong.
|
It seems like currently, CPython uses around 523 _Py_Identifier instances: |
Sorry, something went wrong.
|
One GIL per interpreter does not help when work with a data shared between interpreters. |
Sorry, something went wrong.
Only _PyRuntime is shared by multiple interpreters: access to _PyRuntime is protected by a new lock. |
Sorry, something went wrong.
|
Is it |
Sorry, something went wrong.
Would you mind to elaborate which shared data is not guarded by rt_ids->lock? Globals (shared by all interpreters):
Per-interpreter:
|
Sorry, something went wrong.
|
My apologies. You are right, now I see that But I think there is still a problem with non-atomic |
Sorry, something went wrong.
vstinner
left a comment
There was a problem hiding this comment.
Oops, note for myself: I must revert the _testcapi changes, only there for benchmarks.
Sorry, something went wrong.
ericsnowcurrently
left a comment
There was a problem hiding this comment.
Other than globally-locking around id->index before we're sure it's been set, LGTM.
Sorry, something went wrong.
|
When you're done making the requested changes, leave the comment: |
Sorry, something went wrong.
|
I wrote PR #20390 to check if C11 |
Sorry, something went wrong.
I could modify my PR to only access _Py_Identifier.index when the runtime lock is acquired. The problem is that it may become a new performance bottleneck if many threads of different subinterpreters call _PyUnicode_FromId() in parallel. Threads would have to sequentially execute _PyUnicode_FromId(), rather than being able to run it in parallel. The code protected by the lock is very short and very fast, maybe it's not an issue? A "global" lock for all identifiers may defeat the purpose of per-interpreter GIL. Well, at least, it makes _PyUnicode_FromId() "less parallel" :-) If C11 _Atomic specifier cannot be used, maybe we can identify a subset of functions available on all C compilers supported by CPython. For example, MSC (Visual Studio) provides "Interlocked" functions for atomic operations on LONG or on 64-bit variables: https://docs.microsoft.com/en-us/windows/win32/sync/synchronization-functions?redirectedfrom=MSDN |
Sorry, something went wrong.
|
Another alternative is to use a Read/Write lock which allows parallel read access:
|
Sorry, something went wrong.
|
I wrote PR #20766 which adds functions to access variables atomically without having to declare variables as atomic. I rebased this PR on master and included PR #20766 in this PR to access _Py_Identifier.index atomically. Microbenchmark on the PR using atomic functions: It seems like reading _Py_Identifier.index doesn't use any memory fence, it's just a regular MOV on x86. So the fast path doesn't pay any overhead of an atomic read. |
Sorry, something went wrong.
|
Currently, my PR uses But |
Sorry, something went wrong.
Make _PyUnicode_FromId() function compatible with subinterpreters. Each interpreter now has an array of identifier objects (interned strings decoded from UTF-8). * Add PyInterpreterState.unicode.identifiers: array of identifiers objects. * Add _PyRuntimeState.unicode_ids used to allocate unique indexes to _Py_Identifier. * Rewrite _Py_Identifier structure. Benchmark _PyUnicode_FromId(&PyId_a) with _Py_IDENTIFIER(a): [ref] 2.42 ns +- 0.00 ns -> [atomic] 3.39 ns +- 0.00 ns: 1.40x slower This change adds 1 ns per _PyUnicode_FromId() call in average.
|
I plan to merge this PR next days. cc @serhiy-storchaka @ericsnowcurrently IMO the latest version of the PR is now correct (no race condition) and its performance slowdown is acceptable. This PR has a long history:
I rebased this PR on master and I re-run benchmarks: It adds 1 ns per _PyUnicode_FromId() call in average. IMO it's reasonable and no better approach was found to fix https://bugs.python.org/issue39465 (fix _PyUnicode_FromId() for subinterpreters). Context for these numbers:
I already pushed non-controlversial changes to make this PR as short as possible (to ease reviews). About the |
Sorry, something went wrong.
I fixed the issue spotted by Eric
|
Oh, running |
Sorry, something went wrong.
I was a mistake during my latest rebase. It's now fixed. |
Sorry, something went wrong.
|
This PR is needed to fix https://bugs.python.org/issue40521 : see PR #20085 "Per-interpreter interned strings". |
Sorry, something went wrong.
|
Bisecting history, git tells me this PR is causing an issue, detected in pybind11's embedding tests (pybind/pybind11#2774): https://bugs.python.org/issue42882 I'm happy to debug or help out, but I don't immediately see how to approach this easily. |
Sorry, something went wrong.
Make _PyUnicode_FromId() function compatible with subinterpreters. Each interpreter now has an array of identifier objects (interned strings decoded from UTF-8). * Add PyInterpreterState.unicode.identifiers: array of identifiers objects. * Add _PyRuntimeState.unicode_ids used to allocate unique indexes to _Py_Identifier. * Rewrite the _Py_Identifier structure. Microbenchmark on _PyUnicode_FromId(&PyId_a) with _Py_IDENTIFIER(a): [ref] 2.42 ns +- 0.00 ns -> [atomic] 3.39 ns +- 0.00 ns: 1.40x slower This change adds 1 ns per _PyUnicode_FromId() call in average.
Make _PyUnicode_FromId() function compatible with subinterpreters.
Each interpreter now has an array of identifier objects (interned
strings decoded from UTF-8).
objects.
to _Py_Identifier.
Benchmark _PyUnicode_FromId(&PyId_a) with _Py_IDENTIFIER(a):
[ref] 2.42 ns +- 0.00 ns -> [atomic] 3.39 ns +- 0.00 ns: 1.40x slower
This change adds 1 ns per _PyUnicode_FromId() call in average.
https://bugs.python.org/issue39465