bpo-37448: Add radix tree implementation for obmalloc address_in_range().#14474
bpo-37448: Add radix tree implementation for obmalloc address_in_range().#14474nascheme merged 17 commits into
Conversation
|
Food for thought: consider making L1 fatter? My understanding is that Intel and AMD chips are limited to at most 48 physical address bits now, and they agreed that, for now, the top 16 bits of virtual addresses must be copies of bit 2**47. Which was intentional, to stop "clever" OS authors and users from using the higher-order bits for something else, thus making future expansion of physical address bits essentially impossible. Jerks - we could have used the sign bit to mean "address in range" 😉. So the top 17 bits will all be 0 or all be 1, and any L1 width <= 17 will never see more than those 2 values. On Windows, I expect they'll only see "all 0", because the OS reserves the "all 1" patterns for its own use - don't know about Linux. I bet the next step will be 56-bit physical addresses. |
Sorry, something went wrong.
|
Note that on a 32-bit box, with 256K arenas there are only 2**(32-18) = 2**14 = 16K ideal arena addresses. So a static 1-level tree would be fine. |
Sorry, something went wrong.
|
Regarding making the root node larger, that sounds like a good idea. I'm not sure how large is good though. It increases the virtual set size of the process. I compared VSS and RSS of the base Python (current address_in_range()), small root radix node (MAP1_BITS = 16) and MAP1_BITS = 20, using a program that does basically nothing.
Below is my quick and dirty change to use 20 bits for the root node. I'm hesitant to increase the VSS. On Linux it is not much of a problem because Linux typically is very willing to overcommit on RAM. However, other operating systems might be different. E.g. I think FreeBSD is more conservative. Not sure about Windows. The idea is to avoid having to kill programs due to an out-of-memory situation. If they don't overcommit, they will refuse to start new programs if the VSS is too large. At least, that's my understanding. |
Sorry, something went wrong.
|
Regarding a 32-bit version, I'm not sure what to do there. We should at least do some benchmarking on common 32-bit platforms. I would guess most common would be 32-bit Windows on an amd64 processor or Linux on a 32-bit ARM processor. In those cases, maybe we could keep POOL_SIZE <= 4096 and keep the existing address_in_range()? |
Sorry, something went wrong.
|
I had in mind more widening L1 by narrowing it ;-) That is, throw away the top 16 address bits entirely (in debug mode verifying they're the same as the top bit that remains). Then we only have 48 - 18 = 30 arena prefix bits for the tree levels to resolve. Yes, that's aggressive. But then, best I can tell, the entire useful effect of widening L1 to 20 bits as-is could be gotten by shrinking it to 4 bits instead. Of course the tree would have to be reworked a bit when physical address bits increase. Keep it 3 levels, and perhaps it could just be a matter of changing a FYI, there is no overcommit on Windows, and no OOM killer. While doing a "commit" with So increasing (the moral equivalent of) VSS there is even less atractive. But there's something to be said for writing code that exploits hardware realities: there are only 48 address bits 😉. Then VSS can shrink. |
Sorry, something went wrong.
Ah, that is very pragmatic. Thank you for explaining it to me. ;-P I have made a patch and done some testing. Here is an updated table. Obviously the radix tree is tiny in the case that it only uses 30 bits. I have root node using 10 bits.
|
Sorry, something went wrong.
|
Cool! OTOH ... we're making much stronger assumptions now, and it's unclear how long they'll last. Or even if they all apply even now. For example, I know that the 64-bit PowerPC architecture still lives on in some very high-end systems, but don't know whether Python is used on them, or what their virtual address conventions are. FYI, looks like "the next step" in x64-land will be 57-bit physical addresses rather than 56: https://en.wikipedia.org/wiki/Intel_5-level_paging And Linux has apparently already been patched for that, but in anticipation - no chips can use it yet. Prudent: yes, leave 32-bit Python entirely alone. But also make it very easy to revert 64-bit Python to the status quo too (e.g., if |
Sorry, something went wrong.
I found this recent bug fix: powerpc/mm/64s/hash: Reallocate context ids on fork I think that means with certain setups, user mode programs on POWER9 processors can see addresses in the 4 PB range (52 bits?).
Yes, the patch is in the current release: You will be amused to read that you are not the first to think of hiding flags in the high bits of pointers.
There is a define for that: WITH_RADIX_TREE. Easy to turn off. |
Sorry, something went wrong.
The radix tree approach is a relatively simple and memory sanitary alternative to the current (slightly) unsanitary address_in_range(). The radix tree is currently only implemented for 64-bit platforms. Adding a 32-bit version would be relatively easy.
Current 64-bit chips only have 48 bits of physical addresses. Exploit that to further shrink the size of the radix tree.
1e8297a to
4df4225
Compare
February 21, 2021 23:06
|
The branch has been updated to add 32-bit support. For 32-bit pointers, two levels of the tree have been eliminated vs the 64-bit version. Now the radix tree is enabled unconditionally for all CPython platforms. If desired, we can restore the I went ahead an also increased the arena and pool sizes by 4x. That's based on advice from Tim Peters. He says:
Tim's big-pool patch is here: https://bugs.python.org/issue37211 I've run pyperformance benchmarks. Without increasing pool and arena sizes, the radix version of obmalloc is 1.01x or 1.02x slower on some of the benchmarks. With the size increases, it is about that much faster. So, I would say basically a wash. |
Sorry, something went wrong.
If disabled, the old version of address_in_range() is used.
Configure option is not required. If some Python user find memory usage regression, they can build Python with |
Sorry, something went wrong.
|
Only my own caution, I think. It is easy to back out or disable if we decide it's not good. I have a small cleanup to variable names to make but I can merge after that. |
Sorry, something went wrong.
|
Should probably get a review from @pablogsal (Pablo, you can also delegate to someone else). |
Sorry, something went wrong.
|
Hey everyone, wanted to add an extra data point here. We tested this change with the Instagram workload and all the health metrics looked fine (i.e no weird crashes / behavior). Also, performance-wise, this looks like a net-neutral change. |
Sorry, something went wrong.
Thanks a lot @eduardo-elizondo, that will help us a lot when making a review and a final decision. Seems that the main thing we need to consider is maintenance cost and possible semantic change, but I am optimistic about both. I still need to find time to review it in detail. |
Sorry, something went wrong.
|
One question: does valgrind complains the same way it used to with the old |
Sorry, something went wrong.
|
Pablo, I never ran valgrind myself, but a primary point of this different approach is that it never references uninitialized memory (well, assuming users don't pass insane addresses to |
Sorry, something went wrong.
|
🤖 New build scheduled with the buildbot fleet by @pablogsal for commit b41a77a 🤖 If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again. |
Sorry, something went wrong.
⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️Hi! The buildbot s390x RHEL8 3.x has failed when building commit 85b6b70. What do you need to do:
You can take a look at the buildbot page here: https://buildbot.python.org/all/#builders/509/builds/926 Summary of the results of the build (if available): == Tests result: ENV CHANGED == 412 tests OK. 10 slowest tests:
1 test altered the execution environment: 14 tests skipped: Total duration: 5 min 35 sec Click to see traceback logsTraceback (most recent call last):
File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/sslproto.py", line 321, in __del__
self.close()
File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/sslproto.py", line 316, in close
self._ssl_protocol._start_shutdown()
File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/sslproto.py", line 590, in _start_shutdown
self._abort()
File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/sslproto.py", line 731, in _abort
self._transport.abort()
File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/selector_events.py", line 680, in abort
self._force_close(None)
File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/selector_events.py", line 731, in _force_close
self._loop.call_soon(self._call_connection_lost, exc)
File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/base_events.py", line 745, in call_soon
self._check_closed()
File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/base_events.py", line 510, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
|
Sorry, something went wrong.
|
Buildbot failure on S390x seems to be unrelated to this change. See https://bugs.python.org/issue37368 . |
Sorry, something went wrong.
…n_range(). (pythonGH-14474) This is a backport of the radix tree implementation for pymalloc. It is not platform specific but it solves a segmentation fault on aarch64 platforms when MTE (Memory Tag Extension) is enabled. The github issue is pythongh-87759. Original commit message: The radix tree approach is a relatively simple and memory sanitary alternative to the old (slightly) unsanitary address_in_range(). To disable the radix tree map, set a preprocessor flag as follows: -DWITH_PYMALLOC_RADIX_TREE=0. (cherry picked from commit 85b6b70) Co-authored-by: Tim Peters <tim.peters@gmail.com> Jira: ENTLLT-7122 Change-Id: Ifbc8c1290077b78c85ac30abe0bcbb7c8ea0b959
…n_range(). (pythonGH-14474) This is a backport of the radix tree implementation for pymalloc. It is not platform specific but it solves a segmentation fault on aarch64 platforms when MTE (Memory Tag Extension) is enabled. The github issue is pythongh-87759. Original commit message: The radix tree approach is a relatively simple and memory sanitary alternative to the old (slightly) unsanitary address_in_range(). To disable the radix tree map, set a preprocessor flag as follows: -DWITH_PYMALLOC_RADIX_TREE=0. (cherry picked from commit 85b6b70) Co-authored-by: Tim Peters <tim.peters@gmail.com> Change-Id: I0a3c2979c207f997c707c5f798941426c8d50004
The radix tree approach is a relatively simple and memory sanitary
alternative to the current (slightly) unsanitary address_in_range().
The radix tree is currently only implemented for 64-bit platforms.
Adding a 32-bit version would be relatively easy.
https://bugs.python.org/issue37448