Was this abandoned just because nobody had the time, or was there a problem with the approach? I independently wanted this optimisation, and have ended up implementing something very similar to what was reverted in https://hg.python.org/lookup/dff6b4b61cac.
In a benchmark that creates a large bytearray, then fills it with socket.readinto, I'm seeing a 2x performance improvement on Linux, and from some quick benchmarking it seems to be just as fast as the old code for small arrays that are allocated from the pool.