gh-114264: Optimize performance of copy.deepcopy by adding a fast path for atomic types by eendebakpt · Pull Request #114266 · python/cpython

bedevere-app

import pyperf
runner = pyperf.Runner()

setup="""
import copy

a={'list': [1,2,3,43], 't': (1,2,3), 'str': 'hello', 'subdict': {'a': True}}

from dataclasses import dataclass

@dataclass
class A:
    a : list
    b : str
    c : bool
    
dc=A([1,2,3], 'hello', True)

@dataclass
class A:
    a : int
    
dc_small = A(123)

small_tuple = (1, )

l = {'hi': 100}
repeating_atomic = [ [1] * 100]
repeating = [dc_small] * 100
"""

runner.timeit(name="deepcopy dict", stmt=f"b=copy.deepcopy(a)", setup=setup)
runner.timeit(name="deepcopy dataclass", stmt=f"b=copy.deepcopy(dc)", setup=setup)
runner.timeit(name="deepcopy small dataclass", stmt=f"b=copy.deepcopy(dc_small)", setup=setup)
runner.timeit(name="deepcopy small tuple", stmt=f"b=copy.deepcopy(small_tuple)", setup=setup)
runner.timeit(name="deepcopy repeating", stmt=f"b=copy.deepcopy(repeating)", setup=setup)
runner.timeit(name="deepcopy repeating_atomic", stmt=f"b=copy.deepcopy(repeating_atomic)", setup=setup)

Comparison of the three alternative versions to main. This PR:

deepcopy dict: Mean +- std dev: [main] 6.82 us +- 0.07 us -> [pr] 4.68 us +- 0.08 us: 1.46x faster
deepcopy dataclass: Mean +- std dev: [main] 6.59 us +- 0.07 us -> [pr] 5.23 us +- 0.07 us: 1.26x faster
deepcopy small dataclass: Mean +- std dev: [main] 4.30 us +- 0.07 us -> [pr] 3.88 us +- 0.08 us: 1.11x faster
deepcopy small tuple: Mean +- std dev: [main] 1.07 us +- 0.01 us -> [pr] 925 ns +- 12 ns: 1.16x faster
deepcopy repeating: Mean +- std dev: [main] 21.8 us +- 0.2 us -> [pr] 24.9 us +- 0.4 us: 1.15x slower
deepcopy repeating_atomic: Mean +- std dev: [main] 24.6 us +- 0.5 us -> [pr] 10.1 us +- 0.2 us: 2.44x faster

Geometric mean: 1.31x faster

deepcopy dict: Mean +- std dev: [main] 6.82 us +- 0.07 us -> [v3] 6.34 us +- 0.22 us: 1.08x faster
deepcopy dataclass: Mean +- std dev: [main] 6.59 us +- 0.07 us -> [v3] 6.21 us +- 0.11 us: 1.06x faster
deepcopy small dataclass: Mean +- std dev: [main] 4.30 us +- 0.07 us -> [v3] 4.16 us +- 0.04 us: 1.03x faster
deepcopy small tuple: Mean +- std dev: [main] 1.07 us +- 0.01 us -> [v3] 1.03 us +- 0.02 us: 1.03x faster
deepcopy repeating: Mean +- std dev: [main] 21.8 us +- 0.2 us -> [v3] 22.0 us +- 0.6 us: 1.01x slower
deepcopy repeating_atomic: Mean +- std dev: [main] 24.6 us +- 0.5 us -> [v3] 21.2 us +- 0.6 us: 1.16x faster

Geometric mean: 1.06x faster

deepcopy dict: Mean +- std dev: [main] 6.82 us +- 0.07 us -> [v4] 4.93 us +- 0.05 us: 1.38x faster
deepcopy dataclass: Mean +- std dev: [main] 6.59 us +- 0.07 us -> [v4] 5.35 us +- 0.06 us: 1.23x faster
deepcopy small dataclass: Mean +- std dev: [main] 4.30 us +- 0.07 us -> [v4] 3.89 us +- 0.05 us: 1.11x faster
deepcopy small tuple: Mean +- std dev: [main] 1.07 us +- 0.01 us -> [v4] 938 ns +- 25 ns: 1.14x faster
deepcopy repeating: Mean +- std dev: [main] 21.8 us +- 0.2 us -> [v4] 26.4 us +- 0.5 us: 1.21x slower
deepcopy repeating_atomic: Mean +- std dev: [main] 24.6 us +- 0.5 us -> [v4] 12.4 us +- 0.4 us: 1.99x faster

Geometric mean: 1.23x faster

In this PR we avoid the cost of looking into the memo for atomic types which provides the main speedup. This comes at a performance degradation for objects containing many identical (same id) values. Based on the numbers above there is no peformance reason to pick version v4. For clarify or uniformity of the code it would not hurt a lot though to replace the pr with v4. I think one could construct benchmarks where v3 is faster than this pr (thinking about non-atomic types which have a custom __copy__ and recursive calls to deepcopy such as numpy arrays), but I expect differences to be small. Version v3 has the same performance on the "deepcopy repeating" as main (the 1.01x slower is a random fluctuation I believe, on average it will be 1.00x) and a small performance improvement for some other cases.

I also considered aligning the implementations of copy.copy and copy.deepcopy (for uniformity of the code), but decided against this initially. My line of thought:

There is no memo that can be skipped (which was the main performance benefit)
Calls of copy.copy on atomic types should be relative rare (and cheap anyway), so there is less to gain.
For calls on container types (like list, dict or set) there are no recursive calls to the components. This is in contrast with deepcopy where are call on a dataclass can recurse into atomic-types.

I have not updated the news entry yet with a description of the behaviour changes, because I think the behavior changes are not really something one can notice with normal use of deepcopy.
i) The first behavior change is that the number of invocations of memo.get is less. But memo.get is a read-only operation (on normal dicts), so this is not something visible from the public interface.
ii) In the example given above the memo passed is memo= {id(s2): s}. This memo is not really a valid memo, since we require for each key-value pair in the memo id(value)==key, which is not true in the example.
If desired I can add this to the news entry though.

The results above are only micro benchmarks and it will depend on the application which version will perform best. Applications where I have found the deepcopy to take a significant amount of time are lmfit, qiskit and quantify-scheduler, but I cannot run those with current main (I could with 3.12 though).

There is another PR improving the speed of deepcopy (gh-72793: C implementation of parts of copy.deepcopy #91610). That approach converts the python implementation to C but is more complex and will take time to review (and might not be accepted).