gh-72793: C implementation of parts of copy.deepcopy by eendebakpt · Pull Request #91610 · python/cpython

bedevere-bot

Conversation

The original idea and implementation are from @Villemoes. The original issue has been inactive for a long time.

Benchmark on deepcopy of a dict and dataclass:

deepcopy dict: Mean +- std dev: [b] 7.28 us +- 0.27 us -> [d] 982 ns +- 24 ns: 7.41x faster
deepcopy dataclass: Mean +- std dev: [b] 6.49 us +- 0.07 us -> [d] 3.69 us +- 0.11 us: 1.76x faster

Geometric mean: 3.61x faster

Updated benchmark: (23-12-2022, main at 1ecfd1e)

deepcopy dict: Mean +- std dev: [main] 6.56 us +- 0.07 us -> [pr] 821 ns +- 24 ns: 7.99x faster
deepcopy dataclass: Mean +- std dev: [main] 6.12 us +- 0.07 us -> [pr] 3.10 us +- 0.09 us: 1.97x faster

Geometric mean: 3.97x faster

Updated benchmark: (16-9-2023, main at e57ecf6)

deepcopy dict: Mean +- std dev: [main] 7.91 us +- 0.22 us -> [pr] 958 ns +- 49 ns: 8.25x faster
deepcopy dataclass: Mean +- std dev: [main] 7.74 us +- 0.26 us -> [pr] 3.65 us +- 0.27 us: 2.12x faster

Geometric mean: 4.19x faster

Updated benchmark (10-01-2024)

deepcopy dict: Mean +- std dev: [main] 6.82 us +- 0.07 us -> [c_impl] 922 ns +- 19 ns: 7.40x faster
deepcopy dataclass: Mean +- std dev: [main] 6.59 us +- 0.07 us -> [c_impl] 3.34 us +- 0.06 us: 1.98x faster
deepcopy small dataclass: Mean +- std dev: [main] 4.30 us +- 0.07 us -> [c_impl] 3.01 us +- 0.04 us: 1.43x faster
deepcopy small tuple: Mean +- std dev: [main] 1.07 us +- 0.01 us -> [c_impl] 115 ns +- 1 ns: 9.30x faster
deepcopy repeating: Mean +- std dev: [main] 21.8 us +- 0.2 us -> [c_impl] 8.66 us +- 0.28 us: 2.51x faster
deepcopy repeating_atomic: Mean +- std dev: [main] 24.6 us +- 0.5 us -> [c_impl] 1.28 us +- 0.02 us: 19.16x faster

Geometric mean: 4.59x faster

Benchmark details

Test script

Updated test script

import pyperf
runner = pyperf.Runner()

setup="""
import copy

a={'list': [1,2,3,43], 't': (1,2,3), 'str': 'hello', 'subdict': {'a': True}}

from dataclasses import dataclass

@dataclass
class A:
    a : list
    b : str
    c : bool
    
dc=A([1,2,3], 'hello', True)

@dataclass
class A:
    a : int
    
dc_small = A(123)

small_tuple = (1, )

l = {'hi': 100}
repeating_atomic = [ [1] * 100]
repeating = [dc_small] * 100
"""

runner.timeit(name="deepcopy dict", stmt=f"b=copy.deepcopy(a)", setup=setup)
runner.timeit(name="deepcopy dataclass", stmt=f"b=copy.deepcopy(dc)", setup=setup)
runner.timeit(name="deepcopy small dataclass", stmt=f"b=copy.deepcopy(dc_small)", setup=setup)
runner.timeit(name="deepcopy small tuple", stmt=f"b=copy.deepcopy(small_tuple)", setup=setup)
runner.timeit(name="deepcopy repeating", stmt=f"b=copy.deepcopy(repeating)", setup=setup)
runner.timeit(name="deepcopy repeating_atomic", stmt=f"b=copy.deepcopy(repeating_atomic)", setup=setup)

Old test script:

import pyperf
runner = pyperf.Runner()

setup="""
import copy

a={'list': [1,2,3,43], 't': (1,2,3), 'str': 'hello', 'subdict': {'a': True}}

from dataclasses import dataclass

@dataclass
class A:
    a : list
    b : str
    c : bool
    
dc=A([1,2,3], 'hello', True)
"""

runner.timeit(name=f"deepcopy dict", stmt=f"b=copy.deepcopy(a)", setup=setup)
runner.timeit(name=f"deepcopy dataclass", stmt=f"b=copy.deepcopy(dc)", setup=setup)

Fixes #72793

Pyperformance results

Pyperformance results show small speedup, although this could very well be a random fluctuation. (there are no explicit calls to deepcopy in the pyperformance tests)

2to3: Mean +- std dev: [base] 325 ms +- 5 ms -> [patch] 322 ms +- 4 ms: 1.01x faster
chaos: Mean +- std dev: [base] 87.8 ms +- 0.8 ms -> [patch] 88.3 ms +- 1.1 ms: 1.01x slower
float: Mean +- std dev: [base] 103 ms +- 2 ms -> [patch] 98.5 ms +- 2.5 ms: 1.04x faster
go: Mean +- std dev: [base] 163 ms +- 1 ms -> [patch] 165 ms +- 2 ms: 1.01x slower
json_dumps: Mean +- std dev: [base] 15.1 ms +- 0.5 ms -> [patch] 14.9 ms +- 0.1 ms: 1.01x faster
json_loads: Mean +- std dev: [base] 28.0 us +- 0.3 us -> [patch] 28.1 us +- 0.3 us: 1.01x slower
logging_format: Mean +- std dev: [base] 8.11 us +- 0.14 us -> [patch] 8.18 us +- 0.12 us: 1.01x slower
logging_silent: Mean +- std dev: [base] 133 ns +- 1 ns -> [patch] 130 ns +- 1 ns: 1.02x faster
meteor_contest: Mean +- std dev: [base] 128 ms +- 1 ms -> [patch] 126 ms +- 1 ms: 1.01x faster
nbody: Mean +- std dev: [base] 111 ms +- 2 ms -> [patch] 113 ms +- 2 ms: 1.01x slower
nqueens: Mean +- std dev: [base] 106 ms +- 1 ms -> [patch] 106 ms +- 1 ms: 1.00x slower
pathlib: Mean +- std dev: [base] 23.0 ms +- 0.4 ms -> [patch] 23.2 ms +- 0.7 ms: 1.01x slower
pickle: Mean +- std dev: [base] 10.5 us +- 0.1 us -> [patch] 10.6 us +- 0.4 us: 1.01x slower
pickle_dict: Mean +- std dev: [base] 30.9 us +- 0.2 us -> [patch] 31.3 us +- 0.3 us: 1.02x slower
pickle_list: Mean +- std dev: [base] 4.41 us +- 0.04 us -> [patch] 4.55 us +- 0.06 us: 1.03x slower
pyflate: Mean +- std dev: [base] 532 ms +- 8 ms -> [patch] 535 ms +- 4 ms: 1.01x slower
regex_compile: Mean +- std dev: [base] 163 ms +- 1 ms -> [patch] 165 ms +- 1 ms: 1.01x slower
regex_dna: Mean +- std dev: [base] 238 ms +- 2 ms -> [patch] 213 ms +- 1 ms: 1.12x faster
regex_effbot: Mean +- std dev: [base] 3.50 ms +- 0.03 ms -> [patch] 3.08 ms +- 0.03 ms: 1.14x faster
regex_v8: Mean +- std dev: [base] 29.2 ms +- 0.7 ms -> [patch] 25.8 ms +- 0.9 ms: 1.13x faster
richards: Mean +- std dev: [base] 57.4 ms +- 1.5 ms -> [patch] 58.2 ms +- 1.3 ms: 1.01x slower
scimark_fft: Mean +- std dev: [base] 447 ms +- 10 ms -> [patch] 452 ms +- 2 ms: 1.01x slower
scimark_lu: Mean +- std dev: [base] 142 ms +- 2 ms -> [patch] 145 ms +- 1 ms: 1.02x slower
scimark_sor: Mean +- std dev: [base] 155 ms +- 1 ms -> [patch] 156 ms +- 2 ms: 1.01x slower
scimark_sparse_mat_mult: Mean +- std dev: [base] 6.18 ms +- 0.11 ms -> [patch] 6.03 ms +- 0.06 ms: 1.02x faster
spectral_norm: Mean +- std dev: [base] 150 ms +- 7 ms -> [patch] 146 ms +- 3 ms: 1.02x faster
unpack_sequence: Mean +- std dev: [base] 52.6 ns +- 1.0 ns -> [patch] 52.2 ns +- 0.6 ns: 1.01x faster
unpickle_list: Mean +- std dev: [base] 5.77 us +- 0.06 us -> [patch] 5.88 us +- 0.10 us: 1.02x slower
xml_etree_parse: Mean +- std dev: [base] 176 ms +- 3 ms -> [patch] 173 ms +- 2 ms: 1.02x faster

Benchmark hidden because not significant (17): deltablue, fannkuch, hexiom, logging_simple, pickle_pure_python, pidigits, python_startup, python_startup_no_site, raytrace, scimark_monte_carlo, sqlite_synth, telco, unpickle, unpickle_pure_python, xml_etree_iterparse, xml_etree_generate, xml_etree_process

Geometric mean: 1.01x faster

Notes

Issue: C implementation of parts of copy.deepcopy #72793

We keep the original python code for the implementation of copy.deepcopy (see https://peps.python.org/pep-0399/). In the CI we test both the pure python and accelerator version of deepcopy.
The performance improvement of deepcopy has a measurable impact for certain operations in projects such as lmfit and qiskit.

This was referenced

May 12, 2022

@serhiy-storchaka As the latest core develop working on this file, would you be able to review this PR?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel comfortable reviewing the C code, but you will need to add tests to ensure the fallback and C implementation have the same inputs/outputs/etc -- the easiest way would be to duplicate the tests and run one set for the C accelerator and one for the Python version.

I don't feel comfortable reviewing the C code, but you will need to add tests to ensure the fallback and C implementation have the same inputs/outputs/etc -- the easiest way would be to duplicate the tests and run one set for the C accelerator and one for the Python version.

A

@AA-Turner The _deepcopy_fallback is only used for objects not handled by the C deepcopy method (and vice versa). therefore we cannot test both the fallback and C on the same inputs. The deepcopy is the public method (and _deepcopy_fallback is called via deepcopy), so I think we only need to test on deepcopy. If there are any tests you would like me to add, let me know.

Note: in earlier versions of the PR the fallback there was a funny try-except statement to fix a build problem. This I resolved by making _copy a builtin module.

The _deepcopy_fallback is only used for objects not handled by the C deepcopy method (and vice versa). therefore we cannot test both the fallback and C on the same inputs.

If both functions exist and can be theoretically used, both must be tested. You can import copy._deepcopy_fallback and import _copy.deepcopy to test the functions independently. If I understand correctly, in your current patch _deepcopy_fallback is only ever called from the C layer. If so, you should make this more clear -- oftentimes the C accelerator has a pure Python fallback which implements the same method when the extension module can't be loaded.

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

There are some issues with the copy module. I prefer to resolve them prior to adding the C implementation.

The _deepcopy_fallback is only used for objects not handled by the C deepcopy method (and vice versa). therefore we cannot test both the fallback and C on the same inputs.

If both functions exist and can be theoretically used, both must be tested. You can import copy._deepcopy_fallback and import _copy.deepcopy to test the functions independently. If I understand correctly, in your current patch _deepcopy_fallback is only ever called from the C layer. If so, you should make this more clear -- oftentimes the C accelerator has a pure Python fallback which implements the same method when the extension module can't be loaded.

A

@tiran noted the fallback will be used by PyPy, so it must indeed be tested. I will add the required tests

We have helper code to block / force imports to test both pure and C accelerated features.

from test.support.import_helper import import_fresh_module

copy_py = import_fresh_module('copy', blocked=['_copy'])
try:
    copy_c = import_fresh_module('copy', fresh=['_copy'])
except ImportError:
    copy_c = None

I addressed some of the review comments.

@serhiy-storchaka Could you indicate which issues with the copy module there are, and whether there already is a timeline on addressing them?

@erlend-aasland The failing tests where due to the recent addition of copy.replace. I added the unit tests for new functionality in the same structure as the other unit tests. This means that the pure python method copy.replace is tested twice (which is a bit redundant, but with better consistency overall).

The PR is ready for review, although I am not sure the item Serhiy raised should be resolved first. See #103035 (comment) and #109498.

For any reviewers: I also created a performance improvement for the python implementation: #114266.

Depending on the outcome of that PR and some (minor) changes I want to do to make this work in free-threading I will update the benchmarks.

@eendebakpt, also consider adapting to free-threading in a follow-up PR; this PR is already a huge diff.

The following commit authors need to sign the Contributor License Agreement:

unknow@example.com

Click the button to sign:

Update of the benchmarks with recent changes to main:

deepcopy dict: Mean +- std dev: [main] 3.75 us +- 0.17 us -> [pr] 587 ns +- 25 ns: 6.40x faster
deepcopy dataclass: Mean +- std dev: [main] 4.21 us +- 0.08 us -> [pr] 2.57 us +- 0.11 us: 1.64x faster
deepcopy small dataclass: Mean +- std dev: [main] 3.03 us +- 0.12 us -> [pr] 2.28 us +- 0.10 us: 1.33x faster
deepcopy small tuple: Mean +- std dev: [main] 732 ns +- 21 ns -> [pr] 72.8 ns +- 2.5 ns: 10.05x faster
deepcopy repeating: Mean +- std dev: [main] 20.7 us +- 0.7 us -> [pr] 6.49 us +- 0.32 us: 3.18x faster
deepcopy repeating_atomic: Mean +- std dev: [main] 10.2 us +- 0.3 us -> [pr] 801 ns +- 33 ns: 12.79x faster

Geometric mean: 4.23x faster

To make a fair comparison between current main and the C implementation in this PR I looked at the python implementation again to see whether there are optimizations we do in the C implementation that could also be applied in the Python main. There are quite some options:

Remove the redundant _nil argument to deepcopy
Replace try-except constructions with dictionaries with a d.get(key, sentinel) to avoid expensive creation of exceptions
Use comprehensions in the deepcopy_list and deepcopy_tuple
In deepcopy_tuple combine the creation of y and the check for k, j in zip(x, y): into a single for loop
Inline the if cls in _atomic_types: return x part at various locations
Inline the call to the (only once used) _keep_alive

I will not make PRs for these, because each of them has a very high likelihood of being rejected because either the performance gain is small or the change would reduce readability or maintainability of the code (for the interested reader: I would give the last option the highest probability of success).

If you're looking for a faster deepcopy today, you may find copium useful.

It's a complete rewrite of deepcopy in C, compatible with Python 3.10-3.14(t), available on PyPI.

It was initially based on this PR, but quickly grew into a much larger project. It's still WIP, but already has extensive test coverage and provides real speed boost.

Benchmarks

Benchmark	copy_3.14	copium_3.14
deepcopy dict	3.77 us	396 ns: 9.52x faster
deepcopy dataclass	4.01 us	857 ns: 4.68x faster
deepcopy small dataclass	3.01 us	681 ns: 4.43x faster
deepcopy small tuple	826 ns	75.7 ns: 10.91x faster
deepcopy repeating	20.8 us	1.67 us: 12.42x faster
deepcopy repeating_atomic	8.69 us	686 ns: 12.68x faster
Geometric mean	(ref)	8.35x faster

On 3.13 and lower, there's even larger speedup.

Benchmark	copy_3.13	copium_3.13
deepcopy dict	5.77 us	394 ns: 14.63x faster
deepcopy dataclass	5.33 us	780 ns: 6.83x faster
deepcopy small dataclass	3.29 us	620 ns: 5.31x faster
deepcopy small tuple	882 ns	66.1 ns: 13.34x faster
deepcopy repeating	18.1 us	1.80 us: 10.03x faster
deepcopy repeating_atomic	21.0 us	675 ns: 31.19x faster
Geometric mean	(ref)	11.42x faster

Reproduction

cat > /tmp/benchmark.py << 'PY'
import pyperf
runner = pyperf.Runner()

setup="""
import copy

a={'list': [1,2,3,43], 't': (1,2,3), 'str': 'hello', 'subdict': {'a': True}}

from dataclasses import dataclass

@dataclass
class A:
    a : list
    b : str
    c : bool

dc=A([1,2,3], 'hello', True)

@dataclass
class A:
    a : int

dc_small = A(123)

small_tuple = (1, )

l = {'hi': 100}
repeating_atomic = [ [1] * 100]
repeating = [dc_small] * 100
"""

runner.timeit(name="deepcopy dict", stmt=f"b=copy.deepcopy(a)", setup=setup)
runner.timeit(name="deepcopy dataclass", stmt=f"b=copy.deepcopy(dc)", setup=setup)
runner.timeit(name="deepcopy small dataclass", stmt=f"b=copy.deepcopy(dc_small)", setup=setup)
runner.timeit(name="deepcopy small tuple", stmt=f"b=copy.deepcopy(small_tuple)", setup=setup)
runner.timeit(name="deepcopy repeating", stmt=f"b=copy.deepcopy(repeating)", setup=setup)
runner.timeit(name="deepcopy repeating_atomic", stmt=f"b=copy.deepcopy(repeating_atomic)", setup=setup)
PY

py=3.14
rm -f /tmp/copium_$py.json \
&& uv run --python $py --with 'copium[autopatch]' --with pyperf python /tmp/benchmark.py -o /tmp/copium_$py.json \
&& rm -f /tmp/copy_$py.json \
&& uv run --python $py --with pyperf python /tmp/benchmark.py -o /tmp/copy_$py.json \
&& uvx pyperf compare_to /tmp/copy_$py.json /tmp/copium_$py.json --table --table-format md

@Bobronium Thanks for the link! I tried out copium, and it works as advertised!

Historically there have been two reasons for me to pursue this PR:

If included in CPython it is available for everyone (batteries included!)
When creating a pypi package as an alternative for deepcopy, it would not automatically be used by other pip packages making use of copy.deepcopy.

The second point is handled by copium by patching the vectorcall function of the deepcopy object (see remark Zero overhead patch). Not all projects will be comfortable with such patching, but it works!

Unless there is interest from a core dev to review the PR (in which case I will rebase) I am going to close the PR.

Reviewers

tiran tiran left review comments

AA-Turner AA-Turner left review comments

kumaraditya303 kumaraditya303 requested changes

corona10 Awaiting requested review from corona10 corona10 is a code owner

erlend-aasland Awaiting requested review from erlend-aasland erlend-aasland is a code owner

+1 more reviewer

gh-72793: C implementation of parts of copy.deepcopy by eendebakpt · Pull Request #91610 · python/cpython

Conversation

Choose a reason for hiding this comment

Benchmarks

Reviewers

Labels