Issue 38015: inline function generates slightly inefficient machine code
Created on 2019-09-03 04:02 by malin, last changed 2022-04-11 14:59 by admin. This issue is now closed.
Messages (12)
msg351052 - (view)
Author: Ma Lin (malin) *
Date: 2019-09-03 04:02
Date: 2019-09-03 04:27
Date: 2019-09-09 09:45
Commit 5e63ab0 replaces macro with this inline function:
static inline int
is_small_int(long long ival)
{
return -NSMALLNEGINTS <= ival && ival < NSMALLPOSINTS;
}
(by default, NSMALLNEGINTS is 5, NSMALLPOSINTS is 257)
However, when invoking this function, and `sizeof(value) < sizeof(long long)`, there is an unnecessary type casting.
For example, on 32-bit platform, if `value` is `Py_ssize_t`, it needs to be converted to 8-byte `long long` type.
The following assembly code is the beginning part of `PyLong_FromSsize_t(Py_ssize_t v)` function.
(32-bit x86 build generated by GCC 9.2, with `-m32 -O2` option)
Use macro before commit 5e63ab0:
mov eax, DWORD PTR [esp+4]
add eax, 5
cmp eax, 261
ja .L2
sal eax, 4
add eax, OFFSET FLAT:small_ints
add DWORD PTR [eax], 1
ret
.L2: jmp PyLong_FromSsize_t_rest(int)
Use inlined function:
push ebx
mov eax, DWORD PTR [esp+8]
mov edx, 261
mov ecx, eax
mov ebx, eax
sar ebx, 31
add ecx, 5
adc ebx, 0
cmp edx, ecx
mov edx, 0
sbb edx, ebx
jc .L7
cwde
sal eax, 4
add eax, OFFSET FLAT:small_ints+80
add DWORD PTR [eax], 1
pop ebx
ret
.L7: pop ebx
jmp PyLong_FromSsize_t_rest(int)
On 32-bit x86 platform, 8-byte `long long` is implemented in using two registers, so the machine code is much longer than macro version.
At least these hot functions are suffered from this:
PyObject* PyLong_FromSsize_t(Py_ssize_t v)
PyObject* PyLong_FromLong(long v)
Replacing the inline function with a macro version will fix this:
#define IS_SMALL_INT(ival) (-NSMALLNEGINTS <= (ival) && (ival) < NSMALLPOSINTS)
If you want to see assembly code generated by major compilers, you can paste attached file demo.c to https://godbolt.org/
- demo.c was original written by Greg Price.
- use `-m32 -O2` to generate 32-bit build.
msg351053 - (view)
Author: Raymond Hettinger (rhettinger) *
Date: 2019-09-03 04:27
Do you see a way to salvage the commit? If not, I'm fine with reverting it. Its benefit was only aesthetic.msg351054 - (view) Author: Ma Lin (malin) * Date: 2019-09-03 04:35
There will always be a new commit, replacing with a macro version also looks good. I have no opinion, both are fine.msg351228 - (view) Author: Ma Lin (malin) * Date: 2019-09-06 04:31
Revert commit 5e63ab0 or use PR 15710, both are fine.msg351255 - (view) Author: Sergey Fedoseev (sir-sigurd) * Date: 2019-09-06 14:28
I added similar patch that replaces get_small_int() with macro version, since it also induces unnecessary casts and makes machine code less efficient. Example assembly can be checked at https://godbolt.org/z/1SjG3E. This change produces tiny, but measurable speed-up for handling small ints: $ python -m pyperf timeit -s "from collections import deque; consume = deque(maxlen=0).extend; r = range(256)" "consume(r)" --compare-to=../cpython-master/venv/bin/python --duplicate=1000 /home/sergey/tmp/cpython-master/venv/bin/python: ..................... 1.03 us +- 0.08 us /home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 973 ns +- 18 ns Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 1.03 us +- 0.08 us -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 973 ns +- 18 ns: 1.05x faster (-5%)msg351278 - (view) Author: Ma Lin (malin) * Date: 2019-09-07 03:57
This range has not been changed since "preallocated small integer pool" was introduced: #define NSMALLPOSINTS 257 #define NSMALLNEGINTS 5 The commit (Jan 2007): https://github.com/python/cpython/commit/ddefaf31b366ea84250fc5090837c2b764a04102 Is it worth increase the range? FYI, build with MSVC 2017, the `small_ints` size: 32-bit build: sizeof(PyLongObject) 16 bytes sizeof(small_ints) 4192 bytes 64-bit build: sizeof(PyLongObject) 32 bytes sizeof(small_ints) 8384 bytesmsg351315 - (view) Author: Ma Lin (malin) * Date: 2019-09-08 05:53
> This change produces tiny, but measurable speed-up for handling small ints
I didn't get measurable change, I run this command a dozen times and take the best result:
D:\dev\cpython\PCbuild\amd64\python.exe -m pyperf timeit -s "from collections import deque; consume = deque(maxlen=0).extend; r = range(256)" "consume(r)" --duplicate=1000
before: Mean +- std dev: 771 ns +- 16 ns
after: Mean +- std dev: 770 ns +- 10 ns
Environment:
64-bit release build by MSVC 2017
CPU: i3 4160, System: latest Windows 10 64-bit
Check the machine code from godbolt.org, x64 MSVC v19.14 only saves one instruction:
movsxd rax, ecx
x86-64 GCC 9.2 saves two instructions:
lea eax, [rdi+5]
cdqe
msg351316 - (view)
Author: Sergey Fedoseev (sir-sigurd) *
Date: 2019-09-08 05:59
I use GCC 9.2.msg351348 - (view) Author: Greg Price (Greg Price) * Date: 2019-09-09 05:41
(Just to help keep discussions together: some earlier discussion was on GH-15216 .) Because is_small_int / IS_SMALL_INT is so small, there's not much cost in the source code to making it a macro (as GH-15710 did). But I think it'd be a mistake to go a lot farther than that and convert significantly larger chunks of code like get_small_int to macros (as GH-15718 would), unless the payoff were really commensurate. It'd be better to focus optimization efforts where profiling shows a lot of time is really being spent.msg351387 - (view) Author: Benjamin Peterson (benjamin.peterson) *
Date: 2019-09-09 09:45
I agree with msg351348. The performance wins are very context-dependent and even then very small. It's not worth further disturbing the small int code. So, we should close this issue out.msg351564 - (view) Author: Ma Lin (malin) * Date: 2019-09-10 02:18
PR 15710 has been merged into the master, but the merge message is not shown here. Commit: https://github.com/python/cpython/commit/6b519985d23bd0f0bd072b5d5d5f2c60a81a19f2 Maybe this issue can be closed.msg352692 - (view) Author: Greg Price (Greg Price) * Date: 2019-09-18 05:11
See followup at https://bugs.python.org/issue38205 and https://bugs.python.org/issue37812#msg352670 . The patch in GH-15710 had a side effect of introducing a call to `Py_UNREACHABLE` inside a comma-expression. A subsequent commit 3ab61473b changed `Py_UNREACHABLE` in a way that made that a syntax error; the issue wasn't caught then because the code is only compiled if `NSMALLNEGINTS + NSMALLPOSINTS <= 0`. Detailed discussion on those threads.
History
Date
User
Action
Args
2022-04-11 14:59:19adminsetgithub: 82196
2019-09-18 05:11:29Greg Pricesetmessages:
+ msg352692
2019-09-10 02:18:12malinsetstatus: open -> closed
resolution: fixed
messages: + msg351564
messages: + msg351387
2019-09-09 05:41:42Greg Pricesetmessages: + msg351348 2019-09-08 05:59:21sir-sigurdsetmessages: + msg351316 2019-09-08 05:53:17malinsetmessages: + msg351315 2019-09-07 03:57:06malinsetmessages: + msg351278 2019-09-06 14:28:25sir-sigurdsetmessages: + msg351255 2019-09-06 14:21:20sir-sigurdsetpull_requests: + pull_request15372 2019-09-06 04:31:10malinsetmessages: + msg351228 2019-09-06 04:26:53malinsetkeywords: + patch
stage: patch review
pull_requests: + pull_request15365 2019-09-03 04:35:31malinsetmessages: + msg351054 2019-09-03 04:27:46rhettingersetassignee: rhettinger
messages: + msg351053 2019-09-03 04:02:44malincreate
resolution: fixed
messages: + msg351564
stage: patch review -> resolved
2019-09-09 09:45:59benjamin.petersonsetnosy: + benjamin.petersonmessages: + msg351387
2019-09-09 05:41:42Greg Pricesetmessages: + msg351348 2019-09-08 05:59:21sir-sigurdsetmessages: + msg351316 2019-09-08 05:53:17malinsetmessages: + msg351315 2019-09-07 03:57:06malinsetmessages: + msg351278 2019-09-06 14:28:25sir-sigurdsetmessages: + msg351255 2019-09-06 14:21:20sir-sigurdsetpull_requests: + pull_request15372 2019-09-06 04:31:10malinsetmessages: + msg351228 2019-09-06 04:26:53malinsetkeywords: + patch
stage: patch review
pull_requests: + pull_request15365 2019-09-03 04:35:31malinsetmessages: + msg351054 2019-09-03 04:27:46rhettingersetassignee: rhettinger
messages: + msg351053 2019-09-03 04:02:44malincreate