bpo-37802: Slightly improve perfomance of PyLong_FromSize_t()#15192

sir-sigurd

https://bugs.python.org/issue37802

mdickinson

LGTM

gnprice

BTW, I think the behavior of CHECK_SMALL_INT with its implicit return (and similarly the new CHECK_SMALL_UINT) is pretty surprising when reading the code. I wrote up a patch the other day that replaces it with an explicit return; I'll take this as a prompt to go make an issue for it and send a PR.

That's certainly not a reason not to merge this, though. I'll happily rebase my changes on top of this once it's in.

gnprice

BTW, I think the behavior of CHECK_SMALL_INT with its implicit return (and similarly the new CHECK_SMALL_UINT) is pretty surprising when reading the code. I wrote up a patch the other day that replaces it with an explicit return; I'll take this as a prompt to go make an issue for it and send a PR.

That's certainly not a reason not to merge this, though. I'll happily rebase my changes on top of this once it's in.

(Posted; see https://bugs.python.org/issue37812 and #15203 .)

gnprice

This looks good to me... except it's not ambitious enough! 🙂 Details below.

(Happily the more ambitious version is only a line or two more.)

sir-sigurd

@mdickinson anything should be done to get this merged?

sir-sigurd

I decided to deduplicate PyLong_FromUnsigned* functions in this PR, to apply this optimization to all of them at once.

gnprice

Neat! This simplifies the code and it's faster (based on the measurements you posted in the issue thread) -- that's a nice combination. 😄

I'd be glad to see the signed versions get a similar treatment. (As a separate PR.) Might not deliver a performance win at the same level, since your hypothesis is that the speedup here is from cutting out the sign logic...

but it should be performance-neutral at worst, and it's plausible it could be an improvement by generating less code so it's friendlier to the cache. Also could speed things up just as a natural consequence of deduplicating the code, by making a single place to apply all the optimizations we have in one function or another, plus any new ones you think of while staring at it.

(And even if it does turn out only performance-neutral: I think opinions will vary, but for code deduplication at this scale I'd certainly be in favor.)

gnprice

Wise words! Yet I think for doc-comments the risk is worth it. If the comment simply describes the interface of the function, i.e. the things a caller needs to know about it... then if that ever needs an update, there will be much bigger burdens than updating the comment. Especially for functions in the public API like these. But that is why I would like the comments to describe the interface directly (like they do in master), rather than have a dependency on something more fluid like the arrangement of this code. :-)

…

On Thu, Aug 29, 2019, 11:16 Sergey Fedoseev ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In Objects/longobject.c <#15192 (comment)>: > @@ -410,6 +414,30 @@ PyLong_FromUnsignedLong(unsigned long ival) return (PyObject *)v; } +/* Same as above, but for unsigned long. */ Non-existing comments never lie and don't require updates 🙂 . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15192?email_source=notifications&email_token=AAAG4DJKKLGMQMQG4ULR2JTQHAG7DA5CNFSM4IKUBGHKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCDEW34I#discussion_r319206328>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAG4DMBSVF5OI5LF3HEMEDQHAG7DANCNFSM4IKUBGHA> .

sir-sigurd

The current approach using inline function have the same problems as here (https://bugs.python.org/issue38015). I'll rework it when #15718 will be merged.

gnprice

The current approach using inline function have the same problems as here (https://bugs.python.org/issue38015). I'll rework it when #15718 will be merged.

Hmm, you measured this as being a significant speedup in your microbenchmark, right? That hardly seems like a problem 😉 even if there's a way to make the microbenchmark go even faster still.

I like this PR as it is 🙂 and I'd be glad to see it merged. It actually simplifies the code at the same time as optimizing it, so there's no need to weigh one value against another.

As I commented over on #15718, turning get_small_int into a macro makes the code inside it quite a bit more complex to read. For this one... is the idea that you would write long_from_uint as a macro? That's a 20-line function, so that sounds probably worse :-/. I think that'd be much too big a cost to pay in the code for the payoff of this microbenchmark speedup.

If that is your plan, then it's also basically additive upon the change in this PR. So I think it'd still be good to merge this PR, even if a subsequent PR goes on to turn the function into a macro.

sir-sigurd

Hmm, you measured this as being a significant speedup in your microbenchmark, right? That hardly seems like a problem wink even if there's a way to make the microbenchmark go even faster still.

It induces the same problems as your is_small_int() function on 32-bit platforms.

gnprice

It induces the same problems as your is_small_int() function on 32-bit platforms.

Right but that "problem" is that a few extra instructions are emitted, right? And so the microbenchmark is a few percent slower than it would be.

If the microbenchmark is nevertheless faster than before, then this is still an optimization.

sir-sigurd

It induces the same problems as your is_small_int() function on 32-bit platforms.

Right but that "problem" is that a few extra instructions are emitted, right? And so the microbenchmark is a few percent slower than it would be.

If the microbenchmark is nevertheless faster than before, then this is still an optimization.

https://bugs.python.org/msg351052

I leave it to you to benchmark it on 32-bit platform.

gnprice

I leave it to you to benchmark it on 32-bit platform.

Hmm -- do you mean you haven't tried it on a 32-bit platform? Because you said that it had a problem there, I assumed that meant you'd seen something empirically.

If you have, I think it would be helpful to say what results you got.

sir-sigurd

I mean I checked how it compiles on godbolt.org and that was sufficient to me to not benchmark it.

gnprice

I mean I checked how it compiles on godbolt.org and that was sufficient to me to not benchmark it.

Cool, got it.

Well -- as I said upthread, if this PR were to end up making something as big as long_from_uint into a macro, I think that would be much more of a cost than the micro-optimization is worth.

If you can find a way to get all the desired speedups without that kind of complexity in the code, that'd be great.

As I see it, I would also be glad to see a PR version merged that did something like

a nice simplification to the code (like your current version here)
~10% speedup in microbenchmark on x86_64 (like you quoted for this version on the issue thread)
~5% slowdown in microbenchmark on x86_32 (like you found for get_small_ints as function vs. macro)

because that's good for the source code and no worse than a wash on performance.

(Note also that anyone who's working to squeeze out the last drop of performance in running their code is much more likely to be using a 64-bit platform already.)

aeros

@gnprice:

Note also that anyone who's working to squeeze out the last drop of performance in running their code is much more likely to be using a 64-bit platform already

Agreed, I don't think anyone using a 32-bit platform is going to be overly concerned about a 5% drop in performance. A more significant performance loss may be an issue, but not when it's that small. IMO, an equivalent or larger increase in the performance for 64-bit outweighs an equal or smaller loss for 32-bit (within a reasonable amount).

sir-sigurd

If you can find a way to get all the desired speedups without that kind of complexity in the code, that'd be great.

I agree that'd be great to write some generic function, but C doesn't allow to write type generic functions and using inline function here induces these unwanted type casts that makes emitted code less efficient in unpredictable way. Such inline function creates a false feeling that it works like type generic function (as it was with is_small_int()), but it doesn't.

There are platforms besides x86 and I don't want to spend time assessing how inline function degrades performance on each of these platforms, and you?

(update):

~5% slowdown in microbenchmark on x86_32 (like you found for get_small_ints as function vs. macro)

Where did you get this ~5%? Did you run benchmark?

sir-sigurd

Here's inline function vs macro demo: https://godbolt.org/z/K5tbF2.
On x86-32:

inline function version: 53 instructions
macro version: 38 instructions

vstinner

On x86-32

Do you mean x86 (32 bits) or x32 (64-bit integers, but use 32 bit pointers)?

sir-sigurd

@vstinner x86 (32 bits).

gnprice

~5% slowdown in microbenchmark on x86_32 (like you found for get_small_ints as function vs. macro)

Where did you get this ~5%? Did you run benchmark?

(That's from here:
https://bugs.python.org/issue38015#msg351255

Just as you quoted me saying: it's what you found for get_small_ints as a function vs. as a macro.)

gnprice

There are platforms besides x86 and I don't want to spend time assessing how inline function degrades performance on each of these platforms, and you?

Consider this the other way around: I don't think it's a good use of time to investigate what small unnecessary bits of work each compiler on each platform emits, or going to great lengths in our code to coax them into doing a bit better.

(Remember that there's absolutely nothing in the language that means a compiler can't emit the exact same code in the static-function case as you're seeing in the macro case -- it's purely a matter of how smart the compiler gets. See e.g. this demo: #15216 (comment) )

That can be well worth it when the payoff is large -- when there's an opportunity for a significant win on ordinary Python code. (Or as a good proxy for that: a large win on code that shows up in profiles of ordinary Python code.) But many-line macros and other code complications have a real cost, and the payoff has to be worth the cost.

Compilers will emit less-than-optimal code. That's a fact of life, no matter what we do to our code. We have to keep going anyway and write the code in a way that helps it make sense to each other as humans.

When the compiler emits code that's much slower than it could be, in a place where that significantly matters, that's one thing... but if we insisted on squeezing the last cycle out of every function, we'd never get anything else done.

sir-sigurd

~5% slowdown in microbenchmark on x86_32 (like you found for get_small_ints as function vs. macro)

Where did you get this ~5%? Did you run benchmark?

(That's from here:
https://bugs.python.org/issue38015#msg351255

Just as you quoted me saying: it's what you found for get_small_ints as a function vs. as a macro.)

This 5% of difference is the cost of 2 extra instructions on x86_64, and if you check demo, you can see that in this case there is more substantial difference (53 vs 38 instructions and some of them are in loop).

sir-sigurd

But many-line macros and other code complications have a real cost, and the payoff has to be worth the cost.

This code is trivial, making it a macro doesn't make it less trivial. It makes it less "beautiful", but this code was modified once in the last 10 years, so I doubt there are many people interested in reading it.

(Remember that there's absolutely nothing in the language that means a compiler can't emit the exact same code in the static-function case as you're seeing in the macro case -- it's purely a matter of how smart the compiler gets. See e.g. this demo: #15216 (comment) )

My PR provides speed-up by eliminating these lines:

cpython/Objects/longobject.c

Lines 390 to 391 in c59295a

    
           if (ival < PyLong_BASE) 
        
               return PyLong_FromLong(ival);

I guess the person who wrote these lines many years ago was thinking the same way as you, years have passed, but compilers are still not so smart.

vstinner

LGTM, but I have one last request.

vstinner

Thanks @sir-sigurd, that's a nice micro-optimization ;-)

the-knights-who-say-ni added the CLA signed label Aug 9, 2019

bedevere-bot added the awaiting review label Aug 9, 2019

mdickinson reviewed Aug 9, 2019

View reviewed changes

aeros reviewed Aug 10, 2019

View reviewed changes

mdickinson approved these changes Aug 10, 2019

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting review labels Aug 10, 2019

aeros reviewed Aug 10, 2019

View reviewed changes

sir-sigurd mentioned this pull request Aug 13, 2019

bpo-37837: Add internal _PyLong_FromUnsignedChar() function #15251

Closed

sir-sigurd force-pushed the long-from-sizet-2 branch from 0ba7b3c to 9c52079 Compare August 24, 2019 18:11

gnprice reviewed Aug 24, 2019

View reviewed changes

aeros added the performance label Aug 25, 2019

sir-sigurd force-pushed the long-from-sizet-2 branch from 9c52079 to b4c286f Compare August 25, 2019 08:35

gnprice mentioned this pull request Aug 28, 2019

bpo-37812: Expand confusing CHECK_SMALL_INT so return is explicit. #15216

Merged

gnprice approved these changes Aug 28, 2019

View reviewed changes

sir-sigurd force-pushed the long-from-sizet-2 branch from 0d33a8e to d0fdead Compare September 9, 2019 10:19

sir-sigurd requested a review from mdickinson September 9, 2019 11:41

vstinner reviewed Sep 10, 2019

View reviewed changes

vstinner approved these changes Sep 10, 2019

View reviewed changes

bpo-37802: Slightly improve perfomance of PyLong_FromUnsigned*()

62d3ba8

sir-sigurd force-pushed the long-from-sizet-2 branch from d0fdead to 62d3ba8 Compare September 10, 2019 18:16

gpshead approved these changes Sep 12, 2019

View reviewed changes

gpshead merged commit c6734ee into python:master Sep 12, 2019

bedevere-bot removed the awaiting merge label Sep 12, 2019

sir-sigurd deleted the long-from-sizet-2 branch September 12, 2019 15:06

Conversation

sir-sigurd commented Aug 9, 2019 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdickinson left a comment

Choose a reason for hiding this comment

Uh oh!

gnprice commented Aug 10, 2019

Uh oh!

gnprice commented Aug 10, 2019

Uh oh!

gnprice left a comment

Choose a reason for hiding this comment

Uh oh!

sir-sigurd commented Aug 25, 2019

Uh oh!

sir-sigurd commented Aug 25, 2019

Uh oh!

gnprice left a comment

Choose a reason for hiding this comment

Uh oh!

gnprice commented Aug 29, 2019 via email

Uh oh!

sir-sigurd commented Sep 7, 2019

Uh oh!

gnprice commented Sep 9, 2019

Uh oh!

sir-sigurd commented Sep 9, 2019

Uh oh!

gnprice commented Sep 9, 2019

Uh oh!

sir-sigurd commented Sep 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnprice commented Sep 9, 2019

Uh oh!

sir-sigurd commented Sep 9, 2019

Uh oh!

gnprice commented Sep 9, 2019

Uh oh!

aeros commented Sep 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sir-sigurd commented Sep 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sir-sigurd commented Sep 9, 2019

Uh oh!

vstinner commented Sep 9, 2019

Uh oh!

sir-sigurd commented Sep 9, 2019

Uh oh!

gnprice commented Sep 10, 2019

Uh oh!

gnprice commented Sep 10, 2019

Uh oh!

sir-sigurd commented Sep 10, 2019

Uh oh!

sir-sigurd commented Sep 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commented Sep 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

sir-sigurd commented Aug 9, 2019 •

edited by bedevere-bot

Loading

sir-sigurd commented Sep 9, 2019 •

edited

Loading

aeros commented Sep 9, 2019 •

edited

Loading

sir-sigurd commented Sep 9, 2019 •

edited

Loading

sir-sigurd commented Sep 10, 2019 •

edited

Loading