◐ Shell
reader mode source ↗
Skip to content

gh-87135: Hang non-main threads that attempt to acquire the GIL during finalization#28525

Closed
jbms wants to merge 11 commits into
python:mainfrom
jbms:fix-issue-42969
Closed

gh-87135: Hang non-main threads that attempt to acquire the GIL during finalization#28525
jbms wants to merge 11 commits into
python:mainfrom
jbms:fix-issue-42969

Conversation

@jbms

@jbms jbms commented Sep 22, 2021

Copy link
Copy Markdown
Contributor

@the-knights-who-say-ni

Copy link
Copy Markdown

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

CLA Missing

Our records indicate the following people have not signed the CLA:

@jbms

For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

@jbms

jbms commented Sep 23, 2021

Copy link
Copy Markdown
Contributor Author

Can this possibly go into a point release? I think it would be unfortunate for this bug to remain until 3.11, even though presumably it has been around since forever.

@gpshead

gpshead commented Sep 23, 2021

Copy link
Copy Markdown
Member

Can this possibly go into a point release? I think it would be unfortunate for this bug to remain until 3.11, even though presumably it has been around since forever.

The API deprecation must say 3.11. but the bugfix itself could be backported. We should let this one bake on the main branch for a bit first.

@gpshead

gpshead commented Sep 23, 2021

Copy link
Copy Markdown
Member

PS please do the CLA signing dance (see go/patching#cla internally, it's approved).

@jbms

jbms commented Sep 23, 2021

Copy link
Copy Markdown
Contributor Author

I already signed the CLA earlier today, may take until tomorrow to register.

@jbms

jbms commented Sep 23, 2021

Copy link
Copy Markdown
Contributor Author

I'm working on adding a mechanism that extension code can use to safely avoid threads hanging, and updating the documentation. I'll update this PR shortly once that is done.

@vstinner

Copy link
Copy Markdown
Member

There are legit use cases for pthread_exit(). If PyThread_exit_thread() is deprecated, does it mean that developers using it should call directly pthread_exit(0) and _endthreadex(0)?

@vstinner vstinner left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

I dislike this PR. When Python is embedded in an application, it means that Python leaks hang threads in some cases, whereas currently these threads exit once they attempt to acquire the GIL.

I get that there are some use cases where calling pthread_exit() is bad, but other cases, calling pthread_exit() sounds less bad than hang these threads.

Would it be possible to make it possible to choose the behavior, and keep the current behavior by default?

@jbms jbms force-pushed the fix-issue-42969 branch 3 times, most recently from 099dd01 to 8aab6c0 Compare September 24, 2021 00:51

@vstinner vstinner left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

This PR becomes more and more problems. I'm not sure if it is will fix more problems than introducing more problems.

I don't understand why Python should now start handling Windows GUI messages. Why only Windows GUI messages? Do you expect that other operating systems will also need special code in "hang threads"?

@gpshead gpshead left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

leaving a few more review comments here. i'll followup on the bug with larger picture PR-existential strategy comments.

@jbms

jbms commented Sep 30, 2021

Copy link
Copy Markdown
Contributor Author

I have updated the documentation to document the new functions, and changed to use SleepEx as suggested by @eryksun.

I have not yet made the change suggested by @gpshead to block signals in the pthread version --- I can do that if necessary, but I am not sure what the benefit is. Signal handlers already can't safely attempt to acquire the GIL or use any Python APIs that require the GIL, and therefore I would assume there is no harm in letting them run.

Other than that, what are the next steps for this pull request?

@github-actions

Copy link
Copy Markdown

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions Bot added the stale Stale PR or inactive for long period of time. label Oct 31, 2021
@Skylion007

Copy link
Copy Markdown

116 hidden items Load more…

@gpshead gpshead left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

from your docs:

To avoid non-Python threads becoming blocked, or Python-created threads becoming
blocked while executing C extension code, you can use
:c:func:PyThread_TryAcquireFinalizeBlock and
:c:func:PyThread_ReleaseFinalizeBlock.

I think this phrasing undersells the magnitude of the problem and who can and should "fix" it... The "you can use" wording implies that it is the application user who debugged a hang is capable of resolving the problem by adjusting code they own.

But I suspect the reality is that until all Python C API using code in the world is updated to use these APIs (read: never/8-years)... hangs are going to be most frequently attributed to the huge pile of third party code that everybody's application uses. (Probably including a lot within the CPython internals itself).

There isn't really a solution for this. The existing GIL APIs and thus all existing Python C API using code in the world are not structured to check for and handle failure.

So it's mostly a matter of documentation semantics. We should seek to set expectations of who and when these new APIs should explicitly be used: In C code calling back into Python from potentially non-CPython created threads.

(I'd ignore the daemon thread aspect in our advice - anything could be one of those - we ultimately want to get rid of the daemon concept entirely and separately should document it as a "do not use / fundamentally broken / unfixable" feature in the threading module)


Other than the messaging, I think this overall PR looks good and want @Yhg1s to answer whether we backport to 3.12beta or not upon merge.

gpshead and others added 2 commits May 25, 2023 15:02
The finalized blocks is now protected by the main interpreters GIL.
@jbms

jbms commented May 26, 2023

Copy link
Copy Markdown
Contributor Author

from your docs:

To avoid non-Python threads becoming blocked, or Python-created threads becoming
blocked while executing C extension code, you can use
:c:func:PyThread_TryAcquireFinalizeBlock and
:c:func:PyThread_ReleaseFinalizeBlock.

I think this phrasing undersells the magnitude of the problem and who can and should "fix" it... The "you can use" wording implies that it is the application user who debugged a hang is capable of resolving the problem by adjusting code they own.

But I suspect the reality is that until all Python C API using code in the world is updated to use these APIs (read: never/8-years)... hangs are going to be most frequently attributed to the huge pile of third party code that everybody's application uses. (Probably including a lot within the CPython internals itself).

There isn't really a solution for this. The existing GIL APIs and thus all existing Python C API using code in the world are not structured to check for and handle failure.

So it's mostly a matter of documentation semantics. We should seek to set expectations of who and when these new APIs should explicitly be used: In C code calling back into Python from potentially non-CPython created threads.

(I'd ignore the daemon thread aspect in our advice - anything could be one of those - we ultimately want to get rid of the daemon concept entirely and separately should document it as a "do not use / fundamentally broken / unfixable" feature in the threading module)

Other than the messaging, I think this overall PR looks good and want @Yhg1s to answer whether we backport to 3.12beta or not upon merge.

Not sure exactly what the best change to the docs is --- but as I see it, the key message around the hangs is:

"In most cases this is harmless"

I think it is a rare case indeed where the hang will not be harmless. Aside from the use case of wanting to cleanly finalize and initialize multiple times in a program, cases where hanging is a problem would likely result in a hang or crash without this PR as well.

@gpshead gpshead added the needs backport to 3.12 only security fixes label May 26, 2023

@Yhg1s Yhg1s left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

FWIW, I think we should get this into 3.12 (preferably before beta 2, which is scheduled next week). I think it's safe to do so because only an internal struct changes, and that struct is not used from any macros.

@jbms

jbms commented May 26, 2023

Copy link
Copy Markdown
Contributor Author

I'm not sure I understand the point about PyGILState_Release() being confusing. I think the main question is whether there's a reasonable use-case for the finalize block outside of acquiring the GIL. For the use-case of making sure it's safe to acquire the GIL (and preventing finalization from starting while the GILState is held), I think the PyGILState API makes more sense. At that point it isn't exposed as a separate lock but as a new error condition (and a new safety guarantee when calling TryEnsure).

Framing it as "PyGILState_TryEnsure() is like PyGILState_Ensure() except it also prevents finalization from starting while the GILState is held" makes sense to me. I don't think most users of the C API care if their threads are blocked forever when finalizing, so they can just keep using PyGILState_Ensure. I don't think that's more confusing.

One example where it may be useful to acquire the finalize block separately from the GIL:

  1. Acquire finalize block
  2. Acquire a lock on some internal mutex owned by third-party code, finalize block ensures the thread doesn't hang while holding the mutex.
  3. Do various things with the Python API, which may involve repeatedly acquiring and releasing the GIL
  4. Release internal mutex
  5. Release finalize block

In any case having a single enum and PyGILState_Release seems reasonable. I don't think TryEnsure is a great name to replace PyGILState_AcquireFinalizeBlockAndGIL because it doesn't clearly indicate that a finalize block is being acquired, and users should not acquire finalize blocks casually (e.g. while doing any sort of blocking operation), as that may unexpectedly prevent Python from exiting.

@ericsnowcurrently ericsnowcurrently left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

There are a few things I haven't looked through (e.g. the details of how the "blocks" are implemented). Hopefully this is enough review to get things rolling though.

@bedevere-bot

Copy link
Copy Markdown

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
@AlexWaygood AlexWaygood changed the title bpo-42969: Hang non-main threads that attempt to acquire the GIL during finalization Jun 9, 2023
@jbms

jbms commented Jun 14, 2023

Copy link
Copy Markdown
Contributor Author

I have made the requested changes; please review again

This PR can now be considered "on hold" or abandoned given the API questions regarding multiple interpreters. Please see the separate PR #105805 which contains just the minimal change to hang threads without adding any new public APIs.

@bedevere-bot

Copy link
Copy Markdown

Thanks for making the requested changes!

@gpshead, @ericsnowcurrently: please review the changes made to this pull request.

@ericsnowcurrently

Copy link
Copy Markdown
Member

It looks like this PR was replaced by #105805, which has been merged now. Is there anything to be done here?

@jbms jbms closed this Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting change review needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes stale Stale PR or inactive for long period of time. type-bug An unexpected behavior, bug, or error

Projects

None yet

Development

Successfully merging this pull request may close these issues.