[WIP] bpo-38323: Fix MultiLoopChildWatcher hangs#20142
Conversation
1279008 to
5d10132
Compare
May 17, 2020 01:45
This changes asyncio.MultiLoopChildWatcher's attach_loop() method to call loop.add_signal_handler() instead of calling only signal.signal(). This should eliminate some rare hangs since loop.add_signal_handler() calls signal.set_wakeup_fd(). Without this, the main thread sometimes wasn't getting awakened if a signal occurred during an await.
|
@cjerdonek Are you still interested/able to work on this? I'm mainly asking since it's been a few months since there was activity, and I'm fairly confident that my suggestion will resolve the CI failures. If you're not currently able to, I can re-open a separate PR applying this changes in addition to mine with you as the co-author, titling it something like "fix MultiLoopChildWatcher and ThreadChildWatcher hangs". @vstinner recently brought to my attention that this issue has some degree of urgency, with it having caused intermittent failures in the buildbots for quite some time now. So, there's some definitely an advantage to having it resolved sooner than later. |
Sorry, something went wrong.
|
@aeros As I said in the reply above to the review comment, I think the Regarding the rest of the PR, sure, you're welcome to continue what I started. However, I definitely don't think the changes I made to For example, I don't think |
Sorry, something went wrong.
|
Something else to be aware of is that # Implementation note:
# The class keeps compatibility with AbstractChildWatcher ABC
# To achieve this it has empty attach_loop() method
# and doesn't accept explicit loop argument
# for add_child_handler()/remove_child_handler()
# but retrieves the current loop by get_running_loop()The part mentioning its "empty |
Sorry, something went wrong.
From the git blame, it looks like Andrew wrote that note when initially implementing the |
Sorry, something went wrong.
|
I reverted the |
Sorry, something went wrong.
|
Can I suggest an improved documentation-comment? I noticed a similar constraint in the existing code. MultiLoopChildWatcher could cause hangs if the main thread is waiting in a system call. Because the watcher's signal handler would not get run, until the system call returns. It only works properly, if the main thread can be woken up by [EDIT: a signal that happens to be received on the main thread, or] the fd passed to signal.set_wakeup_fd(). This PR tightens the requirements slightly. MultiLoopChildWatcher now needs the main thread to run an asyncio event loop. MultiLoopChildWatcher will cause hangs if the main thread waits elsewhere in a system call. Or if the main thread runs a very long computation without returning to the event loop. (Code looks plausible. Good analysis of a nasty problem. The signal arrives e.g. within the libc select() function. The python signal handler, MultiLoopChildWatcher._sig_chld(), doesn't get a chance to run before the low-level system call. signal.set_wakeup_fd() is the solution. It gets the select() system call to wake up. The python-level SIGCHLD handler gets a chance to run.). I would drop |
Sorry, something went wrong.
|
Thanks @sourcejedi for the second pair of eyes and great additional comments! I like your doc suggestion and agree it should be incorporated in some form into this PR. A couple unrelated things:
|
Sorry, something went wrong.
|
I just noticed that changing I guess this means |
Sorry, something went wrong.
|
@asvetlov "MultiLoopChildWatcher problems can and should be fixed; there is nothing bad in the idea but slightly imperfect implementation." This fix needs to switch from using signal.signal(), to loop.add_signal_handler(). This requires the main thread to run an asyncio event loop. Previously, this requirement was only documented for SafeChildWatcher and FastChildWatcher. So we need to decide - Q. Is it acceptable to add this new requirement? Or do we deprecate MultiLoopChildWatcher, and warn people that it occasionally hangs? In other words, it is possible an existing program used the "imperfect" MultiLoopChildWatcher, while the main thread ran a non-asyncio event loop. (And the non-asyncio event loop was using signal.set_wakeup_fd()). This fix would break any such program (hang in signal.set_wakeup_fd() is roughly the only way python programs can use signal handlers correctly. The python-level signal handler is run in the main thread. If the main thread is waiting in a system call, it needs something to wake it up after a signal was delivered to a different thread. POSIX says that process signals can be delivered to any thread. (I say "process signals" to exclude pthread_kill(), and faults like
I don't know how that would help, sorry. We absolutely need to interrupt the epoll_wait() call in the main thread. We need that epoll to be watching the signal wakeup fd. Hence we need to talk to the main thread event loop. We may as well use loop.add_signal_handler(). |
Sorry, something went wrong.
|
I said that because if cpython/Lib/asyncio/unix_events.py Lines 1205 to 1206 in 95ad890 So I was wondering if there might be any options for using In the meantime, we might also want to think about whether a band-aid could be added to the tests, at least to prevent hangs as long as |
Sorry, something went wrong.
|
Yeah, I would agree that changing
My only concern though is that by bandaging the tests temporarily, it may encourage usage in production in situations that could lead to the hangs (although admittedly, I'm not aware of specific real-world situations where it would currently be used). That being said, it's seeming like we may have to go that route for now since this issue seems like it might be blocked on feedback from @asvetlov (since it seems like it might require some fundamental design changes). This would at least ensure that other test failures aren't obfuscated due to intermittent |
Sorry, something went wrong.
I was just meaning in master (e.g. for the next few months) while a resolution is developed. By the way, @aeros can you review my other PR #22756, which is independent of (but related to) this issue? |
Sorry, something went wrong.
jstasiak
left a comment
There was a problem hiding this comment.
LGTM, this fixes some hanging sets on Solaris/illumos/OpenIndiana.
Sorry, something went wrong.
|
I'm closing this PR which was mainly exploratory. See here for my conclusions (also noted above): https://bugs.python.org/issue38323#msg395113 |
Sorry, something went wrong.
https://bugs.python.org/issue38323