[WIP] bpo-38323: Fix MultiLoopChildWatcher hangs by cjerdonek · Pull Request #20142 · python/cpython
Calling super().tearDown() last here should not be causing dangling threads, so I think this could be pointing to a more fundamental issue.
In a previous PR, I added an internal _join_threads() method to ThreadedChildWatcher, which specifically cleans up all non-daemon threads. This was there to ensure all threads are joined when the watcher is closed. However, in add_child_handler() the _do_waitpid() threads that are created are set as daemon, meaning they aren't joined in _join_threads() since it checks if they're not daemon.
Based on the error message in Travis, it definitely seems to be those waitpid threads created in ThreadedChildWatcher causing the hangs:
0:11:41 load avg: 3.89 [375/424/1] test_asyncio failed (env changed) (2 min 3 sec) -- running: test_multiprocessing_spawn (1 min 2 sec)
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>
Warning -- threading_cleanup() failed to cleanup 1 threads (count: 1, dangling: 2)
Warning -- Dangling thread: <_MainThread(MainThread, started 140164662380352)>
Warning -- Dangling thread: <Thread(waitpid-0, started daemon 140164437370624)>
As a result, I think a viable fix might be just removing the and not thread.daemon check in ThreadedChildWatcher._join_threads(), to have it also clean up the waitpid threads when the watcher is closed. Based on my local testing, this seems to resolve the issue, but test-with-buildbots should probably be used to make sure it doesn't cause other side effects on other platforms.
A viable alternative might be changing add_child_handler() to instead spawn non-daemon threads. However, I suspect changing the internal _join_threads() method would be less likely to be disruptive to user code and causing side effects because of reliance on those threads being daemon.