bpo-33613, test_semaphore_tracker_sigint: fix race condition#7850
Conversation
Fail `test_semaphore_tracker_sigint` if no warnings are expected and one is received. Fix race condition when the child receives SIGINT before it can register signal handlers for it. The race condition occurs when the parent calls `_semaphore_tracker.ensure_running()` (which in turn spawns the semaphore_tracker using `_posixsubprocess.fork_exec`), the child registers the signal handlers and the parent tries to kill the child. What seem to happen is that in some slow systems, the parent sends the signal to kill the child before the child protects against the signal. There is no reliable and portable solution for the parent to wait until the child has register the signal handlers to send the signal to kill the child so a `sleep` is introduced between the spawning of the child and the parent sending the signal to give time to the child to register the handlers.
|
Regarding the unrelated change of the warning check, FWIW I find the existing code (without the change in this PR) clearer. |
Sorry, something went wrong.
|
@taleinat I am happy to undo that change. The problem is that the test was silently failing when I think in order to check that the test works as intended without the need to run the suite with -Wall, we need to modify the test. |
Sorry, something went wrong.
4b8f76f to
9408236
Compare
June 24, 2018 18:35
… the same process
…runs in the same process
|
@pitrou Could you take a look at this? Here is a summary of all the changes to make reviewing this easier:
|
Sorry, something went wrong.
|
From a high-level POV:
Is that important? If we have very slow buildbots, we could skip the test on them. Testing this piece of functionality on all buildbots is not critical. |
Sorry, something went wrong.
pitrou
left a comment
There was a problem hiding this comment.
Assuming we validate this approach, there are still problems with the implementation.
Sorry, something went wrong.
|
When you're done making the requested changes, leave the comment: |
Sorry, something went wrong.
…esting runs in the same process
e16e027 to
9460da4
Compare
July 24, 2018 12:29
That was one of the original problems in the issue. It does not happen only on the slowest buildbots, but these are a reliable place to test for the race condition to happen. In my humble opinion, the existence of the race makes the tests more unreliable and we should fix that, but I you think there is a better approach or a different compromise, I am more than happy to go for that. |
Sorry, something went wrong.
|
@pitrou wrote: "I'm a bit uneasy with this. What if spawnv_passfds takes a very long time for some reason? The user will try to stop it using ^C... and nothing will happen." You are true that SIGINT is blocked while spawnv_passfds() is running, but spawnv_passfds() should be quick in the parent: to oversimplify, it just calls fork() which should be very quick (it's not a O(n) operating thanks to copy-on-write). I prefer to see this race condition fixed. IMHO the short time window where CTRL+c is blocked is small enough to be acceptable. Note: signals are not lost or ignored, it's just that signals are only handled once spawnv_passfds() completes (once the signals are unblocked). I suggest to backport the change to 2.7, 3.6 and 3.7 branches to make our buildbots more reliable. |
Sorry, something went wrong.
|
@pitrou: @pablogsal updated his PR, and it now LGTM. Would you mind to have a second look? Note: this PR has a long history since @pablogsal chose to rewrite his PR with a different approach ("ping" then pthread_sigmask) rather than creating a new PR. IMHO pthread_sigmask is smpler and more reliable than the "ping" idea. There is no need to establish (and then close, reliably) a new communication channel with pthread_sigmask. |
Sorry, something went wrong.
|
Fair enough. I think this is good to go. |
Sorry, something went wrong.
|
@pitrou: Please replace |
Sorry, something went wrong.
|
Thanks @pablogsal for the PR, and @pitrou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.6. |
Sorry, something went wrong.
|
Thanks @pablogsal for the PR, and @pitrou for merging it 🌮🎉.. I'm working now to backport this PR to: 2.7. |
Sorry, something went wrong.
|
Thanks @pablogsal for the PR, and @pitrou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7. |
Sorry, something went wrong.
|
Sorry, @pablogsal and @pitrou, I could not cleanly backport this to |
Sorry, something went wrong.
…H-7850) Fail `test_semaphore_tracker_sigint` if no warnings are expected and one is received. Fix race condition when the child receives SIGINT before it can register signal handlers for it. The race condition occurs when the parent calls `_semaphore_tracker.ensure_running()` (which in turn spawns the semaphore_tracker using `_posixsubprocess.fork_exec`), the child registers the signal handlers and the parent tries to kill the child. What seem to happen is that in some slow systems, the parent sends the signal to kill the child before the child protects against the signal. (cherry picked from commit ec74d18) Co-authored-by: Pablo Galindo <Pablogsal@gmail.com>
|
Sorry, @pablogsal and @pitrou, I could not cleanly backport this to |
Sorry, something went wrong.
|
Please don't backport this. This isn't fixing a user-visible bug. |
Sorry, something went wrong.
It hurts on the buildbots, and so it makes our life harder (Pablo and me who watch the random failures on buildbots). |
Sorry, something went wrong.
|
Right, but we shouldn't modify library code to heal the buildbots (and risk introducing regressions). |
Sorry, something went wrong.
|
If you don't want to backport the code, I suggest to remove or skip the test in 3.6 and 3.7 branches. I don't want to have known race condition in our test suite. |
Sorry, something went wrong.
|
I'm ok with skipping the tests on the buildbots, as long as it's not skipped unconditionally. |
Sorry, something went wrong.
Fail
test_semaphore_tracker_sigintif no warnings are expected andone is received.
Fix race condition when the child receives SIGINT
before it can register signal handlers for it.
The race condition occurs when the parent calls
_semaphore_tracker.ensure_running()(which in turn spawns thesemaphore_tracker using
_posixsubprocess.fork_exec), the childregisters the signal handlers and the parent tries to kill the child.
What seems to happen is that in some slow systems, the parent sends the
signal to kill the child before the child protects against the signal.
There is no reliable and portable solution for the parent to wait until
the child has register the signal handlers to send the signal to kill
the child so a
sleepis introduced between the spawning of the childand the parent sending the signal to give time to the child to register
the handlers.
https://bugs.python.org/issue33613