gh-108342: Make ssl TestPreHandshakeClose more reliable#108370
Conversation
|
Non-Linux platforms have this issue because of the Without this change, test_ssl fails with "env changed" on Windows (on an idle machine): With this change, the test pass. I tested my change by stressing my Windows VM:
I cannot reproduce the issue with this method on Windows anymore with this change. |
Sorry, something went wrong.
|
This issue also impacts Linux machines, see buildbot failures starting at this comment: #108344 (comment) |
Sorry, something went wrong.
|
I plan to schedule buildbot jobs on this PR to see if my fix works as expected. I tried to write the smallest change to fix the issue, without putting |
Sorry, something went wrong.
|
On Windows x64, test_ssl failed with ENV CHANGED, but it's a different test (unrelated to this PR): |
Sorry, something went wrong.
|
Oh, on "AMD64 Fedora Stable Clang Installed PR", the bug is still there :-( Oh. But here, the bug occurred in test_https_client_non_tls_response_ignored(). |
Sorry, something went wrong.
* In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue pythongh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * Replace socket.send() with socket.sendall() * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0].
|
It also still fails on Windows: |
Sorry, something went wrong.
ff74fd7 to
51e1e09
Compare
August 23, 2023 18:55
|
Maybe let's not merge this until it really fixes the issue now? |
Sorry, something went wrong.
Yeah, I didn't touch test_https_client_non_tls_response_ignored() whereas this test was also fragile :-/ I rewrote my PR to make the 3 tests of TestPreHandshakeClose more reliable. The overall change is more complicated than expected, so I fixed multiple issues (more or less important). |
Sorry, something went wrong.
|
I stress-tested the updated PR (commit 51e1e09) on the 3 tests on Linux on Windows. Linux: Windows: Results:
|
Sorry, something went wrong.
Sorry, something went wrong.
I don't plan to merge before I'm sure that the test is very reliable :-) |
Sorry, something went wrong.
The challenge with these is that having an environment that the failure can be reliably reproduced in is hard. So "fixed" isn't well defined. I'm satisfied that this PR improves things and that Victor has found reproducers that appear to be healthier. But a concrete "fixed" can really only be declared after weeks of diverse buildbot data even when we're 99% confident "this is the last one for real this time". It'd be fine to do the releases without this or the related preceeding test refleak fix PR in FWIW. Our refleaks checks are pedantry we impose upon ourselves that also catch things that are not actual CPython bugs as is the case here. I found and fixed a one reference cycle from the test that I saw come up during initial development. I had no way to notice or find others given I couldn't use public buildbots and I don't maintain a pile of unusually overloaded diverse OS VMs. |
Sorry, something went wrong.
For me, it was quite easy to reproduce the 2 issues on Windows, in a few seconds:
Moreover, on Linux, using |
Sorry, something went wrong.
|
Right now, there are many buildbots failing on these 3 test_ssl tests because of "dangling threads" (need my fix)! It's hard to notice that a test is not reliable in advance. In my experience, only shipping a fix and launching the heavy buildbot farm discovers race conditions in tests. |
Sorry, something went wrong.
|
I updated my PR to address @gpshead's review. @gpshead @ambv: I propose that you review and approve it, and then merge it, rather than waiting for a 3rd batch of buildbot jobs on this PR. We can use the regular buildbot process: merge the PR and then watch buildbots. Then only add backport labels to this PR once enough buildbots completed successfully. |
Sorry, something went wrong.
|
OK, let me just run this PR branch without Greg's original fix to see if the changed tests still trigger the issue. |
Sorry, something went wrong.
|
Yup, it still works: |
Sorry, something went wrong.
gpshead
left a comment
There was a problem hiding this comment.
If the https cllient test still proves flaky, disabling or removing it is fine by me. (it wasn't a regression test)
Sorry, something went wrong.
|
Buildbots are more happy with this change. Only one Windows Refleak buildbot still fails, just because it didn't run with the new code yet. The build 494 is running and it looks good as well ("test_ssl passed")! |
Sorry, something went wrong.
…ythonGH-108370) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue pythongh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <vstinner@python.org>
…ythonGH-108370) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue pythongh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <vstinner@python.org>
…ythonGH-108370) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue pythongh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <vstinner@python.org>
…ythonGH-108370) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue pythongh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <vstinner@python.org>
…ythonGH-108370) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue pythongh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <vstinner@python.org>
…8370) (#108404) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue gh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <vstinner@python.org>
…8370) (#108405) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue gh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <vstinner@python.org>
…8370) (#108406) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue gh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <vstinner@python.org>
) (#108407) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue gh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <vstinner@python.org>
) (#108408) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue gh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb)
…ythonGH-108370) (python#108407) * In preauth tests of test_ssl, explicitly break reference cycles invoving SingleConnectionTestServerThread to make sure that the thread is deleted. Otherwise, the test marks the environment as altered because the threading module sees a "dangling thread" (SingleConnectionTestServerThread). This test leak was introduced by the test added for the fix of issue pythongh-108310. * Use support.SHORT_TIMEOUT instead of hardcoded 1.0 or 2.0 seconds timeout. * SingleConnectionTestServerThread.run() catchs TimeoutError * Fix a race condition (missing synchronization) in test_preauth_data_to_tls_client(): the server now waits until the client connect() completed in call_after_accept(). * test_https_client_non_tls_response_ignored() calls server.join() explicitly. * Replace "localhost" with server.listener.getsockname()[0]. (cherry picked from commit 592bacb) Co-authored-by: Victor Stinner <vstinner@python.org>


invoving SingleConnectionTestServerThread to make sure that the
thread is deleted. Otherwise, the test marks the environment as
altered because the threading module sees a "dangling thread"
(SingleConnectionTestServerThread). This test leak was introduced
by the test added for the fix of issue CVE-2023-40217: Bypass TLS handshake on closed sockets #108310.
timeout.
test_preauth_data_to_tls_client(): the server now waits until the
client connect() completed in call_after_accept().
explicitly.