`test_dtrace` can hang indefinitely when bpftrace usability check times out
Bug report
Bug description:
From this CentOS Buildbot run:
0:25:18 load avg: 2.68 [504/505] test_zipfile passed (3 min 58 sec) -- running (1): test_dtrace (22 min 49 sec)
0:25:48 load avg: 1.63 running (1): test_dtrace (23 min 19 sec)
0:26:18 load avg: 0.98 running (1): test_dtrace (23 min 49 sec)
0:26:48 load avg: 0.60 running (1): test_dtrace (24 min 19 sec)
0:27:18 load avg: 0.36 running (1): test_dtrace (24 min 49 sec)
0:27:48 load avg: 0.22 running (1): test_dtrace (25 min 19 sec)
0:28:18 load avg: 0.13 running (1): test_dtrace (25 min 49 sec)
0:28:48 load avg: 0.08 running (1): test_dtrace (26 min 19 sec)
0:29:18 load avg: 0.05 running (1): test_dtrace (26 min 49 sec)
0:29:48 load avg: 0.03 running (1): test_dtrace (27 min 19 sec)
0:30:18 load avg: 0.02 running (1): test_dtrace (27 min 49 sec)
0:30:48 load avg: 0.01 running (1): test_dtrace (28 min 19 sec)
0:31:18 load avg: 0.00 running (1): test_dtrace (28 min 49 sec)
0:31:48 load avg: 0.00 running (1): test_dtrace (29 min 19 sec)
0:32:18 load avg: 0.00 running (1): test_dtrace (29 min 49 sec)
0:32:48 load avg: 0.00 running (1): test_dtrace (30 min 19 sec)
0:33:18 load avg: 0.00 running (1): test_dtrace (30 min 49 sec)
0:33:48 load avg: 0.14 running (1): test_dtrace (31 min 19 sec)
0:34:18 load avg: 0.08 running (1): test_dtrace (31 min 49 sec)
0:34:48 load avg: 0.05 running (1): test_dtrace (32 min 19 sec)
0:35:18 load avg: 0.03 running (1): test_dtrace (32 min 49 sec)
0:35:48 load avg: 0.02 running (1): test_dtrace (33 min 19 sec)
0:36:18 load avg: 0.01 running (1): test_dtrace (33 min 49 sec)
0:36:48 load avg: 0.00 running (1): test_dtrace (34 min 19 sec)
0:37:18 load avg: 0.00 running (1): test_dtrace (34 min 49 sec)
0:37:48 load avg: 0.00 running (1): test_dtrace (35 min 19 sec)
0:38:18 load avg: 0.00 running (1): test_dtrace (35 min 49 sec)
0:38:48 load avg: 0.00 running (1): test_dtrace (36 min 19 sec)
0:39:18 load avg: 0.00 running (1): test_dtrace (36 min 49 sec)
0:39:48 load avg: 0.45 running (1): test_dtrace (37 min 19 sec)
0:40:18 load avg: 2.92 running (1): test_dtrace (37 min 49 sec)
0:40:48 load avg: 3.79 running (1): test_dtrace (38 min 19 sec)
0:41:18 load avg: 4.79 running (1): test_dtrace (38 min 49 sec)
0:41:48 load avg: 4.97 running (1): test_dtrace (39 min 19 sec)
0:42:18 load avg: 5.75 running (1): test_dtrace (39 min 49 sec)
0:42:48 load avg: 6.48 running (1): test_dtrace (40 min 19 sec)
0:43:18 load avg: 6.59 running (1): test_dtrace (40 min 49 sec)
0:43:48 load avg: 7.63 running (1): test_dtrace (41 min 19 sec)
0:44:18 load avg: 8.18 running (1): test_dtrace (41 min 49 sec)
0:44:48 load avg: 8.63 running (1): test_dtrace (42 min 19 sec)
0:45:18 load avg: 9.51 running (1): test_dtrace (42 min 49 sec)
0:45:48 load avg: 9.46 running (1): test_dtrace (43 min 19 sec)
0:46:18 load avg: 9.73 running (1): test_dtrace (43 min 49 sec)
0:46:48 load avg: 9.57 running (1): test_dtrace (44 min 19 sec)
0:47:18 load avg: 10.71 running (1): test_dtrace (44 min 49 sec)
0:47:29 load avg: 10.19 [505/505/1] test_dtrace worker non-zero exit code (Exit code 1)
beginning 6 repetitions. Showing number of leaks (. for 0 or less, X for 10 or more)
123:456
setUpClass (test.test_dtrace.BPFTraceNormalTests) ... skipped 'bpftrace(1) failed with code 1: ERROR: bpftrace currently only supports running as the root user.\n'
Timeout (0:45:00)!
Thread 0x00007fc88325ab80 [python] (most recent call first):
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/selectors.py", line 398 in select
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/subprocess.py", line 291 in _communicate_io_posix
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/subprocess.py", line 2385 in _communicate
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/subprocess.py", line 1389 in communicate
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/test/test_dtrace.py", line 255 in assert_usable
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/test/test_dtrace.py", line 296 in setUpClass
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/unittest/suite.py", line 166 in _handleClassSetUp
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/unittest/suite.py", line 114 in run
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/unittest/suite.py", line 84 in __call__
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/unittest/suite.py", line 122 in run
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/unittest/suite.py", line 84 in __call__
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/unittest/runner.py", line 256 in run
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/test/libregrtest/single.py", line 84 in _run_suite
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/test/libregrtest/single.py", line 42 in run_unittest
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/test/libregrtest/single.py", line 162 in test_func
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/test/libregrtest/refleak.py", line 145 in runtest_refleak
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/test/libregrtest/single.py", line 114 in regrtest_runner
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/test/libregrtest/single.py", line 165 in _load_run_test
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/test/libregrtest/single.py", line 210 in _runtest_env_changed_exc
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/test/libregrtest/single.py", line 319 in _runtest
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/test/libregrtest/single.py", line 348 in run_single_test
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/test/libregrtest/worker.py", line 99 in worker_process
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/test/libregrtest/worker.py", line 137 in main
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/test/libregrtest/worker.py", line 141 in <module>
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/runpy.py", line 87 in _run_code
File "/home/buildbot/buildarea/3.15.itamaro-centos-aws.refleak.nogil/build/Lib/runpy.py", line 201 in _run_module_as_main
Looking at the test:
| # Check if bpftrace is available and can attach to USDT probes | |
| program = f'usdt:{sys.executable}:python:function__entry {{ printf("probe: success\\n"); exit(); }}' | |
| try: | |
| proc = subprocess.Popen( | |
| ["bpftrace", "-e", program, "-c", f"{sys.executable} -c pass"], | |
| stdout=subprocess.PIPE, | |
| stderr=subprocess.PIPE, | |
| universal_newlines=True, | |
| ) | |
| stdout, stderr = proc.communicate(timeout=10) | |
| except subprocess.TimeoutExpired: | |
| proc.kill() | |
| proc.communicate() # Clean up | |
| raise unittest.SkipTest("bpftrace timed out during usability check") | |
| except OSError as e: | |
| raise unittest.SkipTest(f"bpftrace not available: {e}") |
I'm guessing, after proc.communicate(timeout=10) times out, proc.kill() only kills bpftrace while its forked child survives holding the pipe open, so the unbounded cleanup proc.communicate() blocks until the test times out.
Likely introduced by 9a268e3, CC @pablogsal
CPython versions tested on:
CPython main branch
Operating systems tested on:
No response