bpo-42130: Fix swallowing of cancellation by wait_for#26097
Conversation
|
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA). CLA MissingOur records indicate the following people have not signed the CLA: For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. If you have recently signed the CLA, please wait at least one business day You can check yourself to see if the CLA has been received. Thanks again for the contribution, we look forward to reviewing it! |
Sorry, something went wrong.
|
Does this actually solve the original issue where a connection/resource can be leaked, or are we just trading one variant of the bug for another? I agree this fixes some odd cancellation behaviour, but are we back to having a likelihood of leaked resources, if I remember correctly. Following from one of the comments in the second link: I'm still not sure why As a concept, why do we not have the waiter be a simple call_later handle and add a done callback to fut to cancel it? Likewise, the waiter func just checks if the future is done and cancels it if not. My impression is that this would avoid much of the complexity within # NOT TESTED: PoC pseudocode!
def _optionally_cancel_fut(fut, event):
if not fut.done():
event.set()
fut.cancel()
async def wait_for(fut, timeout):
loop = events.get_running_loop()
fut = ensure_future(fut, loop=loop)
timeout_occurred = asyncio.Event() # Likely not optimal
if timeout is not None and timeout > 0:
timeout_handle = loop.call_later(timeout, _optionally_cancel_fut, fut, timeout_occurred)
fut.add_done_callback(lambda fut: timeout_handle.cancel())
elif timeout is not None:
# Timeout must be negative or zero, cancel immediately
_optionally_cancel_fut(fut, timeout_occurred)
try:
return await fut
except exceptions.CancelledError as exc:
if timeout_occurred.is_set():
raise exceptions.TimeoutError() from exc
raise exc |
Sorry, something went wrong.
No, it doesn't solve bpo-37658. But reverted change doesn't solve it too, at least not completely. And we certainly have to work on leaked resources again. The main purpose of this PR is the test case. |
Sorry, something went wrong.
It's worth a try at least, I'll take it tomorrow. |
Sorry, something went wrong.
👍 I have checked that code in my comment above and it does what is expected at a high level (cancels on timeout etc), but I haven't checked it against the test suite nor against any of the bugs that have led to this. If it works then it's a whole lot easier to reason about than the current impl, but I've probably forgotten something important... 🙄 |
Sorry, something went wrong.
Based on pseudocode suggested by @aaliddell python#26097 (comment)
Here is my first attempt. With this change the following tests fail:
Let's put aside arguable incompatible changes in behaviour and elaborate on main goal. In test case |
Sorry, something went wrong.
|
👍 Regarding the 'badly behaved' coroutines: I'd argue that it is behaving as best as can be expected, since
If we try to patch a poorly behaved coroutine by ensuring we raise the timeout regardless of if the cancellation succeeds, we're masking the fact that the task underneath has been poorly behaved and expose ourselves to the resource leak, but I can see this is is perhaps going to be a point of debate 😄. I guess it comes down to two options:
Re |
Sorry, something went wrong.
IMO, both ways are acceptable. But the important thing is that it's incompatible change that may break someone's code.
To me the behaviour it tests shouldn't be guaranteed. Such things should be done with patterns like RAII or context managers. But it's quite problematic with current API and the whole point of trying the new approach was to remove the race in this particular case. And this goal is not reached. |
Sorry, something went wrong.
|
This PR is stale because it has been open for 30 days with no activity. |
Sorry, something went wrong.
|
Re: the aiokafka library issue tagged above by @ods. I hope this will add some perspective to the issue. When we upgrade to Python 3.8.10, we're no longer able to reliably stop the consumer (close connection and release resources). We wind up needing to toss the stop call into a task and move on but this leaves us with unreleased resources. Our only solution has been to downgrade to Python 3.8.5 which isn't very sustainable in the long run. |
Sorry, something went wrong.
|
I think each time async def wait_for(fut, timeout):
"""Wait for the single Future or coroutine to complete, with timeout.
Coroutine will be wrapped in Task.
Returns result of the Future or coroutine. When a timeout occurs,
it cancels the task and raises TimeoutError. To avoid the task
cancellation, wrap it in shield().
If the wait is cancelled, the task is also cancelled.
This function is a coroutine.
"""
loop = events.get_running_loop()
if timeout is None:
return await fut
timed_out = False
event = asyncio.Event()
fut = ensure_future(fut, loop=loop)
set_event = event.set
fut.add_done_callback(set_event)
if timeout <= 0:
timed_out = True
fut.cancel()
while not fut.done():
try:
await event.wait()
except asyncio.CancelledError as e:
timed_out = False
fut.cancel()
else:
def on_timeout():
nonlocal timed_out
timed_out = True
fut.cancel()
timeout_handle = loop.call_later(timeout, on_timeout)
try:
while not fut.done():
try:
await event.wait()
except asyncio.CancelledError as e:
timed_out = False
timeout_handle.cancel()
fut.cancel()
finally:
timeout_handle.cancel()
if timed_out:
try:
return fut.result()
except exceptions.CancelledError as exc:
raise exceptions.TimeoutError() from exc
return fut.result() |
Sorry, something went wrong.
I think you may have missed the issue. There are various timing issues, that result in situations where the future may already be complete when handling the cancellation. By only changing the code in So far, the only proposal that might work will require making some changes to how the event loop runs tasks. |
Sorry, something went wrong.
That might be true for 3.11 but 3.8 - 3.10 still should be fixed. |
Sorry, something went wrong.
and in https://www.python.org/downloads/release/python-3913/
so this could only be backported as far as 3.10 |
Sorry, something went wrong.
@Dreamsorcerer I also had a go here bf594bd but found a bug in asyncio.timeout |
Sorry, something went wrong.
Related:
https://bugs.python.org/issue42130