gh-87389: avoid treating path as URI with netloc#93894
Conversation
still needs test for urlunsplit() change
|
When you're done making the requested changes, leave the comment: |
Sorry, something went wrong.
|
I have made the requested changes; please review again. I added |
Sorry, something went wrong.
|
Thanks for making the requested changes! @gpshead: please review the changes made to this pull request. |
Sorry, something went wrong.
The pathsplit() function will work correctly on relative paths too so don't say "absolute paths". Improve comment for _get_redirect_url().
This avoids "manual parsing" of the Request-URI part of the request and matches what _get_redirect_url() does.
Possible that someone could override this so a method is nicer.
Since pathsplit() doesn't seem like a generally useful public API, remove it. Instead, add a _request_path_split() method. This ensures that the redirect logic and the translate_path() method use the same path parsing.
|
@gpshead what do you think about landing this change for 3.12 only as an enhancement? |
Sorry, something went wrong.
Yeah I think this landing in 3.12 as a feature makes sense. this PR branch needs syncing now that the other PR has gone in. |
Sorry, something went wrong.
|
A few thoughts about this. I can rework the PR so that changes to urlib.parse are separated from the httpserver change. The fix made is 3.11 is sufficient for fixing the security bug so the change in this PR isn't strictly required. The change to urllib.parse.urlunsplit needs careful consideration. Not merging it into 3.11 was the right move. Thinking about it more, I'm having doubts that we can change this. Maybe we need a new API or maybe we just can't change it. It seems almost certain that some code expects the existing (IMHO insecure) behaviour of We could try changing it an alpha and see what breaks. Or, we can try to analyze publicly available code and see how Ideally we should analyze all the arguments of |
Sorry, something went wrong.
Agreed, it's a challenge. I'm not personally motivated to try and push for this change to the level of doing code analysis of code-at-large's urlunsplit API calls. I like your thinking that further analysis of all the safety checks required being good. There are probably useful things to learn from what other libraries (in any language) do for URL construction from their parts libraries. Adding this is always easier as a I suggest this get tracked as its own Enhancement request Issue. This PR can presumably be tied to that and closed for now given more not yet fully defined work is needed if it is going to be done at all. |
Sorry, something went wrong.
There was a problem hiding this comment.
There seem to be three behaviour changes proposed:
- Changing urlunsplit(('', '', '//path', '', '')) to return '/path' instead of '//path'
Another option is to prefix with double slash, representing an empty host e.g. '////path'. This is proposed in issue #78457 and pull request #113563. I think I prefer that four-slash option, because it also fixes some urlsplit → urlunsplit round-trip cases.
- Changing urlunsplit(('', '', 'colon:path', '', '')) → './colon:path'
This seems a reasonable change, and it is kind of suggested in RFC 3986. (Another option might be to encode the first colon, and return 'colon%3Apath'.)
- SimpleHTTPRequestHandler’s handling of
GET https://example.net/dir
This is a legal HTTP 1.0 and 1.1 request, but is mainly for proxy servers, which is not what SimpleHTTPRequestHandler does. In this case the server looks up https:/example.net/dir as a path in its filesystem (which is not in spirit of HTTP), and decides to redirect with a trailing slash.
Currently it looks like the code sends Location: https://example.net/dir/. I don’t think there is anything really wrong with that.
The proposed changes would send Location: ./https://example.net/dir/. This new redirect is a path-relative URL. The base URL is supposed to be the original target https://example.net/dir, so the redirect would resolve to https://example.net/https://example.net/dir/, which is not intended.
If you want to fix anything in the HTTP server, I would make the server ignore the scheme and authority components, and just look up the path component. But I don’t think anyone is complaining about that, so it may not be worth fixing.
If the urllib.parse changes are too disruptive, perhaps a deprecation warning is the best way forward, and either add an opt-in way to get the new behaviour, or change the warning to an exception in the future?
Sorry, something went wrong.
|
This PR is stale because it has been open for 30 days with no activity. |
Sorry, something went wrong.
I believe this is a more correct fix for gh-87389 and may fix security bugs in other applications using
urlunsplit(). We should not be confusing thepathargument as a netloc, IMHO. I fixedhttp.serverby not usingurllib.parseon thepath. Forhttp.server, that part of the HTTP request is not treated as a full URL but instead a path + optional query + optional fragment. So just parsing it "by hand" and not usingurllib.parseavoids the bug.