bpo-30500: Fix urllib.parse.splithost() to parse correctly fragments#1849
bpo-30500: Fix urllib.parse.splithost() to parse correctly fragments#1849vstinner merged 1 commit into
Conversation
|
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. Thanks again to your contribution and we look forward to looking at it! |
Sorry, something went wrong.
|
@postmasters, thanks for your PR! By analyzing the history of the files in this pull request, we identified @orsenthil, @ncoghlan and @Yhg1s to be potential reviewers. |
Sorry, something went wrong.
|
@postmasters Can you create an issue for this at https://bugs.python.org and then reference the issue number here in the title? Thanks |
Sorry, something went wrong.
|
This looks like this will change the behaviour in corner cases. E.g., what happens with empty URL components like splithost("///path;?#")? See https://bugs.python.org/issue22852 about this problem with urlparse and urlsplit. A dramatic change in implementation like this is not good for a bug fix. Also, keep in mind that splithost is barely documented, and borderline unsupported, especially in Python 3. See my summary at https://bugs.python.org/issue27485#msg270215. |
Sorry, something went wrong.
|
What I meant was that splithost is basically deprecated as an external API. Bug fixes are okay, especially where they fix problems elsewhere in Python’s library, but unnecessary changes in behaviour or features could be a problem. A quick search in my computer’s 2.7 directory shows splithost is used in various external packages. Many (bittorrent, m2crypto, feedparser, . . .) look like forks of internal Python libraries (urllib, xmlrpclib), and a few (subvertpy, twisted) may be original code. |
Sorry, something went wrong.
|
@vadmium I'm unsure of deprecation policy. If a function is deprecated, do we care if its behavior changes? After all, it's going to be removed (the purpose of deprecation), right? Others should normally not depend on deprecated functions. Besides, do we want to keep the broken behavior (as compared to regular web browsers) of a deprecated function? Also, where does it say that these functions are deprecated? My plan is to base all these |
Sorry, something went wrong.
|
The documentation doesn’t say much about these functions. The best I know of is at the bottom of https://docs.python.org/2.7/library/urllib.html#utility-functions, which “recommends” against directly using the functions. But judging by https://bugs.python.org/issue27485 I think there is a consensus to formally deprecate them. I think there are policies about waiting one or more versions between deprecation and removal, and maintaining compatibility with 2.7 until it is dead. I think it is best to avoid changing behaviour of public APIs, including deprecated APIs, unless the behaviour is incorrect. So it may be okay to fix the splithost("//abc#@def") behaviour, but avoid spurious changes to corner cases if practical. Although I haven’t actually tried your code to see if it changes corner cases, or thought if there are other practical ways fix the bug. |
Sorry, something went wrong.
|
@vadmium so it looks like you're half-okay with this, pending on corner cases. The best I know is all existing test cases, including a few new ones, pass. That kinda gives some sort of assurance. Do let me know if you find out otherwise. Other reviewers, could I have a pair of eyes on this again please? Thanks so much. |
Sorry, something went wrong.
|
Can you try out the following test cases (or even add them to the test suite if they don’t exist) with your implementation? The results I give are from the existing Python 3.5 implementation: >>> urllib.parse.splithost("///file") # Is the host the empty string or None?
('', '/file')
>>> urllib.parse.splithost("//example.net/file;") # Is the semicolon retained?
('example.net', '/file;')
>>> urllib.parse.splithost("//example.net/file?") # Question mark retained?
('example.net', '/file?')
>>> urllib.parse.splithost("//example.net/file#") # Hash symbol retained?
('example.net', '/file#') |
Sorry, something went wrong.
vstinner
left a comment
There was a problem hiding this comment.
The change lacks an entry in Misc/NEWS. You may add a [Security] prefix in your entry.
Sorry, something went wrong.
|
[ First-time contributor ] Oh, it's your first contrib to CPython: in that case, please also add yourself to Misc/ACKS. |
Sorry, something went wrong.
|
Since this change fixes a major security vulnerability, I took the liberty of pushing directly to your repository @postmasters. I hope that it is ok with you. I would like to fix this issue quickly. |
Sorry, something went wrong.
|
@vadmium: "Can you try out the following test cases (or even add them to the test suite if they don’t exist) with your implementation? The results I give are from the existing Python 3.5 implementation:" I added your splithost() test cases to test_urlparse, but the results are different from Python 3.5. Is it a backward incompatible change? Do you see a way to fix it? |
Sorry, something went wrong.
vadmium
left a comment
There was a problem hiding this comment.
See comment against the existing regular expression for my (untested) solution
Sorry, something went wrong.
|
If someone was relying on splithost to faithfully retain the original URL’s path + parameters + query + fragment, dropping empty components may not be compatible. I don’t know of a specific use case, but it seems plausable that this could e.g. break code that maintains a cache based on the URL, if it cannot find the cache because part of the URL was stripped. Also, changing the data type for empty hosts from str to None seems problematic. Also, being unable to distinguish an empty host (e.g. ////evil.com with four slashes) from a missing host (/path) has been reported as a security problem: https://bugs.python.org/issue23505. It would be good to avoid opening one hole as we close another hole. |
Sorry, something went wrong.
|
Martin, it seems like you understand well the issue. I like your idea of
fixing the regex to reduce the risk of backward compatibility regressions.
Would you mind to create a new PR based on this one?
|
Sorry, something went wrong.
|
The regex fix is the quickest and easiest to pull. But I'm afraid there might be further corner cases that we are not aware of. Besides, having different versions of utilities like this (split* and urlsplit/urlparse) clashes with "there should be one obvious way". I really do think that split* functions should preferably be refactored to use urlparse rather than continuing down the regex path. Then again, we could also fix the regex first, then work on refactoring later too. |
Sorry, something went wrong.
|
The regex fix is the quickest and easiest to pull. But I'm afraid there
might be further corner cases that we are not aware of.
Security is a thing, backward compatibilty is another thing. I would feel
bad to have to make a choice between the two. I prefer to be pragmatic: fix
exactly the reported security issue and keep backward compatibilty.
A security fix that can break applications is risky or even not acceptable.
|
Sorry, something went wrong.
|
@Haypo, I probably won’t be making a pull request in the near future (limited time & priorities). But I am happy if you or someone else makes the change, as long as it works. |
Sorry, something went wrong.
|
I've updated the code to do what @vadmium suggested. Please take another look. Thanks. |
Sorry, something went wrong.
The current regex based splitting produces a wrong result. For example:: http://abc#@def Web browsers parse that URL as ``http://abc/#@def``, that is, the host is ``abc``, the path is ``/``, and the fragment is ``#@def``.
|
Oh nice @postmasters, you rewrite your urllib change to only modify the regex: I like the new version! I took the liberty of fixing @vadmium requested change on the NEWS entry (parse correctly => correctly parse), since I wrote the NEWS entry and that I would like to get this change merged as soon as possible. I will take care of backports. |
Sorry, something went wrong.
…on#1849) (python#2289) The current regex based splitting produces a wrong result. For example:: http://abc#@def Web browsers parse that URL as ``http://abc/#@def``, that is, the host is ``abc``, the path is ``/``, and the fragment is ``#@def``. (cherry picked from commit 90e01e5) (cherry picked from commit 536c1f1)
…#1849) (#2292) The current regex based splitting produces a wrong result. For example:: http://abc#@def Web browsers parse that URL as ``http://abc/#@def``, that is, the host is ``abc``, the path is ``/``, and the fragment is ``#@def``. (cherry picked from commit 90e01e5) (cherry picked from commit cc54c1c)
The current regex based splitting produces a wrong result. For example::
http://abc#@def
Web browsers parse that URL as
http://abc/#@def, that is, the hostis
abc, the path is/, and the fragment is#@def.