bpo-30500: Fix urllib.parse.splithost() to parse correctly fragments#1849

postmasters

The current regex based splitting produces a wrong result. For example::

Web browsers parse that URL as http://abc/#@def, that is, the host
is abc, the path is /, and the fragment is #@def.

the-knights-who-say-ni

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

Thanks again to your contribution and we look forward to looking at it!

mention-bot

@postmasters, thanks for your PR! By analyzing the history of the files in this pull request, we identified @orsenthil, @ncoghlan and @Yhg1s to be potential reviewers.

Mariatta

@postmasters Can you create an issue for this at https://bugs.python.org and then reference the issue number here in the title? Thanks

vadmium

This looks like this will change the behaviour in corner cases. E.g., what happens with empty URL components like splithost("///path;?#")? See https://bugs.python.org/issue22852 about this problem with urlparse and urlsplit. A dramatic change in implementation like this is not good for a bug fix.

Also, keep in mind that splithost is barely documented, and borderline unsupported, especially in Python 3. See my summary at https://bugs.python.org/issue27485#msg270215.

postmasters

@Mariatta Done.

@vadmium You rightly mentioned that splithost is not well documented. Does that not contradict your first statement that this is a "dramatic change"? Do you know of any other user of splithost than urllib itself? Thanks.

vadmium

What I meant was that splithost is basically deprecated as an external API. Bug fixes are okay, especially where they fix problems elsewhere in Python’s library, but unnecessary changes in behaviour or features could be a problem.

A quick search in my computer’s 2.7 directory shows splithost is used in various external packages. Many (bittorrent, m2crypto, feedparser, . . .) look like forks of internal Python libraries (urllib, xmlrpclib), and a few (subvertpy, twisted) may be original code.

postmasters

@vadmium I'm unsure of deprecation policy. If a function is deprecated, do we care if its behavior changes? After all, it's going to be removed (the purpose of deprecation), right? Others should normally not depend on deprecated functions. Besides, do we want to keep the broken behavior (as compared to regular web browsers) of a deprecated function? Also, where does it say that these functions are deprecated?

My plan is to base all these split* on urlparse. Then any fix to urlparse is automatically enjoyed by these functions. But I'm sure this hasty fix might break some other code. I'd love to hear your suggestion, Martin.

postmasters

@Haypo any further comment?

vadmium

The documentation doesn’t say much about these functions. The best I know of is at the bottom of https://docs.python.org/2.7/library/urllib.html#utility-functions, which “recommends” against directly using the functions. But judging by https://bugs.python.org/issue27485 I think there is a consensus to formally deprecate them. I think there are policies about waiting one or more versions between deprecation and removal, and maintaining compatibility with 2.7 until it is dead.

I think it is best to avoid changing behaviour of public APIs, including deprecated APIs, unless the behaviour is incorrect. So it may be okay to fix the splithost("//abc#@def") behaviour, but avoid spurious changes to corner cases if practical. Although I haven’t actually tried your code to see if it changes corner cases, or thought if there are other practical ways fix the bug.

postmasters

@vadmium so it looks like you're half-okay with this, pending on corner cases. The best I know is all existing test cases, including a few new ones, pass. That kinda gives some sort of assurance. Do let me know if you find out otherwise.

Other reviewers, could I have a pair of eyes on this again please? Thanks so much.

vadmium

Can you try out the following test cases (or even add them to the test suite if they don’t exist) with your implementation? The results I give are from the existing Python 3.5 implementation:

>>> urllib.parse.splithost("///file")  # Is the host the empty string or None?
('', '/file')
>>> urllib.parse.splithost("//example.net/file;")  # Is the semicolon retained?
('example.net', '/file;')
>>> urllib.parse.splithost("//example.net/file?")  # Question mark retained?
('example.net', '/file?')
>>> urllib.parse.splithost("//example.net/file#")  # Hash symbol retained?
('example.net', '/file#')

vstinner

The change lacks an entry in Misc/NEWS. You may add a [Security] prefix in your entry.

vstinner

[ First-time contributor ]

Oh, it's your first contrib to CPython: in that case, please also add yourself to Misc/ACKS.

vstinner

Since this change fixes a major security vulnerability, I took the liberty of pushing directly to your repository @postmasters. I hope that it is ok with you. I would like to fix this issue quickly.

I made the changes that I requested.

postmasters

Not at all. Thanks for doing that, @Haypo. I still don't know how best to address @vadmium test cases. They seem to be related, yet unrelated. I don't really know if the difference in behavior is important enough.

vstinner

@vadmium: "Can you try out the following test cases (or even add them to the test suite if they don’t exist) with your implementation? The results I give are from the existing Python 3.5 implementation:"

I added your splithost() test cases to test_urlparse, but the results are different from Python 3.5. Is it a backward incompatible change? Do you see a way to fix it?

vadmium

See comment against the existing regular expression for my (untested) solution

vadmium

If someone was relying on splithost to faithfully retain the original URL’s path + parameters + query + fragment, dropping empty components may not be compatible. I don’t know of a specific use case, but it seems plausable that this could e.g. break code that maintains a cache based on the URL, if it cannot find the cache because part of the URL was stripped. Also, changing the data type for empty hosts from str to None seems problematic.

Also, being unable to distinguish an empty host (e.g. ////evil.com with four slashes) from a missing host (/path) has been reported as a security problem: https://bugs.python.org/issue23505. It would be good to avoid opening one hole as we close another hole.

vstinner

Martin, it seems like you understand well the issue. I like your idea of fixing the regex to reduce the risk of backward compatibility regressions. Would you mind to create a new PR based on this one?

postmasters

The regex fix is the quickest and easiest to pull. But I'm afraid there might be further corner cases that we are not aware of. Besides, having different versions of utilities like this (split* and urlsplit/urlparse) clashes with "there should be one obvious way".

I really do think that split* functions should preferably be refactored to use urlparse rather than continuing down the regex path.

Then again, we could also fix the regex first, then work on refactoring later too.

vstinner

The regex fix is the quickest and easiest to pull. But I'm afraid there

might be further corner cases that we are not aware of. Security is a thing, backward compatibilty is another thing. I would feel bad to have to make a choice between the two. I prefer to be pragmatic: fix exactly the reported security issue and keep backward compatibilty. A security fix that can break applications is risky or even not acceptable.

vadmium

@Haypo, I probably won’t be making a pull request in the near future (limited time & priorities). But I am happy if you or someone else makes the change, as long as it works.

postmasters

I've updated the code to do what @vadmium suggested. Please take another look. Thanks.

@def

The current regex based splitting produces a wrong result. For example:: http://abc#@def Web browsers parse that URL as ``http://abc/#@def``, that is, the host is ``abc``, the path is ``/``, and the fragment is ``#@def``.

vstinner

Oh nice @postmasters, you rewrite your urllib change to only modify the regex: I like the new version! I took the liberty of fixing @vadmium requested change on the NEWS entry (parse correctly => correctly parse), since I wrote the NEWS entry and that I would like to get this change merged as soon as possible.

I will take care of backports.

bedevere-bot

GH-2289 is a backport of this pull request to the 3.6 branch.

bedevere-bot

GH-2290 is a backport of this pull request to the 3.5 branch.

@def

… (#2290) The current regex based splitting produces a wrong result. For example:: http://abc#@def Web browsers parse that URL as ``http://abc/#@def``, that is, the host is ``abc``, the path is ``/``, and the fragment is ``#@def``. (cherry picked from commit 90e01e5)

@def

… (#2289) The current regex based splitting produces a wrong result. For example:: http://abc#@def Web browsers parse that URL as ``http://abc/#@def``, that is, the host is ``abc``, the path is ``/``, and the fragment is ``#@def``. (cherry picked from commit 90e01e5)

bedevere-bot

GH-2294 is a backport of this pull request to the 2.7 branch.

@def

… (#2294) The current regex based splitting produces a wrong result. For example:: http://abc#@def Web browsers parse that URL as ``http://abc/#@def``, that is, the host is ``abc``, the path is ``/``, and the fragment is ``#@def``. (cherry picked from commit 90e01e5)

@def

…on#1849) (python#2289) The current regex based splitting produces a wrong result. For example:: http://abc#@def Web browsers parse that URL as ``http://abc/#@def``, that is, the host is ``abc``, the path is ``/``, and the fragment is ``#@def``. (cherry picked from commit 90e01e5) (cherry picked from commit 536c1f1)

@def

… (#2291) The current regex based splitting produces a wrong result. For example:: http://abc#@def Web browsers parse that URL as ``http://abc/#@def``, that is, the host is ``abc``, the path is ``/``, and the fragment is ``#@def``. (cherry picked from commit 90e01e5)

@def

…#1849) (#2292) The current regex based splitting produces a wrong result. For example:: http://abc#@def Web browsers parse that URL as ``http://abc/#@def``, that is, the host is ``abc``, the path is ``/``, and the fragment is ``#@def``. (cherry picked from commit 90e01e5) (cherry picked from commit cc54c1c)

the-knights-who-say-ni added the label May 29, 2017

postmasters mentioned this pull request May 29, 2017

[2.7] bpo-30500: urllib: Simplify splithost by calling into urlparse. #1850

Closed

Mariatta removed the CLA not signed label May 29, 2017

the-knights-who-say-ni added the CLA signed label May 29, 2017

postmasters changed the title ~~urllib: Simplify splithost by calling into urlparse.~~ May 29, 2017

Mariatta changed the title ~~[b.p.o 30500] urllib: Simplify splithost by calling into urlparse.~~ May 29, 2017

Mariatta added needs backport to 2.7 labels May 29, 2017

vstinner reviewed May 29, 2017

View reviewed changes

vstinner added the type-security A security issue label May 29, 2017

vstinner previously requested changes May 29, 2017

View reviewed changes

vstinner reviewed Jun 7, 2017

View reviewed changes

vstinner changed the title ~~bpo-30500: urllib: Simplify splithost by calling into urlparse.~~ Jun 14, 2017

vadmium reviewed Jun 17, 2017

View reviewed changes

urllib: Simplify splithost by calling into urlparse. …

b036db6

The current regex based splitting produces a wrong result. For example:: http://abc#@def Web browsers parse that URL as ``http://abc/#@def``, that is, the host is ``abc``, the path is ``/``, and the fragment is ``#@def``.

vstinner merged commit 90e01e5 into python:master Jun 20, 2017

bedevere-bot removed the needs backport to 3.6 label Jun 20, 2017

bedevere-bot removed the needs backport to 3.5 label Jun 20, 2017

bedevere-bot removed the needs backport to 2.7 label Jun 20, 2017

Conversation

postmasters commented May 29, 2017

Uh oh!

the-knights-who-say-ni commented May 29, 2017

Uh oh!

mention-bot commented May 29, 2017

Uh oh!

Mariatta commented May 29, 2017

Uh oh!

vadmium commented May 29, 2017

Uh oh!

postmasters commented May 29, 2017

Uh oh!

vadmium commented May 29, 2017

Uh oh!

postmasters commented May 29, 2017

Uh oh!

postmasters commented May 31, 2017

Uh oh!

vadmium commented Jun 3, 2017

Uh oh!

postmasters commented Jun 5, 2017

Uh oh!

vadmium commented Jun 5, 2017

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commented Jun 14, 2017

Uh oh!

vstinner commented Jun 14, 2017

Uh oh!

postmasters commented Jun 14, 2017

Uh oh!

vstinner commented Jun 14, 2017

Uh oh!

vadmium left a comment

Choose a reason for hiding this comment

Uh oh!

vadmium commented Jun 17, 2017

Uh oh!

vstinner commented Jun 17, 2017 via email

Uh oh!

postmasters commented Jun 17, 2017

Uh oh!

vstinner commented Jun 17, 2017 via email

Uh oh!

vadmium commented Jun 17, 2017

Uh oh!

postmasters commented Jun 19, 2017

Uh oh!

vstinner commented Jun 20, 2017

Uh oh!

bedevere-bot commented Jun 20, 2017

Uh oh!

bedevere-bot commented Jun 20, 2017

Uh oh!

bedevere-bot commented Jun 20, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants