bpo-36338: urllib.urlparse rejects invalid IPv6 addresses#16780

vstinner

The urllib.urlparse module now rejects invalid IPv6 addresses and
invalid port numbers when parsing an URL.

https://bugs.python.org/issue36338

vstinner

@corona10: I wrote the scope test differently for make it more readable.

corona10

@corona10: I wrote the scope test differently for make it more readable.

@vstinner Awesome!

vstinner

@vadmium, @zooba, @tirkarthi: Would you mind to review this change? What do you think of my approach using ipaddress and excluding some characters from the IPv6 scope ("zone identifier")?

vstinner

I modified my PR to also fix https://bugs.python.org/issue33342 : "urllib IPv6 parsing fails with special characters in passwords".

vstinner

@orsenthil: Would you mind to review this change? Any idea for the allowed characters in an IPv6 scope?

orsenthil

@vstinner - Sure, I will review. I will refer to the RFC for the valid characters for the IPv6 scope.

vstinner

I tried to allow [ and ] in the user:password part, but then the URL parser is confused by the URL:

http://[::1%sc[o]pe]

It reads it as IPv6 ::1%sc[o.

* bpo-36338: The urllib.urlparse module now rejects invalid IPv6 addresses and invalid port numbers when parsing an URL. * bpo-33342: Fix urlparse() for IPv6 address with user:password when user and/or password contain "[" and/or "]" characters.

vstinner

I rebased my PR to fix the merge conflict.

@orsenthil: Ping for review.

vstinner

Firefox doesn't seem to accept % in the IPv6 part of an URL. When I type the following URL, it opens Google with the URL as a search...

http://[::1%1]:8000/

vstinner

Firefox doesn't seem to accept % in the IPv6 part of an URL. When I type the following URL, it opens Google with the URL as a search...

Same behavior in Chromium.

orsenthil

LGTM. Thank you.

orsenthil

Ping for review.

Done. Thank you, Victor.

orsenthil

Firefox doesn't seem to accept % in the IPv6 part of an URL.

Same behavior in Chromium.

I expected these browsers to have percent-encode these and work.

https://bugzilla.mozilla.org/show_bug.cgi?id=700999

Also, "Microsoft Edge (as well as Microsoft Explorer) works well with link local IPV6 addresses."

This is the most relevant information I found

https://en.wikipedia.org/wiki/IPv6_address#Use_of_zone_indices_in_URIs

When used in uniform resource identifiers (URI), the use of the percent sign causes a syntax conflict, therefore it must be escaped via percent-encoding,[11] e.g.:

http://[fe80::1ff:fe23:4567:890a%25eth0]/

vstinner

The living URL Standard doesn't implement IPv6 scope on purpose:

Support for <zone_id> is intentionally omitted.

This comment points to https://www.w3.org/Bugs/Public/show_bug.cgi?id=27234#c2 which is a comment written by Ryan Sleevi at 2015-08-14:

Yes, we're especially not keen to support these in Chrome and have repeatedly decided not to. The platform-specific nature of <zone_id> makes it difficult to impossible to validate the well-formedness of the URL (see https://tools.ietf.org/html/rfc4007#section-11.2 , as referenced in 6874, to fully appreciate this special hell). Even if we could reliably parse these (from a URL spec standpoint), it then has to be handed 'somewhere', and that opens a new can of worms.

Even 6874 notes how unlikely it is to encounter these in practice

   Thus, URIs including a
   ZoneID are unlikely to be encountered in HTML documents.  However, if
   they do (for example, in a diagnostic script coded in HTML), it would
   be appropriate to treat them exactly as above.

Note that a 'dumb' parser may not be sufficient, as the Security Considerations of 6874 note:

   To limit this risk, implementations MUST NOT allow use of this format
   except for well-defined usages, such as sending to link-local
   addresses under prefix fe80::/10.  At the time of writing, this is
   the only well-defined usage known.

And also

   An HTTP client, proxy, or other intermediary MUST remove any ZoneID
   attached to an outgoing URI, as it has only local significance at the
   sending host.

This requires a transformative rewrite of any URLs going out the wire. That's pretty substantial. Anne, do you recall the bug talking about IP canonicalization (e.g. http://127.0.0.1 vs http://[::127.0.0.1] vs http://012345 and friends?) This is conceptually a similar issue - except it's explicitly required in the context of <zone_id> that the <zone_id> not be emitted.

There's also the issue that zone_id precludes/requires the use of APIs that user agents would otherwise prefer to avoid, in order to 'properly' handle the zone_id interpretation. For example, Chromium on some platforms uses a built in DNS resolver, and so our address lookup functions would need to define and support <zone_id>'s and map them to system concepts. In doing so, you could end up with weird situations where a URL works in Firefox but not Chrome, even though both 'hypothetically' supported <zone_id>'s, because FF may use an OS routine and Chrome may use a built-in routine and they diverge.

Overall, our internal consensus is that <zone_id>'s are bonkers on many grounds - the technical ambiguity (and RFC 6874 doesn't really resolve the ambiguity as much as it fully owns it and just says #YOLOSWAG) - and supporting them would add a lot of complexity for what is explicitly and admittedly a limited value use case.

Firefox feature request https://bugzilla.mozilla.org/show_bug.cgi?id=700999 has been rejected using this comment as well at 2015-08-14.

Currently, only Microsoft Edge supports IPv6 scope: Firefox and Chromium don't.

I suggest to follow Firefox, Chromium and living URL Standard example: don't support IPv6 scope.

My current implementation doesn't implement the RFC 6874 which suggests to use %25 between the IPv6 and the scope. For example address ::1 with scope eth0 should be written ::1%25eth0. This syntax is hard to read if you use numeric scopes which are common: ::1 with scope 2 should be written ::1%252 :-(

vstinner

It is supposed to get the unquoted URL. I relied on tests of urlparse to state this.

This makes me even more uncomfortable to support IPv6 scope: it is not well defined if urlsplit() is expected to be used on a quote or unquoted URL. This is a major difference for RFC 6874 which is tied to quoted characters. Not well defined means: we should not have to dig into tests to reverse engineer the "expected" function behavior. It should be well documented and well tested.

I mean that if someone wants to support IPv6 scope in URL, I suggest to first clarify what urlsplit() expects. IMHO fixing this is out of the scope of fixing https://bugs.python.org/issue36338 security vulnerability.

vstinner

I failed finding time to finish the PR. I prefer to abandon it.

the-knights-who-say-ni added the CLA signed label Oct 14, 2019

bedevere-bot added the awaiting core review label Oct 14, 2019

vstinner mentioned this pull request Oct 14, 2019

bpo-36338: Reject hostname with [ at position > 0 #14896

Closed

corona10 approved these changes Oct 15, 2019

View reviewed changes

tirkarthi reviewed Oct 15, 2019

View reviewed changes

orsenthil approved these changes Oct 21, 2019

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Oct 21, 2019

miguendes mentioned this pull request Jul 10, 2021

gh-88037: Move port validation logic to parsing time #25774

Open

vstinner closed this Sep 21, 2021

vstinner deleted the urlparse_ipv6 branch September 21, 2021 21:58

sanebow mannequin mentioned this pull request Apr 10, 2022

urlparse of urllib returns wrong hostname #80519

Open

Conversation

vstinner commented Oct 14, 2019 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Oct 14, 2019

Uh oh!

corona10 left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commented Oct 15, 2019

Uh oh!

vstinner commented Oct 15, 2019

Uh oh!

vstinner commented Oct 18, 2019

Uh oh!

orsenthil commented Oct 18, 2019

Uh oh!

vstinner commented Oct 21, 2019

Uh oh!

vstinner commented Oct 21, 2019

Uh oh!

vstinner commented Oct 21, 2019

Uh oh!

vstinner commented Oct 21, 2019

Uh oh!

orsenthil left a comment

Choose a reason for hiding this comment

Uh oh!

orsenthil commented Oct 21, 2019

Uh oh!

orsenthil commented Oct 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented Oct 23, 2019

Uh oh!

vstinner commented Oct 23, 2019

Uh oh!

vstinner commented Sep 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

vstinner commented Oct 14, 2019 •

edited by bedevere-bot

Loading

orsenthil commented Oct 23, 2019 •

edited

Loading