bpo-46337: Urllib.parse scheme-specific behavior without reliance on URL scheme#30520
bpo-46337: Urllib.parse scheme-specific behavior without reliance on URL scheme#30520oldaccountdeadname wants to merge 23 commits into
Conversation
Some features in urllib are dependent on schemes, (i.e., preserving the netloc in url joining). Prior to this patch, this was governed by the uses_* lists (uses_relative, uses_netloc, uses_params) which hard code these attributes for certain schemes. Providing an enum interface and a 'constructor' that allows overrides makes this mechanism a bit more flexible for future modifications.
This allows the callers of urljoin and urlparse to add guaranteed scheme classes to the url regardless of the actual scheme, which may not be in the default uses_* lists of schemes. This call-time behavior is done through an optional parameter that preserves backwards compatibility. A test case is added for this, and requires the change present in test_urlparse.checkJoin.
627b732 to
53c6ccc
Compare
January 10, 2022 21:57
|
This PR is stale because it has been open for 30 days with no activity. |
Sorry, something went wrong.
MaxwellDupre
left a comment
There was a problem hiding this comment.
Could you add info to docs Please?
Otherwise how will users know how to use or that it exists?
Sorry, something went wrong.
|
Could you add info to docs Please?
Ah, definitely, I totally forgot to do that - I just pushed a draft of
some docs (e1082f8). Thanks for pointing that out!
|
Sorry, something went wrong.
This functionality was exposed in 53c6ccc.
e1082f8 to
f9b59dd
Compare
February 26, 2022 21:41
|
Ah, CI was failing due to my editor inserting tabs - should be fixed. The docs commit is now f9b59dd. |
Sorry, something went wrong.
It looks like CI expects this when building documentation.
|
@orsenthil - sorry for the extra email/ping, but would you be able to give this a review when you've got some spare time? It's been sitting stale for roughly a month now. thanks! |
Sorry, something went wrong.
|
@lincolnauster - sure, I will review. |
Sorry, something went wrong.
urljoin will not treat `..` as moving up one directory rather than moving up one file, thus causing the doctests to fail due to a missing trailing slash. Both changes are of the form: http://example.org/post/x -> http://example.org/post/x/ Additionally, the my-protocol example's expected output had the wrong scheme.
|
sigh - code was right, docs were wrong. Stupid question, but I couldn't find this in the devguide. How do I actually run the doctests on my local checkout? I was performing them manually in a shell but evidently I accidentally corrected them while typing them in. I'm running this in gmake doctest PYTHON=../python SPHINXOPTS="-j24 --keep-going" ...and I'm getting a ton of errors from code I haven't touched. (A bunch of stuff in the ast module, _tkinter wasn't found, etc.) Is this a misconfiguration on my part? How can I actually run doctests? Also, I pushed a fix for the current cause of failure. Would it be helpful to squash/rebase it into the original doc commit (f9b59dd) and force-push? Thanks so much! |
Sorry, something went wrong.
Minor style things: + _scheme_classes' docstring's summary was made explicit. + _scheme_classes was prepended with and followed by two newlines.
ethanfurman
left a comment
There was a problem hiding this comment.
I think SchemeFlag works better than SchemeClass. Either way, use an enum.Flag for it, and consider using __repr__ similar to the one in re.RegexFlag.
Sorry, something went wrong.
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Sorry, something went wrong.
|
I have made the requested changes; please review again. Again, thanks so much for the thorough review, and please tell me if there's anything missing! |
Sorry, something went wrong.
|
Thanks for making the requested changes! @ethanfurman: please review the changes made to this pull request. |
Sorry, something went wrong.
|
(copying from #90495) - we can continue discussion here as lot of context is preserved here. Hi @lincolnauster , I was -1 and was thinking much on introducing a flag with the enum in the parse module. This API signature is going to confuse people and will be huge blocker for further adoption and change (even if the default arguments are specified). I was thinking how best to mitigate that.
If design design required, we can bring it to a wider forum too. |
Sorry, something went wrong.
|
Hi @lincolnauster , I was -1 and was thinking much on introducing a
flag with the enum in the parse module.
urlparse(url, scheme='', allow_fragments=True, flags=SchemeFlag(0)):
This API signature is going to confuse people and will be huge blocker
for further adoption and change (even if the default arguments are
specified). I was thinking how best to mitigate that.
Tbh, I'm not too bothered by the additional complexity in the function
signature. As I see it, the complexity is already present in the
module's uses_* lists. If those lists align with your goals, you don't
need to care about them, but if they don't, you need to deal with that
implicit complexity. Documenting it and making it explicit feels like
the best path to take, imo.
1 Did we ever consider not going to flag as as parameter?
There were a few alternatives considered in this PR/issue and a few more
a few years ago. Correct me if I'm wrong, but I'm not sure anyone is
really in favor of the current scheme-based parsing. That said,
backwards compat is valuable, so it looks like we're trying to find some
way to augment the current parse-behavior selection with a solution that
doesn't involve magic strings.
Currently, client code can modify the lists that determine the behavior
for each scheme, but that's got two problems, as I understand:
1. It's a bag of global state. Who owns what data? What if there are
conflicting modifications? This is a pretty good way to cause messy
problems.
2. What if we don't know the schemes we're parsing in advance? We'd
need a *lot* of interactions with the messy global state, and thus
potentially cause a lot of problems.
One proposed solution was formalizing that global state into a global
registry, which would reduce the impact of the above problems, but not
fully solve them. It's certainly a simpler solution, but it also feels
harder to use in non-trivial cases. Some form of per-parse/join context
seems to be required to address the above issues and keep compat.
2 Any other default for the flag rather than an enum?
I'm not sure what you mean by that. Could you explain further?
If design design required, we can bring it to a wider forum too.
This seems like a good idea if necessary. Thanks for all the thoughts!
|
Sorry, something went wrong.
gpshead
left a comment
There was a problem hiding this comment.
I agree with Senthil about the API growing something unobvious. If we add a parameter, it should be keyword only and well named to indicate that it is unusual to provide it. I'd also set its default value to None rather than introducing the casual user to a new concept (SchemeFlag).
The name flags= is probably not sufficient. Something more wordy like behavior_overrides= would communicate better that these are not necessary in most cases as reasonable default behaviors exist. Similarly the name SchemeFlags doesn't necessarily communicate the right thing when seen in code as it technically has nothing to do with a scheme itself. Perhaps something like ParseBehaviors?
Sorry, something went wrong.
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Sorry, something went wrong.
|
Hi, sorry for getting back to this late, and thanks for all the feedback.
Agreed, I'll push these API clarifications shortly! |
Sorry, something went wrong.
|
Okay, I finally had some time to do a thorough review of the module and, as much as I love enums, I don't think this is the right solution to this problem:
I like the attempt to unify the three class lists, but I don't think this approach is the best choice for the Given that we already have an @lincolnauster Are you interested in converting this PR to one of the two above choices? |
Sorry, something went wrong.
ethanfurman
left a comment
There was a problem hiding this comment.
See previous comment.
Sorry, something went wrong.
|
Hi, very sorry for the very late reply - a number of things just stacked
up for the past few months.
Okay, I finally had some time to do a thorough review of the module
and, as much as I love enums, I don't think this is the right solution
to this problem:
* allow_fragments should be part of the enum, but that would just be
confusing at this point
Agreed in hindsight. There are too many flags for a flag to be
comprehensible.
* specifying NETLOC or RELATIVE to urlparse() does nothing
Yeah, the parse/join interface is heterogeneous. Definitely an oversight
in the proposed patches as-is (and an oversight in the scheme lists
being part of the public API).
I like the attempt to unify the three class lists, but I don't think
this approach is the best choice for the urlparse module as it exists.
Given that we already have an allow_fragments flag, I think the best
path forward is to add an allow_params flag, with a default of
False...
What about joining? I believe that when given a string, urljoin will
urlparse it. How do we determine the keyword arguments for urlparse in
that case? I assume we use additional parameters. What about netloc and
relative joining? I believe that gives us this rather unwieldy
signature:
```python
def urljoin(base, url, allow_fragments=True, allow_params=False, allow_netloc=False, allow_relative=False)
```
Maybe it could be written as the following:
```python
def urljoin(base, url, **kwargs)
```
...where kwargs are passed to urljoin, but I'm still smelling something
off here. I'd expect that stacking overrides on overrides on global
state is going to get us unpredictable and untestable behavior, no
matter how we write out the signature.
-- or add a separate universal parsing function, as has been suggested
earlier.
I think I agree on this - universal parsing and joining is probably a
good thing.
Would it be acceptable to build off the code currently in this PR? That
is, simply define some new universal parse and universal join lambdas as
proposed above[0], but without exposing the parse flags to the public
API? It's kind of messy, but at least it partially moves the mess from
the public API to the private implementation.
[0]: #30520 (comment)
Thanks so much for the detailed discussion, and, again, sorry for such a
late response :)
|
Sorry, something went wrong.
|
CI looks like it's running, it's red only because there's a requested
change.
I believe there was an issue with the previous push during the check of
generated files:
https://github.com/python/cpython/runs/5747980058
|
Sorry, something went wrong.
|
Hi, very sorry for the very late reply - a number of things just stacked
up for the past few months.
Okay, I finally had some time to do a thorough review of the module
and, as much as I love enums, I don't think this is the right solution
to this problem:
* allow_fragments should be part of the enum, but that would just be
confusing at this point
Agreed in hindsight. There are too many flags for a flag to be
comprehensible.
* specifying NETLOC or RELATIVE to urlparse() does nothing
Yeah, the parse/join interface is heterogeneous. Definitely an oversight
in the proposed patches as-is (and an oversight in the scheme lists
being part of the public API).
I like the attempt to unify the three class lists, but I don't think
this approach is the best choice for the urlparse module as it exists.
Given that we already have an allow_fragments flag, I think the best
path forward is to add an allow_params flag, with a default of
False...
What about joining? I believe that when given a string, urljoin will
urlparse it. How do we determine the keyword arguments for urlparse in
that case? I assume we use additional parameters. What about netloc and
relative joining? I believe that gives us this rather unwieldy
signature:
```python
def urljoin(base, url, allow_fragments=True, allow_params=False, allow_netloc=False, allow_relative=False)
```
Maybe it could be written as the following:
```python
def urljoin(base, url, **kwargs)
```
...where kwargs are passed to urljoin, but I'm still smelling something
off here. I'd expect that stacking overrides on overrides on global
state is going to get us unpredictable and untestable behavior, no
matter how we write out the signature.
-- or add a separate universal parsing function, as has been suggested
earlier.
I think I agree on this - universal parsing and joining is probably a
good thing.
Would it be acceptable to build off the code currently in this PR? That
is, simply define some new universal parse and universal join lambdas as
proposed above[0], but without exposing the parse flags to the public
API? It's kind of messy, but at least it partially moves the mess from
the public API to the private implementation.
[0]: #30520 (comment)
Thanks so much for the detailed discussion, and, again, sorry for such a
late response :)
|
Sorry, something went wrong.
|
I think this change is stale now. We really do not want to introduce additional complexity to the parsing here, at least many changes in one go. I am closing this request, and we can reopen it to add individual changes (minor refactors in a planned manner) if we want a generic solution. Current diff can introduce complexity that many reviewers have observed above. Thank you for your contribution. |
Sorry, something went wrong.
See bpo-46337. Basically, this allows code like this:
https://bugs.python.org/issue46337