◐ Shell
clean mode source ↗

bpo-40334: Refactor peg_generator to receive a Tokens file when building c code by pablogsal · Pull Request #19745 · python/cpython

@pablogsal

https://bugs.python.org/issue40334

This PR does the following:

  • Fix a bunch of (very minor) mypy stuff that was missing.
  • Separate the C parser and the Python parser in pegen main (because both receive different arguments). Thread down all these changes to the generator build module.
  • Add a new option to the C parser command line to receive the Tokens file.
  • Thread down the Tokens file and add code to parse it and calculate the required token information.
  • Use the new tokens in the c_generator (and simplify some code that was hardcoding some token names).
  • Update the build files (Makefile and the Windows one) to use the new option.
  • Run black over the source.

@pablogsal

This is how the command line looks now with the sub-parsers:

Main CL

~/github/python/master/Tools/peg_generator [bpo-40334](https://bugs.python.org/issue40334)-use-tokens
❯ python -m pegen
usage: pegen [-h] [-q] [-v] [--skip-actions] {c,python} ...

Experimental PEG-like parser generator

positional arguments:
  {c,python}      target language for the generated code
    c             Generate C code for inclusion into CPython
    python        Generate Python code

optional arguments:
  -h, --help      show this help message and exit
  -q, --quiet     Don't print the parsed grammar
  -v, --verbose   Print timing stats; repeat for more debug output

C subparser

~/github/python/master/Tools/peg_generator [bpo-40334](https://bugs.python.org/issue40334)-use-tokens
❯ python -m pegen c -h
usage: pegen c [-h] [--compile-extension] [-o OUT] [--optimized] [--skip-actions] grammar_filename tokens_filename

positional arguments:
  grammar_filename      Grammar description
  tokens_filename       Tokens description

optional arguments:
  -h, --help            show this help message and exit
  -o OUT, --output OUT  Where to write the generated parser
  --compile-extension   Compile generated C code into an extension module
  --optimized           Compile the extension in optimized mode
  --skip-actions        Suppress code emission for rule actions

Python subparser

~/github/python/master/Tools/peg_generator [bpo-40334](https://bugs.python.org/issue40334)-use-tokens
❯ python -m pegen python -h
usage: pegen python [-h] [-o OUT] grammar_filename

positional arguments:
  grammar_filename      Grammar description

optional arguments:
  -h, --help            show this help message and exit
  -o OUT, --output OUT  Where to write the generated parser
  --skip-actions  Suppress code emission for rule actions

lysnikolaou

Comment on lines +64 to 66

if name in self.non_exact_tokens:
name = name.lower()
return f"{name}_var", f"_PyPegen_{name}_token(p)"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should special-case NAME here and call _PyPegen_name_token for it and then call _PyPegen_expect_token for all the others (which is what the "named" functions already do anyway). This way we could get rid of these "named" expect functions (e.g. _PyPegen_indent_token).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I think I still like to have the abstraction in case we need to intercede something like with the NAME token in the future (this also allows us to not have "hardcoded" names in the generator) but if you think is better to re-add that in the future and eliminate the wrappers I can totally do it as I don't feel strongly about it.

What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I still like to have the abstraction in case we need to intercede something like with the NAME token in the future

It feels, that that's only the case with STRINGs. I don't really expect to do anything more complicated that just returning the token with all the whitespace ones (INDENT, DEDENT etc.) and I'd re-add those functions for async and await, if it were ever needed.

With that said, I really don't feel strongly about either one, so it's your call! 😃

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I am going to merge this and we can revisit this in further refactors of this code. I have some improvements in mind to avoid checking for sub-strings (like var.startswith("name") when distinguishing the types and we can explore doing this in that PR. 😉

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with your suggestion and the main blocker is that _PyPegen_lookahead is called with these functions and it imposes that they may take the parser as the only argument and there is not a simple way to make _PyPegen_lookahead allow to forward the arguments :(

Edit: I am continuing experimenting with this idea....

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think of this problem. Thanks a lot for doing the investigation!

Co-Authored-By: Lysandros Nikolaou <lisandrosnik@gmail.com>
Co-Authored-By: Lysandros Nikolaou <lisandrosnik@gmail.com>

lysnikolaou

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

@bedevere-bot

This comment has been minimized.

@pablogsal

Hi! The buildbot AMD64 Fedora Stable 3.x has failed when building commit 5b9f498.

Unrelated failure:

1 test failed:
test_concurrent_futures