bpo-33416: Add end positions to Python AST#11605

ilevkivskyi

The majority of this PR is tediously passing end_lineno and end_col_offset everywhere. Here are non-trivial points:

It is not possible to reconstruct end positions in AST "on the fly", some information is lost after an AST node is constructed, so we need two more attributes for every AST node end_lineno and end_col_offset.
I add end position information to both CST and AST. Although it may be technically possible to avoid adding end positions to CST, the code becomes more cumbersome and less efficient.
Since the end position is not known for non-leaf CST nodes while the next token is added, this requires a bit of extra care (see _PyNode_FinalizeEndPos). Unless I made some mistake, the algorithm should be linear.
For statements, I "trim" the end position of suites to not include the terminal newlines and dedent (this seems to be what people would expect), for example in
```
class C:
    pass

pass
```
the end line and end column for the class definition is (2, 8).
For end_col_offset I use the common Python convention for indexing, for example for pass the end_col_offset is 4 (not 3), so that [0:4] gives one the source code that corresponds to the node.
I added a helper function ast.get_source_segment(), to get source text segment corresponding to a given AST node. It is also useful for testing.

An (inevitable) downside of this PR is that AST now takes almost 25% more memory. I think however it is probably justified by the benefits.

https://bugs.python.org/issue33416

Conflicts: Parser/parsetok.c

gvanrossum

Never mind, that was SyntaxError. The AST uses 0-based column offsets.

asottile

I'm very excited for this change

gvanrossum

Phew! I don't have much to add, this all looks great. I don't think in these days 20% extra space to represent CST+AST for a single module is a big deal (though it would be nice to see what the difference is in absolute number of bytes for Lib/tkinter/init.py, which AFAIK is still the largest stdlib module).

serhiy-storchaka

This is a part of comments.

ilevkivskyi

@asottile @gvanrossum @serhiy-storchaka Thank you for reviewing! I think I addressed all your comments.

ilevkivskyi

@gvanrossum I just tried to compare the sizes of Lib/tkinter/__init__.py AST before and after this PR. Here are the numbers: before 5.8 Mbytes, after 7.2Mbytes (24% increase).

gvanrossum

@gvanrossum I just tried to compare the sizes of Lib/tkinter/__init__.py AST before and after this PR. Here are the numbers: before 5.8 Mbytes, after 7.2Mbytes (24% increase).

How did you measure this? IIUC there's also a size increase for the CST, and during translation from CST to AST both are in memory. Is it possible to measure the peak memory usage during this translation?

ilevkivskyi

I just looked at the difference in allocated memory reported in sys._debugmallocstats(). I just tried another way using tracemalloc. It reports that the line in question:

compile(source, filename, mode, PyCF_ONLY_AST)

allocates 3.7 Mbytes before and 4.6 Mbytes after this PR (still the same 24% relative increase). Also this number (24%) is quite reasonable taking into account that all nodes in CST have exactly the same size: 40 bytes before, 48 bytes after (at least on my 64-bit Linux).

gvanrossum

I do think that's a very reasonable increase (we're not talking MicroPython here :-).

Also this number (24%) is quite reasonable taking into account that all nodes in CST have exactly the same size: 40 bytes before, 48 bytes after (at least on my 64-bit Linux).

That's sizeof what exactly? While the node types indeed are big unions, there are several different ones: struct _stmt, struct _expr and some minor ones (all defined in Include/Python-ast.h).

(I tried this myself on my Mac, and it looks like sizeof(struct _stmt) is 72 bytes, and sizeof(struct _expr) is 48 bytes. An int is 4 bytes.)

ilevkivskyi

That's sizeof what exactly?

This is a size of _node. There is only one "main" struct in CST struct _node from node.h.

gvanrossum

I like this! Let's land this. I combed through the code (both the tedious parts and the interesting parts) and it all looks good to me. I am not worried about the 25% increase in CST and AST tree sizes.

ilevkivskyi

@gvanrossum
Great, thanks! I will merge this now, so that we can move with the typed_ast PR, but there is one question that Serhiy raised and I would like to double check: does the fact that PyNode_AddChild and PyParser_AddToken are not documented on docs.python.org mean they are not part of C-API? Also I just checked the files where they are declared (node.h and parser.h) are not included in Python.h.

gvanrossum

does the fact that PyNode_AddChild and PyParser_AddToken are not documented on docs.python.org mean they are not part of C-API?

I'm not sure. Their names don't have _ prefixes so I think they live in some nebulous area. It would be good to bring this up in another forum where more core devs can think about the issue (some of whome have probably thought about it more).

In the worst case scenario, if it's deemed an unacceptable backwards incompatibility, we can always add new functions that add end_lineno and end_col_offset arguments, and make the old ones call the new ones with some kind of default values computed. But I would only resort to this if during the alpha/beta release we get actual complaints.

ilevkivskyi

OK, I will post to Python-Dev about this.

ilevkivskyi added 29 commits January 6, 2019 21:16

Some initial infra

ba4ba82

Regenerate nodes

3e343e3

Mindless implementation: known bugs, notably in fstrings

1684c17

Some test fixes

514d4ea

More test fixes

3ab2516

Add a TODO

1d3e352

Switch to better algorithm for finding end position

dbf9cc9

Merge remote-tracking branch 'upstream/master' into add-end-line-col …

a44207b

Conflicts: Parser/parsetok.c

Be consistent for line_num

5af33da

Minor fixes; start adding tests

ce7f5ce

Update two failing tests

2171eb9

Fix multiline strings

10cf4bd

Fix end position for if statement

ed05305

Adjust end positions in while and for

58fbfa6

Add also with

f2589ff

Fix try end position (concludes fixing suites)

7d5ca5e

Some formatting plus minor fixes

aa62e3c

More formatting; fix import from

96a0ec0

Fix f-strings

c169025

Add few more tests

553a772

Add final bunch of tests

5cc01e9

Update docstrings

dce260e

Update docs

9ba6604

Add get_source_segment() helper

69a6280

Consistent formatting in docstring; use new helper in tests

4af426f

Fix bug

e5a12c3

Split few long lines

0275a93

Add tests and docs gor the helper

f20635b

Fix missing comma

c9da8f5

ilevkivskyi requested a review from gvanrossum January 18, 2019 09:46

asottile reviewed Jan 18, 2019

View reviewed changes

gvanrossum reviewed Jan 18, 2019

View reviewed changes

serhiy-storchaka reviewed Jan 19, 2019

View reviewed changes

ilevkivskyi and others added 5 commits January 19, 2019 17:26

Fix get_source_segment

48936b9

More CR

ac5b5cb

rst fixes

4726f17

Remove unused vars

eeea87d

📜🤖 Added by blurb_it.

027a4ca

Remove old NEWS file

ff361f2

ilevkivskyi commented Jan 19, 2019

View reviewed changes

gvanrossum mentioned this pull request Jan 22, 2019

bpo-35766: Merge typed_ast back into CPython #11645

Merged

gvanrossum approved these changes Jan 22, 2019

View reviewed changes

ilevkivskyi merged commit 9932a22 into python:master Jan 22, 2019

bedevere-bot removed the awaiting merge label Jan 22, 2019

ilevkivskyi deleted the add-end-line-col branch January 22, 2019 11:18

iamdefinitelyahuman mentioned this pull request Aug 1, 2019

Add end offsets to AST and source map vyperlang/vyper#1557

Closed

theodoretliu mentioned this pull request Aug 16, 2019

Pylint not underlining full expressions in VS Code pylint-dev/pylint#3061

Closed

iamdefinitelyahuman mentioned this pull request Aug 24, 2019

AST end offsets and compressed source map vyperlang/vyper#1580

Merged

jacobtylerwalls mentioned this pull request Jun 27, 2023

AST nodes for PEP 695 type param syntax do not require end_lineno nor end_col_offset #106145

Closed

Conversation

ilevkivskyi commented Jan 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gvanrossum commented Jan 18, 2019

Uh oh!

asottile left a comment

Choose a reason for hiding this comment

Uh oh!

gvanrossum left a comment

Choose a reason for hiding this comment

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

ilevkivskyi commented Jan 19, 2019

Uh oh!

ilevkivskyi commented Jan 19, 2019

Uh oh!

gvanrossum commented Jan 19, 2019

Uh oh!

ilevkivskyi commented Jan 20, 2019

Uh oh!

gvanrossum commented Jan 20, 2019

Uh oh!

ilevkivskyi commented Jan 20, 2019

Uh oh!

gvanrossum left a comment

Choose a reason for hiding this comment

Uh oh!

ilevkivskyi commented Jan 22, 2019

Uh oh!

gvanrossum commented Jan 22, 2019

Uh oh!

ilevkivskyi commented Jan 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ilevkivskyi commented Jan 18, 2019 •

edited

Loading