โ— Shell
clean mode source โ†—

Issue 46555: Unicode-mangled names refer inconsistently to constants

Created on 2022-01-27 21:57 by Kodiologist, last changed 2022-04-11 14:59 by admin.

Messages (8)
msg411930 - (view) Author: (Kodiologist) * Date: 2022-01-27 21:57
I'm not sure if this is a bug, but it certainly surprised me. Most reserved words, when Unicode-mangled, as in "๐••๐•–๐•—", act like ordinary identifiers (see e.g. bpo-46520). `True`, `False`, and `None` are weird in that Unicode-mangled versions of them refer to those same constants initially, but can take on their own identity as variables if assigned to:

    Python 3.9.7 (default, Sep 10 2021, 14:59:43) 
    [GCC 11.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> ๐•‹๐•ฃ๐•ฆ๐•–
    True
    >>> True = 0
      File "<stdin>", line 1
        True = 0
        ^
    SyntaxError: cannot assign to True
    >>> ๐•‹๐•ฃ๐•ฆ๐•– = 0
    >>> True
    True
    >>> ๐•‹๐•ฃ๐•ฆ๐•–
    0

I think that `๐•‹๐•ฃ๐•ฆ๐•– = 1` should probably be forbidden. The fact that `๐•‹๐•ฃ๐•ฆ๐•–` doesn't always mean the same thing as `True` seems to break the rule in PEP 3131 that "comparison of identifiers is based on NFKC".
msg412070 - (view) Author: Carl Friedrich Bolz-Tereick (Carl.Friedrich.Bolz) * Date: 2022-01-29 11:42
hah, this is "great":

>>> ๐•‹๐•ฃ๐•ฆ๐•– = 1
>>> globals()
{'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <class '_frozen_importlib.BuiltinImporter'>, '__spec__': None, '__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, 'True': 1}

The problem is that the lexer assumes that anything that is not ASCII cannot be a keyword and lexes ๐•‹๐•ฃ๐•ฆ๐•– as an identifier.
msg412071 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-01-29 11:53
True is a keyword which is compiled to expression whose value is True, ๐•‹๐•ฃ๐•ฆ๐•– is an identifier which refers to the builtin variable "True" which has a value True by default. You can change the value of a builtin variable, but the value of expression True is always True.

I do not see a problem here. Don't use ๐•‹๐•ฃ๐•ฆ๐•– if your intention is not using a variable.
msg412150 - (view) Author: (Kodiologist) * Date: 2022-01-30 14:47
> the builtin variable "True"

Is the existence of this entity, as separate from the constant `True`, documented anywhere? constants.rst doesn't seem to acknowledge it. Indeed, is its existence a feature, or is it a CPython quirk?
msg412167 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-01-30 18:15
https://docs.python.org/3/library/constants.html#built-in-constants
msg412169 - (view) Author: Carl Friedrich Bolz-Tereick (Carl.Friedrich.Bolz) * Date: 2022-01-30 18:58
Ok, I can definitely agree with Serhiy pov: "True" is a keyword that always evaluates to the object that you get when you call bool(1). There is usually no name "True" and directly assigning to it is forbidden. But there are various other ways to assign a name "True". One is eg globals("True") = 5, another one (discussed in this issue) is using identifiers that NFKC-normalize to the string "True".
msg412170 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2022-01-30 19:09
Why was it decided to not raise a syntax error when the NFKC normalization of a non-ASCII token matches a keyword? I don't see a use for cases such as `๐•š๐•— = 1` and `๐•š๐•— + 1`. It seems the cost in terms of confusion far outweighs any potential benefit.
msg412226 - (view) Author: James Gerity (SnoopJeDi) Date: 2022-02-01 00:41
> Why was it decided to not raise a syntax error...

I'm not sure if such a decision was even ever made, the error happens before normalization is applied. I.e. the parser is doing two things here: (1) validating the syntax against the grammar and (2) building the AST. Normalization happens after (1), and `๐•‹๐•ฃ๐•ฆ๐•– = 0` is valid syntax because the grammar is NOT defined in terms of normalized identifiers, it's describing the valid (but confusing!) assignment that Carl described.

I agree that this doesn't seem like bug, but it IS my new favorite quirk of identifier normalization.
History
Date User Action Args
2022-04-11 14:59:55adminsetgithub: 90713
2022-02-01 00:41:18SnoopJeDisetmessages: + msg412226
2022-01-30 19:09:28eryksunsetnosy: + eryksun
messages: + msg412170
2022-01-30 18:58:53Carl.Friedrich.Bolzsetmessages: + msg412169
2022-01-30 18:15:52serhiy.storchakasetmessages: + msg412167
2022-01-30 14:47:35Kodiologistsetmessages: + msg412150
2022-01-29 17:56:53jack1142setnosy: + jack1142
2022-01-29 11:53:33serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg412071
2022-01-29 11:42:21Carl.Friedrich.Bolzsetnosy: + Carl.Friedrich.Bolz
messages: + msg412070
2022-01-29 03:39:07SnoopJeDisetnosy: + SnoopJeDi
2022-01-27 21:57:22Kodiologistcreate