Issue 39337: codecs.lookup() ignores non-ASCII characters, whereas encodings.normalize_encoding() copies them
Created on 2020-01-14 21:54 by vstinner, last changed 2022-04-11 14:59 by admin. This issue is now closed.
| Pull Requests | |||
|---|---|---|---|
| URL | Status | Linked | Edit |
| PR 18845 | closed | shihai1991, 2020-03-08 09:50 | |
| PR 18987 | closed | shihai1991, 2020-03-14 15:03 | |
| PR 19069 | merged | shihai1991, 2020-03-19 10:40 | |
| PR 22219 | merged | shihai1991, 2020-09-12 18:10 | |
| PR 22360 | merged | shihai1991, 2020-09-22 14:51 | |
| PR 17997 | vstinner, 2020-10-01 16:23 | ||
| Messages (11) | |||
|---|---|---|---|
| msg360004 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2020-01-14 21:54 | |
bpo-37751 changed codecs.lookup() in a subtle way: non-ASCII characters are now ignored, whereas they were copied unmodified previously. I would prefer that codecs.lookup() and encodings.normalize_encoding() behave the same. Either always ignore or always copy. Moreover, it seems like there is no test on how the encoding names are normalized in codecs.register(). I recall that using codecs.register() in an unit test causes troubles since there is no API to unregister a search function. Maybe we should just add a private function for test in _testcapi. Serhiy Storchaka wrote an example on my PR: https://github.com/python/cpython/pull/17997/files > There are other differences. For example, normalize_encoding("КОИ-8") returns "кои_8", but codecs.lookup normalizes it to "8". > The comment in the sources is also not correct. |
|||
| msg363645 - (view) | Author: Hai Shi (shihai1991) * ![]() |
Date: 2020-03-08 09:55 | |
> I would prefer that codecs.lookup() and encodings.normalize_encoding() behave the same. Either always ignore or always copy. How about calling `encodings.normalize_encoding() in codecs.normalizestring()` to keep same behavior?(I create PR18845) > Maybe we should just add a private function for test in _testcapi I can try to add some test cases in next weekend ;) |
|||
| msg364180 - (view) | Author: Hai Shi (shihai1991) * ![]() |
Date: 2020-03-14 15:02 | |
> How about calling `encodings.normalize_encoding() in codecs.normalizestring()` to keep same behavior?(I create PR18845) I have try this idea, but it make the testcase of test_io.py failed because some object will call `codecs.Lookup()` in `__del__()`.-->extension module will be cleaned before calling `__del__().` > I would prefer that codecs.lookup() and encodings.normalize_encoding() behave the same. Either always ignore or always copy. I try to add a `_Py_normalize_unicode_encoding()` in unicodeobject.c to support non-ASCII encoding names' normalization(PR18987), but this PR caused many testcases failed. For example: In master: python3.9 -c "print('a\xac\u1234\u20ac\u8000\U0010ffff'.encode('iso-8859-15', 'namereplace'))" result: b'a\xac\\N{ETHIOPIC SYLLABLE SEE}\xa4\\N{CJK UNIFIED IDEOGRAPH-8000}\\U0010ffff' after PR18987: ./python -c "print('a\xac\u1234\u20ac\u8000\U0010ffff'.encode('iso-8859-15', 'namereplace'))" result: b'a\xac\\N{ETHIOPIC SYLLABLE SEE}\\N{EURO SIGN}\\N{CJK UNIFIED IDEOGRAPH-8000}\\U0010ffff' |
|||
| msg364449 - (view) | Author: Hai Shi (shihai1991) * ![]() |
Date: 2020-03-17 16:57 | |
> Maybe we should just add a private function for test in _testcapi. Oh, there have a problem with this idea: struct _is is defined in internal/pycore_pystate.h. _testcapi must test the public Python C API, not CPython internal C API |
|||
| msg364452 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2020-03-17 17:03 | |
> _testcapi must test the public Python C API, not CPython internal C API Use _testinternalcapi in this case. |
|||
| msg364456 - (view) | Author: Hai Shi (shihai1991) * ![]() |
Date: 2020-03-17 17:16 | |
> Use _testinternalcapi in this case. oh, forgive me. I don't atttention this extension module before :( |
|||
| msg377772 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2020-10-01 16:22 | |
See also bpo-37751 and my PR 17997. |
|||
| msg378281 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2020-10-08 19:21 | |
New changeset 3f342376ab0da3b4c8a38a27edfafba8e8cdf52d by Hai Shi in branch 'master': bpo-39337: Add a test case for normalizing of codec names (GH-19069) https://github.com/python/cpython/commit/3f342376ab0da3b4c8a38a27edfafba8e8cdf52d |
|||
| msg378626 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2020-10-14 15:43 | |
New changeset c5b049b91ca50c615f9a5425055c2b79a82ac547 by Hai Shi in branch 'master': bpo-39337: encodings.normalize_encoding() now ignores non-ASCII characters (GH-22219) https://github.com/python/cpython/commit/c5b049b91ca50c615f9a5425055c2b79a82ac547 |
|||
| msg378628 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2020-10-14 15:46 | |
Thanks Hai Shi for the change and for new codecs and encodings tests. |
|||
| msg378630 - (view) | Author: Hai Shi (shihai1991) * ![]() |
Date: 2020-10-14 15:55 | |
Thanks for everyone's continus review :) |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:59:25 | admin | set | github: 83518 |
| 2020-10-14 15:55:50 | shihai1991 | set | messages: + msg378630 |
| 2020-10-14 15:46:55 | vstinner | set | status: open -> closed versions: + Python 3.10, - Python 3.9 messages: + msg378628 resolution: fixed |
| 2020-10-14 15:43:38 | vstinner | set | messages: + msg378626 |
| 2020-10-08 19:21:16 | vstinner | set | messages: + msg378281 |
| 2020-10-01 16:23:20 | vstinner | set | pull_requests: + pull_request21499 |
| 2020-10-01 16:22:44 | vstinner | set | messages: + msg377772 |
| 2020-09-22 14:51:27 | shihai1991 | set | pull_requests: + pull_request21401 |
| 2020-09-12 18:10:22 | shihai1991 | set | pull_requests: + pull_request21274 |
| 2020-03-19 10:40:43 | shihai1991 | set | pull_requests: + pull_request18424 |
| 2020-03-17 17:16:32 | shihai1991 | set | messages: + msg364456 |
| 2020-03-17 17:03:30 | vstinner | set | messages: + msg364452 |
| 2020-03-17 16:57:57 | shihai1991 | set | messages: + msg364449 |
| 2020-03-14 15:03:52 | shihai1991 | set | pull_requests: + pull_request18344 |
| 2020-03-14 15:02:55 | shihai1991 | set | messages: + msg364180 |
| 2020-03-08 09:55:01 | shihai1991 | set | messages: + msg363645 |
| 2020-03-08 09:50:41 | shihai1991 | set | keywords:
+ patch nosy: + shihai1991 pull_requests:
+ pull_request18202 |
| 2020-01-14 21:54:42 | vstinner | create | |

