bpo-29240: PEP 540: Add a new UTF-8 mode#855
Conversation
|
@Haypo, thanks for your PR! By analyzing the history of the files in this pull request, we identified @zooba, @benjaminp and @serhiy-storchaka to be potential reviewers. |
Sorry, something went wrong.
zooba
left a comment
There was a problem hiding this comment.
Would like to see some improved documentation before this is merged, but I think the rest of the change looks okay.
Sorry, something went wrong.
|
FYI I didn't look again at my code. I only recreated this pull request to not loose the code, since my old PR was on the old CPython repository. I still have to update the PEP and post it to python-dev. Questions like "wcsicmp? Any need to be case-sensitive here?" should be asked on the PEP directly, not on the implementation. You should wait until the PEP is accepted before reviewing the implementation. The PEP may still change. |
Sorry, something went wrong.
|
@Haypo Ah fair enough :) I haven't been paying attention to all the email recently, so I figured it was accepted already. |
Sorry, something went wrong.
|
The implementation is not complete: the command line arguments and environment variables are not re-decoded from UTF-8 once Py_Main() detects that the user requested the UTF-8 mode (-X utf8 or PYTHONUTF8). But thanks to my refactoring work on Modules/main.c ( bugs.python.org/issue32030 ), fixing this issue should now be much simpler. |
Sorry, something went wrong.
* Add -X utf8 command line option, PYTHONUTF8 environment variable and a new sys.flags.utf8_mode flag. * If the LC_CTYPE locale is "C" at startup: enable automatically the UTF-8 mode. * Add _winapi.GetACP(). encodings._alias_mbcs() now calls _winapi.GetACP() to get the ANSI code page * locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8 mode. As a side effect, open() now uses the UTF-8 encoding by default in this mode. * Py_DecodeLocale() and Py_EncodeLocale() now use the UTF-8 encoding in the UTF-8 mode. * Update subprocess._args_from_interpreter_flags() to handle -X utf8 * Skip some tests relying on the current locale if the UTF-8 mode is enabled. * Add test_utf8mode.py. * _Py_DecodeUTF8_surrogateescape() gets a new optional parameter to return also the length (number of wide characters). * pymain_get_global_config() and pymain_set_global_config() now always copy flag values, rather than only copying if the new value is greater than the old value.
|
I updated the PR for the 4th version of the PEP 540: getpreferredencoding() now returns UTF-8 in the UTF-8 Mode. |
Sorry, something went wrong.
Fix conflict in Modules/main.c: pymain_set_global_config() and _Py_CheckHashBasedPycsMode.
Mock _winapi.GetACP(), not _bootlocale.getpreferredencoding().
|
@methane: You approved my PEP. This is implementation, all tests pass on Linux and Windows. While the implementation is not perfect, I propose to merge it right now to be get more time before the Python 3.7 final to test it. Remaining issue: when the UTF-8 mode is enabled manually (-X utf8), the command line arguments and environment variables are decoded from the locale encoding instead of being decoded from UTF-8. The fix should be easy, but I didn't have time yet to implement it. Another less important issue: the code to check if the LC_CTYPE locale is the POSIX locale is not currently shared between locale coercion (PEP 538) and UTF-8 Mode (PEP 540). |
Sorry, something went wrong.
|
I feel adding more code to And when |
Sorry, something went wrong.
It is.
main() and Py_Main() are very complex. With the PEP 432, Nick Coghlan, Eric Snow and me are working on making this code better. See for example https://bugs.python.org/issue32030 Currently, Py_Main() (Modules/main.c) and _PyPathConfig_Calculate() (Modules/getpath.c and PC/getpathp.c) are fully implemented with wchar_t*. Parsing the command line options using char* on Unix but wchar_t* would make the code more complex. I expect that many lines of code would have to be duplicate, one version for char*, one version for wchar_t*. For all these reasons, I propose to merge this uncomplete PR and write a different PR for the most complex part, re-encode wchar_t* command line arguments, implement Py_UnixMain() or another even better option? |
Sorry, something went wrong.
|
Oops, I forgot to push my main.c change. It's now done. |
Sorry, something went wrong.
mode
using -X utf8 or -X utf8=strict command line options.
enabled.
return also the length (number of wide characters).
mode
https://bugs.python.org/issue29240