Issue 8242: Improve support of PEP 383 (surrogates) in Python3: meta-issue
Created on 2010-03-27 01:12 by vstinner, last changed 2022-04-11 14:56 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| surrogates-7.patch | vstinner, 2010-04-20 00:25 | |||
| Messages (13) | |||
|---|---|---|---|
| msg101815 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-03-27 01:12 | |
If the fullpath to the python3 binary contains a non-ASCII character and the file system encoding is ASCII, Python fails with: --- Could not find platform independent libraries <prefix> Could not find platform dependent libraries <exec_prefix> Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>] Fatal Python error: Py_Initialize: can't initialize sys standard streams ImportError: No module named encodings.utf_8 Abandon --- The file system encoding is set to ASCII if there is no locale (eg. LANG=C). The problem is that the command line argument, especially argv[0], is stored to a wchar_t* string using surrogates to store undecodable bytes. Attached patch fixes calculate_path() and import functions to support surrogates. Details: * Initialize Py_FileSystemDefaultEncoding earlier in Py_InitializeEx(), because its value is required to encode unicode using surrogates to bytes * Rename char2wchar() to _Py_char2wchar(), the function is not more static ; and create function _Py_wchar2char() * Escape surrogates (reimplement surrogateescape decoder) in calculate_path() subfunctions (_wstat, _wgetcwd, _Py_wreadlink) * Use surrogateescape error handler in find_module(), NullImporter_init() and zipimporter_init() * Write a "fastpath" (I don't know the right term: is it an hack?) for utf-8 encoding with surrogateescape error handler in PyUnicode_AsEncodedObject() and PyUnicode_AsEncodedString(): required because these functions are called by codecs module is initialized The patch is a work in progress: there are some FIXME (I don't know if the string should be encoded/decoded using surrogates or not). I only tested ASCII and UTF-8 file system encodings. I don't know if we can support more encodings. Python has few builtin encodings. Other encodings are implemented in Python: we have to import them, but we need the codec to import a module, so... I don't think that Windows is affected by this issue because it has a better API for unicode filenames and command line arguments, and most patched functions are surrounded by #ifndef WINDOWS ... #endif |
|||
| msg101816 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-03-27 01:17 | |
If I understood correctly, my patch is also required to import a module having a non-ASCII full path if the file system encoding is ASCII. |
|||
| msg101818 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-03-27 01:51 | |
> Initialize Py_FileSystemDefaultEncoding earlier in Py_InitializeEx(), > because its value is required to encode unicode using surrogates to bytes Oh, it doesn't work: get_codeset() returns NULL, because the codec register is empty when get_codeset() is called (with my patch). |
|||
| msg102960 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-04-12 17:14 | |
New patch fixing more issues about undecodable filenames. Lib/test/test_subprocess.py | 4 - Lib/unittest/runner.py | 4 + Modules/_posixsubprocess.c | 21 ++++++++-- Modules/getpath.c | 90 +++++++++++++++++++++++++++++++++++++++----- Modules/posixmodule.c | 5 +- Modules/python.c | 6 +- Modules/zipimport.c | 11 ++++- Objects/fileobject.c | 6 +- Objects/unicodeobject.c | 22 ++++++++-- Parser/tokenizer.c | 14 ++++-- Python/_warnings.c | 7 +++ Python/ast.c | 10 +++- Python/ceval.c | 2 Python/errors.c | 2 Python/import.c | 37 +++++++++++++----- Python/traceback.c | 38 ++++++++++++++---- 16 files changed, 225 insertions(+), 54 deletions(-) TODO: - Remove assert(PyBytes_Check(opath)); from NullImporter_init() and zipimporter_init() - Fix setup_context() (_warnings.c) - Reencode module filenames if the system default encoding changes - Lib/unittest/runner.py and Lib/test/test_subprocess.py contain hacks to fix tests. It might be rewritten - Fix the 3 "FIXME: use _Py_char2wchar" in getpath.c I restored code setting the system encoding. The patch fixes also _posixsubprocess.fork_exec() to support undecodable current working directory. |
|||
| msg103104 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-04-14 00:34 | |
New version of the patch: all tests pass except of 3 (test_ftplib, test_pep3120, test_traceback). |
|||
| msg103550 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-04-18 23:29 | |
I commited the platform.py patch as r80166 (trunk) and r80167 (py3k), but quickly reverted it because the patch on trunk broke Python bootstrap. The patch might be applied, but only on py3k and with more tests (ensure that it doesn't break bootstrap on any OS) :-) |
|||
| msg103662 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-04-20 00:25 | |
Updated patch: - Some parts have been applied in other issues - Remove assert(PyBytes_Check(x)): support PyByteArray type - use PyErr_Format() instead of sprintf+PyErr_SetString in tokenizer.c - don't convert message to byte and then back to unicode in err_input(): keep the unicode object |
|||
| msg103663 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-04-20 00:28 | |
$ diffstat ~/surrogates-7.patch Doc/library/tarfile.rst | 15 +-- Include/moduleobject.h | 1 Lib/platform.py | 12 +- Lib/subprocess.py | 2 Lib/tarfile.py | 14 -- Lib/test/regrtest.py | 5 - Lib/test/test_import.py | 5 + Lib/test/test_reprlib.py | 4 Lib/test/test_subprocess.py | 4 Lib/test/test_tarfile.py | 4 Lib/test/test_urllib.py | 8 + Lib/test/test_urllib2.py | 4 Lib/test/test_xml_etree.py | 6 + Lib/traceback.py | 10 +- Lib/unittest/runner.py | 4 Modules/_ctypes/callproc.c | 12 +- Modules/_ssl.c | 10 +- Modules/_tkinter.c | 6 - Modules/getpath.c | 100 ++++++++++++++++++-- Modules/main.c | 46 +++++---- Modules/posixmodule.c | 18 ++- Modules/pyexpat.c | 11 +- Modules/zipimport.c | 210 ++++++++++++++++++++++++++++++++------------ Objects/codeobject.c | 7 + Objects/exceptions.c | 49 ++++++---- Objects/fileobject.c | 6 - Objects/moduleobject.c | 22 +++- Objects/unicodeobject.c | 22 +++- Parser/tokenizer.c | 18 ++- Python/_warnings.c | 26 ++++- Python/ast.c | 10 +- Python/bltinmodule.c | 33 ++++-- Python/ceval.c | 4 Python/compile.c | 12 ++ Python/errors.c | 4 Python/import.c | 88 ++++++++++++------ Python/pythonrun.c | 39 ++++---- Python/traceback.c | 39 ++++++-- 38 files changed, 625 insertions(+), 265 deletions(-) |
|||
| msg103671 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2010-04-20 05:45 | |
I haven't reviewed the patch in detail yet, but it seems to me that it fixes independent issues. -1000 on that. One problem, one bug report in the tracker, one commit. If this issue is about the import machinery not working anymore if there is a non-ASCII character in the path, then why the heck does it touch posixmodule.c???? As for modules that have non-ASCII characters in their module name: this is, again, an unrelated issue (ISTM), so if you want to deal with it, please create a new issue. |
|||
| msg103697 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-04-20 12:20 | |
> I haven't reviewed the patch in detail yet, but it seems to me that > it fixes independent issues. Right. First I only wanted to fix import machinery, but then I fixed a lot of "indenpendent" issues to test the patch on import. All fixes are related to surrogates. I'm splitting the big patch into small parts: see the dependency list of this issue. I will open a new issue for the import machinery. But this patch requires extra changes which are now discussed in new issues. > (...) why the heck does it > touch posixmodule.c? I opened issue #8391 for this change: "os.execvpe() doesn't support surrogates in env". |
|||
| msg104933 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-05-04 13:32 | |
I opened a different issue to use surrogates in Python module path: #8611, but the issue is not specific to surrogates ("Python3 doesn't support locale different than utf8 and an non-ASCII path (POSIX)"). |
|||
| msg112019 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-07-29 22:26 | |
I created a new svn branch for my work on import in unicode. I will open a new issue and so I close this one. |
|||
| msg112020 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-07-29 22:27 | |
Remove dependency on #6697 to be able to close this issue. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022-04-11 14:56:59 | admin | set | github: 52489 |
| 2010-07-29 22:27:45 | vstinner | set | status: open -> closed resolution: not a bug dependencies: - Check that _PyUnicode_AsString() result is not NULL messages: + msg112020 |
| 2010-07-29 22:26:23 | vstinner | set | messages: + msg112019 |
| 2010-05-04 13:32:11 | vstinner | set | messages: + msg104933 |
| 2010-04-23 21:00:33 | vstinner | set | dependencies: + Check that _PyUnicode_AsString() result is not NULL |
| 2010-04-23 11:38:00 | vstinner | set | dependencies: + Don't accept bytearray as filenames, or simplify the API |
| 2010-04-20 23:38:50 | vstinner | set | dependencies: + _ssl: support surrogates in filenames, and bytes/bytearray filenames |
| 2010-04-20 12:20:43 | vstinner | set | messages:
+ msg103697 title: Support surrogates in import ; install Python in a non-ASCII directory -> Improve support of PEP 383 (surrogates) in Python3: meta-issue |
| 2010-04-20 12:12:56 | vstinner | set | dependencies: + bz2: support surrogates in filename, and bytes/bytearray filename |
| 2010-04-20 12:03:16 | vstinner | set | dependencies: + subprocess: surrogates of the error message (Python implementation on non-Windows) |
| 2010-04-20 11:16:16 | vstinner | set | dependencies: + utf8, backslashreplace and surrogates |
| 2010-04-20 05:45:57 | loewis | set | messages: + msg103671 |
| 2010-04-20 00:28:19 | vstinner | set | messages: + msg103663 |
| 2010-04-20 00:27:55 | vstinner | set | files: - surrogates-6.patch |
| 2010-04-20 00:25:29 | vstinner | set | files:
+ surrogates-7.patch messages: + msg103662 |
| 2010-04-18 23:29:48 | vstinner | set | messages: + msg103550 |
| 2010-04-18 23:27:35 | vstinner | set | dependencies: + tarfile: use surrogates for undecode fields |
| 2010-04-16 01:14:54 | vstinner | set | dependencies: + pickle is unable to encode unicode surrogates |
| 2010-04-16 01:10:46 | vstinner | set | dependencies: + os.system() doesn't support surrogates nor bytes |
| 2010-04-14 01:18:17 | vstinner | set | files: - surrogates-5.patch |
| 2010-04-14 01:16:36 | vstinner | set | dependencies: + ctypes.dlopen() doesn't support surrogates |
| 2010-04-14 01:09:07 | vstinner | set | dependencies: + subprocess: support undecodable current working directory on POSIX OS |
| 2010-04-14 00:34:31 | vstinner | set | files:
+ surrogates-6.patch messages: + msg103104 |
| 2010-04-14 00:02:40 | vstinner | set | dependencies: + os.execvpe() doesn't support surrogates in env |
| 2010-04-13 23:37:47 | vstinner | set | dependencies: + test_xmlrpc fails with non-ascii path |
| 2010-04-12 17:20:54 | vstinner | set | files: - surrogates_bootstrap-4.patch |
| 2010-04-12 17:14:28 | vstinner | set | files:
+ surrogates-5.patch messages: + msg102960 |
| 2010-03-27 13:39:06 | pitrou | set | nosy:
+ loewis |
| 2010-03-27 01:51:05 | vstinner | set | messages: + msg101818 |
| 2010-03-27 01:17:33 | vstinner | set | messages: + msg101816 |
| 2010-03-27 01:12:36 | vstinner | create | |
