`compile()` raises `UnicodeEncodeError` for docstrings with lone surrogates
Bug report
Bug description:
Found by fuzzing.
The simplest repro:
2026-05-03T01:46:20.460163000+0200 maurycy@gimel /Users/maurycy/src/github.com/maurycy/cpython (main 6b632ce?) % ./python.exe repro.py
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed
[1] 2026-05-03T01:46:21.697294000+0200 maurycy@gimel /Users/maurycy/src/github.com/maurycy/cpython (main 6b632ce?) %
I believe this is the root cause:
| // C implementation of inspect.cleandoc() | |
| // | |
| // Difference from inspect.cleandoc(): | |
| // - Do not remove leading and trailing blank lines to keep lineno. | |
| PyObject * | |
| _PyCompile_CleanDoc(PyObject *doc) | |
| { | |
| doc = PyObject_CallMethod(doc, "expandtabs", NULL); | |
| if (doc == NULL) { | |
| return NULL; | |
| } | |
| Py_ssize_t doc_size; | |
| const char *doc_utf8 = PyUnicode_AsUTF8AndSize(doc, &doc_size); | |
| if (doc_utf8 == NULL) { | |
| Py_DECREF(doc); | |
| return NULL; | |
| } |
It used to work:
2026-05-03T01:50:40.264145000+0200 maurycy@gimel /Users/maurycy/src/github.com/maurycy/cpython (main 6b632ce?) % uv run --python 3.11 ./repro.py 2026-05-03T01:50:46.764500000+0200 maurycy@gimel /Users/maurycy/src/github.com/maurycy/cpython (main 6b632ce?) % uv run --python 3.12 ./repro.py 2026-05-03T01:50:51.752907000+0200 maurycy@gimel /Users/maurycy/src/github.com/maurycy/cpython (main 6b632ce?) % uv run --python 3.13 ./repro.py UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed [1] 2026-05-03T01:50:54.414813000+0200 maurycy@gimel /Users/maurycy/src/github.com/maurycy/cpython (main 6b632ce?) %
CPython versions tested on:
CPython main branch
Operating systems tested on:
macOS