gh-146192: Add base32 support to binascii#146193

kangtastic

Synopsis

Add base32 encoder and decoder functions implemented in C to binascii and use them to greatly improve the performance and reduce the memory usage of the existing base32 codec functions in base64.

No API or documentation changes are necessary with respect to any functions in base64, and all existing unit tests for those functions continue to pass without modification.

Resolves: gh-146192

Discussion

The base32-related functions in base64 are now wrappers for the new functions in binascii, as envisioned in the docs:

The binascii module contains a number of methods to convert between binary and various ASCII-encoded binary representations. Normally, you will not use these functions directly but use wrapper modules like uu or base64 instead. The binascii module contains low-level functions written in C for greater speed that are used by the higher-level modules.

Comments and questions are welcome.

Benchmarks

Benchmark script

# bench_b32.py

# Note: Can be EXTREMELY SLOW on unmodified mainline CPython.

import base64
import sys
import timeit
import tracemalloc

funcs = [(base64.b64encode, base64.b64decode), # sanity check/comparison
         (base64.b32encode, base64.b32decode),
         (base64.b32hexencode, base64.b32hexdecode)]

def mb(n):
    return f"{n / 1024 / 1024:.3f}"

def stats(func, data, t, m):
    name, n, bps = func.__qualname__, len(data), len(data) / t
    print(f"{name:<16}{n:<16}{t:<11.3f}{mb(bps):<13}{mb(m)}")

if __name__ == "__main__":
    print(f"Python {sys.version}\n")
    print(f"function        processed (b)   time (s)   avg (MB/s)   mem (MB)\n")
    data = b"a" * int(sys.argv[1]) * 1024 * 1024
    for fenc, fdec in funcs:
        tracemalloc.start()
        enc = fenc(data)
        menc = tracemalloc.get_traced_memory()[1] - len(enc)
        tracemalloc.stop()
        tenc = timeit.timeit("fenc(data)", number=1, globals=globals())
        stats(fenc, data, tenc, menc)

        tracemalloc.start()
        dec = fenc(enc)
        mdec = tracemalloc.get_traced_memory()[1] - len(dec)
        tracemalloc.stop()
        tdec = timeit.timeit("fdec(enc)", number=1, globals=globals())
        stats(fdec, enc, tdec, mdec)

Unmodified mainline CPython

$ ./python bench_b32.py 16
Python 3.15.0a7+ (heads/main:d357a7dbf38, Mar 19 2026, 23:22:25) [GCC 15.2.0]

function        processed (b)   time (s)   avg (MB/s)   mem (MB)

b64encode       16777216        0.015      1088.370     0.000
b64decode       22369624        0.017      1264.389     0.000
b32encode       16777216        2.308      6.933        17.382
b32decode       26843552        3.389      7.553        27.787
b32hexencode    16777216        2.338      6.843        17.379
b32hexdecode    26843552        3.388      7.557        27.787

With this PR

$ ./python bench_b32.py 16
Python 3.15.0a7+ (heads/base32-accel:72fd0f0302a, Mar 20 2026, 00:04:23) [GCC 15.2.0]

function        processed (b)   time (s)   avg (MB/s)   mem (MB)

b64encode       16777216        0.015      1084.957     0.000
b64decode       22369624        0.016      1363.524     0.000
b32encode       16777216        0.017      967.528      0.000
b32decode       26843552        0.016      1581.002     0.000
b32hexencode    16777216        0.016      995.277      0.000
b32hexdecode    26843552        0.016      1588.353     0.000

Encoding performance is improved by ~150x, decoding performance is improved by ~200x,
and no auxiliary memory is used.

📚 Documentation preview 📚: https://cpython-previews--146193.org.readthedocs.build/

Add base32 encoder and decoder functions implemented in C to `binascii` and use them to greatly improve the performance and reduce the memory usage of the existing base32 codec functions in `base64`. No API or documentation changes are necessary with respect to any functions in `base64`, and all existing unit tests for those functions continue to pass without modification. Resolves: pythongh-146192

serhiy-storchaka

You can now update your PR, @kangtastic.

kangtastic

@serhiy-storchaka Already on it 😄

- Use the new `alphabet` parameter in `binascii` - Remove `binascii.a2b_base32hex()` and `binascii.b2a_base32hex()` - Change value for `.. versionadded::` ReST directive in docs for new `binascii` functions to "next" instead of "3.15"

serhiy-storchaka

I added some suggestions, but the core LGTM.

Please add assertions for new alphabets in test_constants.

- Update docs to refer to "Base 32" and "Base32" - Update docs to better explain `binascii.a2b_base32()` - Inline helper function in `base64` - Add forgotten tests for presence of alphabet module globals

serhiy-storchaka

Please add also the What's New entry.

- Revise docs - Add whatsnew entry - Minor whitespace change in tests

Referring to a group of 8 bytes as an "octet" may cause confusion, because the term is already commonly used in some languages to refer to a group of 8 bits (i.e. a byte). "Octa" is a suitable preexisting alternative for a group of 64 bits [1] (used by Knuth himself, at that). "Octad" was considered, but it, too, historically refers to a byte. Also rename "quintet" to "quint". "Pentad" was considered, but it historically refers to a group of 5 bits. [1] https://en.wikipedia.org/wiki/Units_of_information

serhiy-storchaka

LGTM. 👍

- Reword NEWS.d entry to "Base32" instead of "base-32". No prior entries have ever mentioned "base-64", etc., but they have mentioned "Base64", etc., so this is more consistent. - Reword whatsnew entry to "Base32" instead of "Base 32". No prior entries have ever mentioned "Base 64", etc., and there is an entry a little further up mentioning "Ascii85, Base85, and Z85", so this is more consistent. - Add a whatsnew entry in Optimizations > base64 & binascii section. - Whitespace change in `binascii.c`.

When decoding invalid length (1, 3 or 6 mod 8) + no padding, mention the invalid length instead of the improper padding in the exception message to match what the base64 decoder does. Additionally, move the logic for setting the exception message (back) outside the "slow path" loop; if we do end up checking canonicity of decoder input, it will feel (subjectively) better to have several checks grouped together after the loop.

gpshead

nice work!

kangtastic

@serhiy-storchaka, @gpshead, thanks for the quick review! Doing more of this sort of thing might be fun. Stay safe out there.

…8577 * 'main' of github.com:python/cpython: pythongh-146197: Run -m test.pythoninfo on the Emscripten CI (python#146332) pythongh-146325: Use `test.support.requires_fork` in test_fastpath_cache_cleared_in_forked_child (python#146330) pythongh-146197: Add Emscripten to CI (python#146198) pythongh-143387: Raise an exception instead of returning None when metadata file is missing. (python#146234) pythongh-108907: ctypes: Document _type_ codes (pythonGH-145837) pythongh-146175: Soft-deprecate outdated macros; convert internal usage (pythonGH-146178) pythongh-146056: Rework ref counting in treebuilder_handle_end() (python#146167) Add a warning about untrusted input to `configparser` docs (python#146276) pythongh-145264: Do not ignore excess Base64 data after the first padded quad (pythonGH-145267) pythongh-146308: Fix error handling issues in _remote_debugging module (python#146309) pythongh-146192: Add base32 support to binascii (pythonGH-146193) pythongh-135953: Properly obtain main thread identifier in Gecko Collector (python#146045) pythongh-143414: Implement unique reference tracking for JIT, optimize unpacking of such tuples (pythonGH-144300) pythongh-146261: Fix bug in `_Py_uop_sym_set_func_version` (pythonGH-146291) pythongh-145144: Add more tests for UserList, UserDict, etc (pythonGH-145145) pythongh-143959: Fix test_datetime if _datetime is unavailable (pythonGH-145248) pythongh-146245: Fix reference and buffer leaks via audit hook in socket module (pythonGH-146248) pythongh-140049: Colorize exception notes in `traceback.py` (python#140051) Update docs for pythongh-146056 (pythonGH-146213)

Add base32 encoder and decoder functions implemented in C to the binascii module and use them to greatly improve the performance and reduce the memory usage of the existing base32 codec functions in the base64 module.

bedevere-app Bot mentioned this pull request Mar 20, 2026

C accelerator for Base32 character encoding #146192

Closed

serhiy-storchaka requested review from gpshead and serhiy-storchaka March 20, 2026 09:00

Update PR for python#145981 …

bf1308f

- Use the new `alphabet` parameter in `binascii` - Remove `binascii.a2b_base32hex()` and `binascii.b2a_base32hex()` - Change value for `.. versionadded::` ReST directive in docs for new `binascii` functions to "next" instead of "3.15"

kangtastic force-pushed the base32-accel branch from db96a3f to bf1308f Compare March 20, 2026 16:01

kangtastic marked this pull request as ready for review March 20, 2026 16:03

bedevere-app Bot added the awaiting review label Mar 20, 2026

serhiy-storchaka reviewed Mar 21, 2026

View reviewed changes

kangtastic added 2 commits March 21, 2026 07:56

Address reviewer feedback …

a9a7d26

- Update docs to refer to "Base 32" and "Base32" - Update docs to better explain `binascii.a2b_base32()` - Inline helper function in `base64` - Add forgotten tests for presence of alphabet module globals

Update generated files

6f80c54

gpshead reviewed Mar 22, 2026

View reviewed changes

serhiy-storchaka reviewed Mar 22, 2026

View reviewed changes

kangtastic added 2 commits March 22, 2026 02:20

Address more reviewer feedback …

4c82070

- Revise docs - Add whatsnew entry - Minor whitespace change in tests

kangtastic requested a review from AA-Turner as a code owner March 22, 2026 09:43

serhiy-storchaka approved these changes Mar 22, 2026

View reviewed changes

bedevere-app Bot added awaiting merge and removed awaiting review labels Mar 22, 2026

kangtastic added 2 commits March 22, 2026 07:26

serhiy-storchaka approved these changes Mar 22, 2026

View reviewed changes

gpshead approved these changes Mar 22, 2026

View reviewed changes

bedevere-app Bot removed the awaiting merge label Mar 22, 2026

pablogsal mentioned this pull request Apr 4, 2026

gh-130273: Fix traceback color output with unicode characters #142529

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-146192: Add base32 support to binascii#146193