◐ Shell
reader mode source ↗
Skip to content

gh-146192: Add base32 support to binascii#146193

Merged
serhiy-storchaka merged 8 commits into
python:mainfrom
kangtastic:base32-accel
Mar 22, 2026
Merged

gh-146192: Add base32 support to binascii#146193
serhiy-storchaka merged 8 commits into
python:mainfrom
kangtastic:base32-accel

Conversation

@kangtastic

@kangtastic kangtastic commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

Synopsis

Add base32 encoder and decoder functions implemented in C to binascii and use them to greatly improve the performance and reduce the memory usage of the existing base32 codec functions in base64.

No API or documentation changes are necessary with respect to any functions in base64, and all existing unit tests for those functions continue to pass without modification.

Resolves: gh-146192

Discussion

The base32-related functions in base64 are now wrappers for the new functions in binascii, as envisioned in the docs:

The binascii module contains a number of methods to convert between binary and various ASCII-encoded binary representations. Normally, you will not use these functions directly but use wrapper modules like uu or base64 instead. The binascii module contains low-level functions written in C for greater speed that are used by the higher-level modules.

Comments and questions are welcome.

Benchmarks

Benchmark script

# bench_b32.py

# Note: Can be EXTREMELY SLOW on unmodified mainline CPython.

import base64
import sys
import timeit
import tracemalloc

funcs = [(base64.b64encode, base64.b64decode), # sanity check/comparison
         (base64.b32encode, base64.b32decode),
         (base64.b32hexencode, base64.b32hexdecode)]

def mb(n):
    return f"{n / 1024 / 1024:.3f}"

def stats(func, data, t, m):
    name, n, bps = func.__qualname__, len(data), len(data) / t
    print(f"{name:<16}{n:<16}{t:<11.3f}{mb(bps):<13}{mb(m)}")

if __name__ == "__main__":
    print(f"Python {sys.version}\n")
    print(f"function        processed (b)   time (s)   avg (MB/s)   mem (MB)\n")
    data = b"a" * int(sys.argv[1]) * 1024 * 1024
    for fenc, fdec in funcs:
        tracemalloc.start()
        enc = fenc(data)
        menc = tracemalloc.get_traced_memory()[1] - len(enc)
        tracemalloc.stop()
        tenc = timeit.timeit("fenc(data)", number=1, globals=globals())
        stats(fenc, data, tenc, menc)

        tracemalloc.start()
        dec = fenc(enc)
        mdec = tracemalloc.get_traced_memory()[1] - len(dec)
        tracemalloc.stop()
        tdec = timeit.timeit("fdec(enc)", number=1, globals=globals())
        stats(fdec, enc, tdec, mdec)

Unmodified mainline CPython

$ ./python bench_b32.py 16
Python 3.15.0a7+ (heads/main:d357a7dbf38, Mar 19 2026, 23:22:25) [GCC 15.2.0]

function        processed (b)   time (s)   avg (MB/s)   mem (MB)

b64encode       16777216        0.015      1088.370     0.000
b64decode       22369624        0.017      1264.389     0.000
b32encode       16777216        2.308      6.933        17.382
b32decode       26843552        3.389      7.553        27.787
b32hexencode    16777216        2.338      6.843        17.379
b32hexdecode    26843552        3.388      7.557        27.787

With this PR

$ ./python bench_b32.py 16
Python 3.15.0a7+ (heads/base32-accel:72fd0f0302a, Mar 20 2026, 00:04:23) [GCC 15.2.0]

function        processed (b)   time (s)   avg (MB/s)   mem (MB)

b64encode       16777216        0.015      1084.957     0.000
b64decode       22369624        0.016      1363.524     0.000
b32encode       16777216        0.017      967.528      0.000
b32decode       26843552        0.016      1581.002     0.000
b32hexencode    16777216        0.016      995.277      0.000
b32hexdecode    26843552        0.016      1588.353     0.000

Encoding performance is improved by ~150x, decoding performance is improved by ~200x,
and no auxiliary memory is used.


📚 Documentation preview 📚: https://cpython-previews--146193.org.readthedocs.build/

Add base32 encoder and decoder functions implemented in
C to `binascii` and use them to greatly improve the
performance and reduce the memory usage of the existing
base32 codec functions in `base64`.

No API or documentation changes are necessary with
respect to any functions in `base64`, and all existing
unit tests for those functions continue to pass without
modification.

Resolves: pythongh-146192
@serhiy-storchaka

Copy link
Copy Markdown
Member

You can now update your PR, @kangtastic.

@kangtastic

Copy link
Copy Markdown
Contributor Author

@serhiy-storchaka Already on it 😄

- Use the new `alphabet` parameter in `binascii`
- Remove `binascii.a2b_base32hex()` and `binascii.b2a_base32hex()`
- Change value for `.. versionadded::` ReST directive in docs for
  new `binascii` functions to "next" instead of "3.15"
@kangtastic kangtastic marked this pull request as ready for review March 20, 2026 16:03

@serhiy-storchaka serhiy-storchaka left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

I added some suggestions, but the core LGTM.

Please add assertions for new alphabets in test_constants.

- Update docs to refer to "Base 32" and "Base32"
- Update docs to better explain `binascii.a2b_base32()`
- Inline helper function in `base64`
- Add forgotten tests for presence of alphabet module globals

@serhiy-storchaka serhiy-storchaka left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

Please add also the What's New entry.

- Revise docs
- Add whatsnew entry
- Minor whitespace change in tests
Referring to a group of 8 bytes as an "octet" may cause
confusion, because the term is already commonly used in
some languages to refer to a group of 8 bits (i.e. a byte).

"Octa" is a suitable preexisting alternative for a group of
64 bits [1] (used by Knuth himself, at that). "Octad" was
considered, but it, too, historically refers to a byte.

Also rename "quintet" to "quint". "Pentad" was considered,
but it historically refers to a group of 5 bits.

[1] https://en.wikipedia.org/wiki/Units_of_information
@kangtastic kangtastic requested a review from AA-Turner as a code owner March 22, 2026 09:43

@serhiy-storchaka serhiy-storchaka left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

LGTM. 👍

- Reword NEWS.d entry to "Base32" instead of "base-32".
  No prior entries have ever mentioned "base-64", etc.,
  but they have mentioned "Base64", etc., so this is
  more consistent.

- Reword whatsnew entry to "Base32" instead of "Base 32".
  No prior entries have ever mentioned "Base 64", etc.,
  and there is an entry a little further up mentioning
  "Ascii85, Base85, and Z85", so this is more consistent.

- Add a whatsnew entry in Optimizations > base64 & binascii
  section.

- Whitespace change in `binascii.c`.
When decoding invalid length (1, 3 or 6 mod 8) + no padding,
mention the invalid length instead of the improper padding in
the exception message to match what the base64 decoder does.

Additionally, move the logic for setting the exception message
(back) outside the "slow path" loop; if we do end up checking
canonicity of decoder input, it will feel (subjectively) better
to have several checks grouped together after the loop.

@gpshead gpshead left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

nice work!

Hide details View details @serhiy-storchaka serhiy-storchaka merged commit b4e5bc2 into python:main Mar 22, 2026
50 of 51 checks passed
@kangtastic

Copy link
Copy Markdown
Contributor Author

@serhiy-storchaka, @gpshead, thanks for the quick review! Doing more of this sort of thing might be fun. Stay safe out there.

CuriousLearner added a commit to CuriousLearner/cpython that referenced this pull request Mar 23, 2026
…8577

* 'main' of github.com:python/cpython:
  pythongh-146197: Run -m test.pythoninfo on the Emscripten CI (python#146332)
  pythongh-146325: Use `test.support.requires_fork` in test_fastpath_cache_cleared_in_forked_child (python#146330)
  pythongh-146197: Add Emscripten to CI (python#146198)
  pythongh-143387: Raise an exception instead of returning None when metadata file is missing. (python#146234)
  pythongh-108907: ctypes: Document _type_ codes (pythonGH-145837)
  pythongh-146175: Soft-deprecate outdated macros; convert internal usage (pythonGH-146178)
  pythongh-146056: Rework ref counting in treebuilder_handle_end() (python#146167)
  Add a warning about untrusted input to `configparser` docs (python#146276)
  pythongh-145264: Do not ignore excess Base64 data after the first padded quad (pythonGH-145267)
  pythongh-146308: Fix error handling issues in _remote_debugging module (python#146309)
  pythongh-146192: Add base32 support to binascii (pythonGH-146193)
  pythongh-135953: Properly obtain main thread identifier in Gecko Collector (python#146045)
  pythongh-143414: Implement unique reference tracking for JIT, optimize unpacking of such tuples (pythonGH-144300)
  pythongh-146261: Fix bug in `_Py_uop_sym_set_func_version` (pythonGH-146291)
  pythongh-145144: Add more tests for UserList, UserDict, etc (pythonGH-145145)
  pythongh-143959: Fix test_datetime if _datetime is unavailable (pythonGH-145248)
  pythongh-146245: Fix reference and buffer leaks via audit hook in socket module (pythonGH-146248)
  pythongh-140049: Colorize exception notes in `traceback.py` (python#140051)
  Update docs for pythongh-146056 (pythonGH-146213)
ljfp pushed a commit to ljfp/cpython that referenced this pull request Apr 25, 2026
Add base32 encoder and decoder functions implemented in
C to the binascii module and use them to greatly improve the
performance and reduce the memory usage of the existing
base32 codec functions in the base64 module.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

C accelerator for Base32 character encoding

3 participants