I changed the cython script a bit to use a more naive implementation without memset.
Now it is always significantly faster than bytes(sorted(my_bytes)).
$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'')"
500000 loops, best of 5: 495 nsec per loop
$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'abc')"
500000 loops, best of 5: 519 nsec per loop
$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'Let\'s test a proper string now. One that has some value to be sorted.')"
500000 loops, best of 5: 594 nsec per loop