GH-98363: Have batched() return tuples by rhettinger · Pull Request #100118 · python/cpython

rhettinger

No worries. I agree that there is no genuine "safety" concern, just a question of memory overuse here and there.

This was my fear: someone may be doing an operation that involves very high per-batch overhead, and so they choose a large batch size, say 0.1% of working memory. I've written code roughly like this before, and probably would have used batched() had it existed:

BATCHSIZE = 1_000_000

def get_data(db_cursor, ids):
    for batch in batched(ids, BATCHSIZE):
        db_cursor.executemany(SQL_GET_DATA, batch)
        yield from db_cursor.fetchall()

def file_pairs():
    ...

for in_file, out_file in file_pairs():
    ids = get_data_from_file(in_file)
    for row in get_data(db_cursor, ids):
        append_data_to_file(out_file, row)

If there are many files with len(ids) % BATCHSIZE < 20, including len(ids) < 20, then there could be up to around 2_000*20*BATCHSIZE extra words caught up in various tuples.

Some mitigating factors though:

I imagine BATCHSIZE will often be hand-tuned and hard-coded, so if someone notices too much memory usage, they can shrink the batchsize.
Freelists will often be full, so the stored tuples may not be added to the freelist anyway. Only specific allocation patterns matter.

But this data-dependence leads to another issue: the code seems like it ought to take a large constant amount of memory, but for some obscure inputs (many small leftover batches), it could take 40000x as much memory, caught up in mysterious places like the file_pairs() tuples or database rows. Even if every file happens to have exactly 1_000_007 ids, we wind up with 2000*1000000 allocated words in the length-7 freelist.

I don't really know how to assess how probable these are, especially with the LIFO nature of our freelists, so my thought was to do the more predictable thing and be as conservative as possible with memory usage, giving things back to the operating system as soon as possible.

But maybe my usage of huge batches is playing with fire anyway, so it would be understandable to keep the code as is for that reason.

Another option would be to only do a _PyTuple_Resize if the result will have a length that belongs in a freelist, though I suppose that makes the code uglier. Or do as list_resize does and only realloc if we're under half full.

Thoughts?