Bytewax materialization can run infinitely

james-crabtree-sp

Expected Behavior

Bytewax materialization should run all pods once successfully and then set job status as success

Current Behavior

In the event that a node crashes, successful pod records can be lost and the job will rerun all of those lost pods. If these node crashes occur often enough, this can result in a job continuously rerunning successful pods and never completing.

Steps to reproduce

Run a materialization job against a multi-node kubernetes cluster. Terminate one of the nodes, observe that pods are lost and rerun

Specifications

Version: 0.31
Platform: fedora linux
Subsystem: bytewax batch_engine

Possible Solution

For safety, the job should have a configurable activeDeadlineSeconds. The larger job should also be able to be split into smaller batches to mitigate the effect a node crash can have