feat: Add deduplicate pushdown to clickhouse - improve materialize performance by astronautas · Pull Request #5709 · feast-dev/feast
What this PR does / why we need it:
#5707. We have observed significant slowdown in heavy materialization jobs. Heavy = pulling in significant period of {start_ts, end_ts}, when essentially you end up with many values to be deduplicated by Feast's compute engine. Unfortunately, Polars local engine did not show any significant speed-up. Spark is out of question due to sheer complexity of adding yet another cluster engine. We also observed nearly twice increased memory usage...
Not being a fan of altering the core, I would like to introduce a local change to Clickhouse provider to pushdown the deduplication logic to offline store. This assumes you don't have any Feast compute transformations or aggregation. I would like to just use a flag for now and not make it overcomplicated i.e. try to infer it's Pandas, if it has compute engine transformations. Keeping it simple, for now, and disabled by default.