fix: Rewrite Spark materialization engine to use mapInPandas by tokoko · Pull Request #3936 · feast-dev/feast
What this PR does / why we need it:
Spark materialization engine currently converts data on partitions to a list of dicts before trying to write it to online store. This is terribly inefficient and all but guarantees OOM errors on executors for sufficiently large datasets. This PR changes the implementation to use mapInPandas and sidesteps the conversion to a list of dicts entirely.
P.S. There's unfortunately no equivalent applyInPandas method in pyspark, so the implementation simulates it by using mapInPandas and returning dummy results back to the driver.
P.P.S. Starting from 3.3.0 pyspark also has mapInArrow, which would simplify the operation even more, but would mean that feast would no longer support pyspark versions older than 3.3.0. I decided not to use it for now.