◐ Shell
clean mode source ↗

feat: Add delta format to `FileSource`, add support for it in ibis/duckdb by tokoko · Pull Request #4123 · feast-dev/feast

@tokoko

What this PR does / why we need it:

This PR adds delta format to FileSource (only parquet was supported before), also implements logic for handling delta sources in ibis offline store. It also adds integration tests for duckdb offline store with delta sources, dukdb tests are effectively run twice, once for parquet and then for delta.

Signed-off-by: tokoko <togurg14@freeuni.edu.ge>
Signed-off-by: tokoko <togurg14@freeuni.edu.ge>

HaoXuAI

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.
I was wondering how we can use delta lake. One question, how is it possible to use the delta format with the spark offline store or spark materialization?

@tokoko

SparkSource already supports it afaik, SparkSource can also be used to basically just describe the source as a table name without actually caring about the format (I think). while that is probably fine for spark offline store, one downside is that it's tied to just a single implementation.

What I'm hoping to achieve with FileSource is that we can have a single source type that can be read by multiple offline stores (dask, duckdb, spark and others) so you can define a set of sources in your feature repository that won't lock you to use a specific offline store only.

HaoXuAI

@HaoXuAI

SparkSource already supports it afaik, SparkSource can also be used to basically just describe the source as a table name without actually caring about the format (I think). while that is probably fine for spark offline store, one downside is that it's tied to just a single implementation.

What I'm hoping to achieve with FileSource is that we can have a single source type that can be read by multiple offline stores (dask, duckdb, spark and others) so you can define a set of sources in your feature repository that won't lock you to use a specific offline store only.

Make sense. I guess for the pyspark, in order to use delta format, it needs to install a delta spark plugin?

@HaoXuAI

LGTM!
Would be great to update the document as well

@tokoko tokoko deleted the delta-file-source branch

April 20, 2024 17:07

@tokoko

Make sense. I guess for the pyspark, in order to use delta format, it needs to install a delta spark plugin?

Yeah, delta data source needs to be configured beforehand in some way.

@ion-elgreco

@tokoko nice PR!

I kind of had put this on a back burner : P and focused my energy more on delta-rs.

@tokoko

thanks, this wouldn't be possible without delta-rs.