◐ Shell
clean mode source ↗

feat: Incorporate substrait ODFVs into ibis-based offline store queries by tokoko · Pull Request #4102 · feast-dev/feast

Conversation

@tokoko

What this PR does / why we need it:

This PR changes the offline flow for ODFV execution in ibis-based offline stores (currently only duckdb). Instead of collecting raw offline store output into a pyarrow table and applying a substrait transformation with acero, ibis-based offline stores now get ibis functions and directly extend ibis logical plan, meaning ODFVs will be executed by the offline store engine itself.

The feature necessitated a couple of underlying changes in substrait ODFVs:

  1. substrait ODVS now also store ibis python functions in protos to directly apply these functions in transform_ibis. This can probably be avoided in the future if there's a good substrait to ibis compiler in place, but currently is the simplest solution. Note: there are other problems here as well, substrait plans need to know precisely what input columns they will consume, this is problematic in the offline flow as there's no way to know beforehand which features the user will request in get_historical_features call. To correctly apply ODFVs, one would need to apply each ODFV to a relevant subset of columns and then join all tables back together (kind of like how it's done in the online flow), which is a horrible waste of resources.
  2. Unlike pandas transformations, ibis functions must now return all the columns that are passed as an input to the function (in addition to on demand features) as opposed to returning only the on demand features that they generate. This is to avoid an extra join back to the main table after ibis functions are applied to the output. This way multiple ODFVs can easily be chained, each mutating the offline store output to append the relevant columns.

With this PR, all of the planned ibis/substrait features will now be in place in python sdk. The next steps will be to enable native substrait ODFV execution in non-python sdks as well.

Which issue(s) this PR fixes:

Fixes #3979

Signed-off-by: tokoko <togurg14@freeuni.edu.ge>

HaoXuAI

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, maybe a bad question, does it mean the substrait transformation is only available in ibis offline store? Is it possible to use it with other offline store? Suppose I have a s3 file sources and want to use the substrait transform directly

@tokoko

@HaoXuAI No, it's applicable for all offline stores, the difference is that for non-ibis offline stores the whole dataset will be collected to a single process as an arrow table and transform_arrow method will be used to apply transformation. (This is true for both substrait and pandas transformations)

def transform_arrow(self, pa_table: pyarrow.Table) -> pyarrow.Table:

What this PR adds on top of this is that for ibis-based offline stores collect to arrow is no longer necessary and instead of transform_arrow method, newly added transform_ibis will be used to directly apply transformation with the offline store engine.

P.S. duckdb offline store which is for now the the only ibis-based implementation can work with s3 file sources even now.

HaoXuAI

HaoXuAI

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@tokoko tokoko deleted the duckdb-substrait-odfv branch

April 19, 2024 17:51

Labels

2 participants

@tokoko @HaoXuAI