◐ Shell
clean mode source ↗

feat: Enable Arrow-based columnar data transfers by ElliotNguyen68 · Pull Request #3996 · feast-dev/feast

@ElliotNguyen68

@sudohainguyen

can we eliminate type hint changes and leave it for another PR?
it's better one thing focused only 😄

@sudohainguyen

another comment, have you tested the performance before and after enabling arrow transfer? how was it?
genuinely wondering

@ElliotNguyen68

can we eliminate type hint changes and leave it for another PR? it's better one thing focused only 😄

got it

@ElliotNguyen68

another comment, have you tested the performance before and after enabling arrow transfer? how was it? genuinely wondering

yes I tested it using databrick (3 worker nodes), the data contains over 6 milions rows, with the config enable, this take about 11sec, when not even over 6 minutes still cannot get the result

@ElliotNguyen68

HaoXuAI

spark_session = get_spark_session_or_start_new_with_repoconfig(
self._config.offline_store
)
spark_session.conf.set("spark.sql.execution.arrow.fallback.enabled", "true")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can explicitly claim this in the document, the spark offline store is leveraging arrow to process data. And it's probably better to have a config for it to enable or not as wished by the user.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how can we achieve this @HaoXuAI ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HaoXuAI , I check on snowflake onffline store source code, snowflake.py and can see that the code owner also use spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', 'true') within the codebase, without from a config, so I think we can also follow this, because this is a good thing for feast performance, (1 more thinkg to notice is that our requirement for feast is pyspark version >=3.0.0, and the default value for this config is false for spark session (https://spark.apache.org/docs/latest/configuration.html#:~:text=3.0%2C%20please%20set%20%27-,spark.sql.execution.arrow.pyspark.enabled,-%27.)), what do you think ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest turning it into a offline store configuration

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean we will let the user to config this when they init FeatureStore object using repo config or feature_store.yml , isnt it ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ElliotNguyen68 Yes, I agree. But I still don't like the idea of having two methods of setting the same config. We will have to handle cases when user sets both, for example and stuff like that. Maybe we should just document this option better on spark page. Let's include it in the example and also add a sentence about why it might be useful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so how to include it in the document 🙂?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HaoXuAI , @tokoko , I change those changes to md and template file already, do you guys think that is ok for now ?

Signed-off-by: tanlocnguyen <tanlocnguyen296@gmail.com>

HaoXuAI

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!