feat(spark): Add compute-on-read support for BatchFeatureView in get_…#6357

SIDDHESH1564

What this PR does / why we need it:

When using @batch_feature_view with TransformationMode.PYTHON in the Spark offline store, get_historical_features() fails with UNRESOLVED_COLUMN errors. This occurs because the PIT join SQL reads directly from the raw batch_source and expects transformed feature columns (e.g., aggregated outputs) to already exist in the source data. However, BFV transformations are only executed during feast materialize, not during offline retrieval.

This PR introduces compute-on-read support for BatchFeatureView in SparkOfflineStore. Before generating the PIT join SQL, BFVs with a UDF are detected and their transformations are applied:

Read the raw source into a Spark DataFrame
Invoke the BFV's feature_transformation.udf() (same function used during materialization)
Register the transformed DataFrame as a Spark temporary view
Replace the table_subquery in the query context with the temp view name

This enables reuse of BFV definitions during offline training without requiring pre-materialization or external ETL pipelines. The entire pipeline remains fully distributed in Spark.

Which issue(s) this PR fixes:

Fixes #6345

Checks

I've made sure the tests are passing.
My commits are signed off (git commit -s)
My PR title follows conventional commits format

Testing Strategy

Unit tests
Integration tests
Manual tests
Testing is not required for this change

Added 7 unit tests covering:

BFV with UDF → table_subquery replaced with temp view
UDF invoked with source DataFrame
Transformed DataFrame registered as temp view
Plain FeatureView passes through unchanged
BFV without UDF passes through unchanged
Mixed BFV + plain FeatureView scenarios
All non-transformation context fields preserved

Misc

Changes:

sdk/python/feast/infra/offline_stores/contrib/spark_offline_store/spark.py:
- Added BatchFeatureView import
- Added _apply_bfv_transformations() helper function
- Integrated call into get_historical_features() between query context construction and PIT join SQL generation
sdk/python/tests/unit/infra/offline_stores/contrib/spark_offline_store/test_spark_bfv_compute_on_read.py (new):
- 7 unit tests for compute-on-read behavior

franciscojavierarceo

@copilot can you apply make format-python on this PR?

…historical_features Signed-off-by: Siddhesh Khairnar <khairnarsiddhesh4057@gmail.com>

Signed-off-by: Siddhesh Khairnar <khairnarsiddhesh4057@gmail.com>

…n logic Signed-off-by: Siddhesh Khairnar <khairnarsiddhesh4057@gmail.com>

…ew naming Signed-off-by: Siddhesh Khairnar <khairnarsiddhesh4057@gmail.com>

…V source resolution Signed-off-by: Siddhesh Khairnar <khairnarsiddhesh4057@gmail.com>

SIDDHESH1564 requested a review from a team as a code owner May 1, 2026 19:12

franciscojavierarceo added the ok-to-test label May 1, 2026

franciscojavierarceo approved these changes May 1, 2026

View reviewed changes

franciscojavierarceo changed the title ~~feat(spark): add compute-on-read support for BatchFeatureView in get_…~~ May 1, 2026

feat(spark): add compute-on-read support for BatchFeatureView in get_… …

11d69be

…historical_features Signed-off-by: Siddhesh Khairnar <khairnarsiddhesh4057@gmail.com>

SIDDHESH1564 force-pushed the feature/spark-bfv-compute-on-read branch from fcdf0e6 to 11d69be Compare May 2, 2026 04:28

SIDDHESH1564 and others added 2 commits May 2, 2026 10:22

Merge branch 'master' into feature/spark-bfv-compute-on-read

c2a3d7a

Add None check for batch_source in _apply_bfv_transformations …

9390de9

Signed-off-by: Siddhesh Khairnar <khairnarsiddhesh4057@gmail.com>

ntkathole reviewed May 2, 2026

View reviewed changes

ntkathole mentioned this pull request May 2, 2026

Wire ComputeEngine.get_historical_features() into the standard retrieval path to replace per-store BFV transformation duplication #6359

Open

refactor(spark): use feature_view_utils helpers for BFV transformatio… …

a424bbf

…n logic Signed-off-by: Siddhesh Khairnar <khairnarsiddhesh4057@gmail.com>

SIDDHESH1564 requested a review from ntkathole May 2, 2026 16:52

ntkathole reviewed May 3, 2026

View reviewed changes

SIDDHESH1564 and others added 2 commits May 3, 2026 14:43

fix(spark): apply time-range filter before BFV UDF and update temp vi… …

c9e72de

…ew naming Signed-off-by: Siddhesh Khairnar <khairnarsiddhesh4057@gmail.com>

Merge branch 'master' into feature/spark-bfv-compute-on-read

502fad2

ntkathole reviewed May 3, 2026

View reviewed changes

refactor(spark): use resolve_feature_view_source_with_fallback for BF… …

e1e1764

…V source resolution Signed-off-by: Siddhesh Khairnar <khairnarsiddhesh4057@gmail.com>

ntkathole approved these changes May 3, 2026

View reviewed changes

abhijeet-dhumal mentioned this pull request May 6, 2026

fix(spark): Replace mapInArrow with foreachPartition for materialization #6370

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spark): Add compute-on-read support for BatchFeatureView in get_…#6357