◐ Shell
reader mode source ↗
Skip to content

feat(spark): SparkSource query+path and pre-computed offline read for BatchFeatureView#6440

Open
abhijeet-dhumal wants to merge 8 commits into
feast-dev:masterfrom
abhijeet-dhumal:feat/spark-bfv-offline-historical-features
Open

feat(spark): SparkSource query+path and pre-computed offline read for BatchFeatureView#6440
abhijeet-dhumal wants to merge 8 commits into
feast-dev:masterfrom
abhijeet-dhumal:feat/spark-bfv-offline-historical-features

Conversation

@abhijeet-dhumal

@abhijeet-dhumal abhijeet-dhumal commented May 27, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it

get_historical_features() on a BatchFeatureView re-runs the full UDF on raw data every call. For embedding pipelines, that's 20–40 min of compute per training run even though features already exist from the last materialize.

Fix: Route get_historical_features() to read pre-computed parquet from batch_source.path instead of re-executing the UDF.

To support this, SparkSource now accepts query + path together:

  • query — raw data read during materialize()
  • path — write-back target and pre-computed read source for get_historical_features()
SparkSource(
    query="SELECT id, text, event_timestamp FROM bronze.documents",
    path="s3://my-bucket/feast/features/document_embeddings/",
)

Also allows BatchFeatureView with online=False, offline=True (offline-only) to skip the online validation check in get_historical_features(), so it can be used purely for training data without configuring an online store.

Falls back to live query if path doesn't exist yet (first run before any materialization).

Which issue(s) this PR fixes

N/A. Enables efficient training data retrieval for BatchFeatureView embedding pipelines without re-running UDFs.

Checks

  • I've made sure the tests are passing.
  • My commits are signed off (git commit -s)
  • My PR title follows conventional commits format

Testing Strategy

  • Unit tests — offline path routing, SparkSource constraint, graceful fallback
  • Manual tests — get_historical_features() reads from parquet, not UDF, after materialization

@abhijeet-dhumal abhijeet-dhumal changed the title Feat/spark bfv offline historical features May 27, 2026
@abhijeet-dhumal abhijeet-dhumal force-pushed the feat/spark-bfv-offline-historical-features branch from a98e23b to 57b2489 Compare May 27, 2026 14:50
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review May 28, 2026 06:23
@abhijeet-dhumal abhijeet-dhumal requested a review from a team as a code owner May 28, 2026 06:23

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

@abhijeet-dhumal abhijeet-dhumal force-pushed the feat/spark-bfv-offline-historical-features branch from 57b2489 to e30e146 Compare May 29, 2026 08:57
abhijeet-dhumal added a commit to abhijeet-dhumal/feast that referenced this pull request May 29, 2026
…orical_features

The function and its call were removed in this PR but the replacement
(_apply_bfv_transformations_for_historical) lives in a separate PR (feast-dev#6440).
Removing it here would silently return raw untransformed features for any
BatchFeatureView with a Python UDF via the standard get_historical_features()
API path (FeatureStore → passthrough_provider → SparkOfflineStore).

Restoring the function and its call until feast-dev#6440 lands.

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
SparkSource previously required exactly one of table/query/path.
This relaxes the constraint to allow query + path together:
- query: used for reading raw data during materialization
- path: used for offline write-back (offline=True) and as
  pre-computed read source in get_historical_features

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
… get_historical_features

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
abhijeet-dhumal added a commit to abhijeet-dhumal/feast that referenced this pull request Jun 1, 2026
…orical_features

The function and its call were removed in this PR but the replacement
(_apply_bfv_transformations_for_historical) lives in a separate PR (feast-dev#6440).
Removing it here would silently return raw untransformed features for any
BatchFeatureView with a Python UDF via the standard get_historical_features()
API path (FeatureStore → passthrough_provider → SparkOfflineStore).

Restoring the function and its call until feast-dev#6440 lands.

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@abhijeet-dhumal abhijeet-dhumal force-pushed the feat/spark-bfv-offline-historical-features branch from e30e146 to 4316349 Compare June 1, 2026 07:48
@abhijeet-dhumal

Copy link
Copy Markdown
Contributor Author

@ntkathole May I request your review here too ?

@jyejare jyejare left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hide comment

This PR adds support for SparkSource with combined query+path configuration and pre-computed offline reads for BatchFeatureView. The changes enable reading from materialized offline stores to avoid expensive UDF re-execution. While the feature is useful, there are several security vulnerabilities and error handling gaps that need attention.

@abhijeet-dhumal abhijeet-dhumal force-pushed the feat/spark-bfv-offline-historical-features branch from 4316349 to 5312075 Compare June 2, 2026 12:45
Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Catch FileNotFoundError and PermissionError separately for the expected
fallback cases (path not yet materialized, or no access). Unexpected
errors now emit a distinct RuntimeWarning instead of being silently
swallowed by a bare except Exception.

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
@abhijeet-dhumal abhijeet-dhumal requested a review from jyejare June 2, 2026 13:38
@ntkathole ntkathole force-pushed the feat/spark-bfv-offline-historical-features branch from 2f6910e to 9eb3d29 Compare June 3, 2026 08:03
The 1.5x speedup assertion for convert_response_to_dict is consistently
flaky on macOS CI runners (getting 1.26-1.34x) due to variable load.
1.2x is still a meaningful regression guard without being brittle.

Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@abhijeet-dhumal abhijeet-dhumal force-pushed the feat/spark-bfv-offline-historical-features branch from 9eb3d29 to e7fc883 Compare June 9, 2026 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants