fix: Invalid column names in get_historical_features when there are field mappings on join keys by aloysius-lim · Pull Request #4886 · feast-dev/feast
What this PR does / why we need it:
When there are field mappings on an entity's join keys, get_historical_features() fails because it does not use the original column name in the data source.
This is needed when the source data cannot be modified, and a field mapping is used to map the source data's column to the join key.
Example:
from feast import Entity, Field, FeatureStore, FeatureView from feast.types import Float32, String from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource # Initialize Feature Store. store = FeatureStore(...) # "driver_id" is used in the Feature Store. driver = Entity(name="driver", join_keys=["driver_id"]) # Using SparkSource as an example, but this applies to other sources. # Source data contains a primary key called "id". This is mapped to the join key "driver_id". driver_stats_src = SparkSource( name="driver_stats", field_mapping={"id": "driver_id"}, path=..., file_format=..., ) driver_stats_fv = FeatureView( name="driver_stats", source= driver_stats_src, entities=[driver], schema=[ # join key must be specified in the schema, else it is not included in driver_stats_fv.entity_columns Field(name="driver_id", dtype=String), Field(name="stat1", dtype=Float32), Field(name="stat2", dtype=Float32), ] ) # Get historical features store.get_historical_features( entity_df=..., features=[ "driver_stats:stat1", "driver_stats:stat2", ]
Without this fix (Spark example):
pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `driver_id` cannot be resolved. Did you mean one of the following? [`id`, `stat1`, `stat2`]
Underlying Spark query:
driver_stats__subquery AS ( SELECT event_timestamp as event_timestamp, created as created_timestamp, -- Here is the problem. driver_id AS driver_id, stat1 as stat1, stat2 as stat2 FROM `feast_entity_df_677a1a6fd13443c6b0e8ccc059b25f01` WHERE event_timestamp <= '2025-01-05T14:00:00' )
With this fix, the field mapping is respected and the query becomes:
driver_stats__subquery AS ( SELECT event_timestamp as event_timestamp, created as created_timestamp, id AS driver_id, stat1 as stat1, stat2 as stat2 FROM `feast_entity_df_677a1a6fd13443c6b0e8ccc059b25f01` WHERE event_timestamp <= '2025-01-05T14:00:00' )
Which issue(s) this PR fixes:
Misc
This only works if the entity join key is specified in the schema of the FeatureView (see above).