feat: Support s3gov schema by snowflake offline store during materialization by alex-vinnik-sp · Pull Request #3891 · feast-dev/feast
However, it is a different story when it comes to AWS govcloud.
It turned out that you have to use special snowflake schema s3gov when data exported by Snowflake with COPY INTO location during Feast materialization into online store.
blob_export_location is set to 's3gov://bucket/path` to get a temp data dump to s3. Export part works just fine.
But, following data upload into online data store fails, because pyarrow doesn't recognize s3gov schema of temp data dump. Here is actual error
[2024-01-10, 19:45:52 UTC] {pod_manager.py:367} INFO - File "/usr/local/lib/python3.10/site-packages/feast/infra/materialization/contrib/bytewax/bytewax_materialization_dataflow.py", line 36, in process_path
[2024-01-10, 19:45:52 UTC] {pod_manager.py:367} INFO - dataset = pq.ParquetDataset(path, use_legacy_dataset=False)
[2024-01-10, 19:45:52 UTC] {pod_manager.py:367} INFO - File "/usr/local/lib64/python3.10/site-packages/pyarrow/parquet/core.py", line 1724, in __new__
[2024-01-10, 19:45:52 UTC] {pod_manager.py:367} INFO - return _ParquetDatasetV2(
[2024-01-10, 19:45:52 UTC] {pod_manager.py:367} INFO - File "/usr/local/lib64/python3.10/site-packages/pyarrow/parquet/core.py", line 2401, in __init__
[2024-01-10, 19:45:52 UTC] {pod_manager.py:367} INFO - if filesystem.get_file_info(path_or_paths).is_file:
[2024-01-10, 19:45:52 UTC] {pod_manager.py:367} INFO - File "pyarrow/_fs.pyx", line 571, in pyarrow._fs.FileSystem.get_file_info
[2024-01-10, 19:45:52 UTC] {pod_manager.py:367} INFO - File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
[2024-01-10, 19:45:52 UTC] {pod_manager.py:367} INFO - File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
[2024-01-10, 19:45:52 UTC] {pod_manager.py:367} INFO - pyarrow.lib.ArrowInvalid: Expected a local filesystem path, got a URI: 's3gov://bucket/path_to/temporary_1e8c3ddfb3104a229b89a04a22718d42_0_0_0.snappy.parquet'
Proposed solution is to remove gov from data dump paths, so pyarrow can read data using native path.
Added unit tests for snowflake.to_remote_storage method to ensure native s3 schema is always passed to file uploader.