Milvus Online Store Dimension Mismatch Error in Push API and Materialization
Summary
Feast's Milvus online store integration has a critical dimension mismatch bug that affects both the push API and materialization approaches. When storing embeddings with correct dimensions (384), Feast internally transforms the data incorrectly, causing Milvus to reject the data with dimension errors.
Environment
- Feast version: 0.51.0
- Python version: 3.12.11
- pymilvus version: 2.3.0+
- OS: macOS (Darwin 24.5.0)
- Milvus: milvus-lite (via
path: data/online_store.db)
Bug Description
Error Message
ERROR:pymilvus.decorators:RPC error: [upsert_rows], <MilvusException: (code=65535, message=the length(7695) of float data should divide the dim(384): )>
Expected Behavior
- Input: 5 embeddings × 384 dimensions = 1920 total elements
- Feast should store these embeddings correctly in Milvus
- Expected elements sent to Milvus: 1920
Actual Behavior
- Input: 5 embeddings × 384 dimensions = 1920 total elements
- Feast transforms this to 7695 elements (factor of ~4x)
- Milvus rejects the data because 7695 ÷ 384 = 20.04... (not integer)
Steps to Reproduce
1. Feature Store Configuration
# feast_feature_repo/feature_store.yaml project: rag provider: local registry: data/registry.db online_store: type: milvus path: data/online_store.db vector_enabled: true embedding_dim: 384 index_type: "FLAT" metric_type: "L2" offline_store: type: file entity_key_serialization_version: 3 auth: type: no_auth
2. Feature Definitions
from feast import Entity, FeatureView, Field, FileSource, PushSource from feast.types import Array, Float32, String, Int64 from feast.value_type import ValueType from datetime import timedelta document = Entity( name="document_id", value_type=ValueType.STRING, description="Unique identifier for document chunks" ) document_embeddings_source = FileSource( name="document_embeddings_source", path="data/document_embeddings.parquet", timestamp_field="event_timestamp", created_timestamp_column="created_timestamp", ) document_embeddings_push_source = PushSource( name="document_embeddings_push_source", batch_source=document_embeddings_source, ) document_embeddings = FeatureView( name="document_embeddings", entities=[document], ttl=timedelta(days=365), schema=[ Field(name="embedding", dtype=Array(Float32), vector_index=True), Field(name="chunk_text", dtype=String), Field(name="document_title", dtype=String), Field(name="chunk_index", dtype=Int64), Field(name="file_path", dtype=String), Field(name="chunk_length", dtype=Int64), ], online=True, source=document_embeddings_push_source, tags={"team": "rag", "version": "v3"}, )
3. Reproduce with Push API
import pandas as pd import numpy as np from datetime import datetime from sentence_transformers import SentenceTransformer from feast import FeatureStore from feast.data_format import PushMode # Generate test embeddings (384 dimensions) model = SentenceTransformer('all-MiniLM-L6-v2') texts = [ 'Test document 1', 'Test document 2', 'Test document 3', 'Test document 4', 'Test document 5' ] embeddings = model.encode(texts) # Shape: (5, 384) # Create DataFrame feature_data = [] for i, (text, embedding) in enumerate(zip(texts, embeddings)): feature_data.append({ "document_id": f"test_doc_{i}", "embedding": embedding.tolist(), # Convert to list as per docs "chunk_text": text, "document_title": "test_document.md", "chunk_index": i, "file_path": "test_path", "chunk_length": len(text), "event_timestamp": pd.Timestamp.now(tz='UTC'), "created_timestamp": pd.Timestamp.now(tz='UTC') }) df = pd.DataFrame(feature_data) print(f"Input data: {len(df)} rows, {len(df) * 384} total elements") # Initialize Feast store fs = FeatureStore(repo_path="feast_feature_repo") # This will fail with dimension mismatch fs.push( push_source_name="document_embeddings_push_source", df=df, to=PushMode.ONLINE_AND_OFFLINE )
4. Reproduce with Materialization
# Save to parquet file df.to_parquet('feast_feature_repo/data/document_embeddings.parquet', index=False) # Try materialization from datetime import timedelta end_time = datetime.now() start_time = end_time - timedelta(hours=1) # This will also fail with same dimension mismatch fs.materialize( start_date=start_time, end_date=end_time, feature_views=["document_embeddings"] )
Investigation Results
Data Validation
Our debugging confirmed:
- ✅ Input embeddings are exactly 384 dimensions each
- ✅ DataFrame contains 5 rows × 384 = 1920 total elements
- ✅ Embeddings converted to Python lists correctly
- ✅ Data types are correct (
Array(Float32)) - ❌ Feast somehow transforms 1920 → 7695 elements internally
Affected Methods
- Push API:
store.push()withPushMode.ONLINE_AND_OFFLINE - Materialization:
store.materialize()from parquet files - Both fail with identical dimension mismatch errors
Expected Fix
Feast should correctly handle Array(Float32) fields when:
- Pushing data via push API
- Materializing data from parquet files
- The dimension transformation logic needs debugging/fixing
Potential Root Cause
The issue appears to be in Feast's internal serialization/transformation of Array(Float32) fields when interfacing with Milvus. The ~4x multiplication factor (1920 → 7695) suggests there might be:
- Incorrect flattening of nested arrays
- Multiple serialization passes
- Data type conversion issues in the Milvus online store adapter
Workaround
Currently using direct pymilvus.MilvusClient integration which works perfectly with the same data, confirming the issue is within Feast's Milvus adapter.