feat: Kubernetes materialization engine written based on bytewax by sudohainguyen · Pull Request #4087 · feast-dev/feast

sudohainguyen

What this PR does / why we need it:

As our efforts to prune bytewax dependencies to support python 3.11, in this PR I've rewritten bytewax engine as a new one called kubernetes engine.
Most of the code is unchanged, genuinely modify image entrypoint to process paths from remote storage

Which issue(s) this PR fixes:

Relates to #4046

Additional notes

Docs will be provided in a separate PR and also bytewax removal

sudohainguyen changed the title ~~feat: kubernetes materialization engine written based on bytewax~~ feat: Kubernetes materialization engine written based on bytewax

Apr 9, 2024

Signed-off-by: Harry <quanghai.ng1512@gmail.com>

Signed-off-by: Harry <quanghai.ng1512@gmail.com>

Signed-off-by: Harry <quanghai.ng1512@gmail.com>

Thanks, this looks really cool, especially as a transition path for people who use bytewax engine today. @alex-vinnik-sp

I'm all for merging this, but I would still like to note that I was actually talking with @lokeshrangineni about this the other day and I'm not sure it will be a good idea to keep this engine around if/when we will have good alternatives based on spark, dask or something similar, for example a spark engine that's not limited to working with just spark offline store and also has an adequate performance. The maintenance burden of a homegrown simple distributed engine will be too much then. I may be wrong of course... maybe people will still prefer to use this just to avoid spark overhead.

@tokoko yep spark overhead sometimes frustrates user cause I was in that situation before 😃
I believe users need an engine somewhere between local and spark.

And IIRC spark engine only available to spark offline store 🤔

@sudohainguyen only from a functionality perspective, you're probably right, materialization isn't so complicated after all. But if you consider that users will also want to have better monitoring, resilience and so on, using some other off-the-shelf engine makes more sense imho.

yup, spark only works with spark offline store, but there's nothing stopping us from adding a first step in spark engine as well similar to the one in here that exports offline store output to some storage first and only then spark would kick in to parallelize reading and writing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you for submitting this @sudohainguyen. This will unblock the PR for now. I do agree with all the feedback on this PR.

@sudohainguyen - We have an existing PR to remove bytewax dependencies and upgrading to 3.11 here. Just making sure you are aware of it so that we can avoid duplicate of work. Once this PR merged then we can merge the other PR.

I don't think combining 3.11 works and bytewax pruning is a good move.
let's do one by one 😄

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sudohainguyen LGTM. left comments and I think can serve as future work.

I like this feature. it feels both powerful and lightweight. 👍

		"labels": {pod_labels, self.batch_engine_config.labels},
		},
		"spec": {
		"restartPolicy": "Never",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"restartPolicy": "Never", This can be also a config i think


		job_labels = {"feast-materializer": "job"}
		pod_labels = {"feast-materializer": "pod"}
		job_definition = {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can take this out as a template file, also make it a config option for the user to override the template