feat: Adding get_online_features_async to feature store sdk by breno-costa · Pull Request #4172 · feast-dev/feast

breno-costa

What this PR does / why we need it:

Which issue(s) this PR fixes:

@breno-costa thanks, I'll take a closer look tomorrow. One comment right now is that we will probably need to add a test for async under online_store in integration tests as well. Tests will have to marked to be skipped for every online store other than redis for now. We can probably defer it for separate PR (?) if you like but it'll be essential eventually.

I'm running some benchmarks to check latency and throughput for this new implementation.

Just created a benchmark repo with a FastAPI application with sync and async endpoints.
https://github.com/breno-costa/feast-benchmark-async-sdk

Here are some initial results. Don't take them as final results, as I'm still analyzing them. If you have any comments on benchmark repository, please let me know.

Sync implementation

~ wrk --threads 4 --duration 1m --latency "http://localhost:8000/get_online_features"
Running 1m test @ http://localhost:8000/get_online_features
  4 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    30.91ms    7.87ms 176.23ms   80.32%
    Req/Sec    65.00      9.38    90.00     78.16%
  Latency Distribution
     50%   30.49ms
     75%   35.95ms
     90%   39.54ms
     99%   47.80ms
  15572 requests in 1.00m, 31.51MB read
Requests/sec:    259.22
Transfer/sec:    537.18KB

Async implementation

~ wrk --threads 4 --duration 1m --latency "http://localhost:8000/get_online_features_async"
Running 1m test @ http://localhost:8000/get_online_features_async
  4 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    32.30ms    2.95ms  66.73ms   81.85%
    Req/Sec    62.03      5.87    80.00     69.22%
  Latency Distribution
     50%   32.10ms
     75%   33.14ms
     90%   35.50ms
     99%   40.79ms
  14878 requests in 1.00m, 30.11MB read
Requests/sec:    247.59
Transfer/sec:    513.07KB

that's really useful, I had something like that at the back of my mind. Can I suggest to change fastapi endpoint for sync get_online_features from async to sync? Doing a sync processing behind an async endpoint comes with a performance penalty.

that's really useful, I had something like that at the back of my mind. Can I suggest to change fastapi endpoint for sync get_online_features from async to sync? Doing a sync processing behind an async endpoint comes with a performance penalty.

Thanks for the heads up. I changed and made a lot of difference. The event loop added some overhead and degraded the performance for sync endpoint. I replaced old values with new ones in last comment.

That said, I didn't see many difference for sync and async implementations for this basic benchmark case. I'll run same script with more features and entities to increase time taken to fetch data from Redis. Maybe add some CPU-bound operation after fetching data to simulate more ML like scenario.

That's strange 😄 cpu-bound operations will degrade async performance even more, actually. And it's not that realistic of a scenario anyway, one would normally put this async retrieval behind a separate feature server where model inference doesn't take place. one caveat here is that redis in your benchmarks is a local one, so network latency doesn't play as much role. With a remote redis, async will do better, not sure how significantly that would change the picture though.

P.S. another idea that might help... can you also benchmark pure OnlineStore.read(..) calls rather than get_online_features? There's a lot of crazy stuff going on in get_online_features, maybe they are the culprit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice lgtm

@tokoko I've materialized data into a Redis instance on Google Cloud to test this implementation. The server is accessing the Redis instance through a ssh tunnel intentionally to increase network latency. Also increased the number of connections (or users) in my benchmark. When I simulated such kind of scenario, I started to see the benefits of async implementation. Example here:

~ wrk --threads 4 --connections 200 --duration 1m --latency "http://localhost:8000/get_online_features"
Running 1m test @ http://localhost:8000/get_online_features
  4 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.83s   981.20ms   4.65s    71.63%
    Req/Sec    42.35     35.19   160.00     55.26%
  Latency Distribution
     50%    2.32s
     75%    3.55s
     90%    4.31s
     99%    4.49s
  4160 requests in 1.00m, 8.41MB read
  Socket errors: connect 0, read 149, write 0, timeout 0
Requests/sec:     69.23
Transfer/sec:    143.32KB

~ wrk --threads 4 --connections 200 --duration 1m --latency "http://localhost:8000/get_online_features_async"
Running 1m test @ http://localhost:8000/get_online_features_async
  4 threads and 200 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.29s    66.51ms   1.56s    72.30%
    Req/Sec    42.91     24.80   151.00     69.87%
  Latency Distribution
     50%    1.28s
     75%    1.32s
     90%    1.38s
     99%    1.51s
  9206 requests in 1.00m, 18.61MB read
  Socket errors: connect 0, read 142, write 0, timeout 0
Requests/sec:    153.21
Transfer/sec:    317.18KB

Tests will have to marked to be skipped for every online store other than redis for now. We can probably defer it for separate PR (?) if you like but it'll be essential eventually.

I can manage that in another PR if you don't mind.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, lgtm. thanks

Could you add that label to run tests? I can't add it.

pretty sure integration tests are still broken, so no point yet :) (#4171). @jeremyary Can you add the label here once credentials are in order?

@tokoko yep, will do. Credentials PR is green now, so hopefully shouldn't be much longer!

@breno-costa looks like you'll have to rebase on master to get Jeremy's latest PR

@tokoko actually should just be able to rerun jobs. I just kicked it off.

looked quickly so could be wrong, but assertion errors aren't consistent with other action/PR runs, so may be a legit issue to check out here

Signed-off-by: Breno Costa <brenocosta0901@gmail.com>

Signed-off-by: Breno Costa <brenocosta0901@gmail.com>

I could finally reproduce the error locally.
There is a difference between sync and async results for timestamp values. I'm investigating that right now.

Signed-off-by: Breno Costa <brenocosta0901@gmail.com>

Signed-off-by: Breno Costa <brenocosta0901@gmail.com>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. we can merge this