Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data][Feature Request] Integration with Apache Arrow Datafusion #32032

Closed
andreapiso opened this issue Jan 29, 2023 · 4 comments
Closed

[Data][Feature Request] Integration with Apache Arrow Datafusion #32032

andreapiso opened this issue Jan 29, 2023 · 4 comments
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P2 Important issue, but not time-critical

Comments

@andreapiso
Copy link
Contributor

andreapiso commented Jan 29, 2023

Description

Datafusion (https://arrow.apache.org/datafusion/) is a modular query engine developed as a subproject of Apache Arrow.

It is written in Rust (providing very high performance), has Official Python Bindings which pass data between Rust to Python with minimal overhead and zero-copy, and provides both a DataFrame API as well as SQL support. Being part of the Arrow project, it uses it as its in-memory format.

Given the positioning of Ray Dataset (correct me if I am wrong) as a standard for distributed computation on Arrow data, it seems the two projects would complement each other quite well.

Use case

I can think of three ways i would use this integration (in order of perceived difficulty of implementation)

  1. Running DataFusion transformation within Ray Dataset map_batches: adding Datafusion as an engine would grant arrow-level performance while at the same time exposing a high-level DataFrame API and the possibility to transform Ray Datasets batches using SQL (which is something that the other engines today do not do out of the box).

  2. Implementing aggregations (global and grouped) using DataFusion - this is probably the toughest sell as there is already some awesome work done by @clarkzinzow implementing aggregations using Polars - a different Rust-based dataframe library. If anything (it's a bit of a stretch) maintenance might be easier with DataFusion as the contributors there largely intersect with contributors to Apache Arrow (which Ray Data is already heavily using). It would also be nice to be able to write AggregateFn custom aggregations choosing between a Dataframe API and SQL.

  3. (More difficult to implement) having a full fledged "Datafusion on Ray" with Ray essentially replacing Ballista similarly to how Ray replaces the Dask Dataframe distributed scheduler for Dask on Ray. Given the zero-copy capability of Datafusion this integration looks particularly attractive. Being able to natively execute distributed DataFusion plans on Ray Datasets would be great in extending the framework where capabilities are not readily available (e.g. joins).

I recognise implementing this integration would not be straightforward and would require significant buy-in (for what it's worth, I'd be happy to contribute to the development efforts if the community decided to take this up).

@andreapiso andreapiso added the enhancement Request for new feature and/or capability label Jan 29, 2023
@zhe-thoughts zhe-thoughts added the data Ray Data-related issues label Feb 14, 2023
@zhe-thoughts
Copy link
Collaborator

Thanks for the feature ask @andreapiso . Assigning to @c21 to track

@zhe-thoughts zhe-thoughts added the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Feb 14, 2023
@c21 c21 added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 17, 2023
@anyscalesam
Copy link
Contributor

Reviewed - no current ETA.

@c21 c21 removed their assignment Aug 15, 2024
@edmondop
Copy link

@andreapiso thank you for opening this issue. Funny that the third point in the list you created almost two years ago is not too far from being a reality.

We are looking forward to use Ray object store to persist shuffle results across query stages, see apache/datafusion-ray#55

@richardliaw
Copy link
Contributor

Awesome! I will close this for now as we can move this discussion onto the official datafusion-ray repo, which is (3) on your list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

6 participants