[Data][Feature Request] Integration with Apache Arrow Datafusion #32032

andreapiso · 2023-01-29T11:29:47Z

Description

Datafusion (https://arrow.apache.org/datafusion/) is a modular query engine developed as a subproject of Apache Arrow.

It is written in Rust (providing very high performance), has Official Python Bindings which pass data between Rust to Python with minimal overhead and zero-copy, and provides both a DataFrame API as well as SQL support. Being part of the Arrow project, it uses it as its in-memory format.

Given the positioning of Ray Dataset (correct me if I am wrong) as a standard for distributed computation on Arrow data, it seems the two projects would complement each other quite well.

Use case

I can think of three ways i would use this integration (in order of perceived difficulty of implementation)

Running DataFusion transformation within Ray Dataset map_batches: adding Datafusion as an engine would grant arrow-level performance while at the same time exposing a high-level DataFrame API and the possibility to transform Ray Datasets batches using SQL (which is something that the other engines today do not do out of the box).
Implementing aggregations (global and grouped) using DataFusion - this is probably the toughest sell as there is already some awesome work done by @clarkzinzow implementing aggregations using Polars - a different Rust-based dataframe library. If anything (it's a bit of a stretch) maintenance might be easier with DataFusion as the contributors there largely intersect with contributors to Apache Arrow (which Ray Data is already heavily using). It would also be nice to be able to write AggregateFn custom aggregations choosing between a Dataframe API and SQL.
(More difficult to implement) having a full fledged "Datafusion on Ray" with Ray essentially replacing Ballista similarly to how Ray replaces the Dask Dataframe distributed scheduler for Dask on Ray. Given the zero-copy capability of Datafusion this integration looks particularly attractive. Being able to natively execute distributed DataFusion plans on Ray Datasets would be great in extending the framework where capabilities are not readily available (e.g. joins).

I recognise implementing this integration would not be straightforward and would require significant buy-in (for what it's worth, I'd be happy to contribute to the development efforts if the community decided to take this up).

The text was updated successfully, but these errors were encountered:

zhe-thoughts · 2023-02-14T04:24:33Z

Thanks for the feature ask @andreapiso . Assigning to @c21 to track

anyscalesam · 2023-12-19T21:59:35Z

Reviewed - no current ETA.

edmondop · 2024-12-17T23:04:56Z

@andreapiso thank you for opening this issue. Funny that the third point in the list you created almost two years ago is not too far from being a reality.

We are looking forward to use Ray object store to persist shuffle results across query stages, see apache/datafusion-ray#55

richardliaw · 2024-12-27T07:43:15Z

Awesome! I will close this for now as we can move this discussion onto the official datafusion-ray repo, which is (3) on your list.

andreapiso added the enhancement Request for new feature and/or capability label Jan 29, 2023

zhe-thoughts added the data Ray Data-related issues label Feb 14, 2023

zhe-thoughts assigned c21 Feb 14, 2023

zhe-thoughts added the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Feb 14, 2023

c21 added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 17, 2023

c21 removed their assignment Aug 15, 2024

richardliaw closed this as completed Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data][Feature Request] Integration with Apache Arrow Datafusion #32032

[Data][Feature Request] Integration with Apache Arrow Datafusion #32032

andreapiso commented Jan 29, 2023 •

edited

Loading

zhe-thoughts commented Feb 14, 2023

anyscalesam commented Dec 19, 2023

edmondop commented Dec 17, 2024

richardliaw commented Dec 27, 2024

[Data][Feature Request] Integration with Apache Arrow Datafusion #32032

[Data][Feature Request] Integration with Apache Arrow Datafusion #32032

Comments

andreapiso commented Jan 29, 2023 • edited Loading

Description

Use case

zhe-thoughts commented Feb 14, 2023

anyscalesam commented Dec 19, 2023

edmondop commented Dec 17, 2024

richardliaw commented Dec 27, 2024

andreapiso commented Jan 29, 2023 •

edited

Loading