[Data][Feature Request] Integration with Apache Arrow Datafusion #32032
Labels
data
Ray Data-related issues
enhancement
Request for new feature and/or capability
P2
Important issue, but not time-critical
Description
Datafusion (https://arrow.apache.org/datafusion/) is a modular query engine developed as a subproject of Apache Arrow.
It is written in Rust (providing very high performance), has Official Python Bindings which pass data between Rust to Python with minimal overhead and zero-copy, and provides both a DataFrame API as well as SQL support. Being part of the Arrow project, it uses it as its in-memory format.
Given the positioning of Ray Dataset (correct me if I am wrong) as a standard for distributed computation on Arrow data, it seems the two projects would complement each other quite well.
Use case
I can think of three ways i would use this integration (in order of perceived difficulty of implementation)
Running DataFusion transformation within Ray Dataset
map_batches
: adding Datafusion as an engine would grant arrow-level performance while at the same time exposing a high-level DataFrame API and the possibility to transform Ray Datasets batches using SQL (which is something that the other engines today do not do out of the box).Implementing aggregations (global and grouped) using DataFusion - this is probably the toughest sell as there is already some awesome work done by @clarkzinzow implementing aggregations using Polars - a different Rust-based dataframe library. If anything (it's a bit of a stretch) maintenance might be easier with DataFusion as the contributors there largely intersect with contributors to Apache Arrow (which Ray Data is already heavily using). It would also be nice to be able to write
AggregateFn
custom aggregations choosing between a Dataframe API and SQL.(More difficult to implement) having a full fledged "Datafusion on Ray" with Ray essentially replacing Ballista similarly to how Ray replaces the Dask Dataframe distributed scheduler for Dask on Ray. Given the zero-copy capability of Datafusion this integration looks particularly attractive. Being able to natively execute distributed DataFusion plans on Ray Datasets would be great in extending the framework where capabilities are not readily available (e.g. joins).
I recognise implementing this integration would not be straightforward and would require significant buy-in (for what it's worth, I'd be happy to contribute to the development efforts if the community decided to take this up).
The text was updated successfully, but these errors were encountered: