Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This week in Comet (Jan 18) #1305

Open
andygrove opened this issue Jan 18, 2025 · 2 comments
Open

This week in Comet (Jan 18) #1305

andygrove opened this issue Jan 18, 2025 · 2 comments

Comments

@andygrove
Copy link
Member

Introduction

Inspired by @alamb's weekly updates in DataFusion, I thought it would be a good idea to do something similar in Comet to keep contributors updated on what is happening in the project. These notes reflect things I am personally involved in or thinking about and may not cover all activities. Feel free to add comments for anything that I missed.

News

Comet 0.5.0 has been released. It shows a 1.9x speedup for single node TPC-H @ 100GB, up from 1.7x in the previous release. Thank you to everyone who contributed.

Blog post: https://datafusion.apache.org/blog/2025/01/17/datafusion-comet-0.5.0/

comet-parquet-exec

@mbutrovich and @parthchandra have been working in the comet-parquet-exec branch on adding support for using DataFusion's ParquetExec as an alternative to Comet's current native Parquet reader. This has the advantage of supporting reading complex types from Parquet and may also provide some performance improvements, although we won't know for sure until it is fully implemented. We are now at a point where we would like to merge this work into main and are working on fixing some test regressions so that we can do that. I'm hoping that we can get this merged in the next week.

Array support

There are multiple community PRs either in draft or in review that add additional array functions. The eipc for tracking this effort is #1042. There is also a PR to add array data generation to the fuzz testing tool: #1292 to help find edge cases that are not currently handled, although none have been found so far.

Quality

It is really important to make sure that Comet produces the same results as Spark. We currently rely on Spark's tests as well as additional unit/integration tests in Comet but it is difficult to cover every possible edge case when adding new expressions. I am starting to think about how we can improve and simplify our testing efforts to increase test coverage. Comet has a fuzz testing tool that has been helpful but it is not very sophisticated yet, and we only run it occasionally. I plan on experimenting with some automated fuzz testing that runs as part of the integration test suite.

Community

@alamb
Copy link
Contributor

alamb commented Jan 18, 2025

Some other potentially interesting things:

THe point being there is still quite a ways to go performance wise

@andygrove
Copy link
Member Author

I have a PR that adds a data generator for generating random Parquet files, including complex types and demonstrates using that to test array_remove for all supported types. I found several bugs while working on this and have filed issues.

#1308

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants