You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inspired by @alamb's weekly updates in DataFusion, I thought it would be a good idea to do something similar in Comet to keep contributors updated on what is happening in the project. These notes reflect things I am personally involved in or thinking about and may not cover all activities. Feel free to add comments for anything that I missed.
News
Comet 0.5.0 has been released. It shows a 1.9x speedup for single node TPC-H @ 100GB, up from 1.7x in the previous release. Thank you to everyone who contributed.
@mbutrovich and @parthchandra have been working in the comet-parquet-exec branch on adding support for using DataFusion's ParquetExec as an alternative to Comet's current native Parquet reader. This has the advantage of supporting reading complex types from Parquet and may also provide some performance improvements, although we won't know for sure until it is fully implemented. We are now at a point where we would like to merge this work into main and are working on fixing some test regressions so that we can do that. I'm hoping that we can get this merged in the next week.
Array support
There are multiple community PRs either in draft or in review that add additional array functions. The eipc for tracking this effort is #1042. There is also a PR to add array data generation to the fuzz testing tool: #1292 to help find edge cases that are not currently handled, although none have been found so far.
Quality
It is really important to make sure that Comet produces the same results as Spark. We currently rely on Spark's tests as well as additional unit/integration tests in Comet but it is difficult to cover every possible edge case when adding new expressions. I am starting to think about how we can improve and simplify our testing efforts to increase test coverage. Comet has a fuzz testing tool that has been helpful but it is not very sophisticated yet, and we only run it occasionally. I plan on experimenting with some automated fuzz testing that runs as part of the integration test suite.
I have a PR that adds a data generator for generating random Parquet files, including complex types and demonstrates using that to test array_remove for all supported types. I found several bugs while working on this and have filed issues.
Introduction
Inspired by @alamb's weekly updates in DataFusion, I thought it would be a good idea to do something similar in Comet to keep contributors updated on what is happening in the project. These notes reflect things I am personally involved in or thinking about and may not cover all activities. Feel free to add comments for anything that I missed.
News
Comet 0.5.0 has been released. It shows a 1.9x speedup for single node TPC-H @ 100GB, up from 1.7x in the previous release. Thank you to everyone who contributed.
Blog post: https://datafusion.apache.org/blog/2025/01/17/datafusion-comet-0.5.0/
comet-parquet-exec
@mbutrovich and @parthchandra have been working in the comet-parquet-exec branch on adding support for using DataFusion's ParquetExec as an alternative to Comet's current native Parquet reader. This has the advantage of supporting reading complex types from Parquet and may also provide some performance improvements, although we won't know for sure until it is fully implemented. We are now at a point where we would like to merge this work into main and are working on fixing some test regressions so that we can do that. I'm hoping that we can get this merged in the next week.
Array support
There are multiple community PRs either in draft or in review that add additional array functions. The eipc for tracking this effort is #1042. There is also a PR to add array data generation to the fuzz testing tool: #1292 to help find edge cases that are not currently handled, although none have been found so far.
Quality
It is really important to make sure that Comet produces the same results as Spark. We currently rely on Spark's tests as well as additional unit/integration tests in Comet but it is difficult to cover every possible edge case when adding new expressions. I am starting to think about how we can improve and simplify our testing efforts to increase test coverage. Comet has a fuzz testing tool that has been helpful but it is not very sophisticated yet, and we only run it occasionally. I plan on experimenting with some automated fuzz testing that runs as part of the integration test suite.
Community
The text was updated successfully, but these errors were encountered: