This week in Comet (Jan 18) #1305

andygrove · 2025-01-18T16:34:13Z

Introduction

Inspired by @alamb's weekly updates in DataFusion, I thought it would be a good idea to do something similar in Comet to keep contributors updated on what is happening in the project. These notes reflect things I am personally involved in or thinking about and may not cover all activities. Feel free to add comments for anything that I missed.

News

Comet 0.5.0 has been released. It shows a 1.9x speedup for single node TPC-H @ 100GB, up from 1.7x in the previous release. Thank you to everyone who contributed.

Blog post: https://datafusion.apache.org/blog/2025/01/17/datafusion-comet-0.5.0/

comet-parquet-exec

@mbutrovich and @parthchandra have been working in the comet-parquet-exec branch on adding support for using DataFusion's ParquetExec as an alternative to Comet's current native Parquet reader. This has the advantage of supporting reading complex types from Parquet and may also provide some performance improvements, although we won't know for sure until it is fully implemented. We are now at a point where we would like to merge this work into main and are working on fixing some test regressions so that we can do that. I'm hoping that we can get this merged in the next week.

Array support

There are multiple community PRs either in draft or in review that add additional array functions. The eipc for tracking this effort is #1042. There is also a PR to add array data generation to the fuzz testing tool: #1292 to help find edge cases that are not currently handled, although none have been found so far.

Quality

It is really important to make sure that Comet produces the same results as Spark. We currently rely on Spark's tests as well as additional unit/integration tests in Comet but it is difficult to cover every possible edge case when adding new expressions. I am starting to think about how we can improve and simplify our testing efforts to increase test coverage. Comet has a fuzz testing tool that has been helpful but it is not very sophisticated yet, and we only run it occasionally. I plan on experimenting with some automated fuzz testing that runs as part of the integration test suite.

Community

Weekly Call
Slack/Discord: info links

alamb · 2025-01-18T21:31:47Z

Some other potentially interesting things:

Work on improving spill file performance, via feat: Implement custom RecordBatch serde for shuffle for improved performance #1190
upstream improvements in arrow-rs for reading arrow files (e.g. using mmap, Add example reading data from an mmaped IPC file arrow-rs#6986)

THe point being there is still quite a ways to go performance wise

andygrove · 2025-01-20T16:40:26Z

I have a PR that adds a data generator for generating random Parquet files, including complex types and demonstrates using that to test array_remove for all supported types. I found several bugs while working on this and have filed issues.

#1308

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This week in Comet (Jan 18) #1305

This week in Comet (Jan 18) #1305

andygrove commented Jan 18, 2025

alamb commented Jan 18, 2025

andygrove commented Jan 20, 2025

This week in Comet (Jan 18) #1305

This week in Comet (Jan 18) #1305

Comments

andygrove commented Jan 18, 2025

Introduction

News

comet-parquet-exec

Array support

Quality

Community

alamb commented Jan 18, 2025

andygrove commented Jan 20, 2025