Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add support for Parquet rowgroup and ORC stripe cudf size configs #11799

Open
ustcfy opened this issue Dec 2, 2024 · 3 comments
Open
Labels
feature request New feature or request

Comments

@ustcfy
Copy link
Collaborator

ustcfy commented Dec 2, 2024

Is your feature request related to a problem? Please describe.
I hope support for the orc.stripe.row.count parameter can be added, which would facilitate testing by allowing precise control over the number of stripes generated.
The only parameter related to ORC stripes that I found is orc.stripe.size, which is not very convenient to use.

@ustcfy ustcfy added ? - Needs Triage Need team to review and classify feature request New feature or request labels Dec 2, 2024
@revans2
Copy link
Collaborator

revans2 commented Dec 2, 2024

@ustcfy we currently do not honor orc.stripe.row.count or orc.stripe.size. CUDF does have some options we can expose

https://github.com/rapidsai/cudf/blob/d1bad33caef34b8fa95543c7494780f2084ee603/cpp/include/cudf/io/orc.hpp#L41-L42
https://github.com/rapidsai/cudf/blob/d1bad33caef34b8fa95543c7494780f2084ee603/cpp/include/cudf/io/orc.hpp#L889-L911

But those do not behave the same way as what the Spark ORC writer configs do.

The size limit in CUDF is for pre-compressed data sizes. Spark/ORC's size limit I believe is post compression, checked periodically.

Beyond that the RAPIDS Accelerator will split the input data into batches at an arbitrary point (targeting about 1 GiB uncompressed by default). The CUDF ORC writer will also not produce stripes that span these batches.

Because of all of those differences we decided not to expose these configs. I would really like to understand your use case so that we can produce the correct solution. I am happy to expose these configs in a non-standard way, because they are not the same. But I am not sure that is what you really want.

@ustcfy
Copy link
Collaborator Author

ustcfy commented Dec 3, 2024

I want to address this issue #11735. I need to generate multiple stripes in the test, but with the default configuration, the stripes seem too large. I feel that exposing the stripe_size_rows setting might make writing the test easier. Do you have any suggestions?🥺

@revans2
Copy link
Collaborator

revans2 commented Dec 3, 2024

Okay, then we can expose the size and row count configs, but lets do them as rapids specific configs for now. We can then decide if we want to honor the standard ORC ones. While we are at it we should do the same for parquet.

@mattahrens mattahrens changed the title [FEA] Add support for orc.stripe.row.count config [FEA] Add support for Parquet rowgroup and ORC stripe cudf size configs Dec 3, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants