You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I hope support for the orc.stripe.row.count parameter can be added, which would facilitate testing by allowing precise control over the number of stripes generated.
The only parameter related to ORC stripes that I found is orc.stripe.size, which is not very convenient to use.
The text was updated successfully, but these errors were encountered:
But those do not behave the same way as what the Spark ORC writer configs do.
The size limit in CUDF is for pre-compressed data sizes. Spark/ORC's size limit I believe is post compression, checked periodically.
Beyond that the RAPIDS Accelerator will split the input data into batches at an arbitrary point (targeting about 1 GiB uncompressed by default). The CUDF ORC writer will also not produce stripes that span these batches.
Because of all of those differences we decided not to expose these configs. I would really like to understand your use case so that we can produce the correct solution. I am happy to expose these configs in a non-standard way, because they are not the same. But I am not sure that is what you really want.
I want to address this issue #11735. I need to generate multiple stripes in the test, but with the default configuration, the stripes seem too large. I feel that exposing the stripe_size_rows setting might make writing the test easier. Do you have any suggestions?🥺
Okay, then we can expose the size and row count configs, but lets do them as rapids specific configs for now. We can then decide if we want to honor the standard ORC ones. While we are at it we should do the same for parquet.
mattahrens
changed the title
[FEA] Add support for orc.stripe.row.count config
[FEA] Add support for Parquet rowgroup and ORC stripe cudf size configs
Dec 3, 2024
Is your feature request related to a problem? Please describe.
I hope support for the
orc.stripe.row.count
parameter can be added, which would facilitate testing by allowing precise control over the number of stripes generated.The only parameter related to ORC stripes that I found is
orc.stripe.size
, which is not very convenient to use.The text was updated successfully, but these errors were encountered: