-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet support for structured output #15311
base: main
Are you sure you want to change the base?
Conversation
Thanks for getting started on this! |
Signed-off-by: Max <[email protected]>
Signed-off-by: Max <[email protected]>
Signed-off-by: Max <[email protected]>
Signed-off-by: Max <[email protected]>
Signed-off-by: Max <[email protected]>
Signed-off-by: Max <[email protected]>
Signed-off-by: Max <[email protected]>
Signed-off-by: Max <[email protected]>
Signed-off-by: Max <[email protected]>
Signed-off-by: Max <[email protected]>
Segmentation Fault Fixes
@mschrader15 Thanks for this one! It already looks quite advanced. Maybe we should split it into two parts and do the unique_ptr refactoring first? Also is c++17 really required for using parquet? |
Re the C++17, unfortunately yes. See apache/arrow#32415 FYI, I put together an informal speed comparison here: https://github.com/mschrader15/sumo-parquet-test?tab=readme-ov-file#results I have a busy next month, but will try to find time to break this apart. Agree that it is probably best as several smaller PRs for review-ability. |
To be clear @behrisch, you want this in two MRs?
Is it okay to proceed w/ C++17 ? If not, no point in tackling 1 or 2. |
can we help to push this PR? I think protobuf for I/O would be very helpful to speed up outputs parsing for large networks |
Quickly drafted an implementation of parquet using the OutputDevice base type. I had to abstract out the dependence on
std::ostream
into aStreamDevice
to get it to work somewhat nicely.Using templating + polymorphism (and mimicking the behavior of
std::ostream
) tested my c++ skills, and I'm not happy with heavy use of static_cast.Wanted to open this in draft to see if the SUMO team has any immediate feedback / suggested design changes before I clean it up and add in tests?
Current Issues:
closeTag
is called. This works for outputs like emissions export, but won't for every output type.HAS_PARQUET
defined fails per the pipelines