Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet support for structured output #15311

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

mschrader15
Copy link
Contributor

@mschrader15 mschrader15 commented Aug 2, 2024

Quickly drafted an implementation of parquet using the OutputDevice base type. I had to abstract out the dependence on std::ostream into a StreamDevice to get it to work somewhat nicely.

Using templating + polymorphism (and mimicking the behavior of std::ostream) tested my c++ skills, and I'm not happy with heavy use of static_cast.

Wanted to open this in draft to see if the SUMO team has any immediate feedback / suggested design changes before I clean it up and add in tests?

Current Issues:

  • It currently expects all attributes to have appeared by the first time closeTag is called. This works for outputs like emissions export, but won't for every output type.
    • This also fails when (in the emissions export case) a vehicle is not present in the first timestep. We need to add a device-specific expected depth or similar.
    • We need a runtime check for compatibility or to use more advanced features of parquet like Dremel encoding
  • Arrow/Parquet requires cpp17
  • Build w/o HAS_PARQUET defined fails per the pipelines

@namdre namdre requested a review from behrisch August 2, 2024 10:43
@namdre
Copy link
Contributor

namdre commented Aug 2, 2024

Thanks for getting started on this!
Summer holidays are coming around so expect the review to be delayed until september 😅

@behrisch
Copy link
Contributor

@mschrader15 Thanks for this one! It already looks quite advanced. Maybe we should split it into two parts and do the unique_ptr refactoring first? Also is c++17 really required for using parquet?

@mschrader15
Copy link
Contributor Author

Re the C++17, unfortunately yes. See apache/arrow#32415

FYI, I put together an informal speed comparison here: https://github.com/mschrader15/sumo-parquet-test?tab=readme-ov-file#results

I have a busy next month, but will try to find time to break this apart. Agree that it is probably best as several smaller PRs for review-ability.

@mschrader15
Copy link
Contributor Author

mschrader15 commented Oct 18, 2024

To be clear @behrisch, you want this in two MRs?

  1. Refactor of existing output functionality + std::ptr functionality
  2. Addition of parquet support

Is it okay to proceed w/ C++17 ? If not, no point in tackling 1 or 2.

@aminissn
Copy link
Contributor

can we help to push this PR? I think protobuf for I/O would be very helpful to speed up outputs parsing for large networks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants