Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adopt SF1000+ data sets #173

Open
szarnyasg opened this issue Jun 8, 2021 · 11 comments · Fixed by #292
Open

Adopt SF1000+ data sets #173

szarnyasg opened this issue Jun 8, 2021 · 11 comments · Fixed by #292

Comments

@szarnyasg
Copy link
Member

szarnyasg commented Jun 8, 2021

The SNB Interactive benchmark is currently limited to:

  • Data sets up to SF1000
  • Append-only workloads without deletions

These could be amended by backporting the improvements made for the BI workload.

Larger data sets

Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks:

  • The existing Cypher and SQL solutions need to be updated to work with the new schemas produced by the Spark-based Datagen.
  • The Interactive parameter generator has to be ported (effectively reimplemented) in Spark/SparkSQL (Factor generation for Interactive ldbc_snb_datagen_spark#219).
  • The inserts generated by the new data generator (e.g. inserts/dynamic/Person/part-*.csv) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.)

Introducing deletions

Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the deletes/dynamic/Person/part-*.csv files work well, maybe an updateStream-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput).

Timeline

These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark.

@arvindshmicrosoft
Copy link

@szarnyasg I have a question - the main part of the datagen itself would still scale to SF1000+, correct? Other than param generation and associated breaking changes described above?

@szarnyasg
Copy link
Member Author

@arvindshmicrosoft unfortunately, it doesn't - I tried running it for SF3000 (with a numPerson number that would yield ~3TB of data) but it crashed with an NPE.

@chinyajie
Copy link

Hello,
I would like to know if it is currently possible to generate the SF1000+ Interactive v1 mode dataset. I noticed that the Spark version no longer supports Interactive mode. Could you please provide guidance on how to proceed with generating this dataset?

Thank you!

The SNB Interactive benchmark is currently limited to:

  • Data sets up to SF1000
  • Append-only workloads without deletions

These could be amended by backporting the improvements made for the BI workload.

Larger data sets

Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks:

  • The existing Cypher and SQL solutions need to be updated to work with the new schemas produced by the Spark-based Datagen.
  • The Interactive parameter generator has to be ported (effectively reimplemented) in Spark/SparkSQL (Factor generation for Interactive ldbc_snb_datagen_spark#219).
  • The inserts generated by the new data generator (e.g. inserts/dynamic/Person/part-*.csv) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.)

Introducing deletions

Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the deletes/dynamic/Person/part-*.csv files work well, maybe an updateStream-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput).

Timeline

These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark.

@szarnyasg
Copy link
Member Author

szarnyasg commented Oct 19, 2024 via email

@szarnyasg
Copy link
Member Author

I looked into this in more detail. The top comment in this issue states that the Hadoop Datagen throws a NullPointerException (NPE)for data sets larger than SF1000. While this is true, a NullPointerException in the Hadoop Datagen can be the symptom of running out of memory, so using a machine/cluster with more memory resolves this problem.

sudo apt install zip unzip maven silversearcher-ag python2 fzf wget
curl -s "https://get.sdkman.io" | bash
sdk install java 8.0.422.fx-zulu
cd ldbc_snb_datagen_hadoop/
wget https://archive.apache.org/dist/hadoop/core/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xf hadoop-3.2.1.tar.gz
export JAVA_HOME=${SDKMAN_CANDIDATES_DIR}/java/${CURRENT}
export HADOOP_HOME=`pwd`/hadoop-3.2.1
export HADOOP_CLIENT_OPTS="-Xmx1530G"
/usr/bin/time -v ./run.sh

I generated a data set with the following settings:

  • Instance: r6a.48xlarge (1.5 TB RAM)
  • Serializer: CsvBasic serializers
  • numPersons value: 9800000
  • This setup used EBS storage (with no instance-attached storage), therefore, changing the location of the Hadoop temporary directory is not required (it's required when instance-attached is available).

Results:

  • The generation took ~80 hours (!).

  • According to /usr/bin/time -v, the maximum memory used was 1186 GB.

  • The runtime does not include parameter generation, which crashed and needs to be performed separately (likely with a portion of it rewritten in DuckDB).

  • The peak disk usage was about 6.5 TB (!), more than twice of the scale factor's size.

  • The generated initial data set was 2.8 TB, which is too small (especially because scale factors are determined using the CsvMergeForeign serializer which results in more compact files).

    $ du -hd0 social_network/updateStream*.csv
    745G    social_network/updateStream_0_0_forum.csv
    302M    social_network/updateStream_0_0_person.csv
    du -hd0 social_network/{static,dynamic}
    2.2M    social_network/static
    2.8T    social_network/dynamic

In any case, this experiment proves that the Datagen can generate SF3000 data sets, it just needs a lot of memory to do so, and some fine-tuning to get the size right.

@szarnyasg szarnyasg reopened this Nov 2, 2024
@szarnyasg szarnyasg changed the title Backport BI improvements and adopt SF1000+ data sets Adopt SF1000+ data sets Nov 2, 2024
@chinyajie
Copy link

I looked into this in more detail. The top comment in this issue states that the Hadoop Datagen throws a NullPointerException (NPE)for data sets larger than SF1000. While this is true, a NullPointerException in the Hadoop Datagen can be the symptom of running out of memory, so using a machine/cluster with more memory resolves this problem.

sudo apt install zip unzip maven silversearcher-ag python2 fzf wget
curl -s "https://get.sdkman.io" | bash
sdk install java 8.0.422.fx-zulu
cd ldbc_snb_datagen_hadoop/
wget https://archive.apache.org/dist/hadoop/core/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xf hadoop-3.2.1.tar.gz
export JAVA_HOME=${SDKMAN_CANDIDATES_DIR}/java/${CURRENT}
export HADOOP_HOME=`pwd`/hadoop-3.2.1
export HADOOP_CLIENT_OPTS="-Xmx1530G"
/usr/bin/time -v ./run.sh

I generated a data set with the following settings:

  • Instance: r6a.48xlarge (1.5 TB RAM)
  • Serializer: CsvBasic serializers
  • numPersons value: 9800000
  • This setup used EBS storage (with no instance-attached storage), therefore, changing the location of the Hadoop temporary directory is not required (it's required when instance-attached is available).

Results:

  • The generation took ~80 hours (!).

  • According to /usr/bin/time -v, the maximum memory used was 1186 GB.

  • The runtime does not include parameter generation, which crashed and needs to be performed separately (likely with a portion of it rewritten in DuckDB).

  • The peak disk usage was about 6.5 TB (!), more than twice of the scale factor's size.

  • The generated initial data set was 2.8 TB, which is too small (especially because scale factors are determined using the CsvMergeForeign serializer which results in more compact files).

    $ du -hd0 social_network/updateStream*.csv
    745G    social_network/updateStream_0_0_forum.csv
    302M    social_network/updateStream_0_0_person.csv
    du -hd0 social_network/{static,dynamic}
    2.2M    social_network/static
    2.8T    social_network/dynamic

In any case, this experiment proves that the Datagen can generate SF3000 data sets, it just needs a lot of memory to do so, and some fine-tuning to get the size right.

I apologize for the delayed response. Thank you for sharing your detailed experience and configuration for generating SF3000 datasets. I conducted a similar test and successfully generated SF3000 scale datasets by increasing the swap memory. This has been very helpful for my research.

@chinyajie
Copy link

Hi @chinyajie, Thanks for reaching out. Indeed the Hadoop-based generator is limited to SF1000 and Interactive v2 is still under development. We can attempt to increase the range of supported data sets for Interactive v1 to SF3000 if there is an interest in getting audits for these data sets. Are you interested in obtaining audited results for Interactive v1 ? If so, please reach out to @.*** Gabor

On Thu, Oct 17, 2024 at 10:51 AM chinyajie @.> wrote: Hello, I would like to know if it is currently possible to generate the SF1000+ Interactive v1 mode dataset. I noticed that the Spark version no longer supports Interactive mode. Could you please provide guidance on how to proceed with generating this dataset? Thank you! The SNB Interactive benchmark is currently limited to: - Data sets up to SF1000 - Append-only workloads without deletions These could be amended by backporting the improvements made for the BI workload. Larger data sets Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks: - The existing Cypher and SQL solutions need to be updated to work with the new schemas produced by the Spark-based Datagen. - The Interactive parameter generator has to be ported (effectively reimplemented) in Spark/SparkSQL (Factor generation for Interactive ldbc_snb_datagen_spark#219 <ldbc/ldbc_snb_datagen_spark#219>). - The inserts generated by the new data generator (e.g. inserts/dynamic/Person/part-.csv) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.) Introducing deletions Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the deletes/dynamic/Person/part-.csv files work well, maybe an updateStream-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput). Timeline These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark. — Reply to this email directly, view it on GitHub <#173 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMP53CUH5ONQ6SYKCRTZ3527LAVCNFSM6AAAAABQDHE54CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJYHE2DIMJWGM . You are receiving this because you modified the open/close state.Message ID: @.**>

Hi szarnyasg, could you add src/main/resources/configuration/ldbc/snb/interactive/sf3000.properties for ldbc_snb_interactive_v1_driver ☺

@szarnyasg
Copy link
Member Author

Hello @chinyajie, I'm working on this at the moment. First have to find the correct value for the ldbc.snb.datagen.generator.numPersons in the Data generator. I expect this to be finished by early next week. Then, I can work on the parameter generator which will also require some rework as the current on does not scale for SF3,000.

@chinyajie
Copy link

Hello @chinyajie, I'm working on this at the moment. First have to find the correct value for the ldbc.snb.datagen.generator.numPersons in the Data generator. I expect this to be finished by early next week. Then, I can work on the parameter generator which will also require some rework as the current on does not scale for SF3,000.

Thank you.

@szarnyasg
Copy link
Member Author

Hi @chinyajie, I set the ldbc.snb.datagen.generator.numPersons value in the https://github.com/ldbc/ldbc_snb_datagen_hadoop repository. Note that it needs approximately 850 GB of total memory to run.

The parameter generator needs some more work to be able to scale for these sizes. As it's the holiday period for me now, this is something that will likely be completed early next year.

@chinyajie
Copy link

Hi @szarnyasg
Thank you for your amazing contributions to the project! Your work is incredibly helpful.
I hope you have a wonderful holiday season filled with joy and relaxation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants