Adopt SF1000+ data sets #173

szarnyasg · 2021-06-08T12:24:49Z

The SNB Interactive benchmark is currently limited to:

Data sets up to SF1000
Append-only workloads without deletions

These could be amended by backporting the improvements made for the BI workload.

Larger data sets

Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks:

The existing Cypher and SQL solutions need to be updated to work with the new schemas produced by the Spark-based Datagen.
The Interactive parameter generator has to be ported (effectively reimplemented) in Spark/SparkSQL (Factor generation for Interactive ldbc_snb_datagen_spark#219).
The inserts generated by the new data generator (e.g. inserts/dynamic/Person/part-*.csv) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.)

Introducing deletions

Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the deletes/dynamic/Person/part-*.csv files work well, maybe an updateStream-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput).

Timeline

These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark.

The text was updated successfully, but these errors were encountered:

arvindshmicrosoft · 2021-08-07T01:50:36Z

@szarnyasg I have a question - the main part of the datagen itself would still scale to SF1000+, correct? Other than param generation and associated breaking changes described above?

szarnyasg · 2021-08-08T20:44:25Z

@arvindshmicrosoft unfortunately, it doesn't - I tried running it for SF3000 (with a numPerson number that would yield ~3TB of data) but it crashed with an NPE.

chinyajie · 2024-10-17T08:50:37Z

Hello,
I would like to know if it is currently possible to generate the SF1000+ Interactive v1 mode dataset. I noticed that the Spark version no longer supports Interactive mode. Could you please provide guidance on how to proceed with generating this dataset?

Thank you!

The SNB Interactive benchmark is currently limited to:

Data sets up to SF1000

Append-only workloads without deletions

These could be amended by backporting the improvements made for the BI workload.

Larger data sets

Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks:

The existing Cypher and SQL solutions need to be updated to work with the new schemas produced by the Spark-based Datagen.

The Interactive parameter generator has to be ported (effectively reimplemented) in Spark/SparkSQL (Factor generation for Interactive ldbc_snb_datagen_spark#219).

The inserts generated by the new data generator (e.g. inserts/dynamic/Person/part-*.csv) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.)

Introducing deletions

Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the deletes/dynamic/Person/part-*.csv files work well, maybe an updateStream-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput).

Timeline

These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark.

szarnyasg · 2024-10-19T08:41:20Z

Hi @chinyajie, Thanks for reaching out. Indeed the Hadoop-based generator is limited to SF1000 and Interactive v2 is still under development. We can attempt to increase the range of supported data sets for Interactive v1 to SF3000 if there is an interest in getting audits for these data sets. Are you interested in obtaining audited results for Interactive v1 ? If so, please reach out to ***@***.*** Gabor

…

On Thu, Oct 17, 2024 at 10:51 AM chinyajie ***@***.***> wrote: Hello, I would like to know if it is currently possible to generate the SF1000+ Interactive v1 mode dataset. I noticed that the Spark version no longer supports Interactive mode. Could you please provide guidance on how to proceed with generating this dataset? Thank you! The SNB Interactive benchmark is currently limited to: - Data sets up to SF1000 - Append-only workloads without deletions These could be amended by backporting the improvements made for the BI workload. Larger data sets Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks: - The existing Cypher and SQL solutions need to be updated to work with the new schemas produced by the Spark-based Datagen. - The Interactive parameter generator has to be ported (effectively reimplemented) in Spark/SparkSQL (Factor generation for Interactive ldbc_snb_datagen_spark#219 <ldbc/ldbc_snb_datagen_spark#219>). - The inserts generated by the new data generator (e.g. inserts/dynamic/Person/part-*.csv) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.) Introducing deletions Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the deletes/dynamic/Person/part-*.csv files work well, maybe an updateStream-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput). Timeline These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark. — Reply to this email directly, view it on GitHub <#173 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKWPMP53CUH5ONQ6SYKCRTZ3527LAVCNFSM6AAAAABQDHE54CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJYHE2DIMJWGM> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

szarnyasg · 2024-11-02T07:50:23Z

I looked into this in more detail. The top comment in this issue states that the Hadoop Datagen throws a NullPointerException (NPE)for data sets larger than SF1000. While this is true, a NullPointerException in the Hadoop Datagen can be the symptom of running out of memory, so using a machine/cluster with more memory resolves this problem.

sudo apt install zip unzip maven silversearcher-ag python2 fzf wget
curl -s "https://get.sdkman.io" | bash

sdk install java 8.0.422.fx-zulu

cd ldbc_snb_datagen_hadoop/
wget https://archive.apache.org/dist/hadoop/core/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xf hadoop-3.2.1.tar.gz

export JAVA_HOME=${SDKMAN_CANDIDATES_DIR}/java/${CURRENT}
export HADOOP_HOME=`pwd`/hadoop-3.2.1
export HADOOP_CLIENT_OPTS="-Xmx1530G"
/usr/bin/time -v ./run.sh

I generated a data set with the following settings:

Instance: r6a.48xlarge (1.5 TB RAM)
Serializer: CsvBasic serializers
numPersons value: 9800000
This setup used EBS storage (with no instance-attached storage), therefore, changing the location of the Hadoop temporary directory is not required (it's required when instance-attached is available).

Results:

The generation took ~80 hours (!).
According to /usr/bin/time -v, the maximum memory used was 1186 GB.
The runtime does not include parameter generation, which crashed and needs to be performed separately (likely with a portion of it rewritten in DuckDB).
The peak disk usage was about 6.5 TB (!), more than twice of the scale factor's size.

The generated initial data set was 2.8 TB, which is too small (especially because scale factors are determined using the CsvMergeForeign serializer which results in more compact files).

$ du -hd0 social_network/updateStream*.csv
745G    social_network/updateStream_0_0_forum.csv
302M    social_network/updateStream_0_0_person.csv

du -hd0 social_network/{static,dynamic}
2.2M    social_network/static
2.8T    social_network/dynamic

In any case, this experiment proves that the Datagen can generate SF3000 data sets, it just needs a lot of memory to do so, and some fine-tuning to get the size right.

chinyajie · 2024-11-13T09:50:30Z

I looked into this in more detail. The top comment in this issue states that the Hadoop Datagen throws a NullPointerException (NPE)for data sets larger than SF1000. While this is true, a NullPointerException in the Hadoop Datagen can be the symptom of running out of memory, so using a machine/cluster with more memory resolves this problem.
sudo apt install zip unzip maven silversearcher-ag python2 fzf wget
curl -s "https://get.sdkman.io" | bash
sdk install java 8.0.422.fx-zulu
cd ldbc_snb_datagen_hadoop/
wget https://archive.apache.org/dist/hadoop/core/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xf hadoop-3.2.1.tar.gz
export JAVA_HOME=${SDKMAN_CANDIDATES_DIR}/java/${CURRENT}
export HADOOP_HOME=`pwd`/hadoop-3.2.1
export HADOOP_CLIENT_OPTS="-Xmx1530G"
/usr/bin/time -v ./run.sh
I generated a data set with the following settings:

Instance: r6a.48xlarge (1.5 TB RAM)

Serializer: CsvBasic serializers

numPersons value: 9800000

This setup used EBS storage (with no instance-attached storage), therefore, changing the location of the Hadoop temporary directory is not required (it's required when instance-attached is available).

Results:
The generation took ~80 hours (!).

According to /usr/bin/time -v, the maximum memory used was 1186 GB.

The runtime does not include parameter generation, which crashed and needs to be performed separately (likely with a portion of it rewritten in DuckDB).

The peak disk usage was about 6.5 TB (!), more than twice of the scale factor's size.
The generated initial data set was 2.8 TB, which is too small (especially because scale factors are determined using the CsvMergeForeign serializer which results in more compact files).
$ du -hd0 social_network/updateStream*.csv
745G    social_network/updateStream_0_0_forum.csv
302M    social_network/updateStream_0_0_person.csv
du -hd0 social_network/{static,dynamic}
2.2M    social_network/static
2.8T    social_network/dynamic
In any case, this experiment proves that the Datagen can generate SF3000 data sets, it just needs a lot of memory to do so, and some fine-tuning to get the size right.

I apologize for the delayed response. Thank you for sharing your detailed experience and configuration for generating SF3000 datasets. I conducted a similar test and successfully generated SF3000 scale datasets by increasing the swap memory. This has been very helpful for my research.

chinyajie · 2024-12-10T06:25:03Z

Hi @chinyajie, Thanks for reaching out. Indeed the Hadoop-based generator is limited to SF1000 and Interactive v2 is still under development. We can attempt to increase the range of supported data sets for Interactive v1 to SF3000 if there is an interest in getting audits for these data sets. Are you interested in obtaining audited results for Interactive v1 ? If so, please reach out to @.*** Gabor
…
On Thu, Oct 17, 2024 at 10:51 AM chinyajie @.> wrote: Hello, I would like to know if it is currently possible to generate the SF1000+ Interactive v1 mode dataset. I noticed that the Spark version no longer supports Interactive mode. Could you please provide guidance on how to proceed with generating this dataset? Thank you! The SNB Interactive benchmark is currently limited to: - Data sets up to SF1000 - Append-only workloads without deletions These could be amended by backporting the improvements made for the BI workload. Larger data sets Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks: - The existing Cypher and SQL solutions need to be updated to work with the new schemas produced by the Spark-based Datagen. - The Interactive parameter generator has to be ported (effectively reimplemented) in Spark/SparkSQL (Factor generation for Interactive ldbc_snb_datagen_spark#219 <ldbc/ldbc_snb_datagen_spark#219>). - The inserts generated by the new data generator (e.g. inserts/dynamic/Person/part-.csv) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.) Introducing deletions Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the deletes/dynamic/Person/part-.csv files work well, maybe an updateStream-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput). Timeline These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark. — Reply to this email directly, view it on GitHub <#173 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKWPMP53CUH5ONQ6SYKCRTZ3527LAVCNFSM6AAAAABQDHE54CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJYHE2DIMJWGM . You are receiving this because you modified the open/close state.Message ID: @.**>

Hi szarnyasg, could you add src/main/resources/configuration/ldbc/snb/interactive/sf3000.properties for ldbc_snb_interactive_v1_driver ☺

szarnyasg · 2024-12-10T09:01:07Z

Hello @chinyajie, I'm working on this at the moment. First have to find the correct value for the ldbc.snb.datagen.generator.numPersons in the Data generator. I expect this to be finished by early next week. Then, I can work on the parameter generator which will also require some rework as the current on does not scale for SF3,000.

chinyajie · 2024-12-10T13:16:47Z

Hello @chinyajie, I'm working on this at the moment. First have to find the correct value for the ldbc.snb.datagen.generator.numPersons in the Data generator. I expect this to be finished by early next week. Then, I can work on the parameter generator which will also require some rework as the current on does not scale for SF3,000.

Thank you.

szarnyasg · 2024-12-24T06:32:36Z

Hi @chinyajie, I set the ldbc.snb.datagen.generator.numPersons value in the https://github.com/ldbc/ldbc_snb_datagen_hadoop repository. Note that it needs approximately 850 GB of total memory to run.

The parameter generator needs some more work to be able to scale for these sizes. As it's the holiday period for me now, this is something that will likely be completed early next year.

chinyajie · 2024-12-24T07:30:04Z

Hi @szarnyasg
Thank you for your amazing contributions to the project! Your work is incredibly helpful.
I hope you have a wonderful holiday season filled with joy and relaxation.

szarnyasg mentioned this issue Sep 23, 2021

How can I get the raw data including creationdata, deletiondate and explicitlydeleted ? ldbc/ldbc_snb_datagen_hadoop#4

Closed

szarnyasg assigned GLaDAP Feb 15, 2022

This was referenced Jun 24, 2022

Migrate to new schemas #288

Closed

Implement delete operations #289

Closed

Add deletions #292

Merged

szarnyasg added enhancement cypher postgres common epic labels Jun 26, 2022

szarnyasg self-assigned this Jun 26, 2022

szarnyasg added new feature and removed enhancement labels Jun 26, 2022

szarnyasg closed this as completed in #292 Jul 12, 2022

szarnyasg mentioned this issue Oct 19, 2024

Data sets SF3000+ are not supported ldbc/ldbc_snb_datagen_hadoop#30

Open

szarnyasg reopened this Nov 2, 2024

szarnyasg changed the title ~~Backport BI improvements and adopt SF1000+ data sets~~ Adopt SF1000+ data sets Nov 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adopt SF1000+ data sets #173

Adopt SF1000+ data sets #173

szarnyasg commented Jun 8, 2021 •

edited

Loading

arvindshmicrosoft commented Aug 7, 2021

szarnyasg commented Aug 8, 2021

chinyajie commented Oct 17, 2024

Larger data sets

Introducing deletions

Timeline

szarnyasg commented Oct 19, 2024 via email

szarnyasg commented Nov 2, 2024

chinyajie commented Nov 13, 2024

chinyajie commented Dec 10, 2024

szarnyasg commented Dec 10, 2024

chinyajie commented Dec 10, 2024

szarnyasg commented Dec 24, 2024

chinyajie commented Dec 24, 2024

Adopt SF1000+ data sets #173

Adopt SF1000+ data sets #173

Comments

szarnyasg commented Jun 8, 2021 • edited Loading

Larger data sets

Introducing deletions

Timeline

arvindshmicrosoft commented Aug 7, 2021

szarnyasg commented Aug 8, 2021

chinyajie commented Oct 17, 2024

Larger data sets

Introducing deletions

Timeline

szarnyasg commented Oct 19, 2024 via email

szarnyasg commented Nov 2, 2024

chinyajie commented Nov 13, 2024

chinyajie commented Dec 10, 2024

szarnyasg commented Dec 10, 2024

chinyajie commented Dec 10, 2024

szarnyasg commented Dec 24, 2024

chinyajie commented Dec 24, 2024

szarnyasg commented Jun 8, 2021 •

edited

Loading