-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adopt SF1000+ data sets #173
Comments
@szarnyasg I have a question - the main part of the datagen itself would still scale to SF1000+, correct? Other than param generation and associated breaking changes described above? |
@arvindshmicrosoft unfortunately, it doesn't - I tried running it for SF3000 (with a numPerson number that would yield ~3TB of data) but it crashed with an NPE. |
Hello, Thank you!
|
Hi @chinyajie,
Thanks for reaching out. Indeed the Hadoop-based generator is limited to
SF1000 and Interactive v2 is still under development. We can attempt to
increase the range of supported data sets for Interactive v1 to SF3000 if
there is an interest in getting audits for these data sets. Are you
interested in obtaining audited results for Interactive v1 ? If so, please
reach out to ***@***.***
Gabor
…On Thu, Oct 17, 2024 at 10:51 AM chinyajie ***@***.***> wrote:
Hello,
I would like to know if it is currently possible to generate the SF1000+
Interactive v1 mode dataset. I noticed that the Spark version no longer
supports Interactive mode. Could you please provide guidance on how to
proceed with generating this dataset?
Thank you!
The SNB Interactive benchmark is currently limited to:
- Data sets up to SF1000
- Append-only workloads without deletions
These could be amended by backporting the improvements made for the BI
workload.
Larger data sets
Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based
Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter
generator has scalability issues (it's a single-threaded Python2 script –
for SF1000, it already requires about a day to finish). Therefore, we
should use the new Spark-based generator. However, this creates at least
three development tasks:
- The existing Cypher and SQL solutions need to be updated to work
with the new schemas produced by the Spark-based Datagen.
- The Interactive parameter generator has to be ported (effectively
reimplemented) in Spark/SparkSQL (Factor generation for
Interactive ldbc_snb_datagen_spark#219
<ldbc/ldbc_snb_datagen_spark#219>).
- The inserts generated by the new data generator (e.g.
inserts/dynamic/Person/part-*.csv) use a different format than the
update streams produced by the old generator. To work around this, we would
need to either adjust the driver or implement an "insert file to update
stream converter". (The latter seems simpler and mostly doable in SQL.)
Introducing deletions
Deletions would be a realistic addition to an OLTP benchmark. The
generator is capable of producing them, so it's only a matter of
integrating them into the workload. The key challenges here are (1)
figuring out the format -- maybe the deletes/dynamic/Person/part-*.csv
files work well, maybe an updateStream-like delete stream would work
better, (2) integrating them into the driver, (3) tuning their ratio, (4)
determining how they should be reported in the benchmark results (e.g. a
delete can be counted simply as another operation, contributing one
operation to the throughput).
Timeline
These are plans for the mid-term future (late 2021 or early 2022),
depending on the interest in such a benchmark.
—
Reply to this email directly, view it on GitHub
<#173 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKWPMP53CUH5ONQ6SYKCRTZ3527LAVCNFSM6AAAAABQDHE54CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJYHE2DIMJWGM>
.
You are receiving this because you modified the open/close state.Message
ID: ***@***.***>
|
I looked into this in more detail. The top comment in this issue states that the Hadoop Datagen throws a NullPointerException (NPE)for data sets larger than SF1000. While this is true, a NullPointerException in the Hadoop Datagen can be the symptom of running out of memory, so using a machine/cluster with more memory resolves this problem. sudo apt install zip unzip maven silversearcher-ag python2 fzf wget
curl -s "https://get.sdkman.io" | bash sdk install java 8.0.422.fx-zulu cd ldbc_snb_datagen_hadoop/
wget https://archive.apache.org/dist/hadoop/core/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xf hadoop-3.2.1.tar.gz export JAVA_HOME=${SDKMAN_CANDIDATES_DIR}/java/${CURRENT}
export HADOOP_HOME=`pwd`/hadoop-3.2.1
export HADOOP_CLIENT_OPTS="-Xmx1530G"
/usr/bin/time -v ./run.sh I generated a data set with the following settings:
Results:
In any case, this experiment proves that the Datagen can generate SF3000 data sets, it just needs a lot of memory to do so, and some fine-tuning to get the size right. |
I apologize for the delayed response. Thank you for sharing your detailed experience and configuration for generating SF3000 datasets. I conducted a similar test and successfully generated SF3000 scale datasets by increasing the swap memory. This has been very helpful for my research. |
Hi szarnyasg, could you add src/main/resources/configuration/ldbc/snb/interactive/sf3000.properties for ldbc_snb_interactive_v1_driver ☺ |
Hello @chinyajie, I'm working on this at the moment. First have to find the correct value for the |
Thank you. |
Hi @chinyajie, I set the The parameter generator needs some more work to be able to scale for these sizes. As it's the holiday period for me now, this is something that will likely be completed early next year. |
Hi @szarnyasg |
The SNB Interactive benchmark is currently limited to:
These could be amended by backporting the improvements made for the BI workload.
Larger data sets
Scaling the Interactive workload SF3000 is not trivial: the Hadoop-based Datagen breaks for SF1000+ data sets (with an NPE) and the old parameter generator has scalability issues (it's a single-threaded Python2 script – for SF1000, it already requires about a day to finish). Therefore, we should use the new Spark-based generator. However, this creates at least three development tasks:
inserts/dynamic/Person/part-*.csv
) use a different format than the update streams produced by the old generator. To work around this, we would need to either adjust the driver or implement an "insert file to update stream converter". (The latter seems simpler and mostly doable in SQL.)Introducing deletions
Deletions would be a realistic addition to an OLTP benchmark. The generator is capable of producing them, so it's only a matter of integrating them into the workload. The key challenges here are (1) figuring out the format -- maybe the
deletes/dynamic/Person/part-*.csv
files work well, maybe anupdateStream
-like delete stream would work better, (2) integrating them into the driver, (3) tuning their ratio, (4) determining how they should be reported in the benchmark results (e.g. a delete can be counted simply as another operation, contributing one operation to the throughput).Timeline
These are plans for the mid-term future (late 2021 or early 2022), depending on the interest in such a benchmark.
The text was updated successfully, but these errors were encountered: