spark-rapids-examples

A repo for Spark related utilities and examples using the Rapids Accelerator,including ETL, ML/DL, etc.

Enterprise AI is built on ETL pipelines and relies on AI infrastructure to effectively integrate and process large amounts of data. One of the fundamental purposes of RAPIDS Accelerator is to effectively integrate large ETL and ML/DL pipelines. Rapids Accelerator for Apache Spark offers seamless integration with Machine learning frameworks such XGBoost, PCA. Users can leverage the Apache Spark cluster with NVIDIA GPUs to accelerate the ETL pipelines and then use the same infrastructure to load the data frame into single or multiple GPUs across multiple nodes to train with GPU accelerated XGBoost or a PCA. In addition, if you are using a Deep learning framework to train your tabular data with the same Apache Spark cluster, we have leveraged NVIDIA’s NVTabular library to load and train the data across multiple nodes with GPUs. NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems. We also add MIG support to YARN to allow CSPs to split an A100/A30 into multiple MIG devices and have them appear like a normal GPU.

Please see the Rapids Accelerator for Spark documentation for supported Spark versions and requirements. It is recommended to set up Spark Cluster with JDK8.

Getting Started Guides

1. Microbenchmark guide

The microbenchmark on RAPIDS Accelerator For Apache Spark is to identify, test and analyze the best queries which can be accelerated on the GPU. For detail information please refer to this guide.

2. Xgboost examples guide

We provide three similar Xgboost benchmarks, Mortgage, Taxi and Agaricus. Try one of the "Getting Started Guides". Please note that they target the Mortgage dataset as written with a few changes to EXAMPLE_CLASS and dataPath, they can be easily adapted with each other with different datasets.

3. TensorFlow training on Horovod Spark example guide

We provide a Criteo Benchmark to demo ETL and deep learning training on Horovod Spark, please refer to this guide.

4. PCA example guide

This is an example of the GPU accelerated PCA algorithm running on Spark. For detail information please refer to this guide.

5. MIG support

We provide some guides about the Multi-Instance GPU (MIG) feature based on the NVIDIA Ampere architecture (such as NVIDIA A100 and A30) GPU.

API

1. Xgboost examples API

These guides focus on GPU related Scala and python API interfaces.

Troubleshooting

You can trouble-shooting issues according to following guides.

Trouble Shooting XGBoost

Contributing

See the Contributing guide.

Contact Us

Please see the RAPIDS website for contact information.

License

This content is licensed under the Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.github		.github
datasets		datasets
dockerfile		dockerfile
docs		docs
examples		examples
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-rapids-examples

Getting Started Guides

1. Microbenchmark guide

2. Xgboost examples guide

3. TensorFlow training on Horovod Spark example guide

4. PCA example guide

5. MIG support

API

1. Xgboost examples API

Troubleshooting

Contributing

Contact Us

License

About

Releases

Packages

Languages

License

gerashegalov/spark-rapids-examples

Folders and files

Latest commit

History

Repository files navigation

spark-rapids-examples

Getting Started Guides

1. Microbenchmark guide

2. Xgboost examples guide

3. TensorFlow training on Horovod Spark example guide

4. PCA example guide

5. MIG support

API

1. Xgboost examples API

Troubleshooting

Contributing

Contact Us

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages