A repo for Spark related utilities and examples using the Rapids Accelerator,including ETL, ML/DL, etc.
Enterprise AI is built on ETL pipelines and relies on AI infrastructure to effectively integrate and process large amounts of data. One of the fundamental purposes of RAPIDS Accelerator is to effectively integrate large ETL and ML/DL pipelines. Rapids Accelerator for Apache Spark offers seamless integration with Machine learning frameworks such XGBoost, PCA. Users can leverage the Apache Spark cluster with NVIDIA GPUs to accelerate the ETL pipelines and then use the same infrastructure to load the data frame into single or multiple GPUs across multiple nodes to train with GPU accelerated XGBoost or a PCA. In addition, if you are using a Deep learning framework to train your tabular data with the same Apache Spark cluster, we have leveraged NVIDIA’s NVTabular library to load and train the data across multiple nodes with GPUs. NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems. We also add MIG support to YARN to allow CSPs to split an A100/A30 into multiple MIG devices and have them appear like a normal GPU.
Please see the Rapids Accelerator for Spark documentation for supported Spark versions and requirements. It is recommended to set up Spark Cluster with JDK8.
The microbenchmark on RAPIDS Accelerator For Apache Spark is to identify, test and analyze the best queries which can be accelerated on the GPU. For detail information please refer to this guide.
We provide three similar Xgboost benchmarks, Mortgage, Taxi and Agaricus.
Try one of the "Getting Started Guides".
Please note that they target the Mortgage dataset as written with a few changes
to EXAMPLE_CLASS
and dataPath
, they can be easily adapted with each other with different datasets.
We provide a Criteo Benchmark to demo ETL and deep learning training on Horovod Spark, please refer to this guide.
This is an example of the GPU accelerated PCA algorithm running on Spark. For detail information please refer to this guide.
We provide some guides about the Multi-Instance GPU (MIG) feature based on the NVIDIA Ampere architecture (such as NVIDIA A100 and A30) GPU.
These guides focus on GPU related Scala and python API interfaces.
You can trouble-shooting issues according to following guides.
See the Contributing guide.
Please see the RAPIDS website for contact information.
This content is licensed under the Apache License 2.0