date | duration | maintainer | order | title |
---|---|---|---|---|
w11d2 |
60 |
zwmiller |
10 |
Spark Intro |
- (30 min) Intro to Spark
- (30 min) Intro to Spark API
- (20 min) Word Count Exercise
- (20 min) Spark SQL Exercise
- (45-60 min) ML with Spark
Optionals if there's time:
- (20 min) Spam Classification with Spark
- (20 min) Recommendations in Spark
Note that there are 2 days for Spark. You can spread this content across both days, focusing on theory/API the first day and getting as far as possible into the exercises. Then on the second day revisiting the API and covering as much of the other material as possible.
The goal of this set of lectures and exercises is to build a lot of hands on experience with Spark and the API. The slides go more in depth with DAGs and the style of lazy evaluation that Spark uses. On top of that, it will introduce why Spark is faster than Hadoop and how it manages to do so using RAM and DAGs. It will also discuss Spark 1.0, even though we won't use it for any exercises.
After that, we jump straight into getting used to the Spark API. We'll focus exclusively on the DataFrame version of Spark (Spark 2.0). The goal of the exercises and the Intro to Spark API is to just give the students practice seeing how the API looks. If the students need more time during any of the exercises, grant it. There's a lot of good learning that happens during those sections.
The ML part will also require a lot of time. Spending a lot of time on VectorAssembler and Pipeline is mandatory, as is explaining the difference between SparkML and Spark MLlib as well as how to use different metrics. Focusing on the idiosyncrasies between SkLearn and SparkML is big as well.
At the end of this lesson the students should:
- Understand how Spark differs from MapReduce/Hadoop
- Know what a DAG is and why it makes Spark go
- Know what an RDD is
- Be familiar with the Spark DataFrame API
- Know how to use SparkSQL
- Know how Spark ML works
- Know what Vector Assembler and Pipeline are and why they're important
- Have a sense of how flexible Spark is in dealing with tabular and text data
Setting up a Spark Cluster on Amazon EMR: walkthrough
Reading and writing directly to S3 with Spark: Example