Name		Name	Last commit message	Last commit date
parent directory ..
intro_to_spark.pdf		intro_to_spark.pdf
intro_to_spark.pptx		intro_to_spark.pptx
readme.md		readme.md

readme.md

date	duration	maintainer	order	title
w11d2	60	zwmiller	10	Spark Intro

Sample Lesson

(30 min) Intro to Spark
(30 min) Intro to Spark API
(20 min) Word Count Exercise
(20 min) Spark SQL Exercise
(45-60 min) ML with Spark

Optionals if there's time:

(20 min) Spam Classification with Spark
(20 min) Recommendations in Spark

Note that there are 2 days for Spark. You can spread this content across both days, focusing on theory/API the first day and getting as far as possible into the exercises. Then on the second day revisiting the API and covering as much of the other material as possible.

Instructor Notes

The goal of this set of lectures and exercises is to build a lot of hands on experience with Spark and the API. The slides go more in depth with DAGs and the style of lazy evaluation that Spark uses. On top of that, it will introduce why Spark is faster than Hadoop and how it manages to do so using RAM and DAGs. It will also discuss Spark 1.0, even though we won't use it for any exercises.

After that, we jump straight into getting used to the Spark API. We'll focus exclusively on the DataFrame version of Spark (Spark 2.0). The goal of the exercises and the Intro to Spark API is to just give the students practice seeing how the API looks. If the students need more time during any of the exercises, grant it. There's a lot of good learning that happens during those sections.

The ML part will also require a lot of time. Spending a lot of time on VectorAssembler and Pipeline is mandatory, as is explaining the difference between SparkML and Spark MLlib as well as how to use different metrics. Focusing on the idiosyncrasies between SkLearn and SparkML is big as well.

At the end of this lesson the students should:

Understand how Spark differs from MapReduce/Hadoop
Know what a DAG is and why it makes Spark go
Know what an RDD is
Be familiar with the Spark DataFrame API
Know how to use SparkSQL
Know how Spark ML works
Know what Vector Assembler and Pipeline are and why they're important
Have a sense of how flexible Spark is in dealing with tabular and text data

Additional Materials

Setting up a Spark Cluster on Amazon EMR: walkthrough

Reading and writing directly to S3 with Spark: Example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark-intro

spark-intro

readme.md

Sample Lesson

Instructor Notes

Additional Materials

Files

spark-intro

Directory actions

More options

Directory actions

More options

Latest commit

History

spark-intro

Folders and files

parent directory

readme.md

Sample Lesson

Instructor Notes

Additional Materials