This course provides an in-depth coverage of various topics in big data from data generation, storage, management, transfer, to analytics, with focus on the state-of-the-art technologies, tools, architectures, and systems that constitute big-data computing solutions in high-performance networks. Real-life big-data applications and workflows in various domains (particularly in the sciences) are introduced as use cases to illustrate the development, deployment, and execution of a wide spectrum of emerging big-data solutions.
The first 1/4 of the course is devoted to the foundations of functional programming. This includes an introduction to the Scala language, functional and object oriented data types, Scala collections, pure functions, referential transparency, and anonymous, higher-order, and recursive functions.
The middle 1/2 of the course is devoted to learning Spark. Students will learn Spark's approach to distributed data parallelism in the context of prior technologies like Hadoop and MapReduce. In particular, we focus on the impact that data movement has on the computational complexity of big data analysis jobs.
The final 1/4 of the course is devoted to cloud computing. In this part, students will learn how to deploy their Scala/Spark programs in the cloud on platforms such as AWS, Databricks, and Azure.
Upon completion, students will know
- how to write functional programs using pure functions and immutable data types;
- how to analyze large datasets using modern tools;
- the factors that are most important to the success of a big data architecture/stack;
- how to use big data architectures such as Spark and its ecosystem
- the fundamental concepts, strategies and pitfalls of processing data at scale.
Part 1. Scala
Week | HW Due | Topics | Readings |
---|---|---|---|
1 | Introduction to Big Data; programming paradigms; history and evolution of Big Data; functional programming | ||
2 | HW 1: 2/7 | Functional data types, lists, functions on lists, higher-order functions, map, filter, currying | |
3 | Project 0: 2/17 | The substitution model, evaluation strategies (CBV, CBN), termination, tail recursion, higher-order list functions, flatMap | |
4 | HW 2: 3/1 | Evaluation and operators; class hierarchies; polymorphism; pattern matching; other collections |
Part 2. Spark
Week | HW Due | Topics | Readings |
---|---|---|---|
5 | Project 1: 3/10 | Distributed data parallelism, latency, Spark's Structured APIs, RDDs | LS Ch 1, 2 |
6 | RDD transformations and actions, evaluation in Spark | LS Ch 3 | |
7 | Cluster topology, reduction operations, pair RDDs, transformations and actions on pair RDDs | LS Ch 7 | |
8 | Joins, shuffling, partitioning, optimizing with partitioners | LS Ch 4 | |
9 | Wide vs narrow dependencies, structured vs unstructureddata | LS Ch 5 | |
10 | DataFrames, Spark SQL | LS Ch 6 | |
11 | Datasets | LS Ch 6 |
Part 3. Cloud
Week | HW Due | Topics | Readings |
---|---|---|---|
13 | Obtaining, cleaning, and storing Data, Kaggle | ||
14 | AWS, Databricks, Azure | ||
15 | TBA | ||
16 | TBA |