DS 644: Introduction to Big Data (§102, §104)

Course Syllabus and Schedule

Course Description

This course provides an in-depth coverage of various topics in big data from data generation, storage, management, transfer, to analytics, with focus on the state-of-the-art technologies, tools, architectures, and systems that constitute big-data computing solutions in high-performance networks. Real-life big-data applications and workflows in various domains (particularly in the sciences) are introduced as use cases to illustrate the development, deployment, and execution of a wide spectrum of emerging big-data solutions.

The first 1/4 of the course is devoted to the foundations of functional programming. This includes an introduction to the Scala language, functional and object oriented data types, Scala collections, pure functions, referential transparency, and anonymous, higher-order, and recursive functions.

The middle 1/2 of the course is devoted to learning Spark. Students will learn Spark's approach to distributed data parallelism in the context of prior technologies like Hadoop and MapReduce. In particular, we focus on the impact that data movement has on the computational complexity of big data analysis jobs.

The final 1/4 of the course is devoted to cloud computing. In this part, students will learn how to deploy their Scala/Spark programs in the cloud on platforms such as AWS, Databricks, and Azure.

Learning Objectives

Upon completion, students will know

how to write functional programs using pure functions and immutable data types;
how to analyze large datasets using modern tools;
the factors that are most important to the success of a big data architecture/stack;
how to use big data architectures such as Spark and its ecosystem
the fundamental concepts, strategies and pitfalls of processing data at scale.

Schedule

Part 1. Scala

Week	HW Due	Topics
1		Introduction to Big Data; programming paradigms; history and evolution of Big Data; functional programming
2	HW 1: 2/7	Functional data types, lists, functions on lists, higher-order functions, map, filter, currying
3	Project 0: 2/17	The substitution model, evaluation strategies (CBV, CBN), termination, tail recursion, higher-order list functions, flatMap
4	HW 2: 3/1	Evaluation and operators; class hierarchies; polymorphism; pattern matching; other collections

Part 2. Spark

Week	HW Due	Topics	Readings
5	Project 1: 3/10	Distributed data parallelism, latency, Spark's Structured APIs, RDDs	LS Ch 1, 2
6		RDD transformations and actions, evaluation in Spark	LS Ch 3
7		Cluster topology, reduction operations, pair RDDs, transformations and actions on pair RDDs	LS Ch 7
8		Joins, shuffling, partitioning, optimizing with partitioners	LS Ch 4
9		Wide vs narrow dependencies, structured vs unstructureddata	LS Ch 5
10		DataFrames, Spark SQL	LS Ch 6
11		Datasets	LS Ch 6

Part 3. Cloud

Week	HW Due	Topics	Readings
13		Obtaining, cleaning, and storing Data, Kaggle
14		AWS, Databricks, Azure
15		TBA
16		TBA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DS 644: Introduction to Big Data (§102, §104)

Course Syllabus and Schedule

Course Description

Learning Objectives

Schedule

Files

README.md

Latest commit

History

README.md

File metadata and controls

DS 644: Introduction to Big Data (§102, §104)

Course Syllabus and Schedule

Course Description

Learning Objectives

Schedule