Skip to content

Latest commit

 

History

History

schedule

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

DS 644: Introduction to Big Data (§102, §104)

Course Syllabus and Schedule


Course Description

This course provides an in-depth coverage of various topics in big data from data generation, storage, management, transfer, to analytics, with focus on the state-of-the-art technologies, tools, architectures, and systems that constitute big-data computing solutions in high-performance networks. Real-life big-data applications and workflows in various domains (particularly in the sciences) are introduced as use cases to illustrate the development, deployment, and execution of a wide spectrum of emerging big-data solutions.

The first 1/4 of the course is devoted to the foundations of functional programming. This includes an introduction to the Scala language, functional and object oriented data types, Scala collections, pure functions, referential transparency, and anonymous, higher-order, and recursive functions.

The middle 1/2 of the course is devoted to learning Spark. Students will learn Spark's approach to distributed data parallelism in the context of prior technologies like Hadoop and MapReduce. In particular, we focus on the impact that data movement has on the computational complexity of big data analysis jobs.

The final 1/4 of the course is devoted to cloud computing. In this part, students will learn how to deploy their Scala/Spark programs in the cloud on platforms such as AWS, Databricks, and Azure.

Learning Objectives

Upon completion, students will know

  • how to write functional programs using pure functions and immutable data types;
  • how to analyze large datasets using modern tools;
  • the factors that are most important to the success of a big data architecture/stack;
  • how to use big data architectures such as Spark and its ecosystem
  • the fundamental concepts, strategies and pitfalls of processing data at scale.

Schedule

Part 1. Scala

Week HW Due Topics Readings
1 Introduction to Big Data; programming paradigms; history and evolution of Big Data; functional programming
2 HW 1: 2/7 Functional data types, lists, functions on lists, higher-order functions, map, filter, currying
3 Project 0: 2/17 The substitution model, evaluation strategies (CBV, CBN), termination, tail recursion, higher-order list functions, flatMap
4 HW 2: 3/1 Evaluation and operators; class hierarchies; polymorphism; pattern matching; other collections

Part 2. Spark

Week HW Due Topics Readings
5 Project 1: 3/10 Distributed data parallelism, latency, Spark's Structured APIs, RDDs LS Ch 1, 2
6 RDD transformations and actions, evaluation in Spark LS Ch 3
7 Cluster topology, reduction operations, pair RDDs, transformations and actions on pair RDDs LS Ch 7
8 Joins, shuffling, partitioning, optimizing with partitioners LS Ch 4
9 Wide vs narrow dependencies, structured vs unstructureddata LS Ch 5
10 DataFrames, Spark SQL LS Ch 6
11 Datasets LS Ch 6

Part 3. Cloud

Week HW Due Topics Readings
13 Obtaining, cleaning, and storing Data, Kaggle
14 AWS, Databricks, Azure
15 TBA
16 TBA