Skip to content

ETL flow using Spark script and Relational database on Cloud

Notifications You must be signed in to change notification settings

asadmanzoor93/aws-crimes-mapreduce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETL flow using Spark script and Relational database on Cloud

You need to replicate the scenario taught today in the classroom. You may use cloud provider (AWS or
GCP) of your choice to work.

1. Push crimes data file in S3/GCS bucket.
2. Create a python/scala project having spark-core and spark-sql dependencies locally.
3. Create an ETL pipeline to fetch crimes committed in year 2007 from the data loaded in the bucket
   and load the data into any relational DB table with (crime, count) schema. 
   (Relational db must be on cloud).

Your code will be executed in the local environment and will connect to cloud components or services.

BONUS (1 Absolute Mark): Deploy your code on cloud and execute it from there.

Note: 
To execute this scenario, scripting must be done using distributed programming paradigm (MapReduce)

I have done two implementations

1 - Using S3, pyspark, RDS (Mysql)
Reference Directory: Code Base/AWS-Spark-RDS Based Implementation 

## Second approach is followed in industry

2 - Using S3, Athena , RDS (Mysql)
Reference Directory: Code Base/Athena Based Implementation

Outputs folder have screenshots of implementation

About

ETL flow using Spark script and Relational database on Cloud

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published