ETL flow using Spark script and Relational database on Cloud

You need to replicate the scenario taught today in the classroom. You may use cloud provider (AWS or
GCP) of your choice to work.

1. Push crimes data file in S3/GCS bucket.
2. Create a python/scala project having spark-core and spark-sql dependencies locally.
3. Create an ETL pipeline to fetch crimes committed in year 2007 from the data loaded in the bucket
   and load the data into any relational DB table with (crime, count) schema. 
   (Relational db must be on cloud).

Your code will be executed in the local environment and will connect to cloud components or services.

BONUS (1 Absolute Mark): Deploy your code on cloud and execute it from there.

Note: 
To execute this scenario, scripting must be done using distributed programming paradigm (MapReduce)

I have done two implementations

1 - Using S3, pyspark, RDS (Mysql)
Reference Directory: Code Base/AWS-Spark-RDS Based Implementation 

## Second approach is followed in industry

2 - Using S3, Athena , RDS (Mysql)
Reference Directory: Code Base/Athena Based Implementation

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Code Base		Code Base
Data File		Data File
Outputs		Outputs
AWS DB Structure		AWS DB Structure
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL flow using Spark script and Relational database on Cloud

I have done two implementations

Outputs folder have screenshots of implementation

About

Releases

Packages

Languages

asadmanzoor93/aws-crimes-mapreduce

Folders and files

Latest commit

History

Repository files navigation

ETL flow using Spark script and Relational database on Cloud

I have done two implementations

Outputs folder have screenshots of implementation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages