We have created an automated pipeline that takes in new data, performs the appropriate transformations, and loads the data into existing tables. We wrote a function that takes in the three files—Wikipedia data, Kaggle metadata, and the MovieLens rating data (Extraction) and performed the Transformation process by cleaning and merging the data as we need and we Load the data to a PostgreSQL database.
This project consists of four technical analysis deliverables. We will submit the following:
Deliverable 1: Write an ETL Function to Read Three Data Files
Click the link to view the code of Deliverable 1
Deliverable 2: Extract and Transform the Wikipedia Data
Click the link to view the code of Deliverable 2
Deliverable 3: Extract and Transform the Kaggle data
Click the link to view the code of Deliverable 3
Deliverable 4: Create the Movie Database
Note for the reader:
In this project we are Extracting ,Transforming and Loading the data using Jupyter Notebook,Postgresql
Data extracted from wikimovies , kaggle are used as inputs ,output data stored in postgresql as two tables
The input file ratings.csv has 26x10^6 data entries if you open it in excel you can see only 14X10^6 since excel can hold only that much data
Make sure to check the size of the file after downloading and storing which can prevent mistakes .