Skip to content

A learning-focused replica of an end-to-end data engineering pipeline. Covers data ingestion, processing, and storage using Apache Airflow, Kafka, Zookeeper, Spark, Cassandra, and Python. Containerized with Docker for deployment and scalability, adapted for educational purposes.

License

Notifications You must be signed in to change notification settings

Moiz101-ch/Exploring-Realtime-Streaming

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Exploring-Realtime-Streaming

Introduction

This repository is a learning-focused adaptation of a robust end-to-end data engineering pipeline. It demonstrates how to seamlessly handle data ingestion, processing, and storage while utilizing cutting-edge technologies like Apache Airflow, Kafka, Zookeeper, Spark, Cassandra, and PostgreSQL. To ensure easy deployment and scalability, the entire setup is containerized using Docker.

System Architecture

Data engineering architecture

The pipeline is built using the following components:

  • Data Source: Random user data is generated using the randomuser.me API.
  • Apache Airflow: Orchestrates the pipeline and stores raw data in a PostgreSQL database.
  • Apache Kafka & Zookeeper: Handles real-time data streaming from PostgreSQL to the processing layer.
  • Control Center & Schema Registry: Facilitates monitoring and schema management for Kafka streams.
  • Apache Spark: Processes the streamed data using a distributed computing framework.
  • Cassandra: Serves as the final storage layer for processed data.

Key Learning Outcomes

By working on this project, you will:

  • Understand how to set up and manage a data pipeline using Apache Airflow.
  • Gain hands-on experience with real-time data streaming using Apache Kafka.
  • Learn distributed synchronization techniques with Apache Zookeeper.
  • Explore advanced data processing methods with Apache Spark.
  • Dive into data storage solutions using Cassandra and PostgreSQL.
  • Master containerization of a full data engineering setup using Docker.

Technologies Used

This project employs the following tools and frameworks:

  • Apache Airflow
  • Python
  • Apache Kafka
  • Apache Zookeeper
  • Apache Spark
  • Cassandra
  • PostgreSQL
  • Docker

Getting Started

Follow these steps to set up and run the project on your local machine:

Clone the Repository

  1. Clone the repository:
git clone https://github.com/Moiz101-ch/Exploring-Realtime-Streaming.git
  1. Navigate to the project directory:
cd Exploring-Realtime-Streaming  
  1. Access the components:

Explore the Pipeline

  • The pipeline fetches data from the randomuser.me API.
  • The data flows through PostgreSQL, Kafka, and Spark before being stored in Cassandra.

Acknowledgments

This project draws inspiration from the original implementation by Yusuf Ganiyu (airscholar). It has been adapted for learning purposes with some modifications.

About

A learning-focused replica of an end-to-end data engineering pipeline. Covers data ingestion, processing, and storage using Apache Airflow, Kafka, Zookeeper, Spark, Cassandra, and Python. Containerized with Docker for deployment and scalability, adapted for educational purposes.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published