Realtime Data Streaming

Introduction

This project leverages a robust tech stack comprising Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra, all containerized using Docker for seamless deployment and scalability

Overview

Data Source: Utilizes the 'randomuser.me' API to generate random user data for the pipeline.
Apache Airflow: Responsible for orchestrating the pipeline and storing fetched data in a PostgreSQL database.
Apache Kafka and Zookeeper: Used for streaming data from PostgreSQL to the processing engine.
Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
Apache Spark: For data processing with its master and worker nodes.
Cassandra: Where the processed data will be stored.

Technologies

Apache Airflow
Python
Apache Kafka
Apache Zookeeper
Apache Spark
Cassandra
PostgreSQL
Docker

Getting Started

Clone the repository:

https://github.com/luan-hillne/Randomuser-ETL-Airflow.git

Navigate to the project directory:
```
cd Randomuser-ETL-Airflow
```
Run Docker Compose to spin up the services:
```
docker-compose up -d
```

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
airflow_docker		airflow_docker
dags		dags
script		script
.env		.env
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Realtime Data Streaming

Introduction

Overview

Technologies

Getting Started

About

Releases

Packages

Languages

luan-hillne/Randomuser-ETL-Airflow

Folders and files

Latest commit

History

Repository files navigation

Realtime Data Streaming

Introduction

Overview

Technologies

Getting Started

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages