This project leverages a robust tech stack comprising Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra, all containerized using Docker for seamless deployment and scalability
- Data Source: Utilizes the 'randomuser.me' API to generate random user data for the pipeline.
- Apache Airflow: Responsible for orchestrating the pipeline and storing fetched data in a PostgreSQL database.
- Apache Kafka and Zookeeper: Used for streaming data from PostgreSQL to the processing engine.
- Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
- Apache Spark: For data processing with its master and worker nodes.
- Cassandra: Where the processed data will be stored.
- Apache Airflow
- Python
- Apache Kafka
- Apache Zookeeper
- Apache Spark
- Cassandra
- PostgreSQL
- Docker
-
Clone the repository:
https://github.com/luan-hillne/Randomuser-ETL-Airflow.git
-
Navigate to the project directory:
cd Randomuser-ETL-Airflow
-
Run Docker Compose to spin up the services:
docker-compose up -d