Analytics Engineering Bootcamp Capstone Project

Table of Contents

Project Introduction
Steps
Data Sources
Exploratory Data Analysis
Data Models
Data Dictionary
Tech Stack
Pipeline
Visualization
Conclusion

Project Introduction

The Problem

The European airspace faces several significant challenges, with high levels of traffic congestion, flight delays, and increasing environmental concerns such as noise pollution.

Noise Pollution: As air traffic volumes increase, noise pollution has become a growing issue, especially for communities near airports.
Congestion and Capacity Limits: European airspace is one of the busiest globally, with peak times often leading to bottlenecks.

The Goal

The goal of the project is to analyze open sourced data to provide insights for the above issues.

Noise Pollution Analysis to identify areas experiencing high traffic and high noise levels.
This data could also feed predictive models to anticipate congestions and delays. The primary output of the project is a visualization that enables the public to identify areas of low and high noise pollution due to air traffic.

Target Audience

Individuals can use this data to assess exposure to flight traffic related noise pollution for any areas of interest e.g. before making a decision to move to a new home.
Local Communities affected by noise pollution can advocate for noise mitigation policies and engage in constructive discussions with decision makers.
Air Traffic Management (ATM) Agencies could use these insights to explore alternative paths for flights to mitigate noise pollution, refine air traffic management and reduce delays.

Personal Motivation

I have always lived under a flight corridor, and I thought I was used to the noise pollution, but my current neighborhood is severely impacted. The noise from the aircraft affects my personal life, health, and work. Using this app, individuals who are sensitive to noise pollution can easily look up the city and neighborhood they are interested in before deciding to rent a flat, buy a new home, or organize an important event.

Steps

This is a rough list of the steps I went through building this project.

1. Researching available Data Sources

Given the requirements of the Capstone Project I started with static and dynamic data sources shared in the Bootcamp
Searched the internet for available free and paid data sources, APIs that are interesting

2. Evaluated most interesting Data Sources through below factors:

The availlability of data, e.g. API limitations
The format of the data, e.g. what file format
The quantity of data
The price of data
Concluded on the data sources used and the use case

3. Performed Exploratory Data Analysis

I chose Google Colab's Jupyter Notebook environment to experiment with the data and share it with others
I used DuckDB to easily process the files and run some basic statistics
Identified Data Quality issues, e.g. missing data mostly in the flights and aircrafts data sources
I used Leafmap within the notebook to visualize Geospatial data as a proof of concept

4. Conceptual Data Model and Architecture Diagram

Created the Conceptual Data Model
Created an initial architectural overview based on the array of tools and services available to us within the Bootcamp

5. Created a Data Dictionary

From the Conceptual Data Model I created the Logical Data Model
After some experimentation with different tools I decided on the final tech stack
Created the Physical Data Model
Created a Data Dictionary listing all tables and fields with their information

6. Wrote the ETL pipelines to transform the data.

Created extract tasks to read parquet files from the web and loaded it into the Data Warehouse
Created transformation models in dbt
Defined tests for each source and table

7. Ran the ETL pipelines and QA checks

Created dags for data sources of varying extract intervals
Tested and optimized the running of the dags

8. Created Visualizations

Built Streamlit app with custom visualizations and interactive widgets
Tested and improved user experience
Added instructions and contextual information for end users

Data Sources

Summary of data used

Shortlist of Data sources and their types

OPDI Fligths data from OpenSky Network in parquet files
Airport database from OurAirports.com in CSV files
Aircraft database from OpenSky Network in CSV files
Aircraft type data from OpenSky Network in CSV file
Country data from OurAirports.com in CSV file

Flight Data from the Open Performance Data Initiative

About the Open Performance Data Initiative (OPDI) dataset

This flight dataset is published under the Open Performance Data Initiative (OPDI) as sponsored by the Performance Review Commission and in collaboration with the OpenSky Network (OSN). More information can be found here.

Comparison of sources

API	Hist Pos data	Live pos data	Hist schedule	Live schedule	Static data	Hist Price data	Pricing
FR24	Yes	Yes	Yes	Yes	Yes	No	$ 90 / month
Aviation Edge	No	Yes	Yes	Yes	Yes	No	first $7/m, $300/m, 30000 call
Aviationstack	No	No	Yes	Yes	Yes	No	100 call free, $50/m
Amadeus Airfare prices	No	No	No	No	No	Yes	first 10000 call free, 0,0025 EUR
OPDI	Yes*	No	Yes	No	Yes	No	Free

Why I chose this data:

It is the most complete flights dataset which is freely available.

Covers a period of 30 months
30+ million flights
Free to use
Data is in a series of parquet files

Live Flight Data API by OpenSky Network

About the OpenSky Network (OSN)

The OpenSky Network was initiated in 2012 by researchers from armasuisse (Switzerland), University of Kaiserslautern (Germany), and University of Oxford (UK). The objective was (and still is!) to provide high quality air traffic data to researchers.

Why I chose this data:

It is a free API for data collected by the OSN

Provides detailed track information for specific flights (experimental)
Good coverage for ADS-B
Free

Comparison of sources

API	Hist Pos data	Live pos data	Hist schedule	Live schedule	Static data	Hist Price data	Pricing
FR24	Yes	Yes	Yes	Yes	Yes	No	$ 90 / month
Aviation_Edge	No	Yes	Yes	Yes	Yes	No	first $7/m, $300/m, 30000 call
Aviationstack	No	No	Yes	Yes	Yes	No	100 call free, $50/m
Amadeus	No	No	No	No	No	Yes	first 10000 call free, 0,0025 EUR
OSN	Yes*	Yes	Yes	Yes	No	No	Free

Airports and Runway Data by OurAirports

Why I chose this data:

Most comprehensive dataset around airports
Includes a table for countries and runways as well
Free

Comparison of sources

Name	Up to date	Runway data	Country	Downloadable	Free
AirportDatabase	Yes	No	Yes	No	Yes
OurAirports	Yes	Yes	Yes	Yes	Yes
Global Airport Database	Yes	No	Yes	Yes	Yes

Aircrafts Data by OpenSky Network

Why I chose this data:

Most comprehensive dataset of aircrafts
Has airline information
Free

Comparison of sources

Name	Up to date	Types	Manuf.	Airline	Country	File	Format	Free
OSN	Yes	Yes	Yes	Yes	Yes	Yes	.csv	Yes
Airframes.org	Yes	Yes	Yes	No	No	No	N/A	Yes
Airfleets.net	Yes	Yes	Yes	Yes	Yes	Yes	.xls	No

Exploratory Data Analysis

The goal of the EDA was to:

Understand the data structure
Identify patterns and anomalies in the data
Understand what data is relevant for the use case

Tools used:

Google Colab, a cloud-based Python environment ideal for handling large datasets
DuckDB, an in-process SQL database for querying large-scale datasets efficiently
Matplotlob, Seaborn for creating plots and charts
Leafmap and Geopandas with h3 for geo data visualization

Setup Process:

First mounted a drive in the Google Drive storage
Added the files to the drive
Loaded the files into DuckDB and created tables

Writing queries

Queried the tables with SQL
Saved the query results in Pandas dataframes

Data Cleaning

Checked data types to ensure consistency in temporal and numerical fields
Identified missing data, incomplete records
Checked for duplicates

Visual Analysis

Created plots and charts for visually exploring the data
Created Geospatial visualizations to understand how the data could be used

Link to the Notebook

Data Models

Data Dictionary

Flights

Purpose of the table Source of the table

OPDI Flights dataset

Field	Type	Format	Size	Description	Constraints
FlightID	Varchar			Unique ID	PK, Unique
iataFlightNumber	Varchar	AB123	5-6	Primarily used in booking and ticketing, so passengers see the IATA code on their tickets and in airport displays.
ICAO24	Varchar			ICAO 24-bit address, Mode S equipped aircraft are assigned this unique address (informally Mode-S "hex code") upon national registration.	FK
CallSign	Varchar		6	An identifier used by pilots and air traffic control for communication and flight tracking. Airlines tend to use their flight numbers as their call signs.
DepartureICAO	Varchar	A0F9	3-7	ICAO24 code of the aerodrome of departure.
DestinationICAO	Varchar	A0F9	3-7	ICAO24 code of the aerodrome of destination.
DOF	Date			Date of the flight.
PlannedDeparture	DateTime			Planned time for departure.
PlannedArrival	DateTime			Planned time for arrival.
ActualDeparture	DateTime			Actual time for departure.
ActualArrival	DateTime			Actual time for arrival.

Aggregated Flight Counts

Field	Type	Description	Constraints
HEX	Varchar	The ID of the H3 hexagonal bin	PK, Unique
Year_start	DateTime	The start of the year, the row contains data for	PK
Metric_array	Array	Array of the count of flights flying through the hexbin in a period. The period is every 10 days, and is based on the cadence of the flights event data source.

Events

Source of the table

OPDI Events dataset
OPDI Measurements dataset

Field	Type	Format	Values	Description	Constraints
FlightID	Varchar			Unique ID of the flight	PK
TimeStamp	DateTime			Time dimension.	PK
Type	Varchar			The description of the event connected to the timestamp, e.g. take-off, touch-down, reaching specific altitude or part of the trajectory.
Latitude	Float	NN.NNNNNN	+/-90	Geographic coordinate specifying the north–south position of the location.
Longitude	Float	NNN.NNNNNN	+/-180	Geographic coordinate specifying the east-west position of the location.
Altitude	Float			Geographic coordinate specifying the distance from sealevel.
TimePassed	Integer			Time passed since first waypoint in seconds.
Distance	Number			Distance completed since first waypoint in nautical miles.
Waypoints	Array			The list of 4D coordinates intrapolated from the previous event

Aircrafts

Source of the table

Field	Type	Format	Size	Values	Description	Constraints
ICAO24	Varchar		4	hex	ICAO 24-bit address, Mode S equipped aircraft are assigned this unique address (informally Mode-S "hex code") upon national registration.	PK, Unique
OperatorICAO	Varchar		3	alphanumeric	TheICAO airline designator is a code assigned by the International Civil Aviation Organization (ICAO) to aircraft operating agencies.
Country	Varchar	AA	4-35		Country of registration.
Registration	Varchar	AZ-019	4-7		An aircraft registration is a code unique to a single aircraft, required by international convention to be marked on the exterior of every civil aircraft.	Unique
ManufacturerICAO	Varchar		2-27		TheICAO uses a naming convention for aircraft manufacturers in order to be specific when mentioning an aircraft manufacturer's name.
ManufacturerName	Varchar				The common name of aircraft manufacturers.
ModelName	Varchar				The model of the aircraft.
AircraftType	Varchar		1	L - Landplane S - Seaplane A - Amphibian G - Gyrocopter H - Helicopter T - Tiltrotor	Types of aircrafts.
EngineCount	Integer		1		The number of engines on the aircraft.
EngineType	Varchar			J - jet T - turboprop,turboshaft P - piston E - electric R - rocket	The power component of an aircraft propulsion system.
TypeDesignator	Varchar		2-4	hex	An aircraft type designator is a 2 to 4 character alphanumeric code designating every aircraft type/subtype.
WTC	Varchar		1	L - Light, <=7 t M - Medium, >7 t, <=136 t H - Heavy >136 t, except Super J - Super, specified in ICAO Doc 8643	Wake turbulence category is a category based on the disturbance in the atmosphere that forms behind an aircraft.

Airports

Source of the table

Field	Type	Format	Size	Values	Description	Constraints
ICAO24 (ident)	Varchar	A0F9	3-7		The ICAO airport code or location indicator is a four-letter code designating aerodromes around the world.	PK, Unique
Name	Varchar				Name of the aerodrome.
Type	Varchar				Type of the aerodrome.
Latitude	Float	NN.NNNNNN		+/-90	Geographic coordinate specifying the north–south position of the location.
Longitude	Float	NNN.NNNNNN		+/-180	Geographic coordinate specifying the east-west position of the location.
Elevation	Signed Integer				The elevation of the location in feets.
Municipality	Varchar				The name of the area the aerodrome is located in.
CountryCode	Varchar	AZ	2		Internationally recognized standard codes of 2 letters that refer to countries.
CountryName	Varchar				The name of the country the aerodrome is located in.
Continent	Varchar	AZ	2		The internationally recognized code of the continent the aerodrome is situated.

Tech Stack

High Level Architecture

Snowflake

Centralized storage for all data, allowing performant querying and integration with other tools.

Why Snowflake?

Highly scalable data warehouse that handles large datasets efficiently
Supports Parquet and CSV file ingestion with custom file format definitions
Has general availability of H3 functions within Snowflake's SQL scripting

Snowpark

Used for ingesting data into Snowflake

Why Snowpark?

Provides an easy solution to ingest data into Snowflake
Enables a wide range of operations to complement SQL

dbt

To define and orchestrate transformations on the raw data and create aggregations.

Why dbt?

Simple to create and manage SQL-based transformations.
Provides version control, documentation, and testing for data pipelines.
Ensures modular and reusable transformation logic

Airflow

To Schedule and orchestrate ingestion, transformation, and analytics workflows.

Why Airflow?

Airflow allows scheduling and managing workflows with clear dependencies between tasks.
Can be used with dbt through Cosmos

Cosmos

Bridges Airflow orchestration with dbt's transformation logic, creating a unified ETL pipeline.

Why Cosmos?

Integrates dbt with Airflow
Executes dbt models as part of the Airflow pipeline

Streamlit

To build an interactive dashboard for visualizing data.

Why Streamlit?

Simplifies sharing insights with stakeholders via a web-based interface
Uses Python
Built into Snowflake

DuckDB

To Conduct exploratory analysis and testing of queries before creating them in Snowflake.

Why DuckDB?

Lightweight and fast for prototyping
Ideal for quickly exploring and querying Parquet and CSV files without loading the data into a warehouse

Pipeline

The pipeline uses modern data engineering practices to extract, clean, and model data using a Medallion architecture that includes Staging, Intermediate, and Datamart layers.

Data sources

Some of the data sources comprise of several parquet files broken down to an interval of either a month or 10 days
Other sources are CSV files, updated at a daily or monthly cadence.
All sources are available for download from different websites.
At the moment there is no live API available to get the same complete dataset so I decided the consume the available sources in this project with a time delay, starting from the beginning of 2024

Extracting the Data

The most efficient way I found to ingest the data was to upload to Snowflake's internal stage and copy into tables with Snowflake's Snowpark API
I opted for this solution also because it will be easy to adjust when the dataset is migrated to an S3 bucket in the future

Tasks:

Python script to get the different files for each time period
Using the Snowpark API to upload the file to an internal stage within Snowflake
Running a Snowflake query to copy the data from the stage into a newly created table
A Python script to clean up temporary files
Running a Snowflake query for basic Data Quality check
If the checks pass, the next task deletes the file from the temporary stage

Transformation with dbt

The transformation process follows the Medallion Architecture:

1. Staging Layer

Data is ingested "as-is," retaining all fields and formats to ensure data traceability.
Light transformations, type casting, renaming fields.

2. Intermediate Layer

Perform data enrichment, cleaning, and transformations.
Joining different datasets.

3. Datamart Layer

Analysis-ready datasets for the dashboard.
Creating aggregations and final metrics.

Data Quality Checks in dbt

All dbt models have some kind of testing included:

Built in Generic Tests, e.g.:
- unique
- not_null
- accepted_values
Tests by dbt expectations, e.g.:
- expect_column_values_to_be_between
- expect_column_values_to_be_of_type

Documentation

I used dbt's built in commands to generate documentation and create a static website with an overview of the models.

dbt docs generate
dbt docs serve

Orchestration by Airflow

In the end I grouped the tasks into task groups and created two dags running at different frequencies.

Important: There is an additional task at the beginning of each dag, that checks if the previous dag run was successful. This task needs to be marked as successful for the very first run of both dags.

Monthly Dag:

Monthly Flights data
Aircrafts data
Aircraft types data

Weekly Dag:

Flight Events data
Flight Measurements data
Airports data

Visualization

I used Streamlit within Snowflake to build an interactive web application and pydeck for geospatial visualization. Link to the application is here.

Main visualizations:

Air traffic heatmap for Europe based on H3 hexagonal bins (with a H3 resolution of 8)
An airport heatmap for any airport in the continent, selected by the user after filtering on the country (resolution can be set by the user between 6 and 10)
A map with the most frequently used connections between airports (using pydeck's GreatCircleLayer)
A map showing the path of the 10 aircrafts with the most flights (using pydeck's PathLayer)

The visualizations were particularly challenging because not every tool is supported by Streamlit within Snowflake.

Next Steps

There was only a limited time for the project, so there are still a lot of ideas for improvement and for new functionalities.

The main data sets are currently only available in parquet files downloadable from a website, but hopefully it will be available through an Amazon S3 bucket in the near future
There are plans for more frequent release of the data, which when integrated would offer a more up to date picture
I wanted to integrate flight schedule data as well, to analyze flight delays
Compare schedule data with weather data from the Open-Meteo API to discover the effects of weather on flight delays
Currently the dataset is limited to the start of the year to be mindful of the costs associated but the time scope can be extended easily
Create similar visualization within Tableau or PowerBI

Conclusion

This capstone project for the Analytics Engineering Bootcamp was a challenging experience, but I learned a lot. I had to design and implement a complex data pipeline, integrating several data sources and utilizing various modern data engineering tools.

The main highlights for me working on the project:

Used some data sets with many millions of rows each
Doing EDA in Google Colab with DuckDB
Included a reduced fact table in the data model using complex data structure in the row
Adding specific data formats for the different CSV versions I imported in Snowflake
Creating Snowflake Python User Defined Functions for some special use cases
Using Astronomer's Cosmos to integrate dbt tasks into Airflow
Leveraging Uber's H3 Hexagonal hierarchical geospatial indexing system for aggregating data
Using Geospatial visualization tools and libraries, such as Leafmap and pydeck
Building a Streamlit App for interactive visualization

In this project I tried to implement various methods and technologies we learned in the Bootcamp. Through designing and implementing this complex data pipeline I gained valuable experience, improved technical proficiency and problem solving skills.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Streamlit Viz		Streamlit Viz
dags		dags
dbt		dbt
images		images
README.md		README.md

dbcake/EU_Flights_Data_Project

Folders and files

Latest commit

History

Repository files navigation

Analytics Engineering Bootcamp Capstone Project

Project Introduction

The Problem

The Goal

Target Audience

Personal Motivation

Steps

1. Researching available Data Sources

2. Evaluated most interesting Data Sources through below factors:

3. Performed Exploratory Data Analysis

4. Conceptual Data Model and Architecture Diagram

5. Created a Data Dictionary

6. Wrote the ETL pipelines to transform the data.

7. Ran the ETL pipelines and QA checks

8. Created Visualizations

Data Sources

Summary of data used

Shortlist of Data sources and their types

Flight Data from the Open Performance Data Initiative

About the Open Performance Data Initiative (OPDI) dataset

Comparison of sources

Why I chose this data:

Contents

Live Flight Data API by OpenSky Network

About the OpenSky Network (OSN)

Why I chose this data:

Contents

Comparison of sources

Airports and Runway Data by OurAirports

Why I chose this data:

Comparison of sources

Aircrafts Data by OpenSky Network

Why I chose this data:

Comparison of sources

Exploratory Data Analysis

The goal of the EDA was to:

Tools used:

Setup Process:

Writing queries

Data Cleaning

Visual Analysis

Data Models

Conceptual Data Model

Logical Data Model

Physical Data Model

Flow Diagram

Data Dictionary

Flights

Aggregated Flight Counts

Events

Aircrafts

Airports

Tech Stack

High Level Architecture

Snowflake

Why Snowflake?

Snowpark

Why Snowpark?

dbt

Why dbt?

Airflow

Why Airflow?

Cosmos

Why Cosmos?

Streamlit

Why Streamlit?

DuckDB

Why DuckDB?

Pipeline

Data sources

Extracting the Data

Tasks:

Transformation with dbt

1. Staging Layer

2. Intermediate Layer