Datacroft: Analytics Stack

Datacroft: Analytics Stack was developed by FELD M to help you easily set up a pipeline for reporting and analytics. The stack includes the following components:

🚂 a data loading/EL tool (Airbyte) to help you get data from different sources onto your data warehouse
🗄 a simple data warehouse (a PostgreSQL database) where you can store all your data
🪄 a data modeling tool (dbt) to help you transform your raw data into ready-to-use tables
🔭 a visualization tool (Metabase) where you can build dashboards and explore your data

We assembled this stack to be easy to spin up, with minimal required resources so you can pull your data and experiment with it as fast as possible. If you have your own machine to run this on (such as a private server, or even just your own laptop), the whole stack is free to run and use, since all the tools are open-source.

The Stack

It includes the following components as Docker containers:

EL tool: Airbyte
Data warehouse: PostgreSQL
Data transformation tool: dbt
BI tool: Metabase

These are run using Docker Compose, so it assumes you have a private server or a machine to run this on. If you prefer to use cloud-based, all the tools above are also available as cloud subscriptions with free trials.

You can also switch any of the tools above if you prefer to use a different one. For example, if you want to use Tableau instead of Metabase, comment out the Metabase section in the Dockerfile and connect your Tableau instance to the data warehouse. You could also use cloud-based warehouses such as BigQuery.

Requirements

Docker with Docker Compose
make

Since we use Docker Compose to run the stack, you will need Docker installed on your machine. We also use a makefile so you can easily run commands to use the stack.

This stack was tested on macOS and Ubuntu.

Quickstart

To set up a pipeline in your machine:

Make sure that Docker is running
Edit the following configuration files (this step is optional -- if you skip it, the stack will still run, the tools will just be configured with the default values/credentials set in the .env files.)
- .env: credentails and configurations for PostgreSQL Data Warehouse, dbt, and Metabase
- airbyte.env: credentials and configuration preferences for Airbyte
Open a terminal on the project directory and execute make run to start the containers
Access the apps from below.

To access the applications:

PostgreSQL: connect to [host]:5432 using a PostgreSQL client
Airbyte: open [host]:8000 on a web browser
Metabase: open [host]:3000 on a web browser

If you're running this on your own machine, [host] will be localhost, so Airbyte will be available in localhost:8000, Metabase in localhost:3000, and PostgreSQL from localhost:5432.

To stop the containers:

Go to the project folder in the terminal and run make stop.

Some notes

Note that the first time you run make run, the command might take a while to run, since Docker has to pull all the images. After the command finishes, Metabase will also not yet be immediately available, since it's also still setting up. In our experience, this can take around 15 mins or more.

Once you see that Airbyte and Metabase are accessible from the browser, you may get prompted by Airbyte for a username and password. If you did not change these settings from airbyte.env, then the defaults are set as username airbyte and password password.

Where is the stack best used?

The whole stack runs from two Dockerfiles (one for Airbyte, one for the other tools), which is run on a single machine. Thus, it's best for fast, exploratory work, where you want to be able to get your data and experiment with it as soon as you can, or deliver proof-of-concepts ASAP. It's not as suitable for heavy data syncs or lots of data connections running frequently.

Note that this limitation is not because of the tools themselves. If you want to scale this stack and use it for production workloads, you can take two paths. You can either:

install each of the tools natively on (a) machine/s dedicated for your production workload (Airbyte open source, PostgreSQL db, dbt Core, and Metabase open source), (and if needed, such as with Metabase, properly configure each of them for production as described in their docs), or,
use the cloud-based versions of the tools.

If going for path #2, the stack we commonly use is: Airbyte Cloud + GCP BigQuery + dbt Cloud/dbt Core (which you can also run in GCP if you prefer) + Metabase Cloud/PowerBI/Looker Studio/any BI tool of your choice.

Want some help setting this up?

You can contact us here, or at https://www.feld-m.de/. 😀

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
dbt/config		dbt/config
.env		.env
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
airbyte.env		airbyte.env
docker-compose-airbyte.yaml		docker-compose-airbyte.yaml
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datacroft: Analytics Stack

The Stack

Requirements

Quickstart

To set up a pipeline in your machine:

To access the applications:

To stop the containers:

Some notes

Where is the stack best used?

Want some help setting this up?

About

Releases

Packages

Contributors 2

Languages

feld-m/datacroft-analytics-stack

Folders and files

Latest commit

History

Repository files navigation

Datacroft: Analytics Stack

The Stack

Requirements

Quickstart

To set up a pipeline in your machine:

To access the applications:

To stop the containers:

Some notes

Where is the stack best used?

Want some help setting this up?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages