CommunityLab is an Open-Source ready to use configuration system like Amazon Elastic MapReduce (EMR) and Azure Hadoop Insight (HDInsight).
Build your own high available and scalable Open-Source IDE using Ansible. The relevant Ansible Collections of the IDE are tested using Ansible Molecule.
CommunityLab consists of following components:
Component | Version |
---|---|
JupyterHub | 5.2.0 |
JupyterLab | 4.3.0 |
Apache Hadoop | 3.4.1 |
Apache Spark | 3.5.3 |
Apache Zookeeper | 3.9.3 |
PostgreSQL | 16 |
Each component can be deployed with Ansible Molecule on your local notebook using Docker Container and self signed certificates (see step 2).
If you want to setup your own VMs for the IDE, you can use Terraform (Hetzner Cloud). The deployment process for Terraform is also tested with Terratest (see step 3).
If you want to use a different Cloud provider or On-Premises machines you can specify a custom Ansible inventory file (see step 4).
High-level design:
The IDE can be installed as high available system but also in Non-HA mode. As default Non-HA mode is used to save costs when installing and running the IDE in a Cloud environment.
Low-level design (Non-HA setup):
If you want to setup the IDE in HA mode change the relevant Terraform variable in Hetzner Cloud (see step 2.2.1) or define 3 master nodes in your custom Ansible inventory file when using On-Premise machines or different Cloud provider.
Low-level design (HA setup):
The conda environments used for JupyterHub and JupyterLab contain following packages:
When using Hetzner Cloud following costs have to be considered (Non-HA):
Server Name | Server Type | CPU | RAM | Costs (day) | Costs (month) |
---|---|---|---|---|---|
hub1 | CPX31 | 4 | 8 | 0,60 € | 15,59 € |
master1 | CPX41 | 8 | 16 | 1,18 € | 29,39 € |
worker1 | CPX51 | 16 | 32 | 2,50 € | 64,74 € |
worker2 | CPX51 | 16 | 32 | 2,50 € | 64,74 € |
worker3 | CPX51 | 16 | 32 | 2,50 € | 64,74 € |
security1 | CPX11 | 2 | 2 | 0,17 € | 4,58 € |
CommunityLab | 62 | 122 | 9,45 € | 243,78 € |
If you are german speaking you may be interested in my related academic work: Thesis.pdf
- Ubuntu (was tested on Ubuntu 24.04 LTS)
- Ansible (was tested on Ansible version 2.17.4)
- Python (was tested on Python version 3.12.6)
- Molecule (was tested on Molecule version 24.9.0)
- Docker (was tested on Docker version 27.1.2, required for Ansible Molecule)
- Terraform (was tested on Terraform version v1.9.3)
- Go (was tested on Go version go1.18.1)
- A valid domain name
- Hetzner Account, Hetzner Cloud API Token (Read/Write) and Hetzner DNS Token
The installation process was tested on Ubuntu 24.04 LTS and Windows Ubuntu Subsystem.
georg@notebook:~/git/CommunityLab$ bash requirements.sh
georg@notebook:~/git/CommunityLab$ find . -name molecule
./collections/ansible_collections/jupyter/hub/extensions/molecule
./collections/ansible_collections/authentication/kerberos/extensions/molecule
./collections/ansible_collections/hadoop/hdfs/extensions/molecule
./collections/ansible_collections/hadoop/yarn/extensions/molecule
./collections/ansible_collections/bigdata/spark/extensions/molecule
./collections/ansible_collections/bigdata/zookeeper/extensions/molecule
./collections/ansible_collections/rdbms/postgres/extensions/molecule
./collections/ansible_collections/authorization/ldap/extensions/molecule
(The Ansible Collection bigdata.spark is an exception since the only purpose is the installation of common Apache Spark libraries)
- default (Simple installation process of the Ansible Collection without High Availability)
- ha_setup (More complex installation process of the Ansible Collection with High Availability)
georg@notebook:~/git/CommunityLab$ cd collections/ansible_collections/jupyter/hub/extensions/
georg@notebook:~/git/CommunityLab/collections/ansible_collections/jupyter/hub/extensions$ molecule converge -s default
georg@notebook:~/git/CommunityLab/collections/ansible_collections/jupyter/hub/extensions$ docker container ls
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a33331864a0e geerlingguy/docker-ubuntu2404-ansible "/usr/lib/systemd/sy…" About an hour ago Up About an hour instance-6
9ea4bebf4d72 geerlingguy/docker-ubuntu2404-ansible "/usr/lib/systemd/sy…" About an hour ago Up About an hour instance-5
24f7cf1a2789 geerlingguy/docker-ubuntu2404-ansible "/usr/lib/systemd/sy…" About an hour ago Up About an hour instance-4
7c8c3790565b geerlingguy/docker-ubuntu2404-ansible "/usr/lib/systemd/sy…" About an hour ago Up About an hour instance-3
55440ed1459e geerlingguy/docker-ubuntu2404-ansible "/usr/lib/systemd/sy…" About an hour ago Up About an hour instance-2
16591b433003 geerlingguy/docker-ubuntu2404-ansible "/usr/lib/systemd/sy…" About an hour ago Up About an hour instance-1
georg@notebook:~/git/CommunityLab/collections/ansible_collections/jupyter/hub/extensions$ docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' instance-1
172.23.27.3
georg@notebook:~/git/CommunityLab/collections/ansible_collections/jupyter/hub/extensions$ firefox
2.8 Login to JupyterHub here using credentials of variable ldap_users:
georg@notebook:~/git/CommunityLab/collections/ansible_collections/jupyter/hub/extensions$ molecule destroy -s default
georg@notebook:~/git/CommunityLab$ bash requirements.sh all
3.2.1 Define variables for your custom infrastructure (mandatory: hetzner_token, hetznerdns_token, ssh_public_key_file, ssh_private_key_file, user, domain; optional: ide_ha_setup, set to true for IDE in HA mode)
georg@notebook:~/git/CommunityLab$ cd terraform/
georg@notebook:~/git/CommunityLab/terraform$ vim terraform.tfvars
georg@notebook:~/git/CommunityLab/terraform$ cd test/
georg@notebook:~/git/CommunityLab/terraform/test$ go mod init hcloud.tf
georg@notebook:~/git/CommunityLab/terraform/test$ go mod tidy
georg@notebook:~/git/CommunityLab/terraform/test$ go test deployment_test.go -v
3.2.4 Verify created DNS and Reverse DNS entries of VMs in Hetzner Cloud are correct and SSH connection to VMs is possible
georg@notebook:~/git/CommunityLab/terraform/test$ go test connection_test.go -v
georg@notebook:~/git/CommunityLab/terraform/test$ go test destruction_test.go -v
georg@notebook:~/git/CommunityLab/terraform/test$ cd ../
georg@notebook:~/git/CommunityLab/terraform$ terraform init
georg@notebook:~/git/CommunityLab/terraform$ terraform apply
georg@notebook:~/git/CommunityLab/terraform$ terraform apply
georg@notebook:~/git/CommunityLab$ vim group_vars/all.yml
georg@notebook:~/git/CommunityLab$ ansible-playbook setup.yml
Use credentials of variable ldap_users and login here: https://hub1.example.com:8443
Use credentials of variable ldap_users and login here: https://jupyterhub.example.com
georg@notebook:~/git/CommunityLab$ cd terraform
georg@notebook:~/git/CommunityLab/terraform$ terraform destroy
4.1 Copy Terraform inventory template and change relevant variables for your custom environment if necessary
georg@notebook:~/git/CommunityLab$ cp terraform/inventory_non_ha_ide.tpl inventory
georg@notebook:~/git/CommunityLab$ cp terraform/inventory_ha_ide.tpl inventory
georg@notebook:~/git/CommunityLab$ ansible-playbook setup.yml
After entering JupyterLab you can work with your GitHub or GitLab repositories by cloning them using the provided Git integration on the left side:
Having successfully cloned your repository you can now directly interact with it:
Besides classical Data Science analysis using Python packages like NumPy, Pandas or scikit-learn you may want to use Apache Spark in JupyterLab. Since you are already connected to the Hadoop ecosystem when logging into JupyterLab you can use the provided kernels Python 3, R and Apache Toree when interacting with HDFS like follows. The SPARK_HOME environment variable (/opt/apache-spark/spark) is already set for your container after logging into JupyterLab:
Python 3 kernel:
R kernel:
Apache Toree kernel:
After finishing your work in JupyterLab you can persist all your files in HDFS and stop your running YARN container.
You can copy your files to HDFS by using the hdfs dfs -copyFromLocal command in the terminal of JupyterLab. If you want to provide files for other members of your Data Science project just copy them to the /share folder in HDFS. Files in this folder can be changed and deleted by all IDE users:
Other team members can easily access them by using the hdfs dfs -copyToLocal command in the terminal of JupyterLab and download them in their running YARN container: