This repo contains the development and experimental codebase of AutoFeat: Transitive Feature Discovery over Join Paths
The code is available for local development, or using Docker.
- Python 3.8
- Java (for data discovery only - Valentine)
- neo4j 5.1.0 or 5.3.0
- Create virtual environment
python -m venv {env-name}
- Activate environment
source {env-name}/bin/activate
- Install requirements
pip install -e .
LighGBM on AutoGluon gives Segmentation Fault or won't run unless you install the corret libomp as described here. Steps:
wget https://raw.githubusercontent.com/Homebrew/homebrew-core/fb8323f2b170bd4ae97e1bac9bf3e2983af3fdb0/Formula/libomp.rb
brew uninstall libomp
brew install libomp.rb
rm libomp.rb
Working with neo4j is easier using neo4j desktop application.
- First, download neo4j Desktop
- Open the app
- "Add" > "Local DBMS"
- Give a name to the DBMS, add a password, and choose Version 5.1.0.
- Change the "password" in config
NEO4J_PASS = os.getenv("NEO4J_PASS", "password")
- "Start" the DBMS
- Once it started, "Open"
- Now you can see the neo4j browser, where you can query the database or create new ones, as we will do in the next steps.
The Docker image already contains all the necesarry for development.
- Open a terminal and go to the project root (where the docker-compose.yml is located).
- Build necessary Docker containers (Note: This step takes a while)
docker-compose up -d --build
- Download our experimental datasets and put them in data/benchmark.
To ingest the data in the local development, it is necessary to follow the steps from Neo4j Desktop setup beforehand.
For Docker, Neo4j browser is available at localhost:7474. No user or password is required.
- Create database
benchmark
in neo4j.- Local development - It is necessary to follow the steps from Neo4j Desktop setup beforehand.
- Docker - Go to localhost:7474 to access neo4j browser.
Input in neo4j browser console:
create database benchmark
Wait 1 minute until the database becomes available.
:use benchmark
- Ingest data
- (Docker) Bash into container
docker exec -it feature-discovery-runner /bin/bash
-
(Local development) Open a terminal and go to the project root.
-
Ingest the data using the following command:
feature-discovery-cli ingest-kfk-data
- Go to config.py and set
NEO4J_DATABASE = 'lake'
2. If Docker is running, restart it. - Create database
lake
in neo4j:- Local development - It is necessary to follow the steps from Neo4j Desktop setup beforehand.
- Docker - Go to localhost:7474 to access neo4j browser.
Input in neo4j browser console:
create database lake
Wait 1 minute until the database becomes available.
:use lake
- Ingest data - depending on how many cores you have, this step can take up to 1-2h.
- (Docker) Bash into container
docker exec -it feature-discovery-runner /bin/bash
-
(Local development) Open a terminal and go to the project root.
-
Ingest the data using the following command:
feature-discovery-cli ingest-data --data-discovery-threshold=0.55 --discover-connections-data-lake
To run the experiments in Docker, first bash into the container:
docker exec -it feature-discovery-runner /bin/bash
feature-discovery-cli --help
will show the commands for running experiments:
run-all
Runs all experiments (ARDA + base + AutoFeat).
feature-discovery-cli run-all --help
will show you the parameters needed for running
run-arda
Runs the ARDA experiments
feature-discovery-cli run-arda --help
will show you the parameters needed for running
--dataset-labels
has to be the label of one of the datasets from datasets.csv
file which resides in data/benchmark.
--results-file
by default the experiments are saved as CSV with a predefined filename in results
Example:
feature-discovery-cli run-arda --dataset-labels steel
Will run the experiments on the steel dataset and the results
are saved in results folder
run-base
Runs the base experiments
feature-discovery-cli run-base --help
will show you the parameters needed for running
--dataset-labels
has to be the label of one of the datasets from datasets.csv
file which resides in data/benchmark.
--results-file
by default the experiments are saved as CSV with a predefined filename.
Example:
feature-discovery-cli run-base --dataset-labels steel
Will run the experiments on the steel dataset and the results
are saved in results folder
run-tfd
Runs the AutoFeat experiments.
feature-discovery-cli run-tfd --help
will show you the parameters needed for running
--dataset-labels
has to be the label of one of the datasets from datasets.csv
file which resides in data/benchmark.
--results-file
by default the experiments are saved as CSV with a predefined filename.
--value-ratio
one of the hyper-parameters of our approach, it represents a data quality metric - the percentage of
null values allowed in the datasets. Default: 0.55
--top-k
one of the hyper-parameters of our approach,
it represents the number of features to select from each dataset and the number of paths. Default: 15
Example:
feature-discovery-cli run-tfd --dataset-labels steel
Will run the experiments on the steel
dataset and the results are saved in results folder
Main source for finding datasets.
-
To recreate our plots, first download the results from here.
-
Add the results in the results folder.
-
Then, open the jupyter notebook: run in the root folder of the project:
jupyter notebook
- Open the file Visualisations.ipynb.
- Run every cell.
We conducted an empirical analysis of the most popular feature selection strategies based on relevance and redundancy.
These experiments are documented at: https://github.com/delftdata/bsc_research_project_q4_2023/tree/main/autofeat_experimental_analysis
This repository is created and maintained by Andra Ionescu