The Backbone Application is a client-server application, based on a machine learning algorithm, that matches instances from different datasets and stores them in a database.
The instance matching algorithm is used to match data about companies. It receives a configuration file and two input datasets (.csv files), each containing data about different companies, and tries to find and create links between two entities that refer to the same company. The 'links' are represented by clusters, i.e., if the algorithm matched two companies, they will be put in the same cluster. The output is made of two .csv files, which are composed of the data that was in the input files and two new columns: 'cluster_id' (the id of the cluster the company was assigned to) and 'link_score' (score representing how similar that company is with the others that were assigned to the same cluster).
Notes:
- companies that do not match with other companies from the other dataset are assigned to their own cluster (a 1 element cluster)
- the algorithm does a one-to-one match between the two datasets, i.e., it matches at most two companies from the two datasets; if there are two entities in the same dataset that refer to the same company, they will for sure not end up in the same cluster
The server side is a RESTful application that:
- can receive all the neccessary input files that the matching algorithm needs
- if the user only sent one dataset with company data, the server can extract company entities from the database (based on the jurisdiction the user specifies in the configuration file) and create the second dataset that the algorithm needs
- can insert into the database the results of the matching algorithm
- can run the matching algorithm when all the neccessary input files were provided
- can search by name or address in the database for companies
The client side is a desktop application where the user can:
- select and upload the input files that the algorithm needs
- create a training file for the algorithm (the file is automatically sent to the server after it was created)
- start the algorithm (after all the neccessary files were uploaded)
- search for companies, by their names or addresses, in the database
Run the api.py script in an IDE that supports Python or from a terminal like in the example below:
python3 api.py
Run the client_app.py script in an IDE that supports Python or from a terminal like in the example below
python3 client_app.py
To run only the instance matching algorithm, one needs to have Jupyter Notebook installed and open the dedupe_interlinking_data.ipynb file, that can be found in the server_app folder, with Jupyter notebook.
- Jupyter notebook (needed only if the matching algorithm is to be run individually) - intallation guide
- Python 3 - installation guide
- pandas (if Jupyter notebook and Anaconda are not installed) - installation guide
- numPy (if Jupyter notebook and Anaconda are not installed) - installation guide
- dedupe - Dedupe's GitHub page can be found here
pip install dedupe
- unidecode (used in the instance matching algorithm for preprocessing data)
pip install unidecode
- simplejson (e.g.: used in the instance matching algorithm to read the JSON configuration file)
pip install simplejson
- flask - installation guide
- requests
- Database:
- PostgreSQL database - installation guide
- psycopg - installation guide
- (OPTIONAL) pgAdmin - tool for managing and visualizing the postgreSQL database; download here
The versions of the modules at the development time can be accessed here
More detailed documentation can be found in the wiki pages.