-
Notifications
You must be signed in to change notification settings - Fork 1
Server Application
The server application is composed of 4 files:
- api.py: this file represents the API of the server
- backbone.py: this file contains the Backbone class that is used for almost all the back end functionality
- utilities.py: this file is a Python module containing functions; these functions are used by both the Backbone class and, some, by the API
- dedupe_interlinking_data.ipynb: this Jupyter notebook file contains the Dedupe matching algorithm
The input files that the app can receive are:
- .csv file, which is the first input data set containing data about companies
- (OPTIONAL) .csv file, which is the second input data set; if this file is not given, a data set will be created by extracting companies from the database, based on the jurisdiction specified in configuration_file_bs.json
- configuration_file_bs.json: this is the configuration file that the algorithm needs; an example of the configuration file can be found here
- (OPTIONAL) training file: the training file is needed by the Dedupe matching algorithm; if the user has a training file, then it can be uploaded to the server; otherwise, the user can create one in the client application or when he/she runs the Jupyter notebook independently; an example of a training file can be found here
- (OPTIONAL) settings file: this is a binary file that is created when the Jupyter notebook is run (having the name settings_file), and contains the learnt model from the training part; the user can download this file by accessing the URL for downloading files and then it can upload it again later on, when he/she wants to run again the algorithm
This file should be in the JSON format and its name should always be "configuration_file_bs.json" (bs stands for backbone script)
It contains information about different parameters which the algorithm uses. These parameters are:
- input_file_1 and input_file_2 (type: string) -> names of the two input data sets; the value for input_file_2 is optional (if the user doesn't upload a second input file, this field's value should be null)
- provider_1_name and provider_2_name (type: string) -> names of the providers that gave the two input data sets; the value for provider_2_name is optional (if the user doesn't upload a second input file, this field's value should be null)
- jurisdiction (type: string): if the user doesn't upload a second input file, he/she needs to specify the jurisdiction of the companies that are in the first input data set; based on that jurisdiction, the algorithm will extract from the database companies, and will create with them a second input data set, which will be fed to the Dedupe matching algorithm
-
training (contains the parameters used in the training part)
- nr_of_examples_for_training (type: integer) -> this parameter is used in the sample method of Dedupe (this parameter's name is sample_size in the sample method from Dedupe)
- field_definitions (type: list of dictionaries) -> it represents a list of dictionaries where each dictionary has as keys : "field" and "type". "Field" represents the name of a column used for matching different entities and "type" represents the column's type. Note: there are no INT or FLOAT types, so make them String; to see the valid types that Dedupe accepts click here.
- create_training_file_by_client (type: boolean) -> it represents a boolean value which specifies if the user has created or not a training file using the interface provided in the client application.
- training_file (type: string) -> it represents the name of the training file that was uploaded on the server by the user. Note: if this field's value is null, and the create_training_file_by_client field is false, then the Dedupe algorithm will begin to create a training file by asking the user to label pairs of companies that the algorithm is unsure about; this process will take place only on the server side, so the users on the client side must either upload a training file, or create it in the client application; labeling means saying if two companies match or not, i.e., the two companies are the same company or not
- settings_file (type: string) -> represents the name of the settings_file; if this value is null then a new settings file will be created by the matching algorithm.
- threshold (type: float - between 0 and 1 inclusive) -> this is a parameter of the match method of Dedupe; if the value is close to 0, then the algorithm is expected to have a high recall rate, otherwise, if the value is close to 1, then the algorithm is expected to have a high precision rate; this method is called in the clustering part; if the threshold value is null and compute_threshold value is null, then the default value of the method is used.
-
compute_threshold -> if the threshold value is different from null, then this parameter is not taken into consideration; if this parameter is specified and threshold value is null, then you can use the following parameters for computing the threshold using the threshold method of Dedupe.
- recall_weight (type: float) -> it is a parameter of Dedupe's threshold method (more info if you click on the 'threshold' hyperlink above)
- nr_of_sample_data_for_threshold (type: integer) -> the threshold method needs a sample of examples from each data set (together with the recall_weight parameter) to compute the threshold; this field represents the number of examples that each sample will have.
-
database_config -> this object is used for configuring the database used in the evaluation part. The configuration parameters required for the database are:
- database_name (type: string)
- username (type: string)
- password (type: string)
- host (type: string)
- port (type: integer)
- evaluation -> if the value of this object is null, then the evaluation part won't be executed. If this value is different from null, then in the label_column_name (type: string) the user should specify the name of the column (from the data set), which represents the ground truth (the label). For example, the ground truth can be the id column, which stores the id of the comapanies.
The field types should respect the convention format found here.
The API exposes several HTTP endpoints, all of them being GET or POST requests.
This is both a POST and a GET method:
- if it is used as a POST method, the user can upload one or more files at a time, and the files are stored on the server
- this request can be used as a GET method only if it is accessed from the browser; this is because when the URL is accessed from the browser, the method will return HTML code that enables the user to search for a file in the file system and upload it to the server
This URL is a GET request. The user can download a file from the server if he/she accesses this URL and replaces <filename>
with the actual name of the file he/she wants to retrieve.
thoroughfare represents the street address in the euBusinessGraph ontology (more info can be found here)
This URL is a GET request that will return all the companies, stored in the database, whose street addresses (thoroughfare) are the one specified instead of <thoroughfare>
in the URL
legal_name represents the name of the company in the euBusinessGraph ontology (more info can be found here)
This URL is a GET request that will return all the companies, stored in the database, whose names (legal_name) are the ones specified instead of <legal_name>
in the URL
This URL is a POST request that will create a binary file, on the server, called uncertain_pairs_file, and will contain a list of 200 pairs of companies, that the Dedupe matching algorithm is unsure about. These pairs are used by the users (in the client application) to create a training file for the matching algorithm. The users will have to label the examples, i.e., say which pair of companies is a match (if the two companies refer to the same company) and which pair of companies is not a match (if the two companies refer to different companies).
This URL is a POST request, which will start running the main algorithm (matching companies + inserting them in the database). It assumes all the necessary files were uploaded beforehand.
The execution flow is the next one:
- Creates Backbone object (which will create the configuration file for Dedupe)
- If the user has not provided the 2nd input data set, then a temporary file (data set), which will contain companies extracted by 'jurisdiction' from the database, will be created using the Backbone object; NOTE 1: if some of these companies were already part of a cluster of at least 2 elements, and we have extracted companies that are in the same cluster, in the end, this file will contain only 1 company from a cluster; we don't keep 2 companies that are in the same cluster, because the Dedupe matching algorithm does a 1-to-1 match between the data sets, so it will never create clusters of 3 or more elements; NOTE 2: in the database we can have clusters (links/connections) of 3 or more companies;
- Execute the Jupyter notebook (run all its cells)
- If a data set with companies from the database was created, and the matching algorithm created links between the new companies and the ones from the database, then, the algorithm will update the cluster_ids of the new companies that matched with the existing companies. The matching algorithm assigns cluster_ids (to the new clusters) that are not in the backbone_index table. This is why, in case old companies are in the same cluster with new companies, we need to update that cluster_id to be the same as the one of the company that was already in the database.
- Insert the new cluster_ids into the backbone_index table
- Create table(s) in the database and insert the data set(s) resulted from the Dedupe matching algorithm.
- Remove all the files that were used in the process, except for the configuration file provided by the user. We do not remove this file, because if the user would like to see some results, that are stored in the database, it will need to provide again the configuration file (since the system needs the database configuration data). So, we leave it there for convenience.
The Backbone class incorporates the main logic of the server application and can do the following things:
- create a data set with companies from the database extracted by jurisdiction; the data set will NOT have multiple companies that are part of the same cluster, because the Dedupe matching algorithm does a 1-to-1 match between the input data sets, so it can create clusters of at most 2 elements (companies)
- extract the maximum idx (cluster_id) from the backbone_index table
- create the configuration file for the Dedupe matching algorithm
- execute cells from the Jupyter notebook where the Dedupe matching algorithm is written
- search the database for a value in a certain field; e.g. it can search in the database for all the companies whose names are SINTEF or their names contain the SINTEF substring
Note: the Backbone class has some hard coded file names:
- configuration_file_bs.json: this is how the configuration file, that the user uploads to the server before running the algorithm, must be named
- configuration_file_dedupe.json: this is how the Dedupe matching algorithm's configuration file will be named when a Backbone object creates it
- tmp_input_file_2.csv: in case the user uploads only 1 input data set and the algorithm will have to extract companies by jurisdiction from the database, to create the second input file for the matching algorithm, this is how the data set will be named
- dedupe_interlinking_data.ipynb: this is how the Jupyter notebook, that contains the Dedupe matching algorithm, must be named
- training_file.json: if the user wants to create a training file in the client application, that file will be automatically uploaded to the server with the name training_file.json; so, when the Backbone object creates the configuration file for the Dedupe matching algorithm, and the create_training_file_by_client field in the configuration_file_bs.json is set to true, the Backbone object will write this name in the matching algorithm's configuration file
The utilities module contains a lot of functions that are used by the API and Backbone class. Every function is commented, so, for more information about each individual function, you can access the utilities.py file here