-
Notifications
You must be signed in to change notification settings - Fork 1
Dedupe
This page describes how the Dedupe library was integrated in this project.
The algorithm requires the following input files:
- two datasets (files)
- configuration file
- (OPTIONAL) training file - if the training file is not given, the user will have to create one in the Jupyter notebook, at the training part
- (OPTIONAL) settings file
These datasets should be in the CSV format.
The names of their columns should be lowercase. If their names are composed by two or more words then the underscore convention should be used to separate them instead of the camel case convention.
This file should be in the JSON format and its name should always be "configuration_file_dedupe.json".
An example of configuration file can be found here.
It contains information about different parameters which the algorithm uses. These parameters are:
- input_file_1 and input_file_2 (type: string) -> names of the two input data sets; the value for input_file_2 is optional (if the user doesn't upload a second input file, this field's value should be null)
-
training (contains the parameters used in the training part)
- nr_of_examples_for_training (type: integer) -> this parameter is used in the sample method of dedupe (this parameter's name is sample_size in the sample method from Dedupe).
- field_definitions (type: list of dictionaries) -> it represents a list of dictionaries where each dictionary has as keys : "field" and "type". "Field" represents the name of a column used for matching different entities and "type" represents the column's type. Note: there are no INT or FLOAT types, so make them String; to see the valid types that Dedupe accepts click here.
- create_training_file_by_client (type: boolean) -> it represents a boolean value which specifies if the user has created or not a training file using the interface provided in the application.
- training_file (type: string) -> it represents the name of an existing training file that was uploaded on the server by the user. Note: if this field's value is null, and the create_training_file_by_client field is false, then the Dedupe algorithm will begin to create a training file by asking the user to label pairs of companies that the algorithm is unsure about; this process will take place only on the server side, so the users on the client side must either upload a training file, or create it in the client application; labeling means saying if two companies match or not, i.e., the two companies are the same company or not
- settings_file (type: string) -> represents the name of the settings_file; if this value is null then a new settings file it will be created by matching algorithm.
- threshold (type: float between 0 and 1 inclusive) -> this is a parameter of the match method of dedupe; if the value is close to 0, then the algorithm is expected to have a high recall rate, otherwise, if the value is close to 1, then the algorithm is expected to have a high precision rate; T=this method is called in the clustering part; if the threshold value is null and compute_threshold value is null then the default value of the method is used.
-
compute_threshold -> if the threshold value is different from null, then this parameter is not take into consideration. If this parameter is specified and threshold value is null then you can use the following parameters for computing the threshold using the threshold method of dedupe.
- recall_weight (type: float) -> it is a parameter of Dedupe's method threshold (more info if you click on the 'threshold' hyperlink above).
- nr_of_sample_data_for_threshold (type: integer) -> the threshold method needs a sample of examples from each data set (together with the recall_weight parameter) to compute the threshold; this field represents the number of examples that each sample will have.
-
database_config -> this object is used for configuring the database used in the evaluation part. The configuration parameters required for the database are:
- database_name (type: string)
- username (type: string)
- password (type: string)
- host (type: string)
- port (type: integer).
- evaluation -> if the value of this object is null, then the evaluation part won't be executed. If this value is different from null then in the label_column_name (type: string) the user should specify the name of the column, which represents the ground truth. In our case the ground truth was the id column, which represents the official ids of companies.
-
last_cluster_id (type: integer) -> this value represents the maximum cluster_id (idx) that is in the backbone_index table; the algorithm will give cluster_ids starting from the
last_cluster_id + 1
number
The type of the fields should respect the convention format found [here].(https://simplejson.readthedocs.io/en/latest/#simplejson.JSONDecoder).
This file is in the JSON format. It contains two categories, "match" - has examples from which the library will learn how to identify if two companies are the same entity - and "distinct" - has examples from which the library will learn how to identify when two companies are not about same entity.
Notes:
- the library knows to match cases where all the fields of the two examples are identical
- the library knows that if all the fields of the two examples are different, then the companies are distinct
An example of training file you can find here.
You can create one manually, but you should use a specific template. Also, the library creates automatically one for you. In the training part the library gives you some examples to label because the library is uncertain about them.
This is a binary file which contains information about the model. This model is the result of training part.
More details about the algorithm steps can be found in the jupyter notebook file.