The challenge of predicting the presence of a link between two nodes in a network is known as link prediction. Here we will solve the problem of predicting if a research publication will cite another research paper. For that, we have access to a citation network that includes hundreds of thousands of research publications, as well as their abstracts and author lists.
The pipeline used to solve this problem is identical to that used to solve any classification problem; the goal is to learn the parameters of a classifier using edge information, and then use the classifier to predict whether two nodes are related by an edge or not. Our goal in this project is to transform the different types of data, i.e. abstracts, authors and citation graph to create a feature matrix that we can feed to the classifier that will tackle the link prediction problem. Our model performance will be evaluated with the log loss metric.
This model was created for the following Kaggle competition for the 2021/2022 Advanced learning for text and graph data course. It is ranked TOP 1 both on the public and private learderboard.
The team OverTen is composed by Xavier Jiménez, Jean Quentin and Sacha Revol.
Best submission and results on the validation dataset can be reproduced using the best_submission.ipynb
file.
File Preprocessing.ipynb
handles preprocessing for abstracts, authors and graph data.
File ALTEGRAD_project_v2.ipynb
handles the different steps for matrix creation and evaluation (i.e. LR, RF, XGBoost, LGBM, CatBoost)
File nn-classifier.ipynb
implements the MLP classifier.
Files weighted_co_authors_graph.py
, utils.py
and citation_graph.py
handle authors Graph creation
Files *_embedding.py/ipynb
handle abstract and graph node embeddings.
Files *_optimization.ipynb
find best hyperparameters for a given model using HyperOpt package.