You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Kaggle challenge has officially finished. Our best model is the chemprop model that receives 0.410 public score (rank 911/1984) and 0.232 private score (rank 973/1984). Public score considered 2 groups (seen vs. unseen Building Blocks) of data for each protein, so in total 6 groups of data. The private score is deem to be lower because it tested the model on an additional group of data using a different experiment library. Interesting, the top 5 teams that have the highest public score all fell under top5 in the private score, suggesting the difficulty of the model to learn a generalized pattern for the binding. Those models might overfit on the known library and understand poorly on the new library.
We also explored Random Forest model, BERT model, Neural Network and Graph Neural Network (GNN). We found the importance of feature generation and engineering: a atom-level and bond-level featurizing used by GNN had better result than morgan fingerprint, and directly use of SMILE string in BERT gave the worst performance. Our own GNN model only extracted very basic atom feature and bond feature (final feature dimension 22), due to the limit time remaining, compared to the chemprop (>80 features), this could be the main result why chemprop is better than our GNN model. However, chemprop model was only trained on a subset of the complete training data due to the Out of Memory issue, our own GNN model could train on the complete dataset because we can easily handle the data in trunk. But while increasing the size of the dataset for chemprop, we do see a better and better performance. Therefore, a possible better model could be a GNN model that considered enough atom and bond features and train on the complete training dataset.
Overall, we did a pretty great job in the limited time of this competition, we explored many different models and learned ways to process and train a very large dataset. Moving forward, we should be more familiar and confident with the similar challenge.
Cheers!! 🎉
The text was updated successfully, but these errors were encountered:
The Kaggle challenge has officially finished. Our best model is the chemprop model that receives 0.410 public score (rank 911/1984) and 0.232 private score (rank 973/1984). Public score considered 2 groups (seen vs. unseen Building Blocks) of data for each protein, so in total 6 groups of data. The private score is deem to be lower because it tested the model on an additional group of data using a different experiment library. Interesting, the top 5 teams that have the highest public score all fell under top5 in the private score, suggesting the difficulty of the model to learn a generalized pattern for the binding. Those models might overfit on the known library and understand poorly on the new library.
We also explored Random Forest model, BERT model, Neural Network and Graph Neural Network (GNN). We found the importance of feature generation and engineering: a atom-level and bond-level featurizing used by GNN had better result than morgan fingerprint, and directly use of SMILE string in BERT gave the worst performance. Our own GNN model only extracted very basic atom feature and bond feature (final feature dimension 22), due to the limit time remaining, compared to the chemprop (>80 features), this could be the main result why chemprop is better than our GNN model. However, chemprop model was only trained on a subset of the complete training data due to the Out of Memory issue, our own GNN model could train on the complete dataset because we can easily handle the data in trunk. But while increasing the size of the dataset for chemprop, we do see a better and better performance. Therefore, a possible better model could be a GNN model that considered enough atom and bond features and train on the complete training dataset.
Overall, we did a pretty great job in the limited time of this competition, we explored many different models and learned ways to process and train a very large dataset. Moving forward, we should be more familiar and confident with the similar challenge.
Cheers!! 🎉
The text was updated successfully, but these errors were encountered: