Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summary of our Kaggle Competition🎉 #11

Open
wangwpi opened this issue Jul 9, 2024 · 1 comment
Open

Summary of our Kaggle Competition🎉 #11

wangwpi opened this issue Jul 9, 2024 · 1 comment

Comments

@wangwpi
Copy link
Contributor

wangwpi commented Jul 9, 2024

The Kaggle challenge has officially finished. Our best model is the chemprop model that receives 0.410 public score (rank 911/1984) and 0.232 private score (rank 973/1984). Public score considered 2 groups (seen vs. unseen Building Blocks) of data for each protein, so in total 6 groups of data. The private score is deem to be lower because it tested the model on an additional group of data using a different experiment library. Interesting, the top 5 teams that have the highest public score all fell under top5 in the private score, suggesting the difficulty of the model to learn a generalized pattern for the binding. Those models might overfit on the known library and understand poorly on the new library.

We also explored Random Forest model, BERT model, Neural Network and Graph Neural Network (GNN). We found the importance of feature generation and engineering: a atom-level and bond-level featurizing used by GNN had better result than morgan fingerprint, and directly use of SMILE string in BERT gave the worst performance. Our own GNN model only extracted very basic atom feature and bond feature (final feature dimension 22), due to the limit time remaining, compared to the chemprop (>80 features), this could be the main result why chemprop is better than our GNN model. However, chemprop model was only trained on a subset of the complete training data due to the Out of Memory issue, our own GNN model could train on the complete dataset because we can easily handle the data in trunk. But while increasing the size of the dataset for chemprop, we do see a better and better performance. Therefore, a possible better model could be a GNN model that considered enough atom and bond features and train on the complete training dataset.

Overall, we did a pretty great job in the limited time of this competition, we explored many different models and learned ways to process and train a very large dataset. Moving forward, we should be more familiar and confident with the similar challenge.

Cheers!! 🎉

@kaichop
Copy link
Contributor

kaichop commented Jul 10, 2024

Thank you for the summary. Cheers!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants