-
Notifications
You must be signed in to change notification settings - Fork 74
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add molformer's trainer skeleton * Clean up parameters * Enable GPU inference * Add molformer to training pipelines * Proper handling of list parameters * Inherit from pl training pipeline * Rename arguments * Copy measure_name from dataset to model args * Fix alignments * Add missing parameters * Add aug in dataset args * Add pretrained path for regression * Fix parameters * Apply style * Fix parameter types * Update parameter's metadata * Add missing parameters * Add examples for molformer * Update parameter's metadata
- Loading branch information
1 parent
95cee8d
commit be5642f
Showing
6 changed files
with
606 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
# Molformer | ||
|
||
Simple example to train or finetune the Molformer model | ||
|
||
Make sure to activate the conda environment: | ||
|
||
```console | ||
conda activate gt4sd | ||
``` | ||
|
||
## Pretraining | ||
|
||
An example for Molformer's pretraining. The `data_path` parameter contains the path where one or both the `pubchem`, `ZINC` directories are located. Link to the dataset and further detais about it can be found at the [original molformer repo](https://github.com/IBM/molformer). | ||
|
||
```console | ||
|
||
gt4sd-trainer --training_pipeline_name molformer \ | ||
--type pretraining \ | ||
--batch_size 1200 \ | ||
--n_head 12 \ | ||
--n_layer 12 \ | ||
--n_embd 768 \ | ||
--d_dropout 0.1 \ | ||
--lr_start 3e-5 \ | ||
--num_workers 8\ | ||
--max_epochs 4\ | ||
--num_feats 32 \ | ||
--grad_acc 1\ | ||
--data_path molformer/data/pretrained \ | ||
--model_arch BERT_both_rotate | ||
``` | ||
|
||
## Finetuning | ||
|
||
### Classification | ||
|
||
An example of classification finetuning using the hiv dataset. Link to the dataset can be found at the [original molformer repo](https://github.com/IBM/molformer). | ||
|
||
```console | ||
|
||
gt4sd-trainer --training_pipeline_name molformer \ | ||
--type classification \ | ||
--batch_size 128 \ | ||
--n_head 12 \ | ||
--n_layer 12 \ | ||
--n_embd 768 \ | ||
--d_dropout 0.1 \ | ||
--dropout 0.1 \ | ||
--lr_start 3e-5 \ | ||
--num_workers 8\ | ||
--max_epochs 500\ | ||
--num_feats 32 \ | ||
--every_n_epochs 10 \ | ||
--data_root molformer/data/hiv \ | ||
--pretrained_path pretrained_molformer/checkpoints/N-Step-Checkpoint_3_30000.ckpt \ | ||
--dataset_name hiv \ | ||
--measure_name HIV_active \ | ||
--dims 768 768 768 1 \ | ||
--num_classes 2 \ | ||
--save_dir test_hiv | ||
``` | ||
|
||
### Multiclass classification | ||
|
||
An example of multiclass finetuning using the clintox dataset. Link to the dataset can be found at the [original molformer repo](https://github.com/IBM/molformer). | ||
|
||
```console | ||
|
||
gt4sd-trainer --training_pipeline_name molformer \ | ||
--type multitask_classification \ | ||
--batch_size 128 \ | ||
--n_head 12 \ | ||
--n_layer 12 \ | ||
--n_embd 768 \ | ||
--d_dropout 0.1 \ | ||
--dropout 0.1 \ | ||
--lr_start 3e-5 \ | ||
--num_workers 8\ | ||
--max_epochs 500\ | ||
--num_feats 32 \ | ||
--every_n_epochs 10 \ | ||
--data_root molformer/data/clintox \ | ||
--pretrained_path pretrained_molformer/checkpoints/N-Step-Checkpoint_3_30000.ckpt \ | ||
--dataset_name tox21 \ | ||
--dims 768 768 768 1 \ | ||
--measure_names FDA_APPROVED CT_TOX | ||
--save_dir test_clintox \ | ||
``` | ||
|
||
### Regression | ||
|
||
An example of regression finetuning using the qm9 dataset. Link to the dataset can be found at the [original molformer repo](https://github.com/IBM/molformer). | ||
|
||
```console | ||
|
||
gt4sd-trainer --training_pipeline_name molformer \ | ||
--type regression \ | ||
--n_batch 128 \ | ||
--n_head 12 \ | ||
--n_layer 12 \ | ||
--n_embd 768 \ | ||
--d_dropout 0.1 \ | ||
--dropout 0.1 \ | ||
--lr_start 3e-5 \ | ||
--n_workers 8\ | ||
--max_epochs 500\ | ||
--num_feats 32 \ | ||
--every_n_epochs 10 \ | ||
--data_root molformer/data/qm9 \ | ||
--pretrained_path pretrained_molformer/checkpoints/N-Step-Checkpoint_3_30000.ckpt \ | ||
--dataset_name qm9 \ | ||
--measure_name alpha \ | ||
--dims 768 768 768 1 \ | ||
--save_dir test_alpha | ||
``` | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
24 changes: 24 additions & 0 deletions
24
src/gt4sd/training_pipelines/pytorch_lightning/molformer/__init__.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# | ||
# MIT License | ||
# | ||
# Copyright (c) 2023 GT4SD team | ||
# | ||
# Permission is hereby granted, free of charge, to any person obtaining a copy | ||
# of this software and associated documentation files (the "Software"), to deal | ||
# in the Software without restriction, including without limitation the rights | ||
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
# copies of the Software, and to permit persons to whom the Software is | ||
# furnished to do so, subject to the following conditions: | ||
# | ||
# The above copyright notice and this permission notice shall be included in all | ||
# copies or substantial portions of the Software. | ||
# | ||
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
# SOFTWARE. | ||
# | ||
"""Molformer training pipeline initialization.""" |
Oops, something went wrong.