Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Details of Diffdock-L #261

Open
ShuoAndy opened this issue Oct 24, 2024 · 1 comment
Open

Training Details of Diffdock-L #261

ShuoAndy opened this issue Oct 24, 2024 · 1 comment

Comments

@ShuoAndy
Copy link

Hello! I noticed that in the original DiffDock paper, you mentioned "We trained our final score model on four 48GB RTX A6000 GPUs for 850 epochs (around 18 days)." However, in the DiffDock-L paper, there is no mention of the specific GPUs and the time taken for training. I would like to ask about the training process of the workdir/v1.1/score_model you provided. Specifically, how many GPUs were used and how many days did the training take? If possible, could you also provide the training time for each epoch? On my side, it takes at least one and a half hours to train one epoch using only the PDBBind dataset. Is this within expectations?

@PhillipLo
Copy link

PhillipLo commented Dec 2, 2024

I'm also having a similar issue. I tried training from scratch on PDBBind using the following command:

python -m train --run_name _debug --test_sigma_intervals --pdbbind_esm_embeddings_path data/esm2_embeddings.pt --log_dir workdir --lr 1e-3 --tr_sigma_min 0.1 --tr_sigma_max 19 --rot_sigma_min 0.03 --rot_sigma_max 1.55 --batch_size 16 --ns 48 --nv 10 --num_conv_layers 6 --dynamic_max_cross --scheduler plateau --scale_by_sigma --dropout 0.1 --sampling_alpha 1 --sampling_beta 1 --remove_hs --c_alpha_max_neighbors 24 --receptor_radius 15 --num_dataloader_workers 1 --cudnn_benchmark --val_inference_freq 5 --num_inference_complexes 500 --use_ema --distance_embed_dim 64 --cross_distance_embed_dim 64 --sigma_embed_dim 64 --scheduler_patience 30 --n_epochs 5

I am running the most recent commit (b4704d9 on the main branch). It took only 90 seconds to train 5 epochs, which seems inappropriately low to me.

It would be helpful if the README could include details on how to retrain the models from scratch to replicate the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants