-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensor size issue #1
Comments
Hi! I have not experienced this error, so I suspect it has something to do with our different training setups or package versions. To help debug, can you try the following:
If all of that works, then I would guess it is related to an issue with DDPM in Pytorch-Lightning with your particular system set-up. Are you trying to train with 1 GPU? On a CPU? On multiple GPUs? The parameters in Also, does this error occur at the start of the training epochs? Or mid-way through training? Additionally, make sure that the versions of your packages are the same as those listed in the README, particularly your Pytorch-Lightning, Pytorch, and PyG versions. It would also help if you could provide the complete error traceback. |
Hi @keiradams , thanks for your quick reply. I did not make any changes to the code in the repo. I am able to run the RUNME notebooks using a new virtual environment I have setup without issue. However, for train.py I ran into the following error which I think might be due to the pytorch geometric version. I had to choose slightly different pytorch and pytorch geometric version to yours as my cuda version is different.
Here is the yaml of my local virtual env:
|
Hi @arunraja-hub, sorry for the delay. Can you try adding the line of code I've updated the file on this Github, for your reference. Pytorch / PyG changed the function names in-between versions, which may be causing this issue. Let me know if this solves your issues, or if there are other fixes that need to be implemented! |
Hi @keiradams This error has been resolved now but I am still facing the same original tensor size issue. Here is the complete error traceback. I have had to change lightning, torch and PyG versions to fit my cuda version (11.5)
|
@arunraja-hub Can you confirm that these steps work prior to calling
(edited) |
@keiradams I can call dataset[0] and next(iter(train_loader)) when batch_size > 0 but as expected for batch_size =0, I got the following error:
|
sorry, I meant batch_size = 1 and batch_size > 1 |
Yes batch_size = 1 and batch_size > 1 work for me |
@arunraja-hub this error is quite odd to me, then. Can you train without an issue on a CPU with num_workers = 0? On a CPU with num_workers > 1? On 1 GPU with num_workers = 0 and num_workers > 1 ? You will have to change the parameters in trainer = pl.Trainer() to make these changes. |
@keiradams the training seems to work when batch_size = 1. The tensor size issue might be occurring due to the batching of graphs of various sizes though PyG should have taken care of this as it creates a batch-level adjacency matrix when dealing with a batch of graphs of varying sizes (https://pytorch-geometric.readthedocs.io/en/2.6.1/notes/batching.html) |
@arunraja-hub If you can sample from the dataloader when batch_size > 1 (outside of training) by calling Can you confirm again whether you have tested this? |
When I was just trying to run the training using
python train.py params_x1x3x4_diffusion_mosesaq_20240824 0
, as suggested in the readme, I got the following error:According to lucidrains/denoising-diffusion-pytorch#248 the solution is to change
num_workers
in the dataloader to 0 but that resulted in the following error:Could you please provide some guidance on this?
The text was updated successfully, but these errors were encountered: