Tensor size issue #1

arunraja-hub opened this issue Nov 28, 2024 · 11 comments

arunraja-hub opened this issue Nov 28, 2024 · 11 comments


When I was just trying to run the training using python params_x1x3x4_diffusion_mosesaq_20240824 0, as suggested in the readme, I got the following error:

RuntimeError: Trying to resize storage that is not resizable

According to lucidrains/denoising-diffusion-pytorch#248 the solution is to change num_workers in the dataloader to 0 but that resulted in the following error:

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1176 but got size 595 for tensor number 1 in the list.

Could you please provide some guidance on this?

keiradams commented Nov 28, 2024

Hi! I have not experienced this error, so I suspect it has something to do with our different training setups or package versions.

To help debug, can you try the following:

  • Make sure you can successfully run inference code provided by the RUNME_{}.ipynb notebooks.
  • In, make sure you can call dataset[0] after initializing dataset = HeteroDatset(...)
  • In, make sure you can call next(iter(train_loader)) after initializing train_loader = torch_geometric.loader.DataLoader(...), with batch_size = 1 and batch_size > 1.

If all of that works, then I would guess it is related to an issue with DDPM in Pytorch-Lightning with your particular system set-up. Are you trying to train with 1 GPU? On a CPU? On multiple GPUs? The parameters in parameters/ specify 'num_gpus': 2 and 'multiprocessing_spawn': True. Both of those could be causing issues with your specific setup?

Also, does this error occur at the start of the training epochs? Or mid-way through training?

Additionally, make sure that the versions of your packages are the same as those listed in the README, particularly your Pytorch-Lightning, Pytorch, and PyG versions.

It would also help if you could provide the complete error traceback.

Hi @keiradams , thanks for your quick reply. I did not make any changes to the code in the repo. I am able to run the RUNME notebooks using a new virtual environment I have setup without issue. However, for I ran into the following error which I think might be due to the pytorch geometric version. I had to choose slightly different pytorch and pytorch geometric version to yours as my cuda version is different.

Seed set to 0
Traceback (most recent call last):
  File "/mnt/data/slurm-storage/aruraj/opig/shepherd/", line 98, in <module>
    dataset = HeteroDataset(
TypeError: Can't instantiate abstract class HeteroDataset with abstract method get

Hi @arunraja-hub, sorry for the delay.

Can you try adding the line of code def get(self, k): return self.__getitem__(k) to the class HeteroDataset( definition in in your local clone?

I've updated the file on this Github, for your reference. Pytorch / PyG changed the function names in-between versions, which may be causing this issue.

Let me know if this solves your issues, or if there are other fixes that need to be implemented!

Hi @keiradams This error has been resolved now but I am still facing the same original tensor size issue. Here is the complete error traceback. I have had to change lightning, torch and PyG versions to fit my cuda version (11.5)

Seed set to 0
/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/lightning_fabric/plugins/environments/ The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python params_x1x3x4_diffusion_mosesaq_20240824 0 ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch/ UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch/ UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch/ UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch/ UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch/jit/ UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
beginning to train...
You are using a CUDA device ('NVIDIA A10') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read

  | Name  | Type  | Params
0 | model | Model | 6.0 M 
6.0 M     Trainable params
0         Non-trainable params
6.0 M     Total params
24.042    Total estimated model params size (MB)
/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/ The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=63` in the `DataLoader` to improve performance.
/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/pytorch_lightning/loops/ The number of training batches (143) is smaller than the logging interval Trainer(log_every_n_steps=1000). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
Epoch 0:   0%|                                                                                                                                    | 0/143 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/mnt/data/slurm-storage/aruraj/opig/shepherd/", line 231, in <module>, train_loader, ckpt_path = ckpt_path)
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/pytorch_lightning/trainer/", line 545, in fit
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/pytorch_lightning/trainer/", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/pytorch_lightning/trainer/", line 581, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/pytorch_lightning/trainer/", line 990, in _run
    results = self._run_stage()
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/pytorch_lightning/trainer/", line 1036, in _run_stage
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/pytorch_lightning/loops/", line 202, in run
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/pytorch_lightning/loops/", line 359, in advance
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/pytorch_lightning/loops/", line 136, in run
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/pytorch_lightning/loops/", line 202, in advance
    batch, _, __ = next(data_fetcher)
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/pytorch_lightning/loops/", line 127, in __next__
    batch = super().__next__()
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/pytorch_lightning/loops/", line 56, in __next__
    batch = next(self.iterator)
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/pytorch_lightning/utilities/", line 326, in __next__
    out = next(self._iterator)
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/pytorch_lightning/utilities/", line 74, in __next__
    out[i] = next(self.iterators[i])
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch/utils/data/", line 630, in __next__
    data = self._next_data()
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch/utils/data/", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch/utils/data/_utils/", line 54, in fetch
    return self.collate_fn(data)
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch_geometric/loader/", line 55, in collate_fn
    return self(batch)
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch_geometric/loader/", line 28, in __call__
    return Batch.from_data_list(
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch_geometric/data/", line 93, in from_data_list
    batch, slice_dict, inc_dict = collate(
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch_geometric/data/", line 92, in collate
    value, slices, incs = _collate(attr, values, data_list, stores,
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch_geometric/data/", line 177, in _collate
    value =, dim=cat_dim or 0, out=out)
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1176 but got size 595 for tensor number 1 in the list.

keiradams commented Dec 10, 2024

@arunraja-hub Can you confirm that these steps work prior to calling

  • make sure you can call dataset[0] after initializing dataset = HeteroDatset(...)
  • make sure you can call next(iter(train_loader)) after initializing train_loader = torch_geometric.loader.DataLoader(...), with batch_size = 1.
  • make sure you can call next(iter(train_loader)) after initializing train_loader = torch_geometric.loader.DataLoader(...), with batch_size > 1


@keiradams I can call dataset[0] and next(iter(train_loader)) when batch_size > 0 but as expected for batch_size =0, I got the following error:

Traceback (most recent call last):
  File "/mnt/data/slurm-storage/aruraj/opig/shepherd/", line 159, in <module>
    train_loader = torch_geometric.loader.DataLoader(
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch_geometric/loader/", line 98, in __init__
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch/utils/data/", line 355, in __init__
    batch_sampler = BatchSampler(sampler, batch_size, drop_last)
  File "/slurm-storage/aruraj/.conda/envs/airs/lib/python3.11/site-packages/torch/utils/data/", line 263, in __init__
    raise ValueError(f"batch_size should be a positive integer value, but got batch_size={batch_size}")
ValueError: batch_size should be a positive integer value, but got batch_size=0

keiradams commented Dec 10, 2024

sorry, I meant batch_size = 1 and batch_size > 1


Yes batch_size = 1 and batch_size > 1 work for me

keiradams commented Dec 10, 2024

@arunraja-hub this error is quite odd to me, then. Can you train without an issue on a CPU with num_workers = 0? On a CPU with num_workers > 1? On 1 GPU with num_workers = 0 and num_workers > 1 ?

You will have to change the parameters in trainer = pl.Trainer() to make these changes.

@keiradams the training seems to work when batch_size = 1. The tensor size issue might be occurring due to the batching of graphs of various sizes though PyG should have taken care of this as it creates a batch-level adjacency matrix when dealing with a batch of graphs of varying sizes (

@arunraja-hub If you can sample from the dataloader when batch_size > 1 (outside of training) by calling next(iter(train_loader)), then the issue shouldn't be with batching through PyG.

Can you confirm again whether you have tested this?

