Multi-GPU Training for Large Datasets #188

mstapelberg · 2023-11-13T16:04:04Z

mstapelberg
Nov 13, 2023

Hi there,

I am looking to train a potential on around 10,000 NEB VASP simulations and I've noticed that running on a single gpu simply doesn't cut it.

I'm fairly new to DGL and Torch, so was wondering if anyone had suggestions on how to setup multi-gpu training with matgl?

My current idea is to just use the Matgl functions to create the graphs (to save memory as well) and then use DDP in torch to spread the dataset over 4 gpus following this example : https://huggingface.co/blog/pytorch-ddp-accelerate-transformers

Is there an existing example that does this? If not, I'm happy to try myself and post a working example

Thanks,
Myles

shyuep · 2023-11-13T16:24:47Z

shyuep
Nov 13, 2023
Maintainer

10000 NEB shouldn't pose a problem. We have trained potentials with 180000 data points with no issues. There are already example files showing how to do multi-gpu with pytorch-lightning with matgl. Pls review those. Yes, creating the graphs first before actually running the training would be a good idea. Though the latest version of matgl should already not store the structures after conversion.

5 replies

mstapelberg Nov 13, 2023
Author

Thank you for your quick reply Prof. Ong!

Each NEB run has around 30 ionic steps and 5 images, leading to about 1.5M structures, is this still doable? I read through another user's discussion on handling the large datasets and checked all the examples on the github, but could only find single-gpu implementations for training potentials, is there another repository with more examples? Or did I miss something on the matglapi website and the github?

When enabling ddp in both the MGLDataLoader and the pl.Trainer I get the following error:

File: vasprun_neb_18_01.xml loaded
File: vasprun_neb_18_02.xml loaded
File: vasprun_neb_18_03.xml loaded
75 downloaded from MP.
100%|██████████| 75/75 [00:05<00:00, 14.47it/s]
Traceback (most recent call last):
  File "/home/myless/Potential_Training/V-Cr-Ti/Test_1/train.py", line 77, in <module>
    train_loader, val_loader, test_loader = MGLDataLoader(
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/matgl/graph/data.py", line 78, in MGLDataLoader
    train_loader = GraphDataLoader(
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/dgl/dataloading/dataloader.py", line 1451, in __init__
    self.dist_sampler = _create_dist_sampler(
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/dgl/dataloading/dataloader.py", line 1281, in _create_dist_sampler
    return DistributedSampler(dataset, **dist_sampler_kwargs)
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/data/distributed.py", line 68, in __init__
    num_replicas = dist.get_world_size()
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1196, in get_world_size
    return _get_group_size(group)
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 576, in _get_group_size
    default_pg = _get_default_group()
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 707, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Also, I am unable to run the newest version of matgl and as suggested by JiQi reverted my Torch version to 2.0.1 and matgl version to 0.7.1 to get the single gpu implementation to work.

Code for reference:
train.txt

shyuep Nov 13, 2023
Maintainer

1.5m should still be doable if you use multi-GPU. @kenko911 can provide a training example with multi-gpu. The tutorials of matgl should already have the code.

mstapelberg Nov 13, 2023
Author

Awesome that is great news. Thank you very much!

kenko911 Nov 15, 2023
Maintainer

Hi @mstapelberg, sorry for the late reply. Your script looks right, the error comes from the iter function in the DistributedSampler class from torch source code torch/utils/data/distributed.py. When enabling shuffling data using torch.Generator, the default device is "cpu", which causes the error. The workaround solution would be adding the following line:
device = "cuda" torch.cuda.is_available() else "cpu" just before the line "g = torch.Generator() " in the iter function.
and modify the line from
g = torch.Generator()
to
g = torch.Generator(device=deivce)

Here is the reference link:
https://discuss.pytorch.org/t/distributedsampler-runtimeerror-expected-a-cuda-device-type-for-generator-but-found-cpu/103594/2

This should work and please let me know if any problem. I suggest you first generate M3GNet dataset and then load graphs for training since your dataset is pretty large and it may take a lot of memory.

mstapelberg Nov 15, 2023
Author

Hi @kenko911 thank you very much for your reply and help! After implementing your suggestion, I unfortunately get the same issue:

Here is what I had before:

def __iter__(self) -> Iterator[T_co]:
        if self.shuffle:
            # deterministically shuffle based on epoch and seed
            g = torch.Generator()
            g.manual_seed(self.seed + self.epoch)
            indices = torch.randperm(len(self.dataset), generator=g).tolist()  # type: ignore[arg-type]
        else:
            indices = list(range(len(self.dataset)))  # type: ignore[arg-type]

And changed to:

def __iter__(self) -> Iterator[T_co]:
        if self.shuffle:
            # deterministically shuffle based on epoch and seed
            #work around added to make matgl work
            if torch.cuda.is_available():
                device = "cuda"
            else:
                device = "cpu"
            g = torch.Generator(device=device)
            g.manual_seed(self.seed + self.epoch)
            indices = torch.randperm(len(self.dataset), generator=g).tolist()  # type: ignore[arg-type]
        else:
            indices = list(range(len(self.dataset)))  # type: ignore[arg-type]

Here is my new error log, but it looks identical. It seems that the issue is related to the dist.get_world_size() command?

Traceback (most recent call last):
  File "/home/myless/Potential_Training/V-Cr-Ti/Test_1/train.py", line 77, in <module>
    train_loader, val_loader, test_loader = MGLDataLoader(
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/matgl/graph/data.py", line 78, in MGLDataLoader
    train_loader = GraphDataLoader(
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/dgl/dataloading/dataloader.py", line 1451, in __init__
    self.dist_sampler = _create_dist_sampler(
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/dgl/dataloading/dataloader.py", line 1281, in _create_dist_sampler
    return DistributedSampler(dataset, **dist_sampler_kwargs)
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/utils/data/distributed.py", line 68, in __init__
    num_replicas = dist.get_world_size()
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1196, in get_world_size
    return _get_group_size(group)
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 576, in _get_group_size
    default_pg = _get_default_group()
  File "/home/myless/.mambaforge/envs/matgl-gpu/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 707, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Do you by chance have a minimal working example of multi-training that I could compare to? I'm running matgl 0.7.1 and torch 2.0.1

Thanks!
Myles

kavanase · 2024-05-30T23:16:06Z

kavanase
May 30, 2024

Hi!
I am also interested in running multi GPU training for M3GNet with matgl, but am currently running into issues on my HPC for this.
@shyuep you mentioned above that

There are already example files showing how to do multi-gpu with pytorch-lightning with matgl

but I can't find any on the repository, as @mstapelberg also said above. Do you possibly have these examples locally / on a private branch, that could be shared?
Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU Training for Large Datasets #188

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Multi-GPU Training for Large Datasets #188

mstapelberg Nov 13, 2023

Replies: 2 comments · 5 replies

shyuep Nov 13, 2023 Maintainer

mstapelberg Nov 13, 2023 Author

shyuep Nov 13, 2023 Maintainer

mstapelberg Nov 13, 2023 Author

kenko911 Nov 15, 2023 Maintainer

mstapelberg Nov 15, 2023 Author

kavanase May 30, 2024

mstapelberg
Nov 13, 2023

Replies: 2 comments 5 replies

shyuep
Nov 13, 2023
Maintainer

mstapelberg Nov 13, 2023
Author

shyuep Nov 13, 2023
Maintainer

mstapelberg Nov 13, 2023
Author

kenko911 Nov 15, 2023
Maintainer

mstapelberg Nov 15, 2023
Author

kavanase
May 30, 2024