DeepSpeedPlugin with activation checkpoint fails #9144

nachshonc · 2021-08-26T18:47:56Z

nachshonc
Aug 26, 2021

Hi there,
I'm trying to run pytorch-lightning training with deepspeed plugin and activation checkpoints to support bigger batch sizes, based on https://pytorch-lightning.readthedocs.io/en/stable/advanced/advanced_gpu.html#deepspeed-activation-checkpointing.
As specified in the docs, running the model should be done using the checkpoint function. However, this function seems to return a tensor without gradients. When computing loss based on this value and returning from training_step, I'm getting an error
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Minimal code to reproduce

import os

import deepspeed
import pytorch_lightning as pl
import torch
from deepspeed.ops.adam import FusedAdam
from pytorch_lightning.plugins import DeepSpeedPlugin
from torch import nn
from pytorch_lightning.utilities.types import STEP_OUTPUT
from torch.utils.data import DataLoader, RandomSampler


class PlModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = nn.Linear(1, 1)

    def forward(self, batch):
        return self.model(batch)

    def training_step(self, batch, batch_idx) -> STEP_OUTPUT:
        res = deepspeed.checkpointing.checkpoint(self.model, batch)
        return nn.MSELoss()(res, torch.zeros_like(res, device=res.device))

    def configure_optimizers(self):
        return FusedAdam(self.parameters(), lr=0.1)


if __name__ == '__main__':
    trainer = pl.Trainer(gpus=-1, precision=16, plugins=DeepSpeedPlugin(stage=3, partition_activations=True))
    model = PlModel()
    dataset = torch.rand(100, 1)
    dl = torch.utils.data.DataLoader(dataset, batch_size=1, num_workers=os.cpu_count(),
                                     sampler=RandomSampler(dataset))
    trainer.fit(model, dl)

pytorch-lightning version: 1.3.3
deepspeed version: 0.5.0
Thanks!

Answered by SeanNaren

Aug 26, 2021

Thanks @nachshonc!

I've managed to reproduce the same case without Deepspeed using torch.utils.checkpoint and our bug report model:

import deepspeed
import torch
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.plugins import DeepSpeedPlugin
from torch.utils.data import DataLoader, Dataset


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.L…

View full answer

tchaton · 2021-08-26T19:17:18Z

tchaton
Aug 26, 2021
Maintainer

@SeanNaren

0 replies

SeanNaren · 2021-08-26T22:56:29Z

SeanNaren
Aug 26, 2021

Thanks @nachshonc!

I've managed to reproduce the same case without Deepspeed using torch.utils.checkpoint and our bug report model:

import deepspeed
import torch
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.plugins import DeepSpeedPlugin
from torch.utils.data import DataLoader, Dataset


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return torch.utils.checkpoint.checkpoint(self.layer, x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("train_loss", loss)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("valid_loss", loss)

    def test_step(self, batch, batch_idx):
        loss = self(batch).sum()
        self.log("test_loss", loss)

    def configure_optimizers(self):
        return torch.optim.SGD(self.layer.parameters(), lr=0.1)


def run():
    train_data = DataLoader(RandomDataset(32, 64), batch_size=2)
    val_data = DataLoader(RandomDataset(32, 64), batch_size=2)

    model = BoringModel()
    trainer = Trainer(
        max_epochs=1,
    )
    trainer.fit(model, train_dataloaders=train_data, val_dataloaders=val_data)


if __name__ == "__main__":
    run()

I think the issue arises from the fact that the entire model's activations have been removed, with the input tensors not requiring any gradients, thus the autograd engine not being able to infer any gradients.

For activation checkpointing, it only makes sense to include it if you have intermediate layers which can create expensive activations. For example, swap the model out to look like this:

class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer_h = torch.nn.Linear(32, 32)
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        x = torch.utils.checkpoint.checkpoint(self.layer_h, x)
        return self.layer(x)

Activation checkpointing just means on the backwards, we'll need to re-compute the activations (unless you do CPU checkpointing with Deepspeed or something, where activations are just transferred to the CPU memory). In this case, there is no point checkpointing the final layer, as the final layer will instantly need to be re-computed.

class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer_h = torch.nn.Linear(32, 32)
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        x = self.layer_h(x)
        return torch.utils.checkpoint.checkpoint(self.layer, x) # no point doing this!

We should definitely make the docs clearer for this, I'll make this an issue :)

2 replies

nachshonc Aug 27, 2021
Author

Thanks for the clear reply, make perfect sense.

tchaton Aug 27, 2021
Maintainer

@SeanNaren Mind adding this example in the doc ?

iamlockelightning · 2023-06-14T02:27:04Z

iamlockelightning
Jun 14, 2023

@SeanNaren Detailed answer!👍 But I wonder how to wrap an inner model like HuggingFace AutoModel.from_pretrained("bert-base-uncased")?

class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.bert_layer = AutoModel.from_pretrained("bert-base-uncased")
        self.mlp_layer = torch.nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        bert_output = deepspeed.checkpointing.checkpoint(self.bert_layer, input_ids, attention_mask, token_type_ids)
        # ⬆️ this api will return a tuple of strings: ('last_hidden_state', 'pooler_output'), not embeddings
        cls_output = bert_output.last_hidden_state[:, 0, :]
        mlp_output = self.mlp_layer(cls_output)
        return mlp_output

2 replies

iamlockelightning Jul 18, 2023

Problem solved by changing the BoringModel to inherit from PreTrainedModel and making use of the .gradient_checkpointing_enable() API provided by HuggingFace transformers.

MSchnei Sep 4, 2024

@iamlockelightning would you mind sharing a couple of lines of code to illustrate how you solved this, especially how you used .gradient_checkpointing_enable()?

I am having the same issue of trying to use activation checkpointing on part of a transformers model but only getting returned a tuple of strings

LogicBaron · 2023-10-06T09:34:27Z

LogicBaron
Oct 6, 2023

Deepspeed only provides Pipeline parallelism (PP), and using Deepspeed PP is incompatible with zero-2 and zero-3.

ref : https://deepspeed.readthedocs.io/en/latest/pipeline.html
While the official documentation suggests using activation partitioning with zero-3 + MP, in reality, you cannot use activation partitioning alongside zero-3.

Furthermore, zero-3 calculates activations independently for each GPU in a sub-batch, making activation partitioning itself meaningless.

In conclusion, if you're using the DeepspeedStrategy in PyTorch Lightning,

applying activation partitioning doesn't offer much significance.
applying activation partitioning with zero is impossible.
I'm also studying this issue extensively, and I would like to discuss it further if possible

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeedPlugin with activation checkpoint fails #9144

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

DeepSpeedPlugin with activation checkpoint fails #9144

nachshonc Aug 26, 2021

Replies: 4 comments · 4 replies

tchaton Aug 26, 2021 Maintainer

SeanNaren Aug 26, 2021

nachshonc Aug 27, 2021 Author

tchaton Aug 27, 2021 Maintainer

iamlockelightning Jun 14, 2023

iamlockelightning Jul 18, 2023

MSchnei Sep 4, 2024

LogicBaron Oct 6, 2023

nachshonc
Aug 26, 2021

Replies: 4 comments 4 replies

tchaton
Aug 26, 2021
Maintainer

SeanNaren
Aug 26, 2021

nachshonc Aug 27, 2021
Author

tchaton Aug 27, 2021
Maintainer

iamlockelightning
Jun 14, 2023

LogicBaron
Oct 6, 2023