A FusionDefinition wrapper that takes/produces DTensors. #3703

wujingyue · 2025-01-14T06:17:53Z

This is a proof of concept for integrating nvFuser's model parallelism to the framework.

wujingyue · 2025-01-14T16:14:14Z

!test

wujingyue · 2025-01-14T16:14:57Z

Cc @jjsjann123

To be used in #3703

github-actions · 2025-01-16T00:38:23Z

PR Reviewer Guide 🔍

(Review updated until commit `f9082a9`)

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪

🧪 PR contains tests

⚡ Recommended focus areas for review

Possible Logic Change

The new local property added to the FusionDefinition class may change the behavior of the axis_sharded_on method. Reviewers should verify that the logic is correct and consistent with the rest of the codebase.

@property
def local(self) -> torch.Tensor:
    """Returns the underlying local tensor."""
    return self._dtensor.local

Potential Bug

The FusionDefinitionWrapper class assumes that the define_fusion function will always return a FusionDefinition object. However, if the function returns None or an object of a different type, the wrapper will fail. Reviewers should add error handling to ensure that the wrapper can handle such cases.

class FusionDefinitionWrapper:
    def __init__(self, define_fusion: Callable[[FusionDefinition], None]):
        """Wraps a function that defines a fusion without `multidevice_schedule`."""
        self._define_fusion = define_fusion

    def __call__(self, in_dtensors: Iterable[DTensor]) -> list[DTensor]:
        define_fn = self._define_fusion

        class Model(FusionDefinition):
            def definition(self):
                define_fn(self)

            def _find_tensor_by_index(self, index: int) -> nvfuser.Tensor:
                for t in self.sched.tensors():
                    if t.index == index:
                        return t
                return None

            def multidevice_schedule(self):
                for in_tensor_index, in_dtensor in zip(self.inputs(), in_dtensors):
                    in_tensor = self._find_tensor_by_index(in_tensor_index)

                    # Set the device mesh.
                    assert (
                        in_dtensor.device_mesh.ndim == 1
                    ), "nvFuser's Python API only supports 1D meshes."
                    mesh = nvfuser.DeviceMesh(
                        in_dtensor.device_mesh.mesh.view(-1).tolist()
                    )
                    self.sched._set_device_mesh(in_tensor, mesh)

                    # Parallelize.
                    assert len(in_dtensor.placements) == 1, "Expect a 1D mesh"
                    placement: Placement = in_dtensor.placements[0]
                    if placement.is_shard():
                        dim = cast(Shard, placement).dim
                        self.sched.parallelize(
                            in_tensor, dim, nvfuser.ParallelType.mesh_x
                        )

        in_tensors = [in_dtensor.to_local() for in_dtensor in in_dtensors]
        model = Model()
        out_tensors = model.execute(in_tensors)

        for i, out_tensor in enumerate(out_tensors):
            if isinstance(out_tensor, nvfuser.DistributedTensor):
                mesh = dist.device_mesh.init_device_mesh("cuda", [out_tensor.mesh.size])
                placements: list[Placement] = []
                for parallel_type in [nvfuser.ParallelType.mesh_x]:
                    axis: int = out_tensor.axis_sharded_on(parallel_type)
                    placements.append(Replicate() if axis == -1 else Shard(axis))
                out_tensors[i] = DTensor.from_local(out_tensor.local, mesh, placements)
        return out_tensors

because it's designed to wrap any FusionDefinition.

wujingyue · 2025-01-20T04:19:13Z

!test

wujingyue marked this pull request as draft January 14, 2025 06:18

wujingyue force-pushed the wjy/dtensor branch from b81931f to 0fa89cc Compare January 14, 2025 06:25

wujingyue added a commit that referenced this pull request Jan 14, 2025

Expose Tensor::index via Python API

7e92132

To be used in #3703

wujingyue mentioned this pull request Jan 14, 2025

Expose Tensor::index via Python API #3705

Merged

wujingyue added a commit that referenced this pull request Jan 14, 2025

Expose Tensor::index via Python API (#3705)

3d27e10

To be used in #3703

wujingyue force-pushed the wjy/dtensor branch from 0fa89cc to 4d6d2ed Compare January 16, 2025 00:37

wujingyue changed the title ~~A custom op that wraps a FusionDefinition and takes/produces DTensors.~~ A FusionDefinition wrapper that takes/produces DTensors. Jan 16, 2025

wujingyue force-pushed the wjy/dtensor branch from ee9dbeb to c8e49a7 Compare January 19, 2025 08:03

wujingyue changed the base branch from main to wjy/dist January 19, 2025 08:03

wujingyue added 5 commits January 19, 2025 08:48

A custom op that wraps a FusionDefinition and takes/produces DTensors.

24d0e5a

Move FusionDefinitionWrapper out of the test

2602821

because it's designed to wrap any FusionDefinition.

rename

ba2906c

rename

ceee255

Collect output shardings from DistributedTensor

c3124a6

wujingyue force-pushed the wjy/dtensor branch from c8e49a7 to c3124a6 Compare January 19, 2025 16:49

wujingyue added 2 commits January 19, 2025 09:37

Fix

3096ffd

lint

f9082a9

wujingyue requested review from rdspring1 and jjsjann123 January 19, 2025 17:40

wujingyue marked this pull request as ready for review January 19, 2025 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A FusionDefinition wrapper that takes/produces DTensors. #3703

A FusionDefinition wrapper that takes/produces DTensors. #3703

wujingyue commented Jan 14, 2025 •

edited

Loading

wujingyue commented Jan 14, 2025

wujingyue commented Jan 14, 2025

github-actions bot commented Jan 16, 2025 •

edited

Loading

wujingyue commented Jan 20, 2025

A FusionDefinition wrapper that takes/produces DTensors. #3703

Are you sure you want to change the base?

A FusionDefinition wrapper that takes/produces DTensors. #3703

Conversation

wujingyue commented Jan 14, 2025 • edited Loading

wujingyue commented Jan 14, 2025

wujingyue commented Jan 14, 2025

github-actions bot commented Jan 16, 2025 • edited Loading

PR Reviewer Guide 🔍

(Review updated until commit f9082a9)

wujingyue commented Jan 20, 2025

wujingyue commented Jan 14, 2025 •

edited

Loading

github-actions bot commented Jan 16, 2025 •

edited

Loading

(Review updated until commit `f9082a9`)