Optimize MX4 padding to minimize need for tuning (#3040) · pytorch/FBGEMM@dde4c7f

Commit

Optimize MX4 padding to minimize need for tuning (#3040)

Summary:
X-link: facebookresearch/FBGEMM#137

Pull Request resolved: #3040

D61447274 introduced a very cool way of doing 2D indexing over input tensors during MX4 quantization, however, it is fairly reliant on tuning configurations to get good performance. It turns out the use case for MX4 has highly dynamic shapes, so we spend a huge amount of time tuning those shapes.

After deep meditation I realized there's a much simpler indexing scheme we can use, which is similar to the 1D accesses we used previously but adds shifts for padding.

With this approach we should get the best of both worlds; support for padding rows not divisible by group size and minimizing tuning while maintaining good performance.

After further experimentation, we can actually remove tuning entirely and just use a reasonably large `GROUP_LOAD`. This gives good performance across all shapes and removes any chance of overhead. Empirically, `GROUP_LOAD=64` seems to be the sweet spot.

Differential Revision: D61816830

Loading branch information

jwfromm authored and facebook-github-bot committed Aug 27, 2024

1 parent e782781 commit dde4c7f

0 comments on commit `dde4c7f`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `dde4c7f`

Commit

There are no files selected for viewing

0 comments on commit dde4c7f

0 comments on commit `dde4c7f`