[js/webgpu] ConvTranspose1D slower on Webgpu than Wasm #23273

gianlourbano · 2025-01-07T15:26:32Z

Describe the issue

ConvTranpose1D with input shapes [8, 4098, 435], weights [4096, 1, 4098] strides 1024 and padding 0 appears to be slower on WebGPU than Wasm, with timings:

EP	timing (m1 macbook pro)
wasm	6s
webgpu (latest chrome)	30s
webgpu (canary chrome)	18s

canary faster due to this bug

To reproduce

Simple torch script to generate the conv and convert it to onnx

import torch

class ConvTest (torch.nn.Module):
    def __init__(self, weight, stride, padding = 0):
        super(ConvTest, self).__init__()
        self.weight = weight
        self.stride = stride
        self.padding = padding
    
    def forward(self, x):
        return torch.nn.functional.conv_transpose1d(x, self.weight, stride=self.stride, padding=self.padding)

convtest = ConvTest(weight = torch.randn(4098, 1, 4096), stride = 1024)

input = torch.randn(8, 4098,  435)

torch.onnx.export(
    convtest,
    (input,),
    "convtest.onnx",
    input_names=["input"],
    output_names=["output"],
    opset_version=20,
    dynamo=True,
    do_constant_folding=True,
    keep_initializers_as_inputs=True,
    # report=True,
    external_data=None,
    # verify=True
)

To test in browser:

       const session = await ort.InferenceSession.create("/convtest.onnx", {
            executionProviders: ["webgpu"],
            // logSeverityLevel: 0
        });

        const wgpu_profile = []

        ort.env.webgpu.profiling = {
            mode: "default",
            ondata: (data) => {
                wgpu_profile.push(data);
            }
        }

        const input_dims = [8, 4098, 435];
        const size = 8 * 4098 * 435;

        const no_chunks = 1;
        const chunks = [];

        for (let i = 0; i < no_chunks; i++) {
            const chunk = new Float32Array(size);
            chunks.push(chunk);
        }

        for(let i = 0; i < no_chunks; i++) {
            console.time("onnx step " + i);
            const input = new ort.Tensor("float32", chunks[i], input_dims);
            const output = await session.run({input});
            console.timeEnd("onnx step " + i);
        }

        await session.release();

        wgpu_profile.sort((a, b) => (a.endTime-a.startTime) - (b.endTime-b.startTime));

        wgpu_profile.forEach((kernel) => {
            console.log(`${kernel.kernelType} (${kernel.kernelName}) took ${(kernel.endTime - kernel.startTime) / 1000 / 1000} ms`);
        })

Urgency

Urgent

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.21.0-dev.20241224-2d05c4bcd9

Execution Provider

'webgpu' (WebGPU), 'wasm'/'cpu' (WebAssembly CPU)

gianlourbano · 2025-01-07T15:28:15Z

@qjia7 @gyagp could you please take a look? Maybe it has something to do with this pr

qjia7 · 2025-01-08T10:27:54Z

@gianlourbano I can reproduce it. Will take a look, thanks.

### Description  BUG #23273 With this change, I see the convTranspose time in that bug becomes ~7s from ~90s on my Meteor Lake. This PR does below things: 1. Use stride to update the increasement in the loop. In the bug, the stride is 1024, which can greatly reduce the loop times. 2. Support components for A to reduce the memory access times. 3. When output channels is 1, the b components can be same with A to further reduce the memory access times.

gianlourbano · 2025-01-10T14:08:07Z

Thanks for the help @qjia7 ! Do you think there's more room for improvement? The same op in torch/onnx python cpu takes about 400-600ms

### Description  BUG #23273 With this change, I see the convTranspose time in that bug becomes ~7s from ~90s on my Meteor Lake. This PR does below things: 1. Use stride to update the increasement in the loop. In the bug, the stride is 1024, which can greatly reduce the loop times. 2. Support components for A to reduce the memory access times. 3. When output channels is 1, the b components can be same with A to further reduce the memory access times.

qjia7 · 2025-01-13T02:09:19Z

Do you think there's more room for improvement? The same op in torch/onnx python cpu takes about 400-600ms

Yes. Your shape is very special, the stride is 1024 which is very big. I can do some specific optimization for such big of stride. And the output channel is only 1, which can also be further optimized. Glad to know cpu only takes 400-600ms which gives gpu a high target :)

gianlourbano · 2025-01-13T02:15:05Z

Yes, i'm aware. The convolution is part of an implementation of a custom inverse short time fourier transform, given that the conversion of such operator still does not work from torch to onnx. Thank you for precious help

@jiangzhaoming

BUG #23273 This PR does below optimizations: 1. When output channels is one, 1) calculate the offset before the inchannel loop to reduce indices to offsets calculation, 2) split the `inputChannelsPerGroup` into `inputChannelsPerGroupInt` and `inputChannelsRemainder` parts so that we can always access 4 data for `inputChannelsPerGroupInt`. 2. Use precise initial value to reduce useless loop iterations. Thanks @jiangzhaoming 's suggestion's on this. With this PR, ConvTranspose becomes 3.7s from 8.4s on Intel Meteor Lake. On NV RTX 2000 Ada, it becomes 1.6s from 2.7s.

@jiangzhaoming

BUG #23273 This PR does below optimizations: 1. When output channels is one, 1) calculate the offset before the inchannel loop to reduce indices to offsets calculation, 2) split the `inputChannelsPerGroup` into `inputChannelsPerGroupInt` and `inputChannelsRemainder` parts so that we can always access 4 data for `inputChannelsPerGroupInt`. 2. Use precise initial value to reduce useless loop iterations. Thanks @jiangzhaoming 's suggestion's on this. With this PR, ConvTranspose becomes 3.7s from 8.4s on Intel Meteor Lake. On NV RTX 2000 Ada, it becomes 1.6s from 2.7s.

gianlourbano added the platform:web issues related to ONNX Runtime web; typically submitted using template label Jan 7, 2025

github-actions bot added the ep:WebGPU ort-web webgpu provider label Jan 7, 2025

qjia7 mentioned this issue Jan 9, 2025

[js/webgpu] Optimize convtranspose #23302

Merged

qjia7 mentioned this issue Jan 20, 2025

[js/webgpu] Optimize ConvTranspose (Continue) #23429

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[js/webgpu] ConvTranspose1D slower on Webgpu than Wasm #23273

[js/webgpu] ConvTranspose1D slower on Webgpu than Wasm #23273

gianlourbano commented Jan 7, 2025

gianlourbano commented Jan 7, 2025

qjia7 commented Jan 8, 2025

gianlourbano commented Jan 10, 2025 •

edited

Loading

qjia7 commented Jan 13, 2025

gianlourbano commented Jan 13, 2025

[js/webgpu] ConvTranspose1D slower on Webgpu than Wasm #23273

[js/webgpu] ConvTranspose1D slower on Webgpu than Wasm #23273

Comments

gianlourbano commented Jan 7, 2025

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

Execution Provider

gianlourbano commented Jan 7, 2025

qjia7 commented Jan 8, 2025

gianlourbano commented Jan 10, 2025 • edited Loading

qjia7 commented Jan 13, 2025

gianlourbano commented Jan 13, 2025

gianlourbano commented Jan 10, 2025 •

edited

Loading