Compute shader performance report and optimization strategies #6688

rcoreilly · 2024-12-09T09:07:25Z

rcoreilly
Dec 9, 2024

First, a quick experience report on a large-scale use of wgpu (native) for compute, with a vulkan-based previous implementation for comparison. This is for a relatively complex biologically-based neural network simulation system: https://github.com/emer/axon, which is written in Go, and uses a tool that translates Go into WGSL to generate a GPU version of the Go source.

The new wgpu-native version takes 1.65x as long to run a large benchmark model (~50k neurons, ~32 million synapses) compared to the previous vulkan-based version, on my macbook pro M3 Max laptop (will test on other GPUs later, but previously the M3 outperformed most NVIDIA hardware!). That vulkan version was running over MoltenVK on the mac.

Interestingly, the v22.1.0.5 release version of wgpu-native is 2x slower vs vulkan, whereas the PR gfx-rs/wgpu-native#441 version brings that down to the 1.65x, so already something in there has improved performance significantly.

In principle, wgpu going direct on top of Metal should be able to be just as fast, if not faster. So my question is: what is the best way to identify the current performance hot spots?

All major computation is done entirely on GPU-resident memory buffers, with minimal transfer back to CPU, so any performance issues are likely to be associated with just launching and sequencing the compute dispatches. None of the WGSL code uses any workgroup or other sync mechanisms: just uses atomicAdd in a couple of kernels (out of ~30 total). I'm hoping this might provide a more tractable, narrow search space for relevant performance issues.

rcoreilly · 2024-12-14T00:04:56Z

rcoreilly
Dec 14, 2024
Author

Further exploration has narrowed down the likely locus of the performance difference even further: it is likely to be just about the performance of the actual shader code generated from WGSL vs. HLSL.

I was able to further optimize memory transfers and got the performance on small models to be 1.5x faster than the previous vulkan-based implementation. These small models have a larger proportion of total time associated with the "overhead" of CPU - GPU interactions (dispatch of pipelines and memory transfers), whereas the larger model described above was unaffected by these optimizations. Thus, it seems that the main difference must be in the actual runtime of each shader.

I wasn't able to find much about this topic -- e.g., this https://www.reddit.com/r/rust_gamedev/comments/1doaam9/tools_to_debug_and_improve_wgsl_shader_peformance/ -- I'll try looking at the XCode instrumentation at some point.

I did try to run the vulkan backend on Mac and it crashed, but I'll still plan to look at it on an NVIDIA A100 on linux at some point soon.

Meanwhile, one further issue: the Go source does not have a corresponding distinction between let and var, so everything is translated as var, which obviously could be bad for performance. This seems like something that the compiler should be optimizing on its own in the first place, but I guess that might be one place to explore -- does the naga compiler go ahead and optimize non-modified var local variables to effectively act like let variables anyway?

3 replies

cwfitzgerald Dec 14, 2024
Maintainer

does the naga compiler go ahead and optimize non-modified var local variables to effectively act like let variables anyway?

The let/var distinction shouldn't matter at all - if it does in practice, I would be very surprised.

You might try generating a unchecked shader through create_shader_module_unchecked, there were some regressions to metal performance wrt checking in paritcular in 23.0 that we're working on patching up. Oh wait, that actually still has the kludge enabled... #6662 should fix your problem

JMS55 Dec 14, 2024

This also seems like it could be having an impact #6521

rcoreilly Dec 14, 2024
Author

Thanks for the tips! unfortunately I could not get #6662 to build in wgpu-native, which I depend on. I'll await further updates when they become available there and glad to hear that progress is being made.

rcoreilly · 2025-01-10T07:54:06Z

rcoreilly
Jan 10, 2025
Author

Now that I got everything sorted out per #6875 and I'm able to run on vulkan under linux, I am happy to report that the current wgpu performance on an NVIDIA A100 GPU is 20% faster overall on the large model test case, vs. the previous direct vulkan-based implementation.

Unfortunately I cannot seem to reconstruct how I actually got the vulkan backend running on my mac before -- probably the crash was the same one I've just fixed. There are conflicting docs about whether the environment variable value for WGPU_BACKEND should be vk or vulkan, and I remember having to put the libMoltenVK.dylib file directly in the directory where I was running, but nothing I do seems to make any difference and it just reports the same metal adapter regardless. And I don't see how it would actually link in these libs in the first place!? Is there some doc that I'm missing that tells you how to make this work?

Also, is there some kind of roadmap for when wgpu-native might be updated to track the latest wgpu? It would be great to be able to try out the updated metal fixes sometime soon. Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute shader performance report and optimization strategies #6688

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Compute shader performance report and optimization strategies #6688

rcoreilly Dec 9, 2024

Replies: 2 comments · 3 replies

rcoreilly Dec 14, 2024 Author

cwfitzgerald Dec 14, 2024 Maintainer

JMS55 Dec 14, 2024

rcoreilly Dec 14, 2024 Author

rcoreilly Jan 10, 2025 Author

rcoreilly
Dec 9, 2024

Replies: 2 comments 3 replies

rcoreilly
Dec 14, 2024
Author

cwfitzgerald Dec 14, 2024
Maintainer

rcoreilly Dec 14, 2024
Author

rcoreilly
Jan 10, 2025
Author