Replies: 2 comments 3 replies
-
Further exploration has narrowed down the likely locus of the performance difference even further: it is likely to be just about the performance of the actual shader code generated from WGSL vs. HLSL. I was able to further optimize memory transfers and got the performance on small models to be 1.5x faster than the previous vulkan-based implementation. These small models have a larger proportion of total time associated with the "overhead" of CPU - GPU interactions (dispatch of pipelines and memory transfers), whereas the larger model described above was unaffected by these optimizations. Thus, it seems that the main difference must be in the actual runtime of each shader. I wasn't able to find much about this topic -- e.g., this https://www.reddit.com/r/rust_gamedev/comments/1doaam9/tools_to_debug_and_improve_wgsl_shader_peformance/ -- I'll try looking at the XCode instrumentation at some point. I did try to run the vulkan backend on Mac and it crashed, but I'll still plan to look at it on an NVIDIA A100 on linux at some point soon. Meanwhile, one further issue: the Go source does not have a corresponding distinction between |
Beta Was this translation helpful? Give feedback.
-
Now that I got everything sorted out per #6875 and I'm able to run on vulkan under linux, I am happy to report that the current wgpu performance on an NVIDIA A100 GPU is 20% faster overall on the large model test case, vs. the previous direct vulkan-based implementation. Unfortunately I cannot seem to reconstruct how I actually got the vulkan backend running on my mac before -- probably the crash was the same one I've just fixed. There are conflicting docs about whether the environment variable value for WGPU_BACKEND should be vk or vulkan, and I remember having to put the libMoltenVK.dylib file directly in the directory where I was running, but nothing I do seems to make any difference and it just reports the same metal adapter regardless. And I don't see how it would actually link in these libs in the first place!? Is there some doc that I'm missing that tells you how to make this work? Also, is there some kind of roadmap for when wgpu-native might be updated to track the latest wgpu? It would be great to be able to try out the updated metal fixes sometime soon. Thanks! |
Beta Was this translation helpful? Give feedback.
-
First, a quick experience report on a large-scale use of wgpu (native) for compute, with a vulkan-based previous implementation for comparison. This is for a relatively complex biologically-based neural network simulation system: https://github.com/emer/axon, which is written in Go, and uses a tool that translates Go into WGSL to generate a GPU version of the Go source.
The new wgpu-native version takes 1.65x as long to run a large benchmark model (~50k neurons, ~32 million synapses) compared to the previous vulkan-based version, on my macbook pro M3 Max laptop (will test on other GPUs later, but previously the M3 outperformed most NVIDIA hardware!). That vulkan version was running over MoltenVK on the mac.
Interestingly, the v22.1.0.5 release version of wgpu-native is 2x slower vs vulkan, whereas the PR gfx-rs/wgpu-native#441 version brings that down to the 1.65x, so already something in there has improved performance significantly.
In principle, wgpu going direct on top of Metal should be able to be just as fast, if not faster. So my question is: what is the best way to identify the current performance hot spots?
All major computation is done entirely on GPU-resident memory buffers, with minimal transfer back to CPU, so any performance issues are likely to be associated with just launching and sequencing the compute dispatches. None of the WGSL code uses any workgroup or other sync mechanisms: just uses atomicAdd in a couple of kernels (out of ~30 total). I'm hoping this might provide a more tractable, narrow search space for relevant performance issues.
Beta Was this translation helpful? Give feedback.
All reactions