Replies: 1 comment 1 reply
-
Why there isn't a_small*b_small in the equation? Is it always so small in all cases so that can be safely omitted? 3xTF32 looks awesome. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
CUTLASS 2.8 was released on 11/19, its anniversary, and tagged recently. In this release, we have several new exciting features.
As announced in GTC, we released 3xTF32 gemm, complex gemm, conv2d kernels. 3xTF32 is a technique to emulate FP32 accuracy but with 2x performance. The trick is just splitting a FP32 MMA into 3 TF32 MMAs as shown below. It is useful for HPC/DL when FP32 is too slow or TF32 is not accurate enough. Feel free to try the SDK examples which can check the accuracy and performance of your problems.
3xTF32 kernels are supported in the cutlass profiler. The CMake command is
cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s1688gemm_1*,cutlass_tensorop_s1688gemm_2*,cutlass_tensorop_s1688gemm_6*
to enable all 3xtf32 gemm kernels. Changings1688gemm
toc1688gemm
,s1688fprop
,s1688dgrad,
s1688wgrad
can enalbe 3xtf32 complex gemm, fprop, dgrad, wgrad kernels in the profilers.Group GEMM, similar to batched GEMM but no restriction in any M/N/K dimensions between batches. Imagine its use in Transformer models. The SDK example provides profiling utilities.
Mainloop Scale+Bias+Relu fusion for Fprop and Wgrad. These per-channel elementwise operations are applied before MMA.
Back-to-back conv-conv fusion example can now stage the result of the first conv in the shared memory on Turing. This relaxes the tile size selection and results in better performance. This paper also provides some explanation and performance results.
Just a reminder, in the previous release, CUTLASS open sourced super fast strided dgrad (https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/conv/kernel/default_conv2d_dgrad.h#L678-L796), per channel bias broadcast epilogue fusion, and per channel redution epilgoue fusion.
Beta Was this translation helpful? Give feedback.
All reactions