Tuning wgrad kernels with split-k, in practice #396

masahi · 2022-01-17T08:56:20Z

masahi
Jan 17, 2022

Hi, I have wgrad kernels with non split-k working in TVM, now I'm looking to leverage split-k.

I'm wondering what's the best strategy to pick the best kernel among all cartesian product of

tile shape
alignment
split-k serial or parallel
split-k slices {1, 2, 3, 4 ... }

The first two are already supported in TVM, and for wgrad kernels there are about 80 variants. Now, if I want to add split-k, the number of combinations becomes too large.

What is a good, practical way to go about this? Also what values (how many and how large) of split-k slices should be considered?

For example, how about this one: First consider only tile shape and alignment, and after deciding the top performing one, add split-k variations on top of it (tile and alignment fixed) for further tuning.

cc @hwu36 @manishucsd

Answered by manishucsd

Jan 17, 2022

Below are some guidelines and information on finding the best tile shape, alignment, split-k-mode (serial, parallel), and split-k slice.

1. Tile Shape: You would want to go with the largest Tile Shape for the most reuse; however, the trade-off is that a large Tile Shape might not be able to reach full GPU utilization because of quantization effects. Thus, it is best that you sweep through all possible Tile Shape for each problem size.

2. Alignment: This one is straightforward. The largest possible alignment will always win. Thus, for F16 input go with align8 wgrad kernels.

3. Split-k-mode: Parallel split-k-mode always surpasses serial split-k-mode. Parallel split-k-mode runs a reduction k…

View full answer

manishucsd · 2022-01-17T23:34:20Z

manishucsd
Jan 17, 2022

Below are some guidelines and information on finding the best tile shape, alignment, split-k-mode (serial, parallel), and split-k slice.

1. Tile Shape: You would want to go with the largest Tile Shape for the most reuse; however, the trade-off is that a large Tile Shape might not be able to reach full GPU utilization because of quantization effects. Thus, it is best that you sweep through all possible Tile Shape for each problem size.

2. Alignment: This one is straightforward. The largest possible alignment will always win. Thus, for F16 input go with align8 wgrad kernels.

3. Split-k-mode: Parallel split-k-mode always surpasses serial split-k-mode. Parallel split-k-mode runs a reduction kernel instead of reducing the split-k chunks serially.

4. Split-k-slice: The goal here is to slice in the problem in GEMM-K dimension s.t. that we have enough CTAs to fill the entire GPU and get maximum utilization.

Wgrad Implicit GEMM problem size (GEMM-M x GEMM-N x GEMM-K) = (K x RSC x NPQ).

Typically, we need a large split-k-slice value for wgrad since GEMM-M (K) and GEMM-N (RSC) are small. Thus, we split the GEMM-K (NPQ) to launch more CTAs. For a given TileShape and problem size you can try a split-k-slice number that launches at-least one wave, i.e., 108 CTAs for GA100 or 68 CTAs for GA102.

I am attaching some notes on this topic which tries to analytically compute split-k-slice number (see page 1).

In practice, I have run sweeps to find the best (1) TileShape and (4) split-k-slice (--split-k-slice=1:128:1). Fixing (2) Alignment to largest possible and (3) split-k-mode to parallel.

1 reply

masahi Jan 18, 2022
Author

Thank you very much for the detailed response @manishucsd this is going to be incredibly helpful!! I can see the path forward now with wgrad + split-k integration in TVM.

masahi · 2022-02-03T20:57:30Z

masahi
Feb 3, 2022
Author

@manishucsd Questions on the reduction kernel block shape https://github.com/masahi/cutlass/blob/example-wgrad-splitk/examples/26_ampere_wgrad_mainloop_fusion/ampere_wgrad_mainloop_fusion.cu#L107:

Should we tune this parameter? Or reduction is fast so not worth it?
Are there invalid block shapes for a given workload? I'm seeing strange time measurement (runtime == 0) for some combination of workload and split_k_slices values. I want to make sure that the reduction block size doesn't play into this.

0 replies

hwu36 · 2022-02-04T04:48:52Z

hwu36
Feb 4, 2022
Maintainer

Should we tune this parameter? Or reduction is fast so not worth it?

This parameter can impact the performance of the reduction kernel. However, the reduction time is usually much shorter than the gemm.

Are there invalid block shapes for a given workload? I'm seeing strange time measurement (runtime == 0) for some combination of workload and split_k_slices values. I want to make sure that the reduction block size doesn't play into this.

What problem size and split_k_slice do you use?

10 replies

hwu36 Feb 5, 2022
Maintainer

What about using cutlass profiler to run the same kernel with the same problem size?

masahi Feb 5, 2022
Author

Yeah that works, but it doesn't use split k.

 ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_f16_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8 --n=8 --h=56 -
-w=56 --c=64 --k=64 --r=3 --s=3 --pad_h=1 --pad_w=1 --stride_h=1 --stride_w=1                                                                                                                 
                                                                                                                                                                                              
                                                                                                                                                                                              

=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: conv2d
       Operation: cutlass_tensorop_f16_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed
           cuDNN: Passed

       Arguments: --conv_kind=wgrad --n=8 --h=56 --w=56 --c=64 --k=64 --r=3 --s=3 --p=56 --q=56 --pad_h=1 --pad_w=1 --stride_h=1  \
                  --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f16:nhwc  \
                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
                  --eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128 --cta_k=32 --stages=2  \
                  --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75 --max_cc=1024

           Bytes: 32186368  bytes
           FLOPs: 1849761792  flops
           FLOPs/Byte: 57

         Runtime: 2.11509  ms
          Memory: 14.1724 GiB/s

            Math: 874.554 GFLOP/s


=============================

hwu36 Feb 5, 2022
Maintainer

You can add --split_k_mode=parallel --split_k_slices=64 to use parallel splitk. Note you need to also build cutlass_tensorop_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8.

masahi Feb 5, 2022
Author

Adding --split_k_mode=parallel --split_k_slices=32 causes the profiler to exit immediately. Without that it works (so cutlass_tensorop_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8 is built).

masahi Feb 5, 2022
Author

See below. split_k_slices=64 does work (although output incorrect) with the cutlass_profiler.

masahi · 2022-02-05T23:57:40Z

masahi
Feb 5, 2022
Author

Ok I found why cutlass_profiler exits immediately if I add --split_k_mode=parallel --split_k_slices=64. I'm using wgrad with fp32 accum, and fp16 output. Apparently, the operation lookup is failing at

cutlass/tools/library/src/handle.cu

Lines 1073 to 1075 in 1e4703c

    
           if (operators_it == conv_operations.end()) { 
        
             return nullptr; 
        
           }

This is because this line sets the element C to be the type of accumulator, while what I need there is the type of output tensor.

cutlass/tools/library/src/handle.cu

Line 1060 in 1e4703c

conv_desc.tile_description.math_instruction.element_accumulator,

If I replace that line with conv_desc.C.element, the profiler does run, but it says the output is incorrect:

  Problem ID: 1

        Provider: CUTLASS
   OperationKind: conv2d
       Operation: cutlass_tensorop_f16_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8

          Status: Success
    Verification: ON
     Disposition: Incorrect

reference_device: Incorrect
           cuDNN: Incorrect

       Arguments: --conv_kind=wgrad --n=8 --h=56 --w=56 --c=64 --k=64 --r=3 --s=3 --p=56 --q=56 --pad_h=1 --pad_w=1 --stride_h=1  \
                  --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f16:nhwc  \
                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=parallel --split_k_slices=64  \
                  --eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128 --cta_k=32 --stages=2  \
                  --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75 --max_cc=1024

           Bytes: 32186368  bytes
           FLOPs: 1849761792  flops
           FLOPs/Byte: 57

         Runtime: 0.302807  ms
          Memory: 98.9934 GiB/s

            Math: 6108.71 GFLOP/s

0 replies

hwu36 · 2022-02-06T02:12:29Z

hwu36
Feb 6, 2022
Maintainer

You need to build both cutlass_tensorop_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8 and cutlass_tensorop_f16_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8. You can follow below commands to build

cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_f16_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8,cutlass_tensorop_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8

make cutlass_profiler -j12

Here is what I got

[haichengw@ipp1-0234 build]$ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_f16_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8 --n=8 --h=56 --w=56 --c=64 --k=64 --r=3 --s=3 --pad_h=1 --pad_w=1 --stride_h=1 --stride_w=1  --split_k_mode=parallel --split_k_slices=64



=============================
  Problem ID: 1

        Provider: CUTLASS
   OperationKind: conv2d
       Operation: cutlass_tensorop_f16_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8

          Status: Success
    Verification: ON
     Disposition: Passed

reference_device: Passed
           cuDNN: Not run

       Arguments: --conv_kind=wgrad --n=8 --h=56 --w=56 --c=64 --k=64 --r=3 --s=3 --p=56 --q=56 --pad_h=1 --pad_w=1 --stride_h=1  \
                  --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f16:nhwc  \
                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=parallel --split_k_slices=64  \
                  --eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128 --cta_k=32 --stages=2  \
                  --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75 --max_cc=1024

           Bytes: 32186368  bytes
           FLOPs: 1849761792  flops
           FLOPs/Byte: 57

         Runtime: 0.123085  ms
          Memory: 243.539 GiB/s

            Math: 15028.4 GFLOP/s


=============================

CSV Results:

Problem,Provider,OperationKind,Operation,Disposition,Status,conv_kind,n,h,w,c,k,r,s,p,q,pad_h,pad_w,stride_h,stride_w,dilation_h,dilation_w,Activation,Filter,Output,conv_mode,iterator_algorithm,alpha,beta,split_k_mode,split_k_slices,eq_gemm_provider,op_class,accum,cta_m,cta_n,cta_k,stages,warps_m,warps_n,warps_k,inst_m,inst_n,inst_k,min_cc,max_cc,Bytes,Flops,Flops/Byte,Runtime,GB/s,GFLOPs
1,CUTLASS,conv2d,cutlass_tensorop_f16_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8,passed,success,wgrad,8,56,56,64,64,3,3,56,56,1,1,1,1,1,1,f16:nhwc,f16:nhwc,f16:nhwc,cross,optimized,1,0,parallel,64,none,tensorop,f32,256,128,32,2,4,2,1,16,8,8,75,1024,32186368,1849761792,57,0.123085,243.539,15028.4

What happens behind is that cutlass_profiler detects that the output type of cutlass_tensorop_f16_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8 is smaller than its accumulate type, so it will lauch cutlass_tensorop_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8 whose output type is the same as the accumulate type. After conv kernel, the profiler calls the reduction kernel which loads in fp32 data and writes out fp16 data.

Here is how cutlass_profiler uses cudeEvent to measure performance: https://github.com/NVIDIA/cutlass/blob/master/tools/profiler/src/conv2d_operation_profiler.cu#L1276-L1335 . You can compare TVM one with it if you want.

1 reply

masahi Feb 6, 2022
Author

Thanks, I was able to reproduce this. I now understand what you meant yesterday. Also the results seem to match with cuDNN, so my results posted in #401 (comment) are apparently a bit off. Need more investigation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tuning wgrad kernels with split-k, in practice #396

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 12 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Tuning wgrad kernels with split-k, in practice #396

masahi Jan 17, 2022

Replies: 5 comments · 12 replies

manishucsd Jan 17, 2022

masahi Jan 18, 2022 Author

masahi Feb 3, 2022 Author

hwu36 Feb 4, 2022 Maintainer

hwu36 Feb 5, 2022 Maintainer

masahi Feb 5, 2022 Author

hwu36 Feb 5, 2022 Maintainer

masahi Feb 5, 2022 Author

masahi Feb 5, 2022 Author

masahi Feb 5, 2022 Author

hwu36 Feb 6, 2022 Maintainer

masahi Feb 6, 2022 Author

masahi
Jan 17, 2022

Replies: 5 comments 12 replies

manishucsd
Jan 17, 2022

masahi Jan 18, 2022
Author

masahi
Feb 3, 2022
Author

hwu36
Feb 4, 2022
Maintainer

hwu36 Feb 5, 2022
Maintainer

masahi Feb 5, 2022
Author

hwu36 Feb 5, 2022
Maintainer

masahi Feb 5, 2022
Author

masahi Feb 5, 2022
Author

masahi
Feb 5, 2022
Author

hwu36
Feb 6, 2022
Maintainer

masahi Feb 6, 2022
Author