Skip to content

Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA

License

Notifications You must be signed in to change notification settings

solution3000/tsm2x-imp

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TSM2X: High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on GPUs

by Cody Rivera [[email protected]], Jieyang Chen [[email protected]], and Dingwen Tao [[email protected]]

This repository contains an implementation of two irregular-shape matrix-matrix multiplication algorithms, TSM2R and TSM2L. TSM2R is designed to efficiently multiply a large square (or near-square) matrix by a tall-and-skinny matrix, or more specifically, an (m * k) and (k * n) matrix-matrix multiplication where m and k are approximately equal, and n is much smaller than k. TSM2L is designed to efficiently multiply a tall-and-skinny matrix by a small square matrix, or more specifically, an (m * k) and (k * n) matrix-matrix multiplication where k is much smaller than m, and k and n are approximately equal.

We propose TSM2R and TSM2L in our preprint, "TSM2X: High-Performance Tall-and-Skinny Matrix-MatrixMultiplication on GPUs." [1]. Our work extends an ICS conference paper [2], which introduces TSM2R, by expanding its techniques for different matrix sizes as well as porting the algorithm to the Nvidia Tesla V100.

We have implemented the kernels as templates, with the parameters t1, t2, and t3 as template variables [1]. The program will select an optimal kernel depending on the size of the input matrices. This repository currently provides a set of optimal kernels for the Nvidia V100 GPU only.

Instructions:

This implementation is designed for Unix platforms, and can be built using make. The usage of this program is: ./multiply [-d] [-i] a.mtx b.mtx c.mtx, where a.mtx and b.mtx are input matrices and c.mtx is an output matrix. -d indicates that the matrices are double-precision, while -i indicates that TSM2L (instead of TSM2R) is to be used.

The format of the matrices is binary, with a structure as follows:

template <typename FloatType>
struct matrixFormat {
    uint32_t rows, cols;
    FloatType values[rows * cols];
};

The matrix is stored in column-major format. All multibyte values are little-endian.

You may use the provided gen.cpp program to generate input matrices. The usage is ./gen [-d] -r ROW_COUNT -c COL_COUNT file, where -d signifies double precision.

You may also use the provided print.cpp program to print matrices. The usage is ./print [-d] file.

To evaluate performance across a range of inputs, a Python3 script test.py is provided. The script can be invoked with python3 test.py. The program requires that ../multiply and ../gen exist, and writes its output to CSV files.

Notes:

[1] Cody Rivera, Jieyang Chen, Nan Xiong, Shuaiwen Leon Song, and Dingwen Tao. "TSM2X: High-Performance Tall-and-Skinny Matrix-MatrixMultiplication on GPUs." 2020. arXiv:2002.03258 [cs.DC].

[2] Jieyang Chen, Nan Xiong, Xin Liang, Dingwen Tao, Sihuan Li, Kaiming Ouyang, Kai Zhao, Nathan DeBardeleben, Qiang Guan, and Zizhong Chen. "TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs." In Proceedings of the ACM International Conference on Supercomputing (ICS), pp. 106-116. ACM, 2019. https://doi.org/10.1145/3330345.3330355

About

Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Cuda 84.6%
  • Python 6.7%
  • C++ 6.5%
  • Makefile 2.2%