VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores
Use este enlace para citar
http://hdl.handle.net/2183/35468Colecciones
- GI-GAC - Congresos, conferencias, etc. [53]
- OpenAIRE [266]
Metadatos
Mostrar el registro completo del ítemTítulo
VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor CoresAutor(es)
Fecha
2023-11Cita bibliográfica
Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, and Torsten Hoefler. 2023. VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '23). Association for Computing Machinery, New York, NY, USA, Article 72, 1–14. https://doi.org/10.1145/3581784.3607087
Es versión de
https://doi.org/10.1145/3581784.3607087
Resumen
[Abstract]: The increasing success and scaling of Deep Learning models demands higher computational efficiency and power. Sparsification can lead to both smaller models as well as higher compute efficiency, and accelerated hardware is becoming available. However, exploiting it efficiently requires kernel implementations, pruning algorithms, and storage formats, to utilize hardware support of specialized sparse vector units. An example of those are the NVIDIA's Sparse Tensor Cores (SPTCs), which promise a 2× speedup. However, SPTCs only support the 2:4 format, limiting achievable sparsity ratios to 50%. We present the V:N:M format, which enables the execution of arbitrary N:M ratios on SPTCs. To efficiently exploit the resulting format, we propose Spatha, a high-performance sparse-library for DL routines. We show that Spatha achieves up to 37× speedup over cuBLAS. We also demonstrate a second-order pruning technique that enables sparsification to high sparsity ratios with V:N:M and little to no loss in accuracy in modern transformers.
Palabras clave
Sparse Tensor Cores
GPU
Pruning
Sparsification
CUDA
GPU
Pruning
Sparsification
CUDA
Descripción
© 2023 Autores | ACM. This is the author's version of the work. It is posted here
for your personal use. Not for redistribution. The definitive Version of Record was
published in International Conference for High Performance Computing,
Networking, Storage and Analysis, https://doi.org/10.1145/3581784.3607087
Versión del editor
Derechos
© 2023 Autores | ACM.