Probing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM routine on Ampere GPUs
Use este enlace para citar
http://hdl.handle.net/2183/32881Coleccións
Metadatos
Mostrar o rexistro completo do ítemTítulo
Probing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM routine on Ampere GPUsData
2022Cita bibliográfica
Roberto L. Castro, Diego Andrade, and Basilio B. Fraguela, "Probing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM routine on Ampere GPUs", In International Conference on Parallel Architectures and Compilation Techniques (PACT ’22), October 8–12, 2022, Chicago, IL, USA. ACM, 2022 [Online]. doi: 10.1145/3559009.3569691. Dispoñible en: https://doi.org/10.1145/3559009.3569691
Resumo
[Abstract]: The Deep Learning (DL) community found in pruning techniques a good way to reduce the models' resource and energy consumption. These techniques lead to smaller sparse models, but sparse computations in GPUs only outperform their dense counterparts for extremely high levels of sparsity. However, pruning up to such sparsity levels can seriously harm the accuracy of the Neural Networks (NNs). To alleviate this, novel performance-aware pruning techniques favor the generation of more regular sparse matrices that can improve the exploitation of the underlying hardware. Nevertheless, an important drawback is that these techniques heavily condition the location of the non-pruned values, which can strongly degrade the accuracy of the models. This paper focuses on improving the performance of the SpMM routine on DL workloads by combining performance-aware pruning with pruning-independent SpMM kernels to relax input-format constraints. We start with a microarchitecture-level performance study of SOTA SpMM implementations to identify their main bottlenecks and flaws. Then, the paper centers on maximizing the performance of the routine by adjusting the parameters of performance-aware pruning techniques to the hardware properties. This second study explains the intrinsic causes of the observed performance results. We show that, following this approach, a generic SpMM routine can perform up to 49% and 77% better for half and single precision, respectively, than using non-performance-aware pruning, providing speedups over cuBlas of up to 1.87× and 4.20×, respectively. Additionally, the performance achieved on half precision is boosted with a new Ampere-ready specialized implementation for the columnvector sparse format, CLASP, which achieves a 2.42× speedup over cuBlas. Finally, we also introduce ad-colPrune, a novel pruning technique that widens the design space of possible trade-offs between performance and accuracy. © 2022 Association for Computing Machinery.
Palabras chave
CUDA
Deep learning
GPU
network pruning
sparsity
SpMM
Deep learning
GPU
network pruning
sparsity
SpMM
Versión do editor
Dereitos
Atribución 3.0 España © 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.