Probing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM routine on Ampere GPUs

López Castro, Roberto; Andrade, Diego; Fraguela, Basilio B.

dc.contributor.author	López Castro, Roberto
dc.contributor.author	Andrade, Diego
dc.contributor.author	Fraguela, Basilio B.
dc.date.accessioned	2023-04-18T08:11:12Z
dc.date.available	2023-04-18T08:11:12Z
dc.date.issued	2022
dc.identifier.citation	Roberto L. Castro, Diego Andrade, and Basilio B. Fraguela, "Probing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM routine on Ampere GPUs", In International Conference on Parallel Architectures and Compilation Techniques (PACT ’22), October 8–12, 2022, Chicago, IL, USA. ACM, 2022 [Online]. doi: 10.1145/3559009.3569691. Dispoñible en: https://doi.org/10.1145/3559009.3569691	es_ES
dc.identifier.uri	http://hdl.handle.net/2183/32881
dc.description.abstract	[Abstract]: The Deep Learning (DL) community found in pruning techniques a good way to reduce the models' resource and energy consumption. These techniques lead to smaller sparse models, but sparse computations in GPUs only outperform their dense counterparts for extremely high levels of sparsity. However, pruning up to such sparsity levels can seriously harm the accuracy of the Neural Networks (NNs). To alleviate this, novel performance-aware pruning techniques favor the generation of more regular sparse matrices that can improve the exploitation of the underlying hardware. Nevertheless, an important drawback is that these techniques heavily condition the location of the non-pruned values, which can strongly degrade the accuracy of the models. This paper focuses on improving the performance of the SpMM routine on DL workloads by combining performance-aware pruning with pruning-independent SpMM kernels to relax input-format constraints. We start with a microarchitecture-level performance study of SOTA SpMM implementations to identify their main bottlenecks and flaws. Then, the paper centers on maximizing the performance of the routine by adjusting the parameters of performance-aware pruning techniques to the hardware properties. This second study explains the intrinsic causes of the observed performance results. We show that, following this approach, a generic SpMM routine can perform up to 49% and 77% better for half and single precision, respectively, than using non-performance-aware pruning, providing speedups over cuBlas of up to 1.87× and 4.20×, respectively. Additionally, the performance achieved on half precision is boosted with a new Ampere-ready specialized implementation for the columnvector sparse format, CLASP, which achieves a 2.42× speedup over cuBlas. Finally, we also introduce ad-colPrune, a novel pruning technique that widens the design space of possible trade-offs between performance and accuracy. © 2022 Association for Computing Machinery.	es_ES
dc.description.sponsorship	Xunta de Galicia; ED431C 2021/30	es_ES
dc.description.sponsorship	Xunta de Galicia; ED431G 2019/01	es_ES
dc.description.sponsorship	This research was supported by the Ministry of Science and Innovation of Spain (PID2019-104184RB-I00, AEI, 10.13039/501100011033), the Ministry of Education (predoctoral grant of Roberto L. Castro, FPU19/03974), and by Xunta de Galicia under the Consolidation Program of Competitive Reference Groups (ED431C 2021/30). We also acknowledge the support from CITIC, funded by Xunta de Galicia and FEDER funds of the EU (Centro de Investigación de Galicia accreditation 2019-2022, ED431G 2019/01). Finally, we acknowledge the Centro de Supercomputación de Galicia (CESGA) for the use of their computers.	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Institute of Electrical and Electronics Engineers	es_ES
dc.relation	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-104184RB-I00/ES/DESAFIOS ACTUALES EN HPC: ARQUITECTURAS, SOFTWARE Y APLICACIONES	es_ES
dc.relation.uri	https://doi.org/10.1145/3559009.3569691	es_ES
dc.rights	Atribución 3.0 España	es_ES
dc.rights	© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/es/	*
dc.subject	CUDA	es_ES
dc.subject	Deep learning	es_ES
dc.subject	GPU	es_ES
dc.subject	network pruning	es_ES
dc.subject	sparsity	es_ES
dc.subject	SpMM	es_ES
dc.title	Probing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM routine on Ampere GPUs	es_ES
dc.type	info:eu-repo/semantics/conferenceObject	es_ES
dc.type	info:eu-repo/semantics/conferenceObject	es_ES
dc.rights.access	info:eu-repo/semantics/openAccess	es_ES
UDC.journalTitle	Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT	es_ES
UDC.startPage	135	es_ES
UDC.endPage	147	es_ES
dc.identifier.doi	10.1145/3559009.3569691
UDC.conferenceTitle	International Conference on Parallel Architectures and Compilation Techniques, PACT (31º. Chicago. 2022)	es_ES

Ficheiros no ítem

Nome:: license_rdf
Tamaño:: 1.337Kb
Formato:: application/rdf+xml

Ver/abrir

Nome:: LopezCastro_ Roberto_2023_Prob ...
Tamaño:: 2.764Mb
Formato:: PDF

Ver/abrir

Este ítem aparece na(s) seguinte(s) colección(s)

GI-GAC - Congresos, conferencias, etc. [53]

Mostrar o rexistro simple do ítem