Mostrar o rexistro simple do ítem

dc.contributor.authorLópez Castro, Roberto
dc.contributor.authorAndrade, Diego
dc.contributor.authorFraguela, Basilio B.
dc.date.accessioned2023-04-18T08:11:12Z
dc.date.available2023-04-18T08:11:12Z
dc.date.issued2022
dc.identifier.citationRoberto L. Castro, Diego Andrade, and Basilio B. Fraguela, "Probing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM routine on Ampere GPUs", In International Conference on Parallel Architectures and Compilation Techniques (PACT ’22), October 8–12, 2022, Chicago, IL, USA. ACM, 2022 [Online]. doi: 10.1145/3559009.3569691. Dispoñible en: https://doi.org/10.1145/3559009.3569691es_ES
dc.identifier.urihttp://hdl.handle.net/2183/32881
dc.description.abstract[Abstract]: The Deep Learning (DL) community found in pruning techniques a good way to reduce the models' resource and energy consumption. These techniques lead to smaller sparse models, but sparse computations in GPUs only outperform their dense counterparts for extremely high levels of sparsity. However, pruning up to such sparsity levels can seriously harm the accuracy of the Neural Networks (NNs). To alleviate this, novel performance-aware pruning techniques favor the generation of more regular sparse matrices that can improve the exploitation of the underlying hardware. Nevertheless, an important drawback is that these techniques heavily condition the location of the non-pruned values, which can strongly degrade the accuracy of the models. This paper focuses on improving the performance of the SpMM routine on DL workloads by combining performance-aware pruning with pruning-independent SpMM kernels to relax input-format constraints. We start with a microarchitecture-level performance study of SOTA SpMM implementations to identify their main bottlenecks and flaws. Then, the paper centers on maximizing the performance of the routine by adjusting the parameters of performance-aware pruning techniques to the hardware properties. This second study explains the intrinsic causes of the observed performance results. We show that, following this approach, a generic SpMM routine can perform up to 49% and 77% better for half and single precision, respectively, than using non-performance-aware pruning, providing speedups over cuBlas of up to 1.87× and 4.20×, respectively. Additionally, the performance achieved on half precision is boosted with a new Ampere-ready specialized implementation for the columnvector sparse format, CLASP, which achieves a 2.42× speedup over cuBlas. Finally, we also introduce ad-colPrune, a novel pruning technique that widens the design space of possible trade-offs between performance and accuracy. © 2022 Association for Computing Machinery.es_ES
dc.description.sponsorshipXunta de Galicia; ED431C 2021/30es_ES
dc.description.sponsorshipXunta de Galicia; ED431G 2019/01es_ES
dc.description.sponsorshipThis research was supported by the Ministry of Science and Innovation of Spain (PID2019-104184RB-I00, AEI, 10.13039/501100011033), the Ministry of Education (predoctoral grant of Roberto L. Castro, FPU19/03974), and by Xunta de Galicia under the Consolidation Program of Competitive Reference Groups (ED431C 2021/30). We also acknowledge the support from CITIC, funded by Xunta de Galicia and FEDER funds of the EU (Centro de Investigación de Galicia accreditation 2019-2022, ED431G 2019/01). Finally, we acknowledge the Centro de Supercomputación de Galicia (CESGA) for the use of their computers.es_ES
dc.language.isoenges_ES
dc.publisherInstitute of Electrical and Electronics Engineerses_ES
dc.relationinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-104184RB-I00/ES/DESAFIOS ACTUALES EN HPC: ARQUITECTURAS, SOFTWARE Y APLICACIONESes_ES
dc.relation.urihttps://doi.org/10.1145/3559009.3569691es_ES
dc.rightsAtribución 3.0 Españaes_ES
dc.rights© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.es_ES
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/es/*
dc.subjectCUDAes_ES
dc.subjectDeep learninges_ES
dc.subjectGPUes_ES
dc.subjectnetwork pruninges_ES
dc.subjectsparsityes_ES
dc.subjectSpMMes_ES
dc.titleProbing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM routine on Ampere GPUses_ES
dc.typeinfo:eu-repo/semantics/conferenceObjectes_ES
dc.typeinfo:eu-repo/semantics/conferenceObjectes_ES
dc.rights.accessinfo:eu-repo/semantics/openAccesses_ES
UDC.journalTitleParallel Architectures and Compilation Techniques - Conference Proceedings, PACTes_ES
UDC.startPage135es_ES
UDC.endPage147es_ES
dc.identifier.doi10.1145/3559009.3569691
UDC.conferenceTitleInternational Conference on Parallel Architectures and Compilation Techniques, PACT (31º. Chicago. 2022)es_ES


Ficheiros no ítem

Thumbnail
Thumbnail

Este ítem aparece na(s) seguinte(s) colección(s)

Mostrar o rexistro simple do ítem