Mostrar o rexistro simple do ítem
Probing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM routine on Ampere GPUs
dc.contributor.author | López Castro, Roberto | |
dc.contributor.author | Andrade, Diego | |
dc.contributor.author | Fraguela, Basilio B. | |
dc.date.accessioned | 2023-04-18T08:11:12Z | |
dc.date.available | 2023-04-18T08:11:12Z | |
dc.date.issued | 2022 | |
dc.identifier.citation | Roberto L. Castro, Diego Andrade, and Basilio B. Fraguela, "Probing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM routine on Ampere GPUs", In International Conference on Parallel Architectures and Compilation Techniques (PACT ’22), October 8–12, 2022, Chicago, IL, USA. ACM, 2022 [Online]. doi: 10.1145/3559009.3569691. Dispoñible en: https://doi.org/10.1145/3559009.3569691 | es_ES |
dc.identifier.uri | http://hdl.handle.net/2183/32881 | |
dc.description.abstract | [Abstract]: The Deep Learning (DL) community found in pruning techniques a good way to reduce the models' resource and energy consumption. These techniques lead to smaller sparse models, but sparse computations in GPUs only outperform their dense counterparts for extremely high levels of sparsity. However, pruning up to such sparsity levels can seriously harm the accuracy of the Neural Networks (NNs). To alleviate this, novel performance-aware pruning techniques favor the generation of more regular sparse matrices that can improve the exploitation of the underlying hardware. Nevertheless, an important drawback is that these techniques heavily condition the location of the non-pruned values, which can strongly degrade the accuracy of the models. This paper focuses on improving the performance of the SpMM routine on DL workloads by combining performance-aware pruning with pruning-independent SpMM kernels to relax input-format constraints. We start with a microarchitecture-level performance study of SOTA SpMM implementations to identify their main bottlenecks and flaws. Then, the paper centers on maximizing the performance of the routine by adjusting the parameters of performance-aware pruning techniques to the hardware properties. This second study explains the intrinsic causes of the observed performance results. We show that, following this approach, a generic SpMM routine can perform up to 49% and 77% better for half and single precision, respectively, than using non-performance-aware pruning, providing speedups over cuBlas of up to 1.87× and 4.20×, respectively. Additionally, the performance achieved on half precision is boosted with a new Ampere-ready specialized implementation for the columnvector sparse format, CLASP, which achieves a 2.42× speedup over cuBlas. Finally, we also introduce ad-colPrune, a novel pruning technique that widens the design space of possible trade-offs between performance and accuracy. © 2022 Association for Computing Machinery. | es_ES |
dc.description.sponsorship | Xunta de Galicia; ED431C 2021/30 | es_ES |
dc.description.sponsorship | Xunta de Galicia; ED431G 2019/01 | es_ES |
dc.description.sponsorship | This research was supported by the Ministry of Science and Innovation of Spain (PID2019-104184RB-I00, AEI, 10.13039/501100011033), the Ministry of Education (predoctoral grant of Roberto L. Castro, FPU19/03974), and by Xunta de Galicia under the Consolidation Program of Competitive Reference Groups (ED431C 2021/30). We also acknowledge the support from CITIC, funded by Xunta de Galicia and FEDER funds of the EU (Centro de Investigación de Galicia accreditation 2019-2022, ED431G 2019/01). Finally, we acknowledge the Centro de Supercomputación de Galicia (CESGA) for the use of their computers. | es_ES |
dc.language.iso | eng | es_ES |
dc.publisher | Institute of Electrical and Electronics Engineers | es_ES |
dc.relation | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2019-104184RB-I00/ES/DESAFIOS ACTUALES EN HPC: ARQUITECTURAS, SOFTWARE Y APLICACIONES | es_ES |
dc.relation.uri | https://doi.org/10.1145/3559009.3569691 | es_ES |
dc.rights | Atribución 3.0 España | es_ES |
dc.rights | © 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM. | es_ES |
dc.rights.uri | http://creativecommons.org/licenses/by/3.0/es/ | * |
dc.subject | CUDA | es_ES |
dc.subject | Deep learning | es_ES |
dc.subject | GPU | es_ES |
dc.subject | network pruning | es_ES |
dc.subject | sparsity | es_ES |
dc.subject | SpMM | es_ES |
dc.title | Probing the Efficacy of Hardware-Aware Weight Pruning to Optimize the SpMM routine on Ampere GPUs | es_ES |
dc.type | info:eu-repo/semantics/conferenceObject | es_ES |
dc.type | info:eu-repo/semantics/conferenceObject | es_ES |
dc.rights.access | info:eu-repo/semantics/openAccess | es_ES |
UDC.journalTitle | Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT | es_ES |
UDC.startPage | 135 | es_ES |
UDC.endPage | 147 | es_ES |
dc.identifier.doi | 10.1145/3559009.3569691 | |
UDC.conferenceTitle | International Conference on Parallel Architectures and Compilation Techniques, PACT (31º. Chicago. 2022) | es_ES |