Adapt-S: Effective DNN Pruning via Unified Accuracy and Performance Tuning

Castro, Roberto L.Andrade, DiegoFraguela, Basilio B.2025-10-202025-10-202025-07R. L. Castro, D. Andrade, and B. B. Fraguela, "Adapt-S: Effective DNN Pruning via Unified Accuracy and Performance Tuning", 2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 03-07 June 2025, Milán, Italia, doi: 10.1109/IPDPS64566.2025.00018979-8-3315-3237-61530-2075https://hdl.handle.net/2183/46024Traballo presentado no: 2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 03-07 June 2025, Milán, Italia © 2025 IEEE. This version of the paper has been accepted for publication. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The final published paper is available online at: https://doi.org/10.1109/10.1109/IPDPS64566.2025.00018[Abstract]: Model sparsification has emerged as a promising approach to reducing model size with minimum impact on accuracy. This is achieved through the removal of some model parameters, a process also known as Deep Neural Network (DNN) pruning. The irregular nature of the generated sparse tensors poses a great challenge in the development of efficient GPU kernels optimized for these workloads. This challenge has been recently addressed through the use of hardware-aware semistructured sparsification methods designed to conform to specialized sparse formats and codesigned with template-based kernel implementations. These methods are commonly based on grouping the non-pruned values in blocks of a given size to generate regularity or on generating patterns that fit specialized hardware units. This pruning pattern-format-kernel triplet presents a high degree of tunability, both at the pruning and kernel sides, which can be used to fit certain accuracy-to-performance tradeoffs. On the pruning side, using larger blocks of consecutive non-pruned values favors performance over accuracy, as the weight selection for removal policy becomes less flexible. On the kernel side, recent studies have proven that the tuning of the configuration parameters of template-based multi-level tiling kernel implementations can yield an extra performance boost. This paper presents AdAPT-S, an autotuning system that generates DNN pruning recipes and optimized kernel configurations to fit an accuracy-to-performance specification. This is done through a cost model that integrates both aspects. AdAPT-S gets extra benefits from the exploitation of layer sensitivity by providing per-layer pruning recipes and kernel configurations. The results show that our approach can achieve superior accuracy-to-performance trade-offs and that this can be used to produce models that fit the user requirements.eng© 2025, IEEEDNN PruningTensor Core UnitGPU programmingCUDAAdapt-S: Effective DNN Pruning via Unified Accuracy and Performance Tuningconference outputopen access10.1109/IPDPS64566.2025.00018