Pérez Diéguez, AdriánAmor, MargaritaDoallo, RamónNukada, AkiraMatsuoka, Satoshi2025-01-222025-01-222018A. P. Diéguez, M. Amor, R. Doallo, A. Nukada and S. Matsuoka, "Efficient Solving of Scan Primitive on Multi-GPU Systems," 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vancouver, BC, Canada, 2018, pp. 794-803, doi: 10.1109/IPDPS.2018.00089.978153864368-61530-2075http://hdl.handle.net/2183/40835This version of the article has been accepted for publication, after peer review. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The Version of Record is available online at: https://doi.org/10.1109/IPDPS.2018.00089Presented at: 32nd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018, Vancouver, 21-25 May 2018[Abstract]: GPUs fulfill high computation demands, but it is necessary to develop code carefully, selecting algorithms well suited to the GPU architecture and applying different optimizations. This article presents a GPU-suitable algorithm and a tuning strategy for performing the scan primitive over large problem sizes in CUDA. This tuning strategy defines different performance premises to find the GPU execution parameters that maximize performance. Taking these premises into consideration, we easily develop the kernels using CUDA skeletons to ensure efficiency and portability. Based on this, we describe an optimal proposal analyzed over different multiple GPU environments, the first multiple-GPU batch scan proposal to the best of our knowledge. The resulting implementations outperform other well-known libraries in most cases, such as CUDPP, ModernGPU, Thrust, CUB and LightScan.engCopyright © 2018, IEEEGraphics processing unitsInstruction setsKernelTuningPeer-to-peer computingLibrariesRegistersCUDAMultiGPUMPIScanTuningEfficient Solving of Scan Primitive on Multi-GPU Systemsconference outputopen access10.1109/IPDPS.2018.00089