BPLG–BMCS: GPU-sorting algorithm using a tuning skeleton library

View/ Open
Use this link to cite
http://hdl.handle.net/2183/20960Collections
- Investigación (FIC) [1635]
Metadata
Show full item recordTitle
BPLG–BMCS: GPU-sorting algorithm using a tuning skeleton libraryDate
2017Citation
Diéguez, A.P., Amor, M. & Doallo, R. J Supercomput (2017) 73: 4. https://doi.org/10.1007/s11227-015-1591-9
Abstract
[Abstract] In this work, we present an efficient and portable sorting operator for GPUs. Specifically, we propose an algorithmic variant of the bitonic merge sort which reduces the number of processing stages and internal steps, increasing the workload per thread and focusing on a multi-batch execution for multiple problems of a small size. This proposal is well matched to current GPU architectures and we apply different CUDA optimizations to improve performance. For portability, we use a library based on tuning building blocks. Thanks to this parametrization, the library can easily be tuned for different CUDA GPU architectures. Our proposals obtain competitive performance on two recent NVIDIA GPU architectures, providing an improvement of up to 11,794 × over CUDPP and up to 6467 × over ModernGPU.
Keywords
GPUQ
CUDA
Tuning
Building blocks
Bitonic merge sort
CUDA
Tuning
Building blocks
Bitonic merge sort
Description
This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-015-1591-9
Editor version
ISSN
0920-8542
1573-0484
1573-0484