BPLG–BMCS: GPU-sorting algorithm using a tuning skeleton library

Pérez Diéguez, Adrián; Amor, Margarita; Doallo, Ramón

Título

Autor(es)

Pérez Diéguez, Adrián

Amor, Margarita

Doallo, Ramón

Data

2017

Cita bibliográfica

Diéguez, A.P., Amor, M. & Doallo, R. J Supercomput (2017) 73: 4. https://doi.org/10.1007/s11227-015-1591-9

Resumo

[Abstract] In this work, we present an efficient and portable sorting operator for GPUs. Specifically, we propose an algorithmic variant of the bitonic merge sort which reduces the number of processing stages and internal steps, increasing the workload per thread and focusing on a multi-batch execution for multiple problems of a small size. This proposal is well matched to current GPU architectures and we apply different CUDA optimizations to improve performance. For portability, we use a library based on tuning building blocks. Thanks to this parametrization, the library can easily be tuned for different CUDA GPU architectures. Our proposals obtain competitive performance on two recent NVIDIA GPU architectures, providing an improvement of up to 11,794 × over CUDPP and up to 6467 × over ModernGPU.

Palabras chave

GPUQ
CUDA
Tuning
Building blocks
Bitonic merge sort

Descrición

This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-015-1591-9