Parallel feature selection for distributed-memory clusters

Use this link to cite
http://hdl.handle.net/2183/34381
Except where otherwise noted, this item's license is described as Atribución-NoComercial-SinDerivadas 3.0 España
Collections
- Investigación (FIC) [1615]
Metadata
Show full item recordTitle
Parallel feature selection for distributed-memory clustersDate
2019Citation
González-Domínguez, J., Bolón-Canedo, V., Freire, B., & Touriño, J. (2019). Parallel feature selection for distributed-memory clusters. Information Sciences, 496, 399–409. https://doi.org/10.1016/j.ins.2019.01.050
Is version of
https://doi.org/10.1016/j.ins.2019.01.050
Abstract
[Abstract]: Feature selection is nowadays an extremely important data mining stage in the field of machine learning due to the appearance of problems of high dimensionality. In the literature there are numerous feature selection methods, mRMR (minimum-Redundancy-Maximum-Relevance) being one of the most widely used. However, although it achieves good results in selecting relevant features, it is impractical for datasets with thousands of features. A possible solution to this limitation is the use of the fast-mRMR method, a greedy optimization of the mRMR algorithm that improves both scalability and efficiency. In this work we present fast-mRMR-MPI, a novel hybrid parallel implementation that uses MPI and OpenMP to accelerate feature selection on distributed-memory clusters. Our performance evaluation on two different systems using five representative input datasets shows that fast-mRMR-MPI is significantly faster than fast-mRMR while providing the same results. As an example, our tool needs less than one minute to select 200 features of a dataset with more than four million features and 16,000 samples on a cluster with 32 nodes (768 cores in total), while the sequential fast-mRMR required more than eight hours. Moreover, fast-mRMR-MPI distributes data so that it is able to exploit the memory available on different nodes of a cluster and then complete analyses that fail on a single node due to memory constraints. Our tool is publicly available at https://github.com/borjaf696/Fast-mRMR.
Keywords
Machine learning
Feature selection
High performance computing
Parallel computing
Feature selection
High performance computing
Parallel computing
Description
Versión final aceptada de: https://doi.org/10.1016/j.ins.2019.01.050 This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/
licenses/by-nc-nd/4.0/. This version of the article: González-Domínguez, J. et al. (2019) ‘Parallel feature selection for
distributed-memory clusters’, has been accepted for publication in Information Sciences, 496, pp. 399–409. The
Version of Record is available online at: https://doi.org/10.1016/j.ins.2019.01.050
Editor version
Rights
Atribución-NoComercial-SinDerivadas 3.0 España CC-BY-NC-ND 4.0 license https://creativecommons.org/ licenses/by-nc-nd/4.0/