Parallel feature selection for distributed-memory clusters

Use este enlace para citar
http://hdl.handle.net/2183/34381
A non ser que se indique outra cousa, a licenza do ítem descríbese como Atribución-NoComercial-SinDerivadas 3.0 España
Coleccións
- Investigación (FIC) [1634]
Metadatos
Mostrar o rexistro completo do ítemTítulo
Parallel feature selection for distributed-memory clustersData
2019Cita bibliográfica
González-Domínguez, J., Bolón-Canedo, V., Freire, B., & Touriño, J. (2019). Parallel feature selection for distributed-memory clusters. Information Sciences, 496, 399–409. https://doi.org/10.1016/j.ins.2019.01.050
É version de
https://doi.org/10.1016/j.ins.2019.01.050
Resumo
[Abstract]: Feature selection is nowadays an extremely important data mining stage in the field of machine learning due to the appearance of problems of high dimensionality. In the literature there are numerous feature selection methods, mRMR (minimum-Redundancy-Maximum-Relevance) being one of the most widely used. However, although it achieves good results in selecting relevant features, it is impractical for datasets with thousands of features. A possible solution to this limitation is the use of the fast-mRMR method, a greedy optimization of the mRMR algorithm that improves both scalability and efficiency. In this work we present fast-mRMR-MPI, a novel hybrid parallel implementation that uses MPI and OpenMP to accelerate feature selection on distributed-memory clusters. Our performance evaluation on two different systems using five representative input datasets shows that fast-mRMR-MPI is significantly faster than fast-mRMR while providing the same results. As an example, our tool needs less than one minute to select 200 features of a dataset with more than four million features and 16,000 samples on a cluster with 32 nodes (768 cores in total), while the sequential fast-mRMR required more than eight hours. Moreover, fast-mRMR-MPI distributes data so that it is able to exploit the memory available on different nodes of a cluster and then complete analyses that fail on a single node due to memory constraints. Our tool is publicly available at https://github.com/borjaf696/Fast-mRMR.
Palabras chave
Machine learning
Feature selection
High performance computing
Parallel computing
Feature selection
High performance computing
Parallel computing
Descrición
Versión final aceptada de: https://doi.org/10.1016/j.ins.2019.01.050 This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/
licenses/by-nc-nd/4.0/. This version of the article: González-Domínguez, J. et al. (2019) ‘Parallel feature selection for
distributed-memory clusters’, has been accepted for publication in Information Sciences, 496, pp. 399–409. The
Version of Record is available online at: https://doi.org/10.1016/j.ins.2019.01.050
Versión do editor
Dereitos
Atribución-NoComercial-SinDerivadas 3.0 España CC-BY-NC-ND 4.0 license https://creativecommons.org/ licenses/by-nc-nd/4.0/