Parallel feature selection for distributed-memory clusters

González-Domínguez, Jorge; Bolón-Canedo, Verónica; Freire, Borja; Touriño, Juan

dc.contributor.author	González-Domínguez, Jorge
dc.contributor.author	Bolón-Canedo, Verónica
dc.contributor.author	Freire, Borja
dc.contributor.author	Touriño, Juan
dc.date.accessioned	2023-11-29T19:04:58Z
dc.date.available	2023-11-29T19:04:58Z
dc.date.issued	2019
dc.identifier.citation	González-Domínguez, J., Bolón-Canedo, V., Freire, B., & Touriño, J. (2019). Parallel feature selection for distributed-memory clusters. Information Sciences, 496, 399–409. https://doi.org/10.1016/j.ins.2019.01.050	es_ES
dc.identifier.uri	http://hdl.handle.net/2183/34381
dc.description	Versión final aceptada de: https://doi.org/10.1016/j.ins.2019.01.050	es_ES
dc.description	This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/ licenses/by-nc-nd/4.0/. This version of the article: González-Domínguez, J. et al. (2019) ‘Parallel feature selection for distributed-memory clusters’, has been accepted for publication in Information Sciences, 496, pp. 399–409. The Version of Record is available online at: https://doi.org/10.1016/j.ins.2019.01.050	es_ES
dc.description.abstract	[Abstract]: Feature selection is nowadays an extremely important data mining stage in the field of machine learning due to the appearance of problems of high dimensionality. In the literature there are numerous feature selection methods, mRMR (minimum-Redundancy-Maximum-Relevance) being one of the most widely used. However, although it achieves good results in selecting relevant features, it is impractical for datasets with thousands of features. A possible solution to this limitation is the use of the fast-mRMR method, a greedy optimization of the mRMR algorithm that improves both scalability and efficiency. In this work we present fast-mRMR-MPI, a novel hybrid parallel implementation that uses MPI and OpenMP to accelerate feature selection on distributed-memory clusters. Our performance evaluation on two different systems using five representative input datasets shows that fast-mRMR-MPI is significantly faster than fast-mRMR while providing the same results. As an example, our tool needs less than one minute to select 200 features of a dataset with more than four million features and 16,000 samples on a cluster with 32 nodes (768 cores in total), while the sequential fast-mRMR required more than eight hours. Moreover, fast-mRMR-MPI distributes data so that it is able to exploit the memory available on different nodes of a cluster and then complete analyses that fail on a single node due to memory constraints. Our tool is publicly available at https://github.com/borjaf696/Fast-mRMR.	es_ES
dc.description.sponsorship	This research has been partially funded by projects TIN2016-75845-P and TIN-2015-65069-C2-1-R of the Ministry of Economy, Industry and Competitiveness of Spain, as well as by Xunta de Galicia projects ED431D R2016/045 and GRC2014/035, all of them partially funded by FEDER funds of the European Union. We gratefully thank CESGA for providing access to the Finis Terrae II supercomputer.	es_ES
dc.description.sponsorship	Xunta de Galicia; ED431D R2016/045	es_ES
dc.description.sponsorship	Xunta de Galicia; GRC2014/035	es_ES
dc.language.iso	eng	es_ES
dc.relation	info:eu-repo/grantAgreement/MINECO/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/TIN2016-75845-P/ES/NUEVOS DESAFIOS EN COMPUTACION DE ALTAS PRESTACIONES: DESDE ARQUITECTURAS HASTA APLICACIONES (II)/	es_ES
dc.relation	info:eu-repo/grantAgreement/MINECO/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/TIN2015-65069-C2-1-R/ES/ALGORITMOS ESCALABLES DE APRENDIZAJE COMPUTACIONAL: MAS ALLA DE LA CLASIFICACION Y LA REGRESION	es_ES
dc.relation.isversionof	https://doi.org/10.1016/j.ins.2019.01.050
dc.relation.uri	https://doi.org/10.1016/j.ins.2019.01.050	es_ES
dc.rights	Atribución-NoComercial-SinDerivadas 3.0 España	es_ES
dc.rights	CC-BY-NC-ND 4.0 license https://creativecommons.org/ licenses/by-nc-nd/4.0/	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/es/	*
dc.subject	Machine learning	es_ES
dc.subject	Feature selection	es_ES
dc.subject	High performance computing	es_ES
dc.subject	Parallel computing	es_ES
dc.title	Parallel feature selection for distributed-memory clusters	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.rights.access	info:eu-repo/semantics/openAccess	es_ES
dc.identifier.doi	10.1016/j.ins.2019.01.050

Ficheiros no ítem

Nome:: license_rdf
Tamaño:: 1.203Kb
Formato:: application/rdf+xml

Ver/abrir

Nome:: Gonzalez_Dominguez_Jorge_2019_ ...
Tamaño:: 283.8Kb
Formato:: PDF

Ver/abrir

Este ítem aparece na(s) seguinte(s) colección(s)

GI-GAC - Artigos [182]

Mostrar o rexistro simple do ítem