Distributed correlation-based feature selection in spark

Palma Mendoza, Raúl José; Marcos, Luis de; Rodríguez, Daniel; Alonso-Betanzos, Amparo

dc.contributor.author	Palma Mendoza, Raúl José
dc.contributor.author	Marcos, Luis de
dc.contributor.author	Rodríguez, Daniel
dc.contributor.author	Alonso-Betanzos, Amparo
dc.date.accessioned	2023-12-04T14:29:07Z
dc.date.available	2023-12-04T14:29:07Z
dc.date.issued	2019-09
dc.identifier.citation	R.-J. Palma-Mendoza, L. de-Marcos, D. Rodriguez, y A. Alonso-Betanzos, «Distributed correlation-based feature selection in spark», Information Sciences, vol. 496, pp. 287-299, sep. 2019, doi: 10.1016/j.ins.2018.10.052.	es_ES
dc.identifier.issn	0020-0255
dc.identifier.uri	http://hdl.handle.net/2183/34420
dc.description	© 2019. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/. This version of the article "R.-J. Palma-Mendoza, L. de-Marcos, D. Rodriguez, y A. Alonso-Betanzos, «Distributed correlation-based feature selection in spark», Information Sciences, vol. 496, pp. 287-299, sep. 2019" has been accepted for publication in Information Sciences. The Version of Record is available online at doi: 10.1016/j.ins.2018.10.052.	es_ES
dc.description.abstract	[Abstract]: Feature selection (FS) is a key preprocessing step in data mining. CFS (Correlation-Based Feature Selection) is an FS algorithm that has been successfully applied to classification problems in many domains. We describe Distributed CFS (DiCFS) as a completely redesigned, scalable, parallel and distributed version of the CFS algorithm, capable of dealing with the large volumes of data typical of big data applications. Two versions of the algorithm were implemented and compared using the Apache Spark cluster computing model, currently gaining popularity due to its much faster processing times than Hadoop’s MapReduce model. We tested our algorithms on four publicly available datasets, each consisting of a large number of instances and two also consisting of a large number of features. The results show that our algorithms were superior in terms of both time-efficiency and scalability. In leveraging a computer cluster, they were able to handle larger datasets than the non-distributed WEKA version while maintaining the quality of the results, i.e., exactly the same features were returned by our algorithms when compared to the original algorithm available in WEKA.	es_ES
dc.description.sponsorship	The authors thank CESGA for use of their supercomputing resources. This research has been partially supported by the Spanish Ministerio de Economía y Competitividad (research projects TIN 2015-65069-C2-1R, TIN2016-76956-C3-3-R), the Xunta de Galicia (Grants GRC2014/035 and ED431G/01) and the European Union Regional Development Funds. R. Palma-Mendoza holds a scholarship from the Spanish Fundación Carolina and the National Autonomous University of Honduras.	es_ES
dc.description.sponsorship	Xunta de Galicia; GRC2014/035	es_ES
dc.description.sponsorship	Xunta de Galicia; ED431G/01	es_ES
dc.language.iso	eng	es_ES
dc.publisher	Elsevier	es_ES
dc.relation	info:eu-repo/grantAgreement/MINECO/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/TIN2015-65069-C2-1-R/ES/ALGORITMOS ESCALABLES DE APRENDIZAJE COMPUTACIONAL: MAS ALLA DE LA CLASIFICACION Y LA REGRESION	es_ES
dc.relation	info:eu-repo/grantAgreement/MINECO/Plan Estatal de Investigación Científica y Técnica y de Innovación 2013-2016/TIN2016-76956-C3-3-R/ES/INNOVACION EN LA MEJORA DE LA CALIDAD DE LOS PROCESOS IMPULSADOS POR LAS PERSONAS A TRAVES DE SIMULACION Y GAMIFICACION	es_ES
dc.relation.uri	https://doi.org/10.1016/j.ins.2018.10.052	es_ES
dc.rights	Atribución-NoComercial-SinDerivadas 4.0 Internacional	es_ES
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	Feature selection	es_ES
dc.subject	Scalability	es_ES
dc.subject	Big data	es_ES
dc.subject	Apache spark	es_ES
dc.subject	CFS	es_ES
dc.subject	Correlation	es_ES
dc.title	Distributed correlation-based feature selection in spark	es_ES
dc.type	info:eu-repo/semantics/article	es_ES
dc.rights.access	info:eu-repo/semantics/openAccess	es_ES
UDC.journalTitle	Information Sciences	es_ES
UDC.issue	496	es_ES
UDC.startPage	287	es_ES
UDC.endPage	299	es_ES

Ficheiros no ítem

Nome:: license_rdf
Tamaño:: 1.203Kb
Formato:: application/rdf+xml

Ver/abrir

Nome:: Palma-Mendoza_Raul_2018_Distri ...
Tamaño:: 439.6Kb
Formato:: PDF

Ver/abrir

Este ítem aparece na(s) seguinte(s) colección(s)

GI-LIDIA - Artigos [65]

Mostrar o rexistro simple do ítem