Distributed correlation-based feature selection in spark
Ver/ abrir
Use este enlace para citar
http://hdl.handle.net/2183/34420
A non ser que se indique outra cousa, a licenza do ítem descríbese como Atribución-NoComercial-SinDerivadas 4.0 Internacional
Coleccións
- GI-LIDIA - Artigos [65]
Metadatos
Mostrar o rexistro completo do ítemTítulo
Distributed correlation-based feature selection in sparkData
2019-09Cita bibliográfica
R.-J. Palma-Mendoza, L. de-Marcos, D. Rodriguez, y A. Alonso-Betanzos, «Distributed correlation-based feature selection in spark», Information Sciences, vol. 496, pp. 287-299, sep. 2019, doi: 10.1016/j.ins.2018.10.052.
Resumo
[Abstract]: Feature selection (FS) is a key preprocessing step in data mining. CFS (Correlation-Based Feature Selection) is an FS algorithm that has been successfully applied to classification problems in many domains. We describe Distributed CFS (DiCFS) as a completely redesigned, scalable, parallel and distributed version of the CFS algorithm, capable of dealing with the large volumes of data typical of big data applications. Two versions of the algorithm were implemented and compared using the Apache Spark cluster computing model, currently gaining popularity due to its much faster processing times than Hadoop’s MapReduce model. We tested our algorithms on four publicly available datasets, each consisting of a large number of instances and two also consisting of a large number of features. The results show that our algorithms were superior in terms of both time-efficiency and scalability. In leveraging a computer cluster, they were able to handle larger datasets than the non-distributed WEKA version while maintaining the quality of the results, i.e., exactly the same features were returned by our algorithms when compared to the original algorithm available in WEKA.
Palabras chave
Feature selection
Scalability
Big data
Apache spark
CFS
Correlation
Scalability
Big data
Apache spark
CFS
Correlation
Descrición
© 2019. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/. This version of the article "R.-J. Palma-Mendoza, L. de-Marcos, D. Rodriguez, y A. Alonso-Betanzos, «Distributed correlation-based feature selection in spark», Information Sciences, vol. 496, pp. 287-299, sep. 2019" has been accepted for publication in Information Sciences. The Version of Record is available online at doi: 10.1016/j.ins.2018.10.052.
Versión do editor
Dereitos
Atribución-NoComercial-SinDerivadas 4.0 Internacional
ISSN
0020-0255