Dealing with heterogeneity in the context of distributed feature selection for classification

Morillo-Salas, José Luis; Bolón-Canedo, Verónica; Alonso-Betanzos, Amparo

Use this link to cite:

http://hdl.handle.net/2183/36033

Dealing with heterogeneity in the context of distributed feature selection for classification

Files

Morillo_Salas_JoseL_2020_Dealing_with_heterogeneity_in_the_context_of_distributed_feature_selection_for_classification.pdf (2.78 MB)

Identifiers

URI: http://hdl.handle.net/2183/36033

DOI: 10.1007/s10115-020-01526-4

Publication date

2021

Authors

Morillo-Salas, José Luis

Bolón-Canedo, Verónica

Alonso-Betanzos, Amparo

Bibliographic citation

Morillo-Salas, J.L., Bolón-Canedo, V. & Alonso-Betanzos, A. Dealing with heterogeneity in the context of distributed feature selection for classification. Knowl Inf Syst 63, 233–276 (2021). https://doi.org/10.1007/s10115-020-01526-4

Abstract

[Abstract]: Advances in the information technologies have greatly contributed to the advent of larger datasets. These datasets often come from distributed sites, but even so, their large size usually means they cannot be handled in a centralized manner. A possible solution to this problem is to distribute the data over several processors and combine the different results. We propose a methodology to distribute feature selection processes based on selecting relevant and discarding irrelevant features. This preprocessing step is essential for current high-dimensional sets, since it allows the input dimension to be reduced. We pay particular attention to the problem of data imbalance, which occurs because the original dataset is unbalanced or because the dataset becomes unbalanced after data partitioning. Most works approach unbalanced scenarios by oversampling, while our proposal tests both over- and undersampling strategies. Experimental results demonstrate that our distributed approach to classification obtains comparable accuracy results to a centralized approach, while reducing computational time and efficiently dealing with data imbalance.

Description

This version of the article has been accepted for publication, after peer review and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: https://doi.org/10.1007/s10115-020-01526-4.

Keywords

Feature selection Distributed learning Unbalanced data Oversampling

Editor version

https://doi.org/10.1007/s10115-020-01526-4

Rights

Collections

Investigación (FIC)

Full item page

Dealing with heterogeneity in the context of distributed feature selection for classification

Files

Identifiers

Publication date

Authors

Advisors

Other responsabilities

Journal Title

Bibliographic citation

Type of academic work

Academic degree

Abstract

Description

Keywords

Editor version

Rights

Collections