Information Fusion and Ensembles in Machine Learning

View/ Open
Use this link to cite
http://hdl.handle.net/2183/24568Collections
- Teses de doutoramento [2221]
Metadata
Show full item recordTitle
Information Fusion and Ensembles in Machine LearningAlternative Title(s)
Fusión de Información e Ensembles na Aprendizaxe AutomáticaAuthor(s)
Directors
Alonso-Betanzos, AmparoBolón-Canedo, Verónica
Date
2019Abstract
[Abstract] Traditionally, machine learning methods have used a single learning model to solve
a particular problem. However, the idea of combining multiple models instead of a
single one to solve a problem has its rationale in the old proverb “Two heads are better
than one". The approach constructs a set of hypothesis using several different models,
that then are combined in order to be able to obtain better performance than learning
just one hypothesis using a unique method. There have been several studies that have
shown that these models obtain usually better accuracy than individual methods, due
to the diversity of the approaches and the control of the variance, taking advantage
of the strengths of the individual methods and overcome their weak points at the
same time. These combinations of models are called “committees", or more recently
“ensembles". Ensemble learning algorithms have reached great popularity among the
machine learning literature, as they achieve performances that were not possible some
years ago, and thus have become a “winning horse" in many applications.
Moreover, during the last years, the size of the datasets used in the area of machine
learning has considerably grown. Thus, dimensionality reduction has been a must almost
in any case, and among those preprocesing methods, feature selection (FS) has
become an essential preprocessing step for many data mining applications, eliminating
irrelevant and redundant information, and thus reducing storage requirements and
improving the computational time needed by the machine learning algorithms. Also,
several studies have demonstrated that feature selection can greatly contribute to improve
the performance of posterior classi_cation methods.
One of the main points to be addressed in this thesis is the application of the ensemble
learning idea to the feature selection process, with the aim of introducing diversity
and increasing the regularity of the process. Regularity is the ability of the ensemble
approach to obtain acceptable results regardless of the dataset under study and its
particular properties. It should also be mentioned that using ensemble approaches has
the added benefit of releasing the user from the task of selecting the most adequate
method for each dataset, and thus of the obligation of knowing technical details about
the existing algorithms. In this way, also more user-friendly FS methods are coming into scene.
Ensembles for feature selection are a recent proposal, and not many works can
be found in the literature. There are several steps that need to be confronted when
creating an ensemble for FS:
1. Create a set of different feature selectors, each one providing its output. In
order to create diversity, there are several methods that can be used, such as
using different samples of the training dataset, using different feature selection
methods, or a combination of both.
2. Aggregate the results obtained by the single models. There are several measures
that can be used in this step, such as majority voting, weighted voting, etc. It
is important to choose an adequate aggregation method, that is able to preserve
the diversity of the individual base models, while maintaining accuracy.
In this thesis, we have designed several approaches for the first aforementioned step:
(i) homogeneous approach, that is, using the same feature selection method with different
training data and distributing the dataset over several nodes (or several partitions);
and (ii) heterogeneous approach, i.e., using different feature selection methods with the
same training data. Regarding the second point above, we have also studied different
methods for combining the results obtained from the individual methods. Besides,
when the chosen individual selectors are rankers, at some point we needed to establish
a threshold to retain only the relevant features and to combine the rankings obtained
by the different methods configuring the ensemble. In this sense, we have analyzed two
different proposals, depending on whether thresholding was performed before or after
combination. Finally, a third novelty in this work is related to the need of establishing
an adequate threshold, and thus we propose a methodology for establishing automatic
thresholds based on measurements of data complexity. The adequacy of the methods proposed along this thesis was checked, so as to be able to extract a series of final conclusions. To this end, a variety of datasets of different types were used: synthetic, real “classical" (more samples than features) and real DNA
microarray datasets (more features than samples). In a first step, synthetic datasets
were used to perform the first tests and check the performance of the new implemented
methods. In a second step, real datasets (both classical and microarray) were used to
check the adequacy of new methods to problems presented in the real world, allowing us
to carry out a performance comparison and also to extract a series of final conclusions.
Finally, nowadays it is common to find missing data in real-world problems that the
proposed feature selection ensembles, as any other machine learning method, are likely
to face. Traditionally, the common way to deal with this situation was to delete those
samples that contained missing data, but this is not possible when the percentages of
missing data are important, and thus imputation is the newly common approach. However,
imputation before FS can lead to false positives: features that are not associated
with the target become dependent as a result of imputation. In this exploratory work
we use causal graphs to evidence the notion of structural bias, and develop a modi-
fied t-statistic test to analyze the possible bias that can be originated. Our conclusion
is that it is more advisable to devise feature selection methods that are “robust" to
the presence of missing data than imputing them. In this regard, the development of
ensemble feature selection in this scenario remains as the future line to pursue.
Keywords
Machine learning
Ensemble learning
Feature selection
Missing data
Ensemble learning
Feature selection
Missing data
Rights
Atribución-NoComercial 4.0 España