Improving the prediction accuracy of statistical models: A new hierarchical clustering approach

Loading...
Thumbnail Image

Identifiers

Publication date

Authors

López-Oriona, Ángel
Sun, Ying

Advisors

Other responsabilities

Journal Title

Bibliographic citation

López-Oriona, Á., Sun, Y. & Vilar, J.A. Improving the prediction accuracy of statistical models: A new hierarchical clustering approach. Stat Comput 35, 168 (2025). https://doi.org/10.1007/s11222-025-10683-x

Type of academic work

Academic degree

Abstract

[Abstract]: Statisticians and machine learning practitioners frequently encounter datasets originated from multiple populations but containing the same type of measurements. In such cases, predictive analytics is typically carried out by either fitting a separate model to each dataset independently or by merging the datasets and fitting a single model to the combined data. These approaches overlook the potential existence of multiple groups of datasets associated with different underlying models, and, therefore, fail to exploit the inherent similarity between datasets to improve predictions. A third alternative is to perform pairwise comparisons between the populations before fitting the models. However, this is not always feasible, can become a very challenging task with complex models, and often does not rely on predictive accuracy. To address these issues, we propose a clustering approach designed to improve predictions in general databases. The method is based on a novel type of objective function that represents the total by-group prediction error. The clustering problem is solved using a hierarchical-type algorithm of agglomerative nature that automatically obtains the resulting clustering partition in a fully data-driven manner. An additional advantage of this procedure is that the number of clusters is treated as a variable in the minimization problem, allowing it to be determined naturally in a way that optimizes the predictive accuracy of the underlying models. Furthermore, the technique is versatile and can be used with any type of model for both regression, and classification tasks. Several simulation experiments and two real-world applications involving housing prices demonstrate that the procedure outperforms benchmark approaches in terms of predictive accuracy

Description

Open access publishing provided by King Abdullah University of Science and Technology (KAUST).

Rights

© The Author(s) 2025
© The Author(s) 2025

Except where otherwise noted, this item's license is described as © The Author(s) 2025