Improving the prediction accuracy of statistical models: A new hierarchical clustering approach

López-Oriona, Ángel; Sun, Ying; Vilar, José

Use this link to cite:

https://hdl.handle.net/2183/45733

Improving the prediction accuracy of statistical models: A new hierarchical clustering approach

Files

Vilar_Jose_2025_Improv_prediction_accuracy_stat_model.pdf (608.99 KB)

Identifiers

URI: https://hdl.handle.net/2183/45733

Publication date

2025-08-11

Authors

López-Oriona, Ángel

Sun, Ying

Vilar, José

Bibliographic citation

López-Oriona, Á., Sun, Y. & Vilar, J.A. Improving the prediction accuracy of statistical models: A new hierarchical clustering approach. Stat Comput 35, 168 (2025). https://doi.org/10.1007/s11222-025-10683-x

Abstract

[Abstract]: Statisticians and machine learning practitioners frequently encounter datasets originated from multiple populations but containing the same type of measurements. In such cases, predictive analytics is typically carried out by either fitting a separate model to each dataset independently or by merging the datasets and fitting a single model to the combined data. These approaches overlook the potential existence of multiple groups of datasets associated with different underlying models, and, therefore, fail to exploit the inherent similarity between datasets to improve predictions. A third alternative is to perform pairwise comparisons between the populations before fitting the models. However, this is not always feasible, can become a very challenging task with complex models, and often does not rely on predictive accuracy. To address these issues, we propose a clustering approach designed to improve predictions in general databases. The method is based on a novel type of objective function that represents the total by-group prediction error. The clustering problem is solved using a hierarchical-type algorithm of agglomerative nature that automatically obtains the resulting clustering partition in a fully data-driven manner. An additional advantage of this procedure is that the number of clusters is treated as a variable in the minimization problem, allowing it to be determined naturally in a way that optimizes the predictive accuracy of the underlying models. Furthermore, the technique is versatile and can be used with any type of model for both regression, and classification tasks. Several simulation experiments and two real-world applications involving housing prices demonstrate that the procedure outperforms benchmark approaches in terms of predictive accuracy