Improving the prediction accuracy of statistical models: A new hierarchical clustering approach
| UDC.coleccion | Investigación | |
| UDC.departamento | Matemáticas | |
| UDC.grupoInv | Modelización, Optimización e Inferencia Estatística (MODES) | |
| UDC.journalTitle | Statistics and Computing | |
| UDC.startPage | 168 | |
| UDC.volume | 35 | |
| dc.contributor.author | López-Oriona, Ángel | |
| dc.contributor.author | Sun, Ying | |
| dc.contributor.author | Vilar, José | |
| dc.date.accessioned | 2025-09-09T09:02:31Z | |
| dc.date.available | 2025-09-09T09:02:31Z | |
| dc.date.issued | 2025-08-11 | |
| dc.description | Open access publishing provided by King Abdullah University of Science and Technology (KAUST). | |
| dc.description.abstract | [Abstract]: Statisticians and machine learning practitioners frequently encounter datasets originated from multiple populations but containing the same type of measurements. In such cases, predictive analytics is typically carried out by either fitting a separate model to each dataset independently or by merging the datasets and fitting a single model to the combined data. These approaches overlook the potential existence of multiple groups of datasets associated with different underlying models, and, therefore, fail to exploit the inherent similarity between datasets to improve predictions. A third alternative is to perform pairwise comparisons between the populations before fitting the models. However, this is not always feasible, can become a very challenging task with complex models, and often does not rely on predictive accuracy. To address these issues, we propose a clustering approach designed to improve predictions in general databases. The method is based on a novel type of objective function that represents the total by-group prediction error. The clustering problem is solved using a hierarchical-type algorithm of agglomerative nature that automatically obtains the resulting clustering partition in a fully data-driven manner. An additional advantage of this procedure is that the number of clusters is treated as a variable in the minimization problem, allowing it to be determined naturally in a way that optimizes the predictive accuracy of the underlying models. Furthermore, the technique is versatile and can be used with any type of model for both regression, and classification tasks. Several simulation experiments and two real-world applications involving housing prices demonstrate that the procedure outperforms benchmark approaches in terms of predictive accuracy | |
| dc.description.sponsorship | Ángel López-Oriona and Ying Sun thank King Abdullah University of Science and Technology (KAUST) for its support. The research by José A. Vilar is supported by the grants PID2020-113578RB-I00 and PID2023-147127OB-I00 "ERDF/EU", funded by MCIN/AEI/10.13039/501100011033/. It has also been supported by the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2024/14) and by CITIC as a center accredited for excellence within the Galician University System and a member of the CIGUS Network, receiving subsidies from the Department of Education, Science, Universities, and Vocational Training of the Xunta de Galicia. Additionally, it is co-financed by the EU through the FEDER Galicia 2021-27 operational program (Ref. ED431G 2023/01). | |
| dc.description.sponsorship | Xunta de Galicia; ED431C-2024/14 | |
| dc.description.sponsorship | Xunta de Galicia; ED431G 2023/01 | |
| dc.identifier.citation | López-Oriona, Á., Sun, Y. & Vilar, J.A. Improving the prediction accuracy of statistical models: A new hierarchical clustering approach. Stat Comput 35, 168 (2025). https://doi.org/10.1007/s11222-025-10683-x | |
| dc.identifier.issn | 1573-1375 | |
| dc.identifier.issn | 0960-3174 | |
| dc.identifier.uri | https://hdl.handle.net/2183/45733 | |
| dc.language.iso | eng | |
| dc.publisher | Springer Nature | |
| dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/PID2020-113578RB-I00/ES/METODOS ESTADISTICOS FLEXIBLES EN CIENCIA DE DATOS PARA DATOS COMPLEJOS Y DE GRAN VOLUMEN: TEORIA Y APLICACIONES/ | |
| dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica, Técnica y de Innovación 2021-2023/PID2023-147127OB-I00/ES/INFERENCIA ESTADISTICA UTILIZANDO METODOS FLEXIBLES PARA DATOS COMPLEJOS: TEORIA Y APPLICACIONES | |
| dc.relation.uri | https://doi.org/10.1007/s11222-025-10683-x | |
| dc.rights | © The Author(s) 2025 | |
| dc.rights.accessRights | open access | |
| dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | |
| dc.subject | Categorization | |
| dc.subject | Data Mining | |
| dc.subject | Functional clustering | |
| dc.subject | Machine Learning | |
| dc.subject | Predictive medicine | |
| dc.subject | Statistical Learning | |
| dc.title | Improving the prediction accuracy of statistical models: A new hierarchical clustering approach | |
| dc.type | journal article | |
| dc.type.hasVersion | VoR | |
| dspace.entity.type | Publication | |
| relation.isAuthorOfPublication | c9381eef-6e06-41b8-a15c-a194bdff8d03 | |
| relation.isAuthorOfPublication.latestForDiscovery | c9381eef-6e06-41b8-a15c-a194bdff8d03 |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Vilar_Jose_2025_Improv_prediction_accuracy_stat_model.pdf
- Size:
- 608.99 KB
- Format:
- Adobe Portable Document Format

