Comparison of Data Set Sample Selection Algorithms for Data Science: a Systematic Review
| UDC.coleccion | Investigación | |
| UDC.departamento | Ciencias da Computación e Tecnoloxías da Información | |
| UDC.grupoInv | Redes de Neuronas Artificiais e Sistemas Adaptativos -Informática Médica e Diagnóstico Radiolóxico (RNASA - IMEDIR) | |
| UDC.institutoCentro | CITIC - Centro de Investigación de Tecnoloxías da Información e da Comunicación | |
| UDC.journalTitle | Neural Computing and Applications | |
| UDC.startPage | 408 | |
| UDC.volume | 38 | |
| dc.contributor.author | Fernández Sánchez, Alberto | |
| dc.contributor.author | Gestal, M. | |
| dc.contributor.author | Bolón-Canedo, Verónica | |
| dc.contributor.author | Dorado, Julián | |
| dc.contributor.author | Pazos, A. | |
| dc.date.accessioned | 2026-06-04T08:10:36Z | |
| dc.date.available | 2026-06-04T08:10:36Z | |
| dc.date.issued | 2026 | |
| dc.description | Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG | |
| dc.description.abstract | [Abstract]: In the era of big data, selecting representative samples has become essential to mitigate overfitting, noise, and high computational cost in machine learning. This study systematically reviews the evolution of instance selection (IS) methods, highlighting the growing importance of instance hardness (IH) as a guiding criterion to improve training efficiency and model robustness. Through a comprehensive search in Scopus and Web of Science, fifty-five studies were identified and analyzed following strict inclusion and exclusion criteria. The reviewed works were classified according to their underlying rationale–error-based, geometric, heuristic, or explainability-driven–revealing that IH principles intersect these categories as a transversal perspective on data quality. Most studies focus on enhancing predictive accuracy (56%) and computational efficiency (36%), while bias reduction and privacy preservation remain secondary. Reported outcomes show significant dataset reductions (up to 97%) with minimal accuracy loss and, in some cases, notable performance gains (+32% accuracy, +67% improvement in MSE). Despite these advances, explicit references to IH are rare, though many methods implicitly rely on related metrics such as misclassification frequency or decision-boundary proximity. Overall, IS is gaining relevance across domains such as cybersecurity, biomedicine, and computer vision, yet the field still lacks standardized methodologies and benchmarking frameworks, underscoring the need for unified, IH-informed strategies for robust and generalizable instance selection. | |
| dc.description.sponsorship | This work was supported by the Ministry for Digital Transformation and Civil Service and ’NextGenerationEU’/PRTR under Grant TSI-100925-2023-1. The author also acknowledges CITEEC, a Collaborative Center of the Galician University System, co-funded by the Xunta de Galicia and the European Union (ERDF) through the FEDER Galicia 2021-2027 program (ED431G 2023/10) under the thematic objective "A Smarter Europe: Innovative Economic Transformation. | |
| dc.description.sponsorship | Xunta de Galicia; ED431G 2023/10 | |
| dc.identifier.citation | Fernández-Sánchez, A., Pose, M.G., Canedo, V.B. et al. Comparison of data set sample selection algorithms for data science: a systematic review. Neural Comput & Applic 38, 408 (2026). https://doi.org/10.1007/s00521-026-12124-w | |
| dc.identifier.doi | 10.1007/s00521-026-12124-w | |
| dc.identifier.issn | 1433-3058 | |
| dc.identifier.uri | https://hdl.handle.net/2183/48518 | |
| dc.language.iso | eng | |
| dc.publisher | Springer | |
| dc.relation.projectID | info:eu-repo/grantAgreement/MTDPF//TSI-100925-2023-1/ES/CÁTEDRA UDC-INDITEX DE IA EN ALGORITMOS VERDES | |
| dc.relation.uri | https://doi.org/10.1007/s00521-026-12124-w | |
| dc.rights | Attribution 4.0 International | en |
| dc.rights.accessRights | open access | |
| dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | |
| dc.subject | Instance selection | |
| dc.subject | Dataset filtering | |
| dc.subject | Instance hardness | |
| dc.subject | Data complexity | |
| dc.subject | Machine learning | |
| dc.title | Comparison of Data Set Sample Selection Algorithms for Data Science: a Systematic Review | |
| dc.type | journal article | |
| dc.type.hasVersion | VoR | |
| dspace.entity.type | Publication | |
| relation.isAuthorOfPublication | 65439986-7b8c-4418-b8e3-5694f520ecc7 | |
| relation.isAuthorOfPublication | c114dccd-76e4-4959-ba6b-7c7c055289b1 | |
| relation.isAuthorOfPublication | 5139dea6-2326-4384-a423-317cec26ee8a | |
| relation.isAuthorOfPublication | fa192a4c-bffd-4b23-87ae-e68c29350cdc | |
| relation.isAuthorOfPublication.latestForDiscovery | 65439986-7b8c-4418-b8e3-5694f520ecc7 |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Gestal_Marcos_2026_Comparison_of_Data_Set_Sample_Selection_Algorithms_for_Data_Science.pdf
- Size:
- 1.69 MB
- Format:
- Adobe Portable Document Format

