Comparison of Data Set Sample Selection Algorithms for Data Science: a Systematic Review

UDC.coleccionInvestigación
UDC.departamentoCiencias da Computación e Tecnoloxías da Información
UDC.grupoInvRedes de Neuronas Artificiais e Sistemas Adaptativos -Informática Médica e Diagnóstico Radiolóxico (RNASA - IMEDIR)
UDC.institutoCentroCITIC - Centro de Investigación de Tecnoloxías da Información e da Comunicación
UDC.journalTitleNeural Computing and Applications
UDC.startPage408
UDC.volume38
dc.contributor.authorFernández Sánchez, Alberto
dc.contributor.authorGestal, M.
dc.contributor.authorBolón-Canedo, Verónica
dc.contributor.authorDorado, Julián
dc.contributor.authorPazos, A.
dc.date.accessioned2026-06-04T08:10:36Z
dc.date.available2026-06-04T08:10:36Z
dc.date.issued2026
dc.descriptionFinanciado para publicación en acceso aberto: Universidade da Coruña/CISUG
dc.description.abstract[Abstract]: In the era of big data, selecting representative samples has become essential to mitigate overfitting, noise, and high computational cost in machine learning. This study systematically reviews the evolution of instance selection (IS) methods, highlighting the growing importance of instance hardness (IH) as a guiding criterion to improve training efficiency and model robustness. Through a comprehensive search in Scopus and Web of Science, fifty-five studies were identified and analyzed following strict inclusion and exclusion criteria. The reviewed works were classified according to their underlying rationale–error-based, geometric, heuristic, or explainability-driven–revealing that IH principles intersect these categories as a transversal perspective on data quality. Most studies focus on enhancing predictive accuracy (56%) and computational efficiency (36%), while bias reduction and privacy preservation remain secondary. Reported outcomes show significant dataset reductions (up to 97%) with minimal accuracy loss and, in some cases, notable performance gains (+32% accuracy, +67% improvement in MSE). Despite these advances, explicit references to IH are rare, though many methods implicitly rely on related metrics such as misclassification frequency or decision-boundary proximity. Overall, IS is gaining relevance across domains such as cybersecurity, biomedicine, and computer vision, yet the field still lacks standardized methodologies and benchmarking frameworks, underscoring the need for unified, IH-informed strategies for robust and generalizable instance selection.
dc.description.sponsorshipThis work was supported by the Ministry for Digital Transformation and Civil Service and ’NextGenerationEU’/PRTR under Grant TSI-100925-2023-1. The author also acknowledges CITEEC, a Collaborative Center of the Galician University System, co-funded by the Xunta de Galicia and the European Union (ERDF) through the FEDER Galicia 2021-2027 program (ED431G 2023/10) under the thematic objective "A Smarter Europe: Innovative Economic Transformation.
dc.description.sponsorshipXunta de Galicia; ED431G 2023/10
dc.identifier.citationFernández-Sánchez, A., Pose, M.G., Canedo, V.B. et al. Comparison of data set sample selection algorithms for data science: a systematic review. Neural Comput & Applic 38, 408 (2026). https://doi.org/10.1007/s00521-026-12124-w
dc.identifier.doi10.1007/s00521-026-12124-w
dc.identifier.issn1433-3058
dc.identifier.urihttps://hdl.handle.net/2183/48518
dc.language.isoeng
dc.publisherSpringer
dc.relation.projectIDinfo:eu-repo/grantAgreement/MTDPF//TSI-100925-2023-1/ES/CÁTEDRA UDC-INDITEX DE IA EN ALGORITMOS VERDES
dc.relation.urihttps://doi.org/10.1007/s00521-026-12124-w
dc.rightsAttribution 4.0 Internationalen
dc.rights.accessRightsopen access
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectInstance selection
dc.subjectDataset filtering
dc.subjectInstance hardness
dc.subjectData complexity
dc.subjectMachine learning
dc.titleComparison of Data Set Sample Selection Algorithms for Data Science: a Systematic Review
dc.typejournal article
dc.type.hasVersionVoR
dspace.entity.typePublication
relation.isAuthorOfPublication65439986-7b8c-4418-b8e3-5694f520ecc7
relation.isAuthorOfPublicationc114dccd-76e4-4959-ba6b-7c7c055289b1
relation.isAuthorOfPublication5139dea6-2326-4384-a423-317cec26ee8a
relation.isAuthorOfPublicationfa192a4c-bffd-4b23-87ae-e68c29350cdc
relation.isAuthorOfPublication.latestForDiscovery65439986-7b8c-4418-b8e3-5694f520ecc7

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Gestal_Marcos_2026_Comparison_of_Data_Set_Sample_Selection_Algorithms_for_Data_Science.pdf
Size:
1.69 MB
Format:
Adobe Portable Document Format