Comparison of Data Set Sample Selection Algorithms for Data Science: a Systematic Review

Fernández Sánchez, Alberto; Gestal, M.; Bolón-Canedo, Verónica; Dorado, Julián; Pazos, A.

Comparison of Data Set Sample Selection Algorithms for Data Science: a Systematic Review

UDC.coleccion	Investigación
UDC.departamento	Ciencias da Computación e Tecnoloxías da Información
UDC.grupoInv	Redes de Neuronas Artificiais e Sistemas Adaptativos -Informática Médica e Diagnóstico Radiolóxico (RNASA - IMEDIR)
UDC.institutoCentro	CITIC - Centro de Investigación de Tecnoloxías da Información e da Comunicación
UDC.journalTitle	Neural Computing and Applications
UDC.startPage	408
UDC.volume	38
dc.contributor.author	Fernández Sánchez, Alberto
dc.contributor.author	Gestal, M.
dc.contributor.author	Bolón-Canedo, Verónica
dc.contributor.author	Dorado, Julián
dc.contributor.author	Pazos, A.
dc.date.accessioned	2026-06-04T08:10:36Z
dc.date.available	2026-06-04T08:10:36Z
dc.date.issued	2026
dc.description	Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG
dc.description.abstract	[Abstract]: In the era of big data, selecting representative samples has become essential to mitigate overfitting, noise, and high computational cost in machine learning. This study systematically reviews the evolution of instance selection (IS) methods, highlighting the growing importance of instance hardness (IH) as a guiding criterion to improve training efficiency and model robustness. Through a comprehensive search in Scopus and Web of Science, fifty-five studies were identified and analyzed following strict inclusion and exclusion criteria. The reviewed works were classified according to their underlying rationale–error-based, geometric, heuristic, or explainability-driven–revealing that IH principles intersect these categories as a transversal perspective on data quality. Most studies focus on enhancing predictive accuracy (56%) and computational efficiency (36%), while bias reduction and privacy preservation remain secondary. Reported outcomes show significant dataset reductions (up to 97%) with minimal accuracy loss and, in some cases, notable performance gains (+32% accuracy, +67% improvement in MSE). Despite these advances, explicit references to IH are rare, though many methods implicitly rely on related metrics such as misclassification frequency or decision-boundary proximity. Overall, IS is gaining relevance across domains such as cybersecurity, biomedicine, and computer vision, yet the field still lacks standardized methodologies and benchmarking frameworks, underscoring the need for unified, IH-informed strategies for robust and generalizable instance selection.
dc.description.sponsorship	This work was supported by the Ministry for Digital Transformation and Civil Service and ’NextGenerationEU’/PRTR under Grant TSI-100925-2023-1. The author also acknowledges CITEEC, a Collaborative Center of the Galician University System, co-funded by the Xunta de Galicia and the European Union (ERDF) through the FEDER Galicia 2021-2027 program (ED431G 2023/10) under the thematic objective "A Smarter Europe: Innovative Economic Transformation.
dc.description.sponsorship	Xunta de Galicia; ED431G 2023/10
dc.identifier.citation	Fernández-Sánchez, A., Pose, M.G., Canedo, V.B. et al. Comparison of data set sample selection algorithms for data science: a systematic review. Neural Comput & Applic 38, 408 (2026). https://doi.org/10.1007/s00521-026-12124-w
dc.identifier.doi	10.1007/s00521-026-12124-w
dc.identifier.issn	1433-3058
dc.identifier.uri	https://hdl.handle.net/2183/48518
dc.language.iso	eng
dc.publisher	Springer
dc.relation.projectID	info:eu-repo/grantAgreement/MTDPF//TSI-100925-2023-1/ES/CÁTEDRA UDC-INDITEX DE IA EN ALGORITMOS VERDES
dc.relation.uri	https://doi.org/10.1007/s00521-026-12124-w
dc.rights	Attribution 4.0 International	en
dc.rights.accessRights	open access
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	Instance selection
dc.subject	Dataset filtering
dc.subject	Instance hardness
dc.subject	Data complexity
dc.subject	Machine learning
dc.title	Comparison of Data Set Sample Selection Algorithms for Data Science: a Systematic Review
dc.type	journal article
dc.type.hasVersion	VoR
dspace.entity.type	Publication
relation.isAuthorOfPublication	65439986-7b8c-4418-b8e3-5694f520ecc7
relation.isAuthorOfPublication	c114dccd-76e4-4959-ba6b-7c7c055289b1
relation.isAuthorOfPublication	5139dea6-2326-4384-a423-317cec26ee8a
relation.isAuthorOfPublication	fa192a4c-bffd-4b23-87ae-e68c29350cdc
relation.isAuthorOfPublication.latestForDiscovery	65439986-7b8c-4418-b8e3-5694f520ecc7

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Gestal_Marcos_2026_Comparison_of_Data_Set_Sample_Selection_Algorithms_for_Data_Science.pdf
Size:: 1.69 MB
Format:: Adobe Portable Document Format

Download

Collections

Investigación (FIC)