Comparison of Data Set Sample Selection Algorithms for Data Science: a Systematic Review

Bibliographic citation

Fernández-Sánchez, A., Pose, M.G., Canedo, V.B. et al. Comparison of data set sample selection algorithms for data science: a systematic review. Neural Comput & Applic 38, 408 (2026). https://doi.org/10.1007/s00521-026-12124-w

Type of academic work

Academic degree

Abstract

[Abstract]: In the era of big data, selecting representative samples has become essential to mitigate overfitting, noise, and high computational cost in machine learning. This study systematically reviews the evolution of instance selection (IS) methods, highlighting the growing importance of instance hardness (IH) as a guiding criterion to improve training efficiency and model robustness. Through a comprehensive search in Scopus and Web of Science, fifty-five studies were identified and analyzed following strict inclusion and exclusion criteria. The reviewed works were classified according to their underlying rationale–error-based, geometric, heuristic, or explainability-driven–revealing that IH principles intersect these categories as a transversal perspective on data quality. Most studies focus on enhancing predictive accuracy (56%) and computational efficiency (36%), while bias reduction and privacy preservation remain secondary. Reported outcomes show significant dataset reductions (up to 97%) with minimal accuracy loss and, in some cases, notable performance gains (+32% accuracy, +67% improvement in MSE). Despite these advances, explicit references to IH are rare, though many methods implicitly rely on related metrics such as misclassification frequency or decision-boundary proximity. Overall, IS is gaining relevance across domains such as cybersecurity, biomedicine, and computer vision, yet the field still lacks standardized methodologies and benchmarking frameworks, underscoring the need for unified, IH-informed strategies for robust and generalizable instance selection.

Description

Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG

Rights

Attribution 4.0 International
Attribution 4.0 International

Except where otherwise noted, this item's license is described as Attribution 4.0 International