Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation

Parapar, JavierLosada, David E.Barreiro, Álvaro2025-03-052025-03-052021-04Javier Parapar, David E. Losada, and Álvaro Barreiro. 2021. Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation. In Proceedings of the 36th Annual ACM Symposium on Applied Computing (SAC '21). Association for Computing Machinery, New York, NY, USA, 655–664. https://doi.org/10.1145/3412841.3441945978-1-4503-8104-8http://hdl.handle.net/2183/41305This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in Proceedings of the 36th Annual ACM Symposium on Applied Computing (SAC '21). Association for Computing Machinery, New York, NY, USA, 655–664. https://doi.org/10.1145/3412841.3441945.[Abstract]: Null Hypothesis Significance Testing (NHST) has been recurrently employed as the reference framework to assess the difference in performance between Information Retrieval (IR) systems. IR practitioners customarily apply significance tests, such as the t-test, the Wilcoxon Signed Rank test, the Permutation test, the Sign test or the Bootstrap test. However, the question of which of these tests is the most reliable in IR experimentation is still controversial. Different authors have tried to shed light on this issue, but their conclusions are not in agreement. In this paper, we present a new methodology for assessing the behavior of significance tests in typical ranking tasks. Our method creates models from the search systems and uses those models to simulate different inputs to the significance tests. With such an approach, we can control the experimental conditions and run experiments with full knowledge about the truth or falseness of the null hypothesis. Following our methodology, we computed a series of simulations that estimate the proportion of Type I and Type II errors made by different tests. Results conclusively suggest that the Wilcoxon test is the most reliable test and, thus, IR practitioners should adopt it as the reference tool to assess differences between IR systems.eng© 2021 Authors|ACM. This author's version is posted here for your personal use. Not for redistribution.Information retrievalStatistical testingSimulationTesting the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluationconference outputopen access10.1145/3412841.3441945