Towards Reliable Testing for Multiple Information Retrieval System Comparisons

Otero, DavidParapar, JavierBarreiro, Álvaro2025-04-162025Otero, D., Parapar, J., Barreiro, Á. (2025). Towards Reliable Testing for Multiple Information Retrieval System Comparisons. In: Hauff, C., et al. Advances in Information Retrieval. ECIR 2025. Lecture Notes in Computer Science, vol 15573. Springer, Cham. https://doi.org/10.1007/978-3-031-88711-6_27978-3-031-88711-6http://hdl.handle.net/2183/41777Presented at: Advances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025This version of the article has been accepted for publication, after peer review and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: http://dx.doi.org/10.1007/978-3-031-88711-6_27[Abstract]: Null Hypothesis Significance Testing is the de facto tool for assessing effectiveness differences between Information Retrieval systems. Researchers use statistical tests to check whether those differences will generalise to online settings or are just due to the samples observed in the laboratory. Much work has been devoted to studying which test is the most reliable when comparing a pair of systems, but most of the IR real-world experiments involve more than two. In the multiple comparisons scenario, testing several systems simultaneously may inflate the errors committed by the tests. In this paper, we use a new approach to assess the reliability of multiple comparison procedures using simulated and real TREC data. Experiments show that Wilcoxon plus the Benjamini-Hochberg correction yields Type I error rates according to the significance level for typical sample sizes while being the best test in terms of statistical power.eng© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AGInformation Retrieval EvaluationNull Hypothesis Significance TestingMultiple ComparisonsFamily-Wise Error RateFalse Discovery RateTowards Reliable Testing for Multiple Information Retrieval System Comparisonsconference outputopen access10.1007/978-3-031-88711-6_27