Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation

Parapar, Javier; Losada, David E.; Barreiro, Álvaro

Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation

UDC.coleccion	Investigación	es_ES
UDC.conferenceTitle	SAC '21: The 36th ACM/SIGAPP Symposium on Applied Computing	es_ES
UDC.departamento	Ciencias da Computación e Tecnoloxías da Información	es_ES
UDC.endPage	664	es_ES
UDC.grupoInv	Information Retrieval Lab (IRlab)	es_ES
UDC.institutoCentro	CITIC - Centro de Investigación de Tecnoloxías da Información e da Comunicación	es_ES
UDC.startPage	655	es_ES
UDC.volume	2021	es_ES
dc.contributor.author	Parapar, Javier
dc.contributor.author	Losada, David E.
dc.contributor.author	Barreiro, Álvaro
dc.date.accessioned	2025-03-05T17:58:39Z
dc.date.available	2025-03-05T17:58:39Z
dc.date.issued	2021-04
dc.description	This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in Proceedings of the 36th Annual ACM Symposium on Applied Computing (SAC '21). Association for Computing Machinery, New York, NY, USA, 655–664. https://doi.org/10.1145/3412841.3441945.	es_ES
dc.description.abstract	[Abstract]: Null Hypothesis Significance Testing (NHST) has been recurrently employed as the reference framework to assess the difference in performance between Information Retrieval (IR) systems. IR practitioners customarily apply significance tests, such as the t-test, the Wilcoxon Signed Rank test, the Permutation test, the Sign test or the Bootstrap test. However, the question of which of these tests is the most reliable in IR experimentation is still controversial. Different authors have tried to shed light on this issue, but their conclusions are not in agreement. In this paper, we present a new methodology for assessing the behavior of significance tests in typical ranking tasks. Our method creates models from the search systems and uses those models to simulate different inputs to the significance tests. With such an approach, we can control the experimental conditions and run experiments with full knowledge about the truth or falseness of the null hypothesis. Following our methodology, we computed a series of simulations that estimate the proportion of Type I and Type II errors made by different tests. Results conclusively suggest that the Wilcoxon test is the most reliable test and, thus, IR practitioners should adopt it as the reference tool to assess differences between IR systems.	es_ES
dc.description.sponsorship	This work was supported by projects RTI2018-093336-B-C21, RTI-2018-093336-B-C22 (Ministerio de Ciencia e Innvovación & ERDF). The first and third authors thank the financial support supplied by the Consellería de Educación, Universidade e Formación Profe- sional (accreditation 2019-2022 ED431G/01, ED431B 2019/03) and the European Regional Development Fund, which acknowledges the CITIC Research Center in ICT of the University of A Coruña as a Research Center of the Galician University System. The second author also thanks the financial support supplied by the Consellería de Educación, Universidade e Formación Profesional (accreditation 2019-2022 ED431G-2019/04, ED431C 2018/29) and the European Regional Development Fund, which acknowledges the CiTIUS-Research Center in Intelligent Technologies of the University of Santiago de Compostela as a Research Center of the Galician University System.	es_ES
dc.description.sponsorship	Xunta de Galicia; ED431G/01	es_ES
dc.description.sponsorship	Xunta de Galicia; ED431B 2019/03	es_ES
dc.description.sponsorship	Xunta de Galicia; ED431G-2019/04	es_ES
dc.description.sponsorship	Xunta de Galicia; ED431C 2018/29	es_ES
dc.identifier.citation	Javier Parapar, David E. Losada, and Álvaro Barreiro. 2021. Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation. In Proceedings of the 36th Annual ACM Symposium on Applied Computing (SAC '21). Association for Computing Machinery, New York, NY, USA, 655–664. https://doi.org/10.1145/3412841.3441945	es_ES
dc.identifier.doi	10.1145/3412841.3441945
dc.identifier.isbn	978-1-4503-8104-8
dc.identifier.uri	http://hdl.handle.net/2183/41305
dc.language.iso	eng	es_ES
dc.publisher	Association for Computing Machinery	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/RTI2018-093336-B-C21/ES/TECNOLOGIAS PARA LA PREDICCION TEMPRANA DE SIGNOS RELACIONADOS CON TRASTORNOS PSICOLOGICOS	es_ES
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2017-2020/RTI2018-093336-B-C22/ES/TECNOLOGIAS PARA LA PREDICCION TEMPRANA DE SIGNOS RELACIONADOS CON TRASTORNOS PSICOLOGICOS (SUBPROYECTO UDC)	es_ES
dc.relation.uri	https://doi.org/10.1145/3412841.3441945	es_ES
dc.rights	© 2021 Authors\|ACM. This author's version is posted here for your personal use. Not for redistribution.	es_ES
dc.rights.accessRights	open access	es_ES
dc.subject	Information retrieval	es_ES
dc.subject	Statistical testing	es_ES
dc.subject	Simulation	es_ES
dc.title	Testing the tests: simulation of rankings to compare statistical significance tests in information retrieval evaluation	es_ES
dc.type	conference output	es_ES
dspace.entity.type	Publication
relation.isAuthorOfPublication	fef1a9cb-e346-4e53-9811-192e144f09d0
relation.isAuthorOfPublication	a3e43020-ee28-428d-8087-2f3c1e20aa2c
relation.isAuthorOfPublication.latestForDiscovery	fef1a9cb-e346-4e53-9811-192e144f09d0

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Parapar_Javier_2021_Testing_the_Tests.pdf
Size:: 448.3 KB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Investigación (FIC)