Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation

UDC.coleccionInvestigación
UDC.conferenceTitleSIGIR ’25
UDC.departamentoCiencias da Computación e Tecnoloxías da Información
UDC.grupoInvInformation Retrieval Lab (IRlab)
UDC.institutoCentroCITIC - Centro de Investigación de Tecnoloxías da Información e da Comunicación
dc.contributor.authorOtero, David
dc.contributor.authorParapar, Javier
dc.contributor.authorBarreiro, Álvaro
dc.date.accessioned2025-07-22T08:19:46Z
dc.date.available2025-07-22T08:19:46Z
dc.date.issued2025-07-13
dc.descriptionPresented at: SIGIR '25: The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 13 - 18, 2025, Padua, Italy © Owner/Author | ACM 2025 This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in SIGIR ’25, https://doi.org/10.1145/3726302.3730221
dc.description.abstract[Abstract]: Offline evaluation of search systems depends on test collections. These benchmarks provide the researchers with a corpus of documents, topics and relevance judgements indicating which documents are relevant for each topic. While test collections are an integral part of Information Retrieval (IR) research, their creation involves significant efforts in manual annotation. Large language models (LLMs) are gaining much attention as tools for automatic relevance assessment. Recent research has shown that LLM-based assessments yield high systems ranking correlation with human-made judgements. These correlations are helpful in large-scale experiments but less informative if we want to focus on top-performing systems. Moreover, these correlations ignore whether and how LLM-based judgements impact the statistically significant differences among systems with respect to human assessments. In this work, we look at how LLM-generated judgements preserve ranking differences among top-performing systems and also how they preserve pairwise significance evaluation as human judgements. Our results show that LLM-based judgements are unfair at ranking top-performing systems. Moreover, we observe an exceedingly high rate of false positives regarding statistical differences.
dc.description.sponsorshipThe authors thank the financial support supplied by the grant PID2022-137061OB-C21 funded by MICIU/AEI/10.13039/501100011033 and by “ERDF/EU”. The authors also thank the funding supplied by the Consellería de Cultura, Educación, Formación Profesional e Universidades (accreditations ED431G 2023/01 and ED431C 2025/49) and the European Regional Development Fund, which acknowledges the CITIC Research Center as a Center of Excellence and recognizes it as a Member of the CIGUS Network for the period 2024-2027.
dc.identifier.citationDavid Otero, Javier Parapar, and Álvaro Barreiro. 2025. Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25), July 13–18, 2025, Padua, Italy. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3726302.3730221
dc.identifier.doi10.1145/3726302.3730221
dc.identifier.urihttps://hdl.handle.net/2183/45538
dc.language.isoeng
dc.publisherACM
dc.relation.projectIDinfo:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2022-137061OB-C21/ES/BUSQUEDA, SELECCION Y ORGANIZACION DE CONTENIDOS PARA NECESIDADES DE INFORMACION RELACIONADAS CON LA SALUD - CONSTRUCCION DE RECURSOS Y PERSONALIZACION
dc.relation.urihttps://doi.org/10.1145/3726302.3730221
dc.rights© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
dc.rightsAttribution 4.0 Internationalen
dc.rights.accessRightsopen access
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.subjectInformation Retrieval
dc.subjectTest Collections
dc.subjectLLMs
dc.subjectRelevance Assessments
dc.titleLimitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation
dc.typeconference output
dspace.entity.typePublication
relation.isAuthorOfPublication00d04042-9b75-419e-9aab-33fd14b201af
relation.isAuthorOfPublicationfef1a9cb-e346-4e53-9811-192e144f09d0
relation.isAuthorOfPublicationa3e43020-ee28-428d-8087-2f3c1e20aa2c
relation.isAuthorOfPublication.latestForDiscovery00d04042-9b75-419e-9aab-33fd14b201af

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Otero_David_2025_Limitations_of_Automatic_Relevance_Assessments_with_Large_Language_Models.pdf
Size:
694.47 KB
Format:
Adobe Portable Document Format