Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation

Otero, David; Parapar, Javier; Barreiro, Álvaro

Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation

UDC.coleccion	Investigación
UDC.conferenceTitle	SIGIR ’25
UDC.departamento	Ciencias da Computación e Tecnoloxías da Información
UDC.grupoInv	Information Retrieval Lab (IRlab)
UDC.institutoCentro	CITIC - Centro de Investigación de Tecnoloxías da Información e da Comunicación
dc.contributor.author	Otero, David
dc.contributor.author	Parapar, Javier
dc.contributor.author	Barreiro, Álvaro
dc.date.accessioned	2025-07-22T08:19:46Z
dc.date.available	2025-07-22T08:19:46Z
dc.date.issued	2025-07-13
dc.description	Presented at: SIGIR '25: The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 13 - 18, 2025, Padua, Italy © Owner/Author \| ACM 2025 This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in SIGIR ’25, https://doi.org/10.1145/3726302.3730221
dc.description.abstract	[Abstract]: Offline evaluation of search systems depends on test collections. These benchmarks provide the researchers with a corpus of documents, topics and relevance judgements indicating which documents are relevant for each topic. While test collections are an integral part of Information Retrieval (IR) research, their creation involves significant efforts in manual annotation. Large language models (LLMs) are gaining much attention as tools for automatic relevance assessment. Recent research has shown that LLM-based assessments yield high systems ranking correlation with human-made judgements. These correlations are helpful in large-scale experiments but less informative if we want to focus on top-performing systems. Moreover, these correlations ignore whether and how LLM-based judgements impact the statistically significant differences among systems with respect to human assessments. In this work, we look at how LLM-generated judgements preserve ranking differences among top-performing systems and also how they preserve pairwise significance evaluation as human judgements. Our results show that LLM-based judgements are unfair at ranking top-performing systems. Moreover, we observe an exceedingly high rate of false positives regarding statistical differences.
dc.description.sponsorship	The authors thank the financial support supplied by the grant PID2022-137061OB-C21 funded by MICIU/AEI/10.13039/501100011033 and by “ERDF/EU”. The authors also thank the funding supplied by the Consellería de Cultura, Educación, Formación Profesional e Universidades (accreditations ED431G 2023/01 and ED431C 2025/49) and the European Regional Development Fund, which acknowledges the CITIC Research Center as a Center of Excellence and recognizes it as a Member of the CIGUS Network for the period 2024-2027.
dc.identifier.citation	David Otero, Javier Parapar, and Álvaro Barreiro. 2025. Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25), July 13–18, 2025, Padua, Italy. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3726302.3730221
dc.identifier.doi	10.1145/3726302.3730221
dc.identifier.uri	https://hdl.handle.net/2183/45538
dc.language.iso	eng
dc.publisher	ACM
dc.relation.projectID	info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2022-137061OB-C21/ES/BUSQUEDA, SELECCION Y ORGANIZACION DE CONTENIDOS PARA NECESIDADES DE INFORMACION RELACIONADAS CON LA SALUD - CONSTRUCCION DE RECURSOS Y PERSONALIZACION
dc.relation.uri	https://doi.org/10.1145/3726302.3730221
dc.rights	© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
dc.rights	Attribution 4.0 International	en
dc.rights.accessRights	open access
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.subject	Information Retrieval
dc.subject	Test Collections
dc.subject	LLMs
dc.subject	Relevance Assessments
dc.title	Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation
dc.type	conference output
dspace.entity.type	Publication
relation.isAuthorOfPublication	00d04042-9b75-419e-9aab-33fd14b201af
relation.isAuthorOfPublication	fef1a9cb-e346-4e53-9811-192e144f09d0
relation.isAuthorOfPublication	a3e43020-ee28-428d-8087-2f3c1e20aa2c
relation.isAuthorOfPublication.latestForDiscovery	00d04042-9b75-419e-9aab-33fd14b201af

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Otero_David_2025_Limitations_of_Automatic_Relevance_Assessments_with_Large_Language_Models.pdf
Size:: 694.47 KB
Format:: Adobe Portable Document Format

Download

Collections

Investigación (FIC)