Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation
| UDC.coleccion | Investigación | |
| UDC.conferenceTitle | SIGIR ’25 | |
| UDC.departamento | Ciencias da Computación e Tecnoloxías da Información | |
| UDC.grupoInv | Information Retrieval Lab (IRlab) | |
| UDC.institutoCentro | CITIC - Centro de Investigación de Tecnoloxías da Información e da Comunicación | |
| dc.contributor.author | Otero, David | |
| dc.contributor.author | Parapar, Javier | |
| dc.contributor.author | Barreiro, Álvaro | |
| dc.date.accessioned | 2025-07-22T08:19:46Z | |
| dc.date.available | 2025-07-22T08:19:46Z | |
| dc.date.issued | 2025-07-13 | |
| dc.description | Presented at: SIGIR '25: The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 13 - 18, 2025, Padua, Italy © Owner/Author | ACM 2025 This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in SIGIR ’25, https://doi.org/10.1145/3726302.3730221 | |
| dc.description.abstract | [Abstract]: Offline evaluation of search systems depends on test collections. These benchmarks provide the researchers with a corpus of documents, topics and relevance judgements indicating which documents are relevant for each topic. While test collections are an integral part of Information Retrieval (IR) research, their creation involves significant efforts in manual annotation. Large language models (LLMs) are gaining much attention as tools for automatic relevance assessment. Recent research has shown that LLM-based assessments yield high systems ranking correlation with human-made judgements. These correlations are helpful in large-scale experiments but less informative if we want to focus on top-performing systems. Moreover, these correlations ignore whether and how LLM-based judgements impact the statistically significant differences among systems with respect to human assessments. In this work, we look at how LLM-generated judgements preserve ranking differences among top-performing systems and also how they preserve pairwise significance evaluation as human judgements. Our results show that LLM-based judgements are unfair at ranking top-performing systems. Moreover, we observe an exceedingly high rate of false positives regarding statistical differences. | |
| dc.description.sponsorship | The authors thank the financial support supplied by the grant PID2022-137061OB-C21 funded by MICIU/AEI/10.13039/501100011033 and by “ERDF/EU”. The authors also thank the funding supplied by the Consellería de Cultura, Educación, Formación Profesional e Universidades (accreditations ED431G 2023/01 and ED431C 2025/49) and the European Regional Development Fund, which acknowledges the CITIC Research Center as a Center of Excellence and recognizes it as a Member of the CIGUS Network for the period 2024-2027. | |
| dc.identifier.citation | David Otero, Javier Parapar, and Álvaro Barreiro. 2025. Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25), July 13–18, 2025, Padua, Italy. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3726302.3730221 | |
| dc.identifier.doi | 10.1145/3726302.3730221 | |
| dc.identifier.uri | https://hdl.handle.net/2183/45538 | |
| dc.language.iso | eng | |
| dc.publisher | ACM | |
| dc.relation.projectID | info:eu-repo/grantAgreement/AEI/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2022-137061OB-C21/ES/BUSQUEDA, SELECCION Y ORGANIZACION DE CONTENIDOS PARA NECESIDADES DE INFORMACION RELACIONADAS CON LA SALUD - CONSTRUCCION DE RECURSOS Y PERSONALIZACION | |
| dc.relation.uri | https://doi.org/10.1145/3726302.3730221 | |
| dc.rights | © 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM. | |
| dc.rights | Attribution 4.0 International | en |
| dc.rights.accessRights | open access | |
| dc.rights.uri | http://creativecommons.org/licenses/by/4.0/ | |
| dc.subject | Information Retrieval | |
| dc.subject | Test Collections | |
| dc.subject | LLMs | |
| dc.subject | Relevance Assessments | |
| dc.title | Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation | |
| dc.type | conference output | |
| dspace.entity.type | Publication | |
| relation.isAuthorOfPublication | 00d04042-9b75-419e-9aab-33fd14b201af | |
| relation.isAuthorOfPublication | fef1a9cb-e346-4e53-9811-192e144f09d0 | |
| relation.isAuthorOfPublication | a3e43020-ee28-428d-8087-2f3c1e20aa2c | |
| relation.isAuthorOfPublication.latestForDiscovery | 00d04042-9b75-419e-9aab-33fd14b201af |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Otero_David_2025_Limitations_of_Automatic_Relevance_Assessments_with_Large_Language_Models.pdf
- Size:
- 694.47 KB
- Format:
- Adobe Portable Document Format

