Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation

Otero, David; Parapar, Javier; Barreiro, Álvaro

Use this link to cite:

https://hdl.handle.net/2183/45538

Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation

Files

Otero_David_2025_Limitations_of_Automatic_Relevance_Assessments_with_Large_Language_Models.pdf (694.47 KB)

Identifiers

URI: https://hdl.handle.net/2183/45538

DOI: 10.1145/3726302.3730221

Publication date

2025-07-13

Authors

Otero, David

Parapar, Javier

Barreiro, Álvaro

Bibliographic citation

David Otero, Javier Parapar, and Álvaro Barreiro. 2025. Limitations of Automatic Relevance Assessments with Large Language Models for Fair and Reliable Retrieval Evaluation. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’25), July 13–18, 2025, Padua, Italy. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3726302.3730221

Abstract

[Abstract]: Offline evaluation of search systems depends on test collections. These benchmarks provide the researchers with a corpus of documents, topics and relevance judgements indicating which documents are relevant for each topic. While test collections are an integral part of Information Retrieval (IR) research, their creation involves significant efforts in manual annotation. Large language models (LLMs) are gaining much attention as tools for automatic relevance assessment. Recent research has shown that LLM-based assessments yield high systems ranking correlation with human-made judgements. These correlations are helpful in large-scale experiments but less informative if we want to focus on top-performing systems. Moreover, these correlations ignore whether and how LLM-based judgements impact the statistically significant differences among systems with respect to human assessments. In this work, we look at how LLM-generated judgements preserve ranking differences among top-performing systems and also how they preserve pairwise significance evaluation as human judgements. Our results show that LLM-based judgements are unfair at ranking top-performing systems. Moreover, we observe an exceedingly high rate of false positives regarding statistical differences.

Description

Presented at: SIGIR '25: The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, July 13 - 18, 2025, Padua, Italy © Owner/Author | ACM 2025 This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in SIGIR ’25, https://doi.org/10.1145/3726302.3730221