LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Advanced search
Browse by:

Department | Year

UCL Theses | Latest

Deposit your research

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks

Bavaresco, Anna; Bernardi, Raffaella; Bertolazzi, Leonardo; Elliott, Desmond; Fernández, Raquel; Gatt, Albert; Ghaleb, Esam; ... Testoni, Alberto; + view all (2025) LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). (pp. pp. 238-255). Association for Computational Linguistics Green open access

Preview

Text
2025.acl-short.20.pdf - Published Version
Download (705kB) | Preview

Abstract

There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments, raising questions about the validity of these evaluations, as well as their reproducibility in the case of proprietary models. We provide JUDGE-BENCH, an extensible collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show substantial variance across models and datasets. Models are reliable evaluators on some tasks, but overall display substantial variability depending on the property being evaluated, the expertise level of the human judges, and whether the language is human or model-generated. We conclude that LLMs should be carefully validated against human judgments before being used as evaluators.

Type:	Proceedings paper
Title:	LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
Event:	63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Dates:	Jul 2025 - Jul 2025
Open access status:	An open access version is available from UCL Discovery
DOI:	10.18653/v1/2025.acl-short.20
Publisher version:	https://doi.org/10.18653/v1/2025.acl-short.20
Language:	English
Additional information:	© 1963–2025 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License. Permission is granted to make copies for the purposes of teaching and research. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License.
UCL classification:	UCL UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > Div of Psychology and Lang Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > Div of Psychology and Lang Sciences > Linguistics
URI:	https://discovery.ucl.ac.uk/id/eprint/10216471

Downloads since deposit

6Downloads

Download activity - last month

Download activity - last 12 months

Downloads by country - last 12 months

Archive Staff Only

View Item