Bavaresco, Anna;
Bernardi, Raffaella;
Bertolazzi, Leonardo;
Elliott, Desmond;
Fernández, Raquel;
Gatt, Albert;
Ghaleb, Esam;
... Testoni, Alberto; + view all
(2025)
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks.
In:
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
(pp. pp. 238-255).
Association for Computational Linguistics
Preview |
Text
2025.acl-short.20.pdf - Published Version Download (705kB) | Preview |
Abstract
There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments, raising questions about the validity of these evaluations, as well as their reproducibility in the case of proprietary models. We provide JUDGE-BENCH, an extensible collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show substantial variance across models and datasets. Models are reliable evaluators on some tasks, but overall display substantial variability depending on the property being evaluated, the expertise level of the human judges, and whether the language is human or model-generated. We conclude that LLMs should be carefully validated against human judgments before being used as evaluators.
| Type: | Proceedings paper |
|---|---|
| Title: | LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks |
| Event: | 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) |
| Dates: | Jul 2025 - Jul 2025 |
| Open access status: | An open access version is available from UCL Discovery |
| DOI: | 10.18653/v1/2025.acl-short.20 |
| Publisher version: | https://doi.org/10.18653/v1/2025.acl-short.20 |
| Language: | English |
| Additional information: | © 1963–2025 ACL; other materials are copyrighted by their respective copyright holders. Materials prior to 2016 here are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 International License. Permission is granted to make copies for the purposes of teaching and research. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License. |
| UCL classification: | UCL UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > Div of Psychology and Lang Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > Div of Psychology and Lang Sciences > Linguistics |
| URI: | https://discovery.ucl.ac.uk/id/eprint/10216471 |
Archive Staff Only
![]() |
View Item |

