Bartolo, Max;
(2025)
Adversarial Robustness of Language Models with Humans and
Models in the Loop.
Doctoral thesis (Ph.D), UCL (University College London).
Preview |
Text
Adversarial Robustness of Language Models with Humans and Models in the Loop - Max Bartolo.pdf - Accepted Version Download (7MB) | Preview |
Abstract
Machine Learning (ML) systems often fail in unexpected and unpredictable ways. They lack robustness to minor non-semantic changes to inputs, which can limit their potential for widespread application. We provide a comprehensive exploration of the involvement of humans and models in the loop to study and improve the adversarial robustness of machine language understanding. We first investigate the use of increasingly capable models in the annotation loop to collect progressively more complex and interesting data, for both training and evaluation. We further investigate the downstream generalisation, robustness and transfer implications, demonstrating improvements across all axes of interest. Following this, we introduce Dynabench, an open-source platform to facilitate dynamic dataset creation and model benchmarking, aiming for more robust and informative dynamic benchmarks across a suite of NLP tasks. Building on this foundation, we explore Synthetic Adversarial Data Generation (SADG), making models more robust to human adversaries without requiring any additional human data collection. We also introduce Adversarial Human Evaluation (AHE), an evaluation paradigm involving humans in the loop to measure robustness to adversarial attack, with implications for performance aspects such as robustness and safety. Finally, we introduce Generative Annotation Assistants (GAAs), generator-in-the-loop models that provide real-time suggestions that annotators can either approve, modify, or entirely reject. We demonstrate the effectiveness of GAAs through a detailed study, demonstrating significant benefits in both annotation efficiency and effectiveness, which also leads to improved downstream model performance and robustness. We offer novel insight into the potential of human-model competition and collaboration, providing a pathway to more robust and reliable language models capable of adapting to diverse adversarial scenarios, representative of the real-world environments these models are expected to operate in.
| Type: | Thesis (Doctoral) |
|---|---|
| Qualification: | Ph.D |
| Title: | Adversarial Robustness of Language Models with Humans and Models in the Loop |
| Open access status: | An open access version is available from UCL Discovery |
| Language: | English |
| Additional information: | Copyright © The Author 2025. Original content in this thesis is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) Licence (https://creativecommons.org/licenses/by-nc/4.0/). Any third-party copyright material present remains the property of its respective owner(s) and is licensed under its existing terms. Access may initially be restricted at the author’s request. |
| UCL classification: | UCL UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science |
| URI: | https://discovery.ucl.ac.uk/id/eprint/10208031 |
Archive Staff Only
![]() |
View Item |

