UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Understanding and Evaluating Generalisation for Superhuman AI Systems

Kirk, Robert; (2025) Understanding and Evaluating Generalisation for Superhuman AI Systems. Doctoral thesis (Ph.D), UCL (University College London). Green open access

[thumbnail of Kirk_10205843_Thesis.pdf]
Preview
Text
Kirk_10205843_Thesis.pdf

Download (20MB) | Preview

Abstract

As artificial intelligence systems grow increasingly sophisticated and start to surpass human capabilities, the critical challenges of AI alignment and safety come to the forefront. Ensuring that advanced AI systems remain robustly safe and aligned with human values requires a deep understanding of their generalisation properties, i.e.\ how these systems behave in novel situations and tasks potentially far beyond their training data. With the goal of improving our understanding of generalisation in superhuman AI systems, this dissertation investigates this challenge in current systems that share potential similarities with those in the future: agentic systems trained with reinforcement learning (RL), and systems involving large-scale pretraining such as large language models (LLMs). Throughout, I demonstrate how proper evaluations are crucial to understanding generalisation abilities of AI systems. I begin with a comprehensive survey of zero-shot generalisation in deep reinforcement learning, analysing existing environments, evaluation protocols, benchmarks, and methods. This survey not only synthesises current knowledge, but also proposes best practices and future research directions in this rapidly evolving field. To facilitate research into generalisation in reinforcement learning problems, I introduce MiniHack, a versatile environment creation tool and benchmark suite. MiniHack enables researchers to design and evaluate a wide array of RL scenarios, with a particular emphasis on zero-shot generalisation tasks. Both of these works demonstrated that training RL agents from scratch is unlikely to produce generally intelligent systems, and that RL is more likely to be used at the fine-tuning stage once more generalisable representations have been learned with other techniques. Additionally, there is a still work to be done to produce RL algorithms that produce robust, generalisable and hence safe AI agents. This insight motivates the latter two chapters of this thesis, which focus on investigating fine-tuning of pretrained models. First, I investigate the impact of various fine-tuning techniques on large language models (LLMs). I compare Reinforcement Learning from Human Feedback (RLHF), supervised fine-tuning (SFT), and best-of-N sampling, evaluating their effects on generalisation capabilities and output diversity across multiple tasks. My findings reveal that while RLHF enhances generalisation, it comes at the cost of reduced output diversity compared to SFT. To complement this behavioural understanding of language-model fine-tuning, I then investigate the mechanistic effects of fine-tuning on pretrained models using a synthetic data setting and a suite of interpretability tools. My analysis uncovers that fine-tuning primarily creates minimal "wrappers" around existing model capabilities rather than fundamentally deleting or producing entirely new capabilities. This implies that the generalisation properties of fine-tuned models are likely fundamentally limited by the representations learned during pretraining. Again, we see here that, while pretrained models clearly have much-improved generalisation properties, there is still a gap between how current algorithms perform and the necessary level of robustness required for safe and aligned AI systems. By examining generalisation from diverse angles, this thesis contributes to our understanding of how AI systems adapt to new challenges and how various training techniques influence both their behaviour and internal mechanisms. These insights are crucial for the development of more robust, adaptable, and ultimately safer AI systems as we move closer to superhuman AI capabilities, helping to ensure that as AI systems become more powerful, they remain robustly aligned with human values and interests across a broad spectrum of applications and scenarios.

Type: Thesis (Doctoral)
Qualification: Ph.D
Title: Understanding and Evaluating Generalisation for Superhuman AI Systems
Open access status: An open access version is available from UCL Discovery
Language: English
Additional information: Copyright © The Author 2025. Original content in this thesis is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) Licence (https://creativecommons.org/licenses/by-nc/4.0/). Any third-party copyright material present remains the property of its respective owner(s) and is licensed under its existing terms. Access may initially be restricted at the author’s request.
UCL classification: UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
UCL
URI: https://discovery.ucl.ac.uk/id/eprint/10205843
Downloads since deposit
0Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item