UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

An Auditing Test to Detect Behavioral Shift in Language Models

Richter, L; He, X; Minervini, P; Kusner, MJ; (2025) An Auditing Test to Detect Behavioral Shift in Language Models. In: Proceedings of the 13th International Conference on Learning Representations Iclr 2025. (pp. pp. 40671-40697). ICML Green open access

[thumbnail of 12343_An_Auditing_Test_to_Dete.pdf]
Preview
PDF
12343_An_Auditing_Test_to_Dete.pdf - Published Version

Download (830kB) | Preview

Abstract

As language models (LMs) approach human-level performance, a comprehensive understanding of their behavior becomes crucial. This includes evaluating capabilities, biases, task performance, and alignment with societal values. Extensive initial evaluations, including red teaming and diverse benchmarking, can establish a model’s behavioral profile. However, subsequent fine-tuning or deployment modifications may alter these behaviors in unintended ways. We present an efficient statistical test to tackle Behavioral Shift Auditing (BSA) in LMs, which we define as detecting distribution shifts in qualitative properties of the output distributions of LMs. Our test compares model generations from a baseline model to those of the model under scrutiny and provides theoretical guarantees for change detection while controlling false positives. The test features a configurable tolerance parameter that adjusts sensitivity to behavioral changes for different use cases. We evaluate our approach using two case studies: monitoring changes in (a) toxicity and (b) translation performance. We find that the test is able to detect meaningful changes in behavior distributions using just hundreds of examples.

Type: Proceedings paper
Title: An Auditing Test to Detect Behavioral Shift in Language Models
Event: 13th International Conference on Learning Representations Iclr 2025
Open access status: An open access version is available from UCL Discovery
Publisher version: https://openreview.net/forum?id=h0jdAboh0o
Language: English
Additional information: © The Author 2025. Original content in this thesis is licensed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) Licence (https://creativecommons.org/licenses/by/4.0/).
Keywords: AI alignment, model auditing, model evaluations, red teaming, sequential hypothesis testing
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery.ucl.ac.uk/id/eprint/10211860
Downloads since deposit
5Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item