UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Jain, S; Kirk, R; Lubana, ES; Dick, RP; Tanaka, H; Grefenstette, E; Rocktäschel, T; (2024) Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. In: 12th International Conference on Learning Representations ICLR 2024. ICLR Green open access

[thumbnail of 8951_Mechanistically_analyzing.pdf]
Preview
Text
8951_Mechanistically_analyzing.pdf - Accepted Version

Download (5MB) | Preview

Abstract

Fine-tuning large pre-trained models has become the de facto strategy for developing both task-specific and general-purpose machine learning systems, including developing models that are safe to deploy. Despite its clear importance, there has been minimal work that explains how fine-tuning alters the underlying capabilities learned by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic, controlled settings where we can use mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. We perform an extensive analysis of the effects of fine-tuning in these settings, and show that: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities, creating the illusion that they have been modified; and (iii) further fine-tuning on a task where such “wrapped capabilities” are relevant leads to sample-efficient revival of the capability, i.e., the model begins reusing these capabilities after only a few gradient steps. This indicates that practitioners can unintentionally remove a model's safety wrapper merely by fine-tuning it on a, e.g., superficially unrelated, downstream task. We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.

Type: Proceedings paper
Title: Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
Event: ICLR 2024
Open access status: An open access version is available from UCL Discovery
Publisher version: https://openreview.net/forum?id=Yu8yWRoONO
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery.ucl.ac.uk/id/eprint/10216727
Downloads since deposit
1Download
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item