UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection

King, Theo; Wu, Zekun; Koshiyama, Adriano; Kazim, Emre; Treleaven, Philip; (2024) HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection. In: Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024). (pp. pp. 1-38). NeurIPS 2024 Green open access

[thumbnail of 70_HEARTS_A_Holistic_Framework.pdf]
Preview
PDF
70_HEARTS_A_Holistic_Framework.pdf - Accepted Version

Download (1MB) | Preview

Abstract

Stereotypes are generalised assumptions about societal groups, and even state-of-the-art LLMs using in-context learning struggle to identify them accurately. Due to the subjective nature of stereotypes, where what constitutes a stereotype can vary widely depending on cultural, social, and individual perspectives, robust explainability is crucial. Explainable models ensure that these nuanced judgments can be understood and validated by human users, promoting trust and accountability. We address these challenges by introducing HEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text Stereotype Detection), a framework that enhances model performance, minimises carbon footprint, and provides transparent, interpretable explanations. We establish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising 57,201 labelled texts across six groups, including under-represented demographics like LGBTQ+ and regional stereotypes. Ablation studies confirm that BERT models fine-tuned on EMGSD outperform those trained on individual components. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model using SHAP to generate token-level importance values, ensuring alignment with human understanding, and calculate explainability confidence scores by comparing SHAP and LIME outputs. An analysis of examples from the EMGSD test data indicates that when the ALBERT-V2 model predicts correctly, it assigns the highest importance to labelled stereotypical tokens. These correct predictions are also associated with higher explanation confidence scores compared to incorrect predictions. Finally, we apply the HEARTS framework to assess stereotypical bias in the outputs of 12 LLMs, using neutral prompts generated from the EMGSD test data to elicit 1,050 responses per model. This reveals a gradual reduction in bias over time within model families, with models from the LLaMA family appearing to exhibit the highest rates of bias.

Type: Proceedings paper
Title: HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection
Event: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)
Open access status: An open access version is available from UCL Discovery
Publisher version: https://openreview.net/forum?id=arh91riKiQ
Language: English
Additional information: © The Author(s) 2025. Original content in this paper is licensed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) Licence (https://creativecommons.org/licenses/by/4.0/).
Keywords: Large Language Models, Stereotype Detection, Token-Level Explanations, Model Explainability, SHAP, LIME, Ethical AI, Responsible AI, Natural Language Processing
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery.ucl.ac.uk/id/eprint/10209111
Downloads since deposit
7Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item