UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Understanding and Guarding against Natural Language Adversarial Examples

Mozes, Maximilian Attila Janos; (2024) Understanding and Guarding against Natural Language Adversarial Examples. Doctoral thesis (Ph.D), UCL (University College London). Green open access

[thumbnail of mmozes_thesis.pdf]
Preview
Text
mmozes_thesis.pdf - Other

Download (2MB) | Preview

Abstract

Despite their success, machine learning models have been shown to be susceptible to adversarial examples: carefully constructed perturbations of model inputs that are intended to lead a model into misclassifying those inputs. While this phenomenon was discovered in the context of computer vision, an increasing body of work focuses on adversarial examples in natural language processing (NLP). This PhD thesis presents an investigation into such adversarial examples in the context of text classification, focusing on studies to characterize them through both computational analyses and behavioral studies. As computational analysis, we present results showing that the effectiveness of adversarial word-level perturbations is due to the replacement of input words with low-frequency synonyms. Based on these insights, we propose an effective detection method for adversarial examples (Study 1). As behavioral analysis, we present (Study 2) a data collection effort comprising human-written word-level adversarial examples, and conduct statistical comparisons between human- and machine-generated adversarial examples with respect to their preservation of sentiment, naturalness, and grammaticality. We find that human- and machine-authored adversarial examples are of similar quality across most comparisons, yet humans can generate adversarial examples with much greater efficiency. In Study 3, we investigate the patterns of human behavior when authoring adversarial examples, and provide “human strategies” for generating adversarial examples that have the potential to advance automated attacks. Study 4 discusses the NLP-related scientific safety and security literature with respect to more recent large language models (LLMs). We provide a taxonomy of existing efforts related to that topic that are categorized into threats arising from the generative capabilities of LLMs, prevention measures developed to safeguard models against misuse, and vulnerabilities stemming from imperfect prevention measures. We conclude the thesis by discussing this work’s contributions and impact on the research community as well as potential future work arising from the obtained insights.

Type: Thesis (Doctoral)
Qualification: Ph.D
Title: Understanding and Guarding against Natural Language Adversarial Examples
Open access status: An open access version is available from UCL Discovery
Language: English
Additional information: Copyright © The Author 2024. Original content in this thesis is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) Licence (https://creativecommons.org/licenses/by-nc/4.0/). Any third-party copyright material present remains the property of its respective owner(s) and is licensed under its existing terms. Access may initially be restricted at the author’s request.
Keywords: machine learning, natural language processing, adversarial machine learning
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Security and Crime Science
URI: https://discovery.ucl.ac.uk/id/eprint/10190224
Downloads since deposit
9Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item