Nagano, Yuta;
(2024)
Overcoming data bottlenecks in T
cell receptor specificity prediction
with effective machine learning.
Doctoral thesis (Ph.D), UCL (University College London).
Preview |
Text
main.pdf - Accepted Version Download (4MB) | Preview |
Abstract
T cells are an integral part of the adaptive immune system. They detect host cells that have been compromised due to infection or mutation, and eliminate them through direct attack or recruitment of other effector immune cell types. This is achieved through T cell receptors (TCRs) expressed on their surface, which allows T cells to bind to peptide-major histocompatibility complexes (pMHCs) in a target-specific manner. Uncovering the rules of TCR-pMHC specificity has the potential for profound and positive impact in biomedicine. However, this remains an unsolved challenge made particularly difficult by the immense numbers of possible TCRs (∼ 10^60) and pMHCs (∼ 10^15 ). While it remains implausible to empirically map out anywhere near a majority of the possible binders, an application of machine learning may help us better understand the binding rules by extrapolating from existing data. Recent advances in the natural language processing (NLP) field have demonstrated the impressive ability of transformer language models to learn from unsupervised objectives using large corpora of unlabelled text. Since TCRs, like other proteins, can naturally be represented as a sequence of amino acids, there has been growing interest in applying language modelling technologies to the TCR domain. However, how to most effectively design unsupervised training objectives to optimise language models for downstream TCR-pMHC specificity prediction remains an open question. The core theme of this thesis is the investigation of contrastive learning as a method of training transformer-based TCR representation models. In this regard, I show that combining unsupervised contrastive learning (autocontrastive learning) with the traditional masked-language modelling (MLM) objective is a highly effective way of pre-training a TCR representation model. In addition to the above, the thesis presents related work that I have conducted during my PhD candidacy around automated TCR data standardisation as well as a statistical framework for calibrating TCR distance metrics to probabilities of TCR co-specificity.
Type: | Thesis (Doctoral) |
---|---|
Qualification: | Ph.D |
Title: | Overcoming data bottlenecks in T cell receptor specificity prediction with effective machine learning |
Open access status: | An open access version is available from UCL Discovery |
Language: | English |
Additional information: | Copyright © The Author 2024. Original content in this thesis is licensed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) Licence (https://creativecommons.org/licenses/by/4.0/). Any third-party copyright material present remains the property of its respective owner(s) and is licensed under its existing terms. Access may initially be restricted at the author’s request. |
Keywords: | Immunology, T cell receptor, Machine learning, Mathematical modelling |
UCL classification: | UCL UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Medical Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Medical Sciences > Div of Medicine |
URI: | https://discovery.ucl.ac.uk/id/eprint/10200813 |




Archive Staff Only
![]() |
View Item |