eprintid: 10108160
rev_number: 24
eprint_status: archive
userid: 608
dir: disk0/10/10/81/60
datestamp: 2020-10-15 13:19:52
lastmod: 2021-02-24 23:43:56
status_changed: 2020-10-15 13:19:52
type: thesis
metadata_visibility: show
creators_name: Sun, Yuxin
title: Identification of antigen-specific patterns from high-dimensional sequencing data
ispublished: unpub
divisions: UCL
divisions: A01
divisions: B02
divisions: C10
divisions: D19
divisions: G97
divisions: D15
note: Copyright © The Author 2020. Original content in this thesis is licensed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) Licence (https://creativecommons.org/licenses/by/4.0/). Any third-party copyright material present remains the property of its respective owner(s) and is licensed under its existing terms. Access may initially be restricted at the author’s request.
abstract: T cells recognize antigens using a diverse set of antigen-specific T-cell receptors (TCRs) on the surface. This poses two challenges for studying TCRs that respond to a given antigen. First, the enormous diversity of the TCR repertoire creates an ultra-high dimensional feature space; second, TCRs that respond to an antigen are often correlated. This thesis aims to develop efficient machine learning algorithms concerning both problems for feature selection from high-dimensional feature spaces. Our research concerns two subproblems: identification of antigen-enriched sequence motifs within the CDR3 region of TCRs and antigen-enriched entire TCR sequences. We apply a string kernel and a Fisher kernel to represent subsequences and develop fast algorithms to learn antigen-specific subsequences from graph-represented features. Both fixed-length and varying-length subsequences from mouse samples are selected with high efficiency and accuracy. Our results also suggest that short subsequences are found at specific positions, which may correspond to the actual interacting regions between TCR and MHC-peptide complex. We further develop fast algorithms to solve exclusive group Lasso and provide a novel methodology to select entire TCR sequences that are relevant to specific antigens. Our solution concerns a notoriously difficult problem in feature selection to select highly correlated features. Experiments on synthetic data show good performance under various correlation settings. The proposed algorithms are also validated on real-world data to select a sparse set of entire TCRs with high accuracy.
date: 2020-08-28
date_type: published
oa_status: green
full_text_type: other
thesis_class: doctoral_open
thesis_award: Ph.D
language: eng
thesis_view: UCL_Thesis
primo: open
primo_central: open_green
verified: verified_manual
elements_id: 1808729
lyricists_name: Chain, Benjamin
lyricists_name: Sun, Yuxin
lyricists_id: BMCHA43
lyricists_id: SUNBX27
actors_name: Sun, Yuxin
actors_id: SUNBX27
actors_role: owner
full_text_status: public
pages: 195
event_title: UCL (University College London)
institution: UCL (University College London)
department: Computer Science
thesis_type: Doctoral
editors_name: Shawe-Taylor, J
editors_name: Chain, B
citation:        Sun, Yuxin;      (2020)    Identification of antigen-specific patterns from high-dimensional sequencing data.                   Doctoral thesis  (Ph.D), UCL (University College London).     Green open access   
 
document_url: https://discovery.ucl.ac.uk/id/eprint/10108160/1/Identification%20of%20Antigen-Specific%20Patterns%20from%20High-Dimensional%20Sequencing%20Data.pdf