eprintid: 10108160 rev_number: 24 eprint_status: archive userid: 608 dir: disk0/10/10/81/60 datestamp: 2020-10-15 13:19:52 lastmod: 2021-02-24 23:43:56 status_changed: 2020-10-15 13:19:52 type: thesis metadata_visibility: show creators_name: Sun, Yuxin title: Identification of antigen-specific patterns from high-dimensional sequencing data ispublished: unpub divisions: UCL divisions: A01 divisions: B02 divisions: C10 divisions: D19 divisions: G97 divisions: D15 note: Copyright © The Author 2020. Original content in this thesis is licensed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) Licence (https://creativecommons.org/licenses/by/4.0/). Any third-party copyright material present remains the property of its respective owner(s) and is licensed under its existing terms. Access may initially be restricted at the author’s request. abstract: T cells recognize antigens using a diverse set of antigen-specific T-cell receptors (TCRs) on the surface. This poses two challenges for studying TCRs that respond to a given antigen. First, the enormous diversity of the TCR repertoire creates an ultra-high dimensional feature space; second, TCRs that respond to an antigen are often correlated. This thesis aims to develop efficient machine learning algorithms concerning both problems for feature selection from high-dimensional feature spaces. Our research concerns two subproblems: identification of antigen-enriched sequence motifs within the CDR3 region of TCRs and antigen-enriched entire TCR sequences. We apply a string kernel and a Fisher kernel to represent subsequences and develop fast algorithms to learn antigen-specific subsequences from graph-represented features. Both fixed-length and varying-length subsequences from mouse samples are selected with high efficiency and accuracy. Our results also suggest that short subsequences are found at specific positions, which may correspond to the actual interacting regions between TCR and MHC-peptide complex. We further develop fast algorithms to solve exclusive group Lasso and provide a novel methodology to select entire TCR sequences that are relevant to specific antigens. Our solution concerns a notoriously difficult problem in feature selection to select highly correlated features. Experiments on synthetic data show good performance under various correlation settings. The proposed algorithms are also validated on real-world data to select a sparse set of entire TCRs with high accuracy. date: 2020-08-28 date_type: published oa_status: green full_text_type: other thesis_class: doctoral_open thesis_award: Ph.D language: eng thesis_view: UCL_Thesis primo: open primo_central: open_green verified: verified_manual elements_id: 1808729 lyricists_name: Chain, Benjamin lyricists_name: Sun, Yuxin lyricists_id: BMCHA43 lyricists_id: SUNBX27 actors_name: Sun, Yuxin actors_id: SUNBX27 actors_role: owner full_text_status: public pages: 195 event_title: UCL (University College London) institution: UCL (University College London) department: Computer Science thesis_type: Doctoral editors_name: Shawe-Taylor, J editors_name: Chain, B citation: Sun, Yuxin; (2020) Identification of antigen-specific patterns from high-dimensional sequencing data. Doctoral thesis (Ph.D), UCL (University College London). Green open access document_url: https://discovery.ucl.ac.uk/id/eprint/10108160/1/Identification%20of%20Antigen-Specific%20Patterns%20from%20High-Dimensional%20Sequencing%20Data.pdf