UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Polymorphisms Predicting Phylogeny in Hepatitis B Virus (HBV)

Lourenço, José; McNaughton, Anna; Pley, Caitlin; Obolski, Uri; Gupta, Sunetra; Matthews, Philippa; (2022) Polymorphisms Predicting Phylogeny in Hepatitis B Virus (HBV). bioRxiv: NY, USA. Green open access

[thumbnail of PPP manuscript full.pdf]
PPP manuscript full.pdf - Submitted Version

Download (592kB) | Preview


Hepatitis B viruses (HBV) are compact viruses with circular genomes of ∼3.2kb in length. Four genes (HBx, Core, Surface and Polymerase) generating seven products are encoded on overlapping reading frames. Ten HBV genotypes have been characterised (A-J), which may account for differences in transmission, outcomes of infection, and treatment response. However, HBV genotyping is rarely undertaken, and sequencing remains inaccessible in many settings. We used a machine learning approach based on random forest algorithms (RFA) to assess which amino acid (aa) sites in the genome are most informative for determining genotype. We downloaded 5496 genome-length HBV sequences from a public database, excluding recombinant sequences, regions with conserved indels, and genotypes I/J. Each gene was separately translated into aa, and the proteins concatenated into a single sequence (length 1614aa). Using RFA, we searched for aa sites predictive of genotype, and assessed co-variation among the sites with a Mutual Information (MI)-based method. We were able to discriminate confidently between genotypes A-H using 10 aa sites. 5/10 sites were identified in Polymerase (Pol), of which 4/5 were in the spacer domain, and a single site in reverse transcriptase. A further 4/10 sites were located in Surface protein, and a single site in HBx. There were no informative sites in Core. Properties of the aa were generally not conserved between genotypes at informative sites. Co-variation analysis identified 55 pairs of highly-linked sites. Three RFA-identified sites were represented across all pairs (two sites in spacer, and one in HBx). Residues that co-vary with these sites are concentrated in the small HBV surface gene. We also observe a cluster of sites adjacent to the Surface promoter region that co-vary with a spacer residue. Overall, we have shown that RFA analysis is a powerful tool for identifying aa sites that predict HBV lineage, with an unexpectedly high number of such sites in the spacer domain, which has conventionally been viewed as unimportant for structure or function. Our results improve ease of genotype prediction from limited regions of HBV sequence, and may have implications for understanding HBV evolution and the role of the spacer domain.

Type: Working / discussion paper
Title: Polymorphisms Predicting Phylogeny in Hepatitis B Virus (HBV)
Open access status: An open access version is available from UCL Discovery
DOI: 10.1101/2022.07.05.498824
Publisher version: https://doi.org/10.1101/2022.07.05.498824
Language: English
Additional information: The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC 4.0 International license.
UCL classification: UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Medical Sciences > Div of Infection and Immunity
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
URI: https://discovery.ucl.ac.uk/id/eprint/10152280
Downloads since deposit
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item