UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Reliably Filter Drug-Induced Liver Injury Literature With Natural Language Processing and Conformal Prediction

Zhan, Xianghao; Wang, Fanjin; Gevaert, Olivier; (2022) Reliably Filter Drug-Induced Liver Injury Literature With Natural Language Processing and Conformal Prediction. IEEE Journal of Biomedical and Health Informatics , 26 (10) pp. 5033-5041. 10.1109/JBHI.2022.3193365. Green open access

[thumbnail of Reliably_Filter_Drug-Induced_Liver_Injury_Literature_With_Natural_Language_Processing_and_Conformal_Prediction.pdf]
Preview
PDF
Reliably_Filter_Drug-Induced_Liver_Injury_Literature_With_Natural_Language_Processing_and_Conformal_Prediction.pdf - Published Version

Download (4MB) | Preview

Abstract

Drug-induced liver injury describes the adverse effects of drugs that damage the liver. Life-threatening results were also reported in severe cases. Therefore, liver toxicity is an important assessment for new drug candidates. These reports are documented in research papers that contain preliminary in vitro and in vivo experiments. Conventionally, data extraction from publications relies on resource-demanding manual labeling, which restricts the efficiency of the information extraction. The development of natural language processing techniques enables the automatic processing of biomedical texts. Herein, based on around 28,000 papers (titles and abstracts) provided by the Critical Assessment of Massive Data Analysis challenge, this study benchmarked model performances on filtering liver-damage-related literature. Among five text embedding techniques, the model using term frequency-inverse document frequency (TF-IDF) and logistic regression outperformed others with an accuracy of 0.957 on the validation set. Furthermore, an ensemble model with similar overall performances was developed with a logistic regression model on the predicted probability given by separate models with different vectorization techniques. The ensemble model achieved a high accuracy of 0.954 and an F1 score of 0.955 in the hold-out validation data in the challenge. Moreover, important words in positive/negative predictions were identified via model interpretation. The prediction reliability was quantified with conformal prediction, which provides users with a control over the prediction uncertainty. Overall, the ensemble model and TF-IDF model reached satisfactory classification results, which can be used by researchers to rapidly filter literature that describes events related to liver injury induced by medications.

Type: Article
Title: Reliably Filter Drug-Induced Liver Injury Literature With Natural Language Processing and Conformal Prediction
Location: United States
Open access status: An open access version is available from UCL Discovery
DOI: 10.1109/JBHI.2022.3193365
Publisher version: https://doi.org/10.1109/JBHI.2022.3193365
Language: English
Additional information: This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Keywords: Science & Technology, Technology, Life Sciences & Biomedicine, Computer Science, Information Systems, Computer Science, Interdisciplinary Applications, Mathematical & Computational Biology, Medical Informatics, Computer Science, Liver, Predictive models, Drugs, Data models, Training, Bioinformatics, Injuries, Drug-induced liver injury, natural language processing, ensemble learning, sentence embedding, conformal prediction
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Mechanical Engineering
URI: https://discovery.ucl.ac.uk/id/eprint/10158860
Downloads since deposit
24Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item