UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Information Retrieval-Driven Software Vulnerability Prediction

Meka, Chizzy Godson; (2025) Information Retrieval-Driven Software Vulnerability Prediction. Doctoral thesis (Ph.D), UCL (University College London). Green open access

[thumbnail of Meka_10207359_thesis.pdf]
Preview
Text
Meka_10207359_thesis.pdf

Download (3MB) | Preview

Abstract

Context: The growing complexity of software systems continually correlates with increasing vulnerabilities, necessitating effective mitigation strategies, such as vulnerability prediction. This Artificial Intelligence (AI)-driven approach aims to improve Secure Software Development Life Cycle (SSDLC) practices by proactively identifying potential security flaws. While various prediction approaches have been proposed, opportunities for further research remain, particularly in leveraging Information Retrieval (IR)’s pattern-matching capabilities to enhance prediction models. Objective: This thesis advances secure software engineering methodologies by introducing IR-driven feature engineering methods for predicting software vulnerabilities. We develop granular method-level vulnerability prediction models that leverage novel IR-driven security-relevant metrics and evaluate their predictive performances. Methodology: We developed two varieties of sixteen IR-driven security-relevant features using token-based and Abstract Syntax Tree (AST)-based source code representations. Then, we utilised these features to develop models with various machine learning classifiers using Python and evaluated them on Java open-source software systems, starting with a Within-Project (release-by-release dataset). Finally, we conducted a stress test in a Mixed-Project (multi-software systems dataset) setting to assess the generalisability of our models across software systems. Results: Our Within-Project token-based IR-driven approach reached a post-hyperparameter tuning precision of 0.73, a recall of 0.60, and an F1 score of 0.66 using a Random Forest classifier. The Within-Project AST-based approach attained a slightly better F1 score performance, yielding a post-hyperparameter tuning precision of 0.72, a recall of 0.62, and an F1 score of 0.67, also using Random Forest. Conclusion: Our research indicates that IR-driven feature engineering techniques significantly enhance prediction performance, demonstrating the effectiveness of our approach. However, the Mixed-Project analysis indicated that data-related challenges in vulnerability prediction persist, especially regarding data heterogeneity across software systems. Thus, system-specific vulnerability prediction models leveraging a release-by-release dataset and knowledge of previous system-specific vulnerabilities represent the most promising approach for practical vulnerability prediction in real-world software systems.

Type: Thesis (Doctoral)
Qualification: Ph.D
Title: Information Retrieval-Driven Software Vulnerability Prediction
Open access status: An open access version is available from UCL Discovery
Language: English
Additional information: Copyright © The Author 2025. Original content in this thesis is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) Licence (https://creativecommons.org/licenses/by-nc/4.0/). Any third-party copyright material present remains the property of its respective owner(s) and is licensed under its existing terms. Access may initially be restricted at the author’s request.
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery.ucl.ac.uk/id/eprint/10207359
Downloads since deposit
203Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item