UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

PitVQA: Image-Grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery

He, Runlong; Xu, Mengya; Das, Adrito; Khan, Danyal Z; Bano, Sophia; Marcus, Hani J; Stoyanov, Danail; ... Islam, Mobarakol; + view all (2024) PitVQA: Image-Grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery. In: Dou, Q and Linguraru, MG and Feragen, A and Giannarou, S and Glocker, B and Lekadir, K and Schnabel, JA, (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. (pp. pp. 488-498). Springer Nature

[thumbnail of 3403_paper.pdf] Text
3403_paper.pdf - Accepted Version
Access restricted to UCL open access staff until 4 October 2025.

Download (1MB)

Abstract

Visual Question Answering (VQA) within the surgical domain, utilizing Large Language Models (LLMs), offers a distinct opportunity to improve intra-operative decision-making and facilitate intuitive surgeon-AI interaction. However, the development of LLMs for surgical VQA is hindered by the scarcity of diverse and extensive datasets with complex reasoning tasks. Moreover, contextual fusion of the image and text modalities remains an open research challenge due to the inherent differences between these two types of information and the complexity involved in aligning them. This paper introduces PitVQA, a novel dataset specifically designed for VQA in endonasal pituitary surgery and PitVQA-Net, an adaptation of the GPT2 with a novel image-grounded text embedding for surgical VQA. PitVQA comprises 25 procedural videos and a rich collection of question-answer pairs spanning crucial surgical aspects such as phase and step recognition, context understanding, tool detection and localization, and tool-tissue interactions. PitVQA-Net consists of a novel image-grounded text embedding that projects image and text features into a shared embedding space and GPT2 Backbone with an excitation block classification head to generate contextually relevant answers within the complex domain of endonasal pituitary surgery. Our image-grounded text embedding leverages joint embedding, cross-attention and contextual representation to understand the contextual relationship between questions and surgical images. We demonstrate the effectiveness of PitVQA-Net on both the PitVQA and the publicly available EndoVis18-VQA dataset, achieving improvements in balanced accuracy of 8% and 9% over the most recent baselines, respectively. Our code and dataset is available at https://github.com/mobarakol/PitVQA.

Type: Proceedings paper
Title: PitVQA: Image-Grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery
Event: 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI)
Location: Marrakesh, Morocco
Dates: 6th-10th Oct 2024
ISBN-13: 978-3-031-72088-8
DOI: 10.1007/978-3-031-72089-5_46
Publisher version: https://doi.org/10.1007/978-3-031-72089-5_46
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher's terms and conditions.
UCL classification: UCL
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > UCL Queen Square Institute of Neurology
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Population Health Sciences > Institute of Health Informatics
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > UCL Queen Square Institute of Neurology > Department of Neuromuscular Diseases
URI: https://discovery.ucl.ac.uk/id/eprint/10203116
Downloads since deposit
1Download
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item