UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Surgical-VQLA:Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Bai, L; Islam, M; Seenivasan, L; Ren, H; (2023) Surgical-VQLA:Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery. In: Proceedings - IEEE International Conference on Robotics and Automation. (pp. pp. 6859-6865). IEEE: London, UK. Green open access

[thumbnail of Surgical_VQLA.pdf]
Preview
Text
Surgical_VQLA.pdf - Accepted Version

Download (495kB) | Preview

Abstract

Despite the availability of computer-aided simulators and recorded videos of surgical procedures, junior residents still heavily rely on experts to answer their queries. However, expert surgeons are often overloaded with clinical and academic workloads and limit their time in answering. For this purpose, we develop a surgical question-answering system to facilitate robot-assisted surgical scene and activity understanding from recorded videos. Most of the existing visual question answering (VQA) methods require an object detector and regions based feature extractor to extract visual features and fuse them with the embedded text of the question for answer generation. However, (i) surgical object detection model is scarce due to smaller datasets and lack of bounding box annotation; (ii) current fusion strategy of heterogeneous modalities like text and image is naive; (iii) the localized answering is missing, which is crucial in complex surgical scenarios. In this paper, we propose Visual Question Localized-Answering in Robotic Surgery (Surgical-VQLA) to localize the specific surgical area during the answer prediction. To deal with the fusion of the heterogeneous modalities, we design gated vision-language embedding (GVLE) to build input patches for the Language Vision Transformer (LViT) to predict the answer. To get localization, we add the detection head in parallel with the prediction head of the LViT. We also integrate generalized intersection over union (GIoU) loss to boost localization performance by preserving the accuracy of the question-answering model. We annotate two datasets of VQLA by utilizing publicly available surgical videos from EndoVis-17 and 18 of the MICCAI challenges. Our validation results suggest that Surgical-VQLA can better understand the surgical scene and localized the specific area related to the question-answering. GVLE presents an efficient language-vision embedding technique by showing superior performance over the existing benchmarks.

Type: Proceedings paper
Title: Surgical-VQLA:Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
Event: 2023 IEEE International Conference on Robotics and Automation (ICRA)
Dates: 29 May 2023 - 2 Jun 2023
ISBN-13: 9798350323658
Open access status: An open access version is available from UCL Discovery
DOI: 10.1109/ICRA48891.2023.10160403
Publisher version: https://doi.org/10.1109/ICRA48891.2023.10160403
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords: Location awareness , Visualization, Head, Surgery, Predictive models, Logic gates, Feature extraction
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Med Phys and Biomedical Eng
URI: https://discovery.ucl.ac.uk/id/eprint/10179478
Downloads since deposit
66Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item