UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

Peng, Min; Wang, Chongyang; Gao, Yuan; Shi, Yu; Zhou, Xiang-Dong; (2022) Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. (pp. pp. 1276-1282). IJCAI: International Joint Conferences on Artificial Intelligence Organization Green open access

[thumbnail of chmrjkwvhmpfwxhspskjxndftkhvcxjh.pdf]
Preview
Text
chmrjkwvhmpfwxhspskjxndftkhvcxjh.pdf - Accepted Version

Download (577kB) | Preview

Abstract

Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language processing. While most existing approaches ignore the visual appearance-motion information at different temporal scales, it is unknown how to incorporate the multilevel processing capacity of a deep learning model with such multiscale information. Targeting these issues, this paper proposes a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA. MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR). With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations. Thereon, with a shared transformer encoder, PVR infers the visual cues at each level in parallel to fit with answering different question types that may rely on the visual information at relevant levels. Through extensive experiments on three VideoQA datasets, we demonstrate improved performances than previous state-of-the-arts and justify the effectiveness of each part of our method.

Type: Proceedings paper
Title: Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering
Event: IJCAI-ECAI 2022: 31st International Joint Conference on Artificial Intelligence and the 25th European Conference on Artificial Intelligence
Open access status: An open access version is available from UCL Discovery
Publisher version: https://doi.org/10.24963/ijcai.2022/178
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords: Computer Vision: Vision and language, Computer Vision: Scene analysis and understanding, Computer Vision: Video analysis and understanding, Machine Learning: Multi-modal learning, Natural Language Processing: Question Answering
UCL classification: UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > Div of Psychology and Lang Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL
URI: https://discovery.ucl.ac.uk/id/eprint/10150093
Downloads since deposit
25Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item