Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust

Advanced search
Browse by:

Department | Year

UCL Theses | Latest

Deposit your research

Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust

Zhi, Zhuo; liu, ziquan; wu, qiangqiang; Rodrigues, Miguel; (2024) Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust. In: Proceedings of ICML 2024. (pp. pp. 1-11). Proceedings of Machine Learning Research (PMLR ): Vienna, Austria. Green open access

[thumbnail of 22_wasserstein_modality_alignment.pdf]

Preview

Text
22_wasserstein_modality_alignment.pdf - Published Version
Download (370kB) | Preview

Abstract

Early fusion at a one-tower model such as a multimodal transformer is an effective multimodal learning paradigm. However, in a multimodal transformer, the modality fusion is performed solely through the self-attention function, which is originally designed for unimodal token sequences. To improve the self-attention mechanism for handling multimodal input, a parametric adapter model, like the Q-former in BLIP-2, is often used to align tokens from different modalities. Unlike existing methods that use an adapter model for modality alignment, our paper proposes an implicit approach based on Wasserstein distance that aligns tokens from different modalities in a multimodal transformer without using any additional parameters. Our empirical study shows that the implicit modality alignment improves the effectiveness of the multimodal Transformer in discriminative tasks, as well as its robustness to input noise and missing modalities. We conduct experiments on four different types of downstream task datasets, including both 2-modalities and 3- modalities tasks. In standard testing, testing with modality noise, and testing with missing modalities, the averaged improvement of our method compared with the baseline over all datasets are 0.9%, 2.5%, and 2.1% respectively.

Type:	Proceedings paper
Title:	Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust
Event:	ICML 2024 TiFA Workshop
Open access status:	An open access version is available from UCL Discovery
Publisher version:	https://proceedings.mlr.press/
Language:	English
Additional information:	This version is the author accepted manuscript. For information on re-use, please refer to the publisher's terms and conditions.
UCL classification:	UCL UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Electronic and Electrical Eng
URI:	https://discovery.ucl.ac.uk/id/eprint/10194484

Downloads since deposit

17Downloads

Download activity - last month

Download activity - last 12 months

Downloads by country - last 12 months

Archive Staff Only

View Item