UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust

Zhi, Zhuo; Sun, Yuxuan; wu, qiangqiang; Liu, ziquan; Rodrigues, miguel; (2025) Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust. Transactions on Machine Learning Research (In press). Green open access

[thumbnail of 3212_Wasserstein_Modality_Alig.pdf]
Preview
PDF
3212_Wasserstein_Modality_Alig.pdf - Accepted Version

Download (903kB) | Preview

Abstract

Multimodal fusion with a multimodal transformer is an effective method for both early and late fusion paradigms. However, in a multimodal transformer, the modality fusion is performed solely through the self-attention mechanism, which is originally designed for unimodal token sequences. To improve the self-attention mechanism for handling multimodal input, a parametric adapter model, like the Q-former in BLIP-2, is often used to align tokens from different modalities. Our empirical study unveils that only using the self-attention layer to perform the modality fusion makes the model less robust to missing modalities and input noise, as the model will overly rely on one certain modality. To improve the robustness of the transformer, our paper proposes an implicit approach based on Wasserstein distance that aligns tokens from different modalities without using any additional trainable parameters. Our empirical study shows that the implicit modality alignment improves the effectiveness of the multimodal Transformer in discriminative tasks, as well as its robustness to input noise and missing modalities. We conduct experiments on four downstream task datasets, including 2-modalities and 3-modalities tasks. We also consider different fusion paradigms, i.e., early and late fusion. The experimental results show that our proposed method has a significant improvement in both performance and robustness over all baselines across all datasets and fusion paradigms.

Type: Article
Title: Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust
Open access status: An open access version is available from UCL Discovery
Publisher version: https://openreview.net/forum?id=dbaGuiYsTl
Language: English
Additional information: © The Authors 2025. Original content in this paper is licensed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) Licence (https://creativecommons.org/licenses/by/4.0/).
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Electronic and Electrical Eng
URI: https://discovery.ucl.ac.uk/id/eprint/10204207
Downloads since deposit
Loading...
63Downloads
Download activity - last month
Loading...
Download activity - last 12 months
Loading...
Downloads by country - last 12 months
Loading...

Archive Staff Only

View Item View Item