Zhi, Zhuo;
Sun, Yuxuan;
wu, qiangqiang;
Liu, ziquan;
Rodrigues, miguel;
(2025)
Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust.
Transactions on Machine Learning Research
(In press).
Preview |
PDF
3212_Wasserstein_Modality_Alig.pdf - Accepted Version Download (903kB) | Preview |
Abstract
Multimodal fusion with a multimodal transformer is an effective method for both early and late fusion paradigms. However, in a multimodal transformer, the modality fusion is performed solely through the self-attention mechanism, which is originally designed for unimodal token sequences. To improve the self-attention mechanism for handling multimodal input, a parametric adapter model, like the Q-former in BLIP-2, is often used to align tokens from different modalities. Our empirical study unveils that only using the self-attention layer to perform the modality fusion makes the model less robust to missing modalities and input noise, as the model will overly rely on one certain modality. To improve the robustness of the transformer, our paper proposes an implicit approach based on Wasserstein distance that aligns tokens from different modalities without using any additional trainable parameters. Our empirical study shows that the implicit modality alignment improves the effectiveness of the multimodal Transformer in discriminative tasks, as well as its robustness to input noise and missing modalities. We conduct experiments on four downstream task datasets, including 2-modalities and 3-modalities tasks. We also consider different fusion paradigms, i.e., early and late fusion. The experimental results show that our proposed method has a significant improvement in both performance and robustness over all baselines across all datasets and fusion paradigms.
Type: | Article |
---|---|
Title: | Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust |
Open access status: | An open access version is available from UCL Discovery |
Publisher version: | https://openreview.net/forum?id=dbaGuiYsTl |
Language: | English |
Additional information: | © The Authors 2025. Original content in this paper is licensed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) Licence (https://creativecommons.org/licenses/by/4.0/). |
UCL classification: | UCL UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Electronic and Electrical Eng |
URI: | https://discovery.ucl.ac.uk/id/eprint/10204207 |




Archive Staff Only
![]() |
View Item |