TY  - INPR
TI  - Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust
AV  - public
Y1  - 2025/01/23/
N1  - © The Authors 2025. Original content in this paper is licensed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) Licence (https://creativecommons.org/licenses/by/4.0/).
N2  - Multimodal fusion with a multimodal transformer is an effective method for both early and late fusion paradigms. However, in a multimodal transformer, the modality fusion is performed solely through the self-attention mechanism, which is originally designed for unimodal token sequences. To improve the self-attention mechanism for handling multimodal input, a parametric adapter model, like the Q-former in BLIP-2, is often used to align tokens from different modalities. Our empirical study unveils that only using the self-attention layer to perform the modality fusion makes the model less robust to missing modalities and input noise, as the model will overly rely on one certain modality. To improve the robustness of the transformer, our paper proposes an implicit approach based on Wasserstein distance that aligns tokens from different modalities without using any additional trainable parameters. Our empirical study shows that the implicit modality alignment improves the effectiveness of the multimodal Transformer in discriminative tasks, as well as its robustness to input noise and missing modalities. We conduct experiments on four downstream task datasets, including 2-modalities and 3-modalities tasks. We also consider different fusion paradigms, i.e., early and late fusion. The experimental results show that our proposed method has a significant improvement in both performance and robustness over all baselines across all datasets and fusion paradigms.
ID  - discovery10204207
UR  - https://openreview.net/forum?id=dbaGuiYsTl
JF  - Transactions on Machine Learning Research
A1  - Zhi, Zhuo
A1  - Sun, Yuxuan
A1  - wu, qiangqiang
A1  - Liu, ziquan
A1  - Rodrigues, miguel
ER  -