eprintid: 10204207
rev_number: 6
eprint_status: archive
userid: 699
dir: disk0/10/20/42/07
datestamp: 2025-02-03 12:26:26
lastmod: 2025-02-03 12:26:26
status_changed: 2025-02-03 12:26:26
type: article
metadata_visibility: show
sword_depositor: 699
creators_name: Zhi, Zhuo
creators_name: Sun, Yuxuan
creators_name: wu, qiangqiang
creators_name: Liu, ziquan
creators_name: Rodrigues, miguel
title: Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust
ispublished: inpress
divisions: UCL
divisions: B04
divisions: F46
note: © The Authors 2025. Original content in this paper is licensed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) Licence (https://creativecommons.org/licenses/by/4.0/).
abstract: Multimodal fusion with a multimodal transformer is an effective method for both early and late fusion paradigms. However, in a multimodal transformer, the modality fusion is performed solely through the self-attention mechanism, which is originally designed for unimodal token sequences. To improve the self-attention mechanism for handling multimodal input, a parametric adapter model, like the Q-former in BLIP-2, is often used to align tokens from different modalities. Our empirical study unveils that only using the self-attention layer to perform the modality fusion makes the model less robust to missing modalities and input noise, as the model will overly rely on one certain modality. To improve the robustness of the transformer, our paper proposes an implicit approach based on Wasserstein distance that aligns tokens from different modalities without using any additional trainable parameters. Our empirical study shows that the implicit modality alignment improves the effectiveness of the multimodal Transformer in discriminative tasks, as well as its robustness to input noise and missing modalities. We conduct experiments on four downstream task datasets, including 2-modalities and 3-modalities tasks. We also consider different fusion paradigms, i.e., early and late fusion. The experimental results show that our proposed method has a significant improvement in both performance and robustness over all baselines across all datasets and fusion paradigms.
date: 2025-01-23
date_type: published
official_url: https://openreview.net/forum?id=dbaGuiYsTl
oa_status: green
full_text_type: other
language: eng
primo: open
primo_central: open_green
verified: verified_manual
elements_id: 2357277
lyricists_name: Zhi, Zhuo
lyricists_id: ZZZHI62
actors_name: Zhi, Zhuo
actors_id: ZZZHI62
actors_role: owner
full_text_status: public
publication: Transactions on Machine Learning Research
citation:        Zhi, Zhuo;    Sun, Yuxuan;    wu, qiangqiang;    Liu, ziquan;    Rodrigues, miguel;      (2025)    Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust.                   Transactions on Machine Learning Research           (In press).    Green open access   
 
document_url: https://discovery.ucl.ac.uk/id/eprint/10204207/1/3212_Wasserstein_Modality_Alig.pdf