Lo, Kin Ian;
Hawashin, Hala;
Abbaszadeh, Mina;
Limbäck-Stokin, Tilen Gaetano;
Wazni, Hadi;
Sadrzadeh, Mehrnoosh;
(2025)
DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding.
In:
Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025).
(pp. pp. 316-327).
Association for Computational Linguistics
Preview |
Text
Sadrzadeh_A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding_VoR.pdf Download (674kB) | Preview |
Abstract
Recent vision–language models excel at largescale image–text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate–argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence’s grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP’s SVOProbes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision–language tasks.Recent vision–language models excel at largescale image–text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate–argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence’s grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP’s SVOProbes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision–language tasks.
| Type: | Proceedings paper |
|---|---|
| Title: | DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding |
| Event: | Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025) |
| Open access status: | An open access version is available from UCL Discovery |
| DOI: | 10.18653/v1/2025.starsem-1.25 |
| Publisher version: | https://doi.org/10.18653/v1/2025.starsem-1.25 |
| Language: | English |
| Additional information: | © ACL 2025. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/). |
| UCL classification: | UCL UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science |
| URI: | https://discovery.ucl.ac.uk/id/eprint/10216898 |
Archive Staff Only
![]() |
View Item |

