UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding

Lo, Kin Ian; Hawashin, Hala; Abbaszadeh, Mina; Limbäck-Stokin, Tilen Gaetano; Wazni, Hadi; Sadrzadeh, Mehrnoosh; (2025) DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding. In: Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025). (pp. pp. 316-327). Association for Computational Linguistics Green open access

[thumbnail of Sadrzadeh_A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding_VoR.pdf]
Preview
Text
Sadrzadeh_A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding_VoR.pdf

Download (674kB) | Preview

Abstract

Recent vision–language models excel at largescale image–text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate–argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence’s grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP’s SVOProbes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision–language tasks.Recent vision–language models excel at largescale image–text alignment but often neglect the compositional structure of language, leading to failures on tasks that hinge on word order and predicate–argument structure. We introduce DisCoCLIP, a multimodal encoder that combines a frozen CLIP vision transformer with a novel tensor network text encoder that explicitly encodes syntactic structure. Sentences are parsed with a Combinatory Categorial Grammar parser to yield distributional word tensors whose contractions mirror the sentence’s grammatical derivation. To keep the model efficient, high-order tensors are factorized with tensor decompositions, reducing parameter count from tens of millions to under one million. Trained end-to-end with a self-supervised contrastive loss, DisCoCLIP markedly improves sensitivity to verb semantics and word order: it raises CLIP’s SVOProbes verb accuracy from 77.6% to 82.4%, boosts ARO attribution and relation scores by over 9% and 4%, and achieves 93.7% on a newly introduced SVO-Swap benchmark. These results demonstrate that embedding explicit linguistic structure via tensor networks yields interpretable, parameter-efficient representations that substantially improve compositional reasoning in vision–language tasks.

Type: Proceedings paper
Title: DisCoCLIP: A Distributional Compositional Tensor Network Encoder for Vision-Language Understanding
Event: Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)
Open access status: An open access version is available from UCL Discovery
DOI: 10.18653/v1/2025.starsem-1.25
Publisher version: https://doi.org/10.18653/v1/2025.starsem-1.25
Language: English
Additional information: © ACL 2025. Materials published in or after 2016 are licensed on a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/).
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery.ucl.ac.uk/id/eprint/10216898
Downloads since deposit
0Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item