Li, Wenjie;
Zhang, Yujie;
Sun, Haoran;
Li, Yueqi;
Zhang, Fanrui;
Xu, Mengzhe;
Clausich, Victoria Borja;
... Wang, Lei; + view all
(2025)
CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning.
Information Fusion
, Article 104027. 10.1016/j.inffus.2025.104027.
(In press).
Preview |
Text
CX-Mind.pdf - Accepted Version Download (31MB) | Preview |
Abstract
Chest X-ray (CXR) imaging is one of the most widely used diagnostic modalities in clinical practice, encompassing a broad spectrum of diagnostic tasks. Recent advancements have seen the extensive application of reasoning-based multimodal large language models (MLLMs) in medical imaging to enhance diagnostic efficiency and interpretability. However, existing multimodal models predominantly rely on ”one-time” diagnostic approaches, lacking verifiable supervision of the reasoning process. This leads to challenges in multi-task CXR diagnosis, including lengthy reasoning, sparse rewards, and frequent hallucinations. To address these issues, we propose CX-Mind, the first generative model to achieve interleaved ”think-answer” reasoning for CXR tasks, driven by curriculum-based reinforcement learning and verifiable process rewards (CuRL-VPR). Specifically, we constructed an instruction-tuning dataset, CX-Set, comprising 708,473 images and 2,619,148 samples, and generated 42,828 high-quality interleaved reasoning data points supervised by clinical reports. Optimization was conducted in two stages under the Group Relative Policy Optimization framework: initially stabilizing basic reasoning with closed-domain tasks, followed by transfer to open-domain diagnostics, incorporating rule-based conditional process rewards to bypass the need for pretrained reward models. Extensive experimental results demonstrate that CX-Mind significantly outperforms existing medical and general-domain MLLMs in visual understanding, text generation, and spatiotemporal alignment, achieving an average performance improvement of 25.1% over comparable CXR-specific models. On a real-world clinical dataset (Rui-CXR), CX-Mind achieves a mean recall@1 across 14 diseases that substantially surpasses the second-best results, with multi-center expert evaluations further confirming its clinical utility across multiple dimensions. CX-Mind establishes a new paradigm for constructing interpretable, and high-performing medical MLLMs.
| Type: | Article |
|---|---|
| Title: | CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning |
| Open access status: | An open access version is available from UCL Discovery |
| DOI: | 10.1016/j.inffus.2025.104027 |
| Publisher version: | https://doi.org/10.1016/j.inffus.2025.104027 |
| Language: | English |
| Additional information: | © The Author(s), 2025. This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/. |
| Keywords: | Medical Reasoning, Multimodal Large Language Models, Chest X-ray, Reinforcement Learning, Curriculum Learning |
| UCL classification: | UCL UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Mechanical Engineering |
| URI: | https://discovery.ucl.ac.uk/id/eprint/10218774 |
Archive Staff Only
![]() |
View Item |

