M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

¹ Zhejiang University, ² University of Science and Technology of China, ³ East China Normal University, ⁴ Zhejiang Provincial People’s Hospital, ⁵ National University of Singapore

^* These authors contributed equally to this work.

Paper arXiv Code Data Leaderboard

Highlights

To address the challenges of evaluating chain-of-thought (CoT) reasoning in medical image interpretation, we introduce M3CoTBench, a benchmark designed to standardize and systematically assess clinically grounded reasoning in multimodal large language models (MLLMs).

(1) Diverse Medical VQA Dataset. We curate a 1,079-image medical visual question answering (VQA) dataset spanning 24 imaging modalities, stratified by difficulty and annotated with step-by-step reasoning aligned with real clinical diagnostic workflows.

(2) Multidimensional CoT-Centric Evaluation Metrics. We propose a comprehensive evaluation protocol that measures reasoning correctness, efficiency, impact, and consistency, enabling fine-grained and interpretable analysis of CoT behaviors across diverse MLLMs.

(3) Comprehensive Model Analysis and Case Studies. We benchmark both general-purpose and medical-domain MLLMs using quantitative metrics and in-depth qualitative case studies, revealing strengths and failure modes in clinical reasoning to guide future model design.

MLLM Evaluation Leaderboard

Comprehensive evaluation results of MLLMs. (Metrics Guide: ↑ Higher is Better ↓ Lower is Better)

#	Model	Category	Correctness ↑			Impact ↑			Efficiency		Consistency C_path ↑
#	Model	Category	F1	P	R	Acc_direct	Acc_step	I	E ↑	L ↓	Consistency C_path ↑
1	LLava-CoT	Open-source	49.80	54.08	46.15	40.08	36.75	-3.33	0.06	1.56	77.02
2	InternVL3.5-8B	Open-source	56.48	60.61	52.88	56.81	53.61	-3.20	0.10	18.27	71.65
3	InternVL3.5-30B	Open-source	59.42	62.15	56.92	63.81	57.60	-6.21	0.03	16.68	76.30
4	Qwen3-VL-Instruct-8B	Open-source	55.17	52.74	57.84	51.30	46.62	-4.68	0.04	93.94	82.65
5	Qwen3-VL-Instruct-30B	Open-source	59.15	56.13	62.51	54.63	51.39	-3.24	0.03	35.63	83.01
6	Qwen3-VL-Thinking-8B	Open-source	59.87	59.84	59.91	48.33	52.83	+4.50	0.02	2.79	76.91
7	Qwen3-VL-Thinking-30B	Open-source	62.15	63.34	61.01	51.90	55.47	+3.57	0.02	1.15	76.02
8	GPT-4.1	Closed-source	60.76	58.32	63.42	56.77	57.97	+1.22	0.17	5.08	81.31
9	GPT-5	Closed-source	55.13	64.15	48.34	58.76	58.29	-0.47	0.06	1.10	65.39
10	Gemini 2.5 Pro	Closed-source	66.07	62.48	70.10	60.24	60.06	-0.18	0.10	1.52	82.00
11	Claude-Sonnet-4.5	Closed-source	56.50	53.62	59.71	51.25	51.07	-0.18	0.15	2.69	85.22
12	LLaVA-Med (7B)	Medical	30.51	36.33	26.30	29.38	29.29	-0.09	0.35	3.22	72.68
13	HuatuoGPT-Vision (7B)	Medical	49.45	51.17	47.85	41.89	34.94	-6.95	0.21	5.92	73.19
14	HealthGPT (3.8B)	Medical	32.56	47.27	24.83	44.11	41.98	-2.13	0.06	15.36	67.72
15	Lingshu-7B	Medical	57.57	63.96	52.34	50.00	42.08	-7.92	0.30	8.37	74.83
16	Lingshu-32B	Medical	59.16	65.68	53.82	51.77	44.95	-6.82	0.21	10.87	71.47
17	MedGemma-4B	Medical	48.13	50.29	46.14	43.33	41.29	-2.04	0.05	20.61	74.03
18	MedGemma-27B	Medical	50.98	48.33	53.81	46.06	45.88	-0.18	0.03	23.71	82.55

Dataset Duration Pipeline

Data acquisition and annotation pipeline of M3CoTBench. a) Carefully curated medical images from various public sources. b) Multi-type and multi-difficulty QA generation via LLMs and expert calibration.c) Structured annotation of key reasoning steps aligned with clinical diagnostic workflows.

Benchmark Overview

Overview of M3CoTBench. Top: The benchmark covers 24 imaging modalities/examination types, 4 question types, and 13 clinical reasoning tasks. Middle: CoT annotation examples and 4 evaluation dimensions. Bottom: The distribution of image-QA pairs across a) modalities, b) question types, and c) tasks.

BibTeX

@misc{jiang2026m3cotbenchbenchmarkchainofthoughtmllms, title={M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding}, author={Juntao Jiang and Jiangning Zhang and Yali Bi and Jinsheng Bai and Weixuan Liu and Weiwei Jin and Zhucun Xue and Yong Liu and Xiaobin Hu and Shuicheng Yan}, year={2026}, eprint={2601.08758}, archivePrefix={arXiv}, primaryClass={eess.IV}, url={https://arxiv.org/abs/2601.08758}, } }

M3CoTBench