M3CoTBench Icon M3CoTBench

M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

1 Zhejiang University, 2 University of Science and Technology of China, 3 East China Normal University, 4 Zhejiang Provincial People’s Hospital, 5 National University of Singapore

* These authors contributed equally to this work.

Highlights

To address the challenges of evaluating chain-of-thought (CoT) reasoning in medical image interpretation, we introduce M3CoTBench, a benchmark designed to standardize and systematically assess clinically grounded reasoning in multimodal large language models (MLLMs).

(1) Diverse Medical VQA Dataset. We curate a 1,079-image medical visual question answering (VQA) dataset spanning 24 imaging modalities, stratified by difficulty and annotated with step-by-step reasoning aligned with real clinical diagnostic workflows.

(2) Multidimensional CoT-Centric Evaluation Metrics. We propose a comprehensive evaluation protocol that measures reasoning correctness, efficiency, impact, and consistency, enabling fine-grained and interpretable analysis of CoT behaviors across diverse MLLMs.

(3) Comprehensive Model Analysis and Case Studies. We benchmark both general-purpose and medical-domain MLLMs using quantitative metrics and in-depth qualitative case studies, revealing strengths and failure modes in clinical reasoning to guide future model design.

MLLM Evaluation Leaderboard

Comprehensive evaluation results of MLLMs. (Metrics Guide: ↑ Higher is Better ↓ Lower is Better)

# Model Category Correctness Impact Efficiency Consistency
Cpath
F1 P R Accdirect Accstep I E L
1 LLava-CoT Open-source 49.8054.0846.15 40.0836.75-3.33 0.061.5677.02
2 InternVL3.5-8B Open-source 56.4860.6152.88 56.8153.61-3.20 0.1018.2771.65
3 InternVL3.5-30B Open-source 59.4262.1556.92 63.8157.60-6.21 0.0316.6876.30
4 Qwen3-VL-Instruct-8B Open-source 55.1752.7457.84 51.3046.62-4.68 0.0493.9482.65
5 Qwen3-VL-Instruct-30B Open-source 59.1556.1362.51 54.6351.39-3.24 0.0335.6383.01
6 Qwen3-VL-Thinking-8B Open-source 59.8759.8459.91 48.3352.83+4.50 0.022.7976.91
7 Qwen3-VL-Thinking-30B Open-source 62.1563.3461.01 51.9055.47+3.57 0.021.1576.02
8 GPT-4.1 Closed-source 60.7658.3263.42 56.7757.97+1.22 0.175.0881.31
9 GPT-5 Closed-source 55.1364.1548.34 58.7658.29-0.47 0.061.1065.39
10 Gemini 2.5 Pro Closed-source 66.0762.4870.10 60.2460.06-0.18 0.101.5282.00
11 Claude-Sonnet-4.5 Closed-source 56.5053.6259.71 51.2551.07-0.18 0.152.6985.22
12 LLaVA-Med (7B) Medical 30.5136.3326.30 29.3829.29-0.09 0.353.2272.68
13 HuatuoGPT-Vision (7B) Medical 49.4551.1747.85 41.8934.94-6.95 0.215.9273.19
14 HealthGPT (3.8B) Medical 32.5647.2724.83 44.1141.98-2.13 0.0615.3667.72
15 Lingshu-7B Medical 57.5763.9652.34 50.0042.08-7.92 0.308.3774.83
16 Lingshu-32B Medical 59.1665.6853.82 51.7744.95-6.82 0.2110.8771.47
17 MedGemma-4B Medical 48.1350.2946.14 43.3341.29-2.04 0.0520.6174.03
18 MedGemma-27B Medical 50.9848.3353.81 46.0645.88-0.18 0.0323.7182.55

Benchmark

Dataset Duration Pipeline

data-curation

Data acquisition and annotation pipeline of M3CoTBench. a) Carefully curated medical images from various public sources. b) Multi-type and multi-difficulty QA generation via LLMs and expert calibration.c) Structured annotation of key reasoning steps aligned with clinical diagnostic workflows.

Benchmark Overview

data-composition

Overview of M3CoTBench. Top: The benchmark covers 24 imaging modalities/examination types, 4 question types, and 13 clinical reasoning tasks. Middle: CoT annotation examples and 4 evaluation dimensions. Bottom: The distribution of image-QA pairs across a) modalities, b) question types, and c) tasks.

Benchmark Comparison

data-composition

Criterion comparison for current benchmarks. ✔: Satisfied. ✖: Unsatisfied. Our proposed M3CoTBench boasts distinct advantages across various key dimensions.

BibTeX

@misc{jiang2026m3cotbenchbenchmarkchainofthoughtmllms,
      title={M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding}, 
      author={Juntao Jiang and Jiangning Zhang and Yali Bi and Jinsheng Bai and Weixuan Liu and Weiwei Jin and Zhucun Xue and Yong Liu and Xiaobin Hu and Shuicheng Yan},
      year={2026},
      eprint={2601.08758},
      archivePrefix={arXiv},
      primaryClass={eess.IV},
      url={https://arxiv.org/abs/2601.08758}, 
}
}