Comprehensive evaluation results of MLLMs. (Metrics Guide: ↑ Higher is Better ↓ Lower is Better)
| # | Model | Category | Correctness ↑ | Impact ↑ | Efficiency | Consistency Cpath ↑ |
|||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| F1 | P | R | Accdirect | Accstep | I | E ↑ | L ↓ | ||||
| 1 | LLava-CoT | Open-source | 49.80 | 54.08 | 46.15 | 40.08 | 36.75 | -3.33 | 0.06 | 1.56 | 77.02 |
| 2 | InternVL3.5-8B | Open-source | 56.48 | 60.61 | 52.88 | 56.81 | 53.61 | -3.20 | 0.10 | 18.27 | 71.65 |
| 3 | InternVL3.5-30B | Open-source | 59.42 | 62.15 | 56.92 | 63.81 | 57.60 | -6.21 | 0.03 | 16.68 | 76.30 |
| 4 | Qwen3-VL-Instruct-8B | Open-source | 55.17 | 52.74 | 57.84 | 51.30 | 46.62 | -4.68 | 0.04 | 93.94 | 82.65 |
| 5 | Qwen3-VL-Instruct-30B | Open-source | 59.15 | 56.13 | 62.51 | 54.63 | 51.39 | -3.24 | 0.03 | 35.63 | 83.01 |
| 6 | Qwen3-VL-Thinking-8B | Open-source | 59.87 | 59.84 | 59.91 | 48.33 | 52.83 | +4.50 | 0.02 | 2.79 | 76.91 |
| 7 | Qwen3-VL-Thinking-30B | Open-source | 62.15 | 63.34 | 61.01 | 51.90 | 55.47 | +3.57 | 0.02 | 1.15 | 76.02 |
| 8 | GPT-4.1 | Closed-source | 60.76 | 58.32 | 63.42 | 56.77 | 57.97 | +1.22 | 0.17 | 5.08 | 81.31 |
| 9 | GPT-5 | Closed-source | 55.13 | 64.15 | 48.34 | 58.76 | 58.29 | -0.47 | 0.06 | 1.10 | 65.39 |
| 10 | Gemini 2.5 Pro | Closed-source | 66.07 | 62.48 | 70.10 | 60.24 | 60.06 | -0.18 | 0.10 | 1.52 | 82.00 |
| 11 | Claude-Sonnet-4.5 | Closed-source | 56.50 | 53.62 | 59.71 | 51.25 | 51.07 | -0.18 | 0.15 | 2.69 | 85.22 |
| 12 | LLaVA-Med (7B) | Medical | 30.51 | 36.33 | 26.30 | 29.38 | 29.29 | -0.09 | 0.35 | 3.22 | 72.68 |
| 13 | HuatuoGPT-Vision (7B) | Medical | 49.45 | 51.17 | 47.85 | 41.89 | 34.94 | -6.95 | 0.21 | 5.92 | 73.19 |
| 14 | HealthGPT (3.8B) | Medical | 32.56 | 47.27 | 24.83 | 44.11 | 41.98 | -2.13 | 0.06 | 15.36 | 67.72 |
| 15 | Lingshu-7B | Medical | 57.57 | 63.96 | 52.34 | 50.00 | 42.08 | -7.92 | 0.30 | 8.37 | 74.83 |
| 16 | Lingshu-32B | Medical | 59.16 | 65.68 | 53.82 | 51.77 | 44.95 | -6.82 | 0.21 | 10.87 | 71.47 |
| 17 | MedGemma-4B | Medical | 48.13 | 50.29 | 46.14 | 43.33 | 41.29 | -2.04 | 0.05 | 20.61 | 74.03 |
| 18 | MedGemma-27B | Medical | 50.98 | 48.33 | 53.81 | 46.06 | 45.88 | -0.18 | 0.03 | 23.71 | 82.55 |