MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts

1MAIS, Institute of Automation of Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences
CVPR 2025
geometric reasoning

Performance comparison of six Multimodal Large Language Models(MLLMs) on our proposed MV-MATH dataset across 11 subjects(left) and 3 question types(right). SAR: Step Accuracy Rate, QCR: Question Completeness Rate.

Introduction

Multimodal Large Language Models (MLLMs) have shown promising capabilities in mathematical reasoning within visual contexts across various datasets. However, most existing multimodal math benchmarks are limited to single-visual contexts, which diverges from the multi-visual scenarios commonly encountered in real-world mathematical applications. To address this gap, we introduce MV-MATH: a meticulously curated dataset of 2,009 high-quality mathematical problems. Each problem integrates multiple images interleaved with text, derived from authentic K-12 scenarios, and enriched with detailed annotations. MV-MATH includes multiple-choice, free-form, and multi-step questions, covering 11 subject areas across 3 difficulty levels, and serves as a comprehensive and rigorous benchmark for assessing MLLMs’ mathematical reasoning in multi-visual contexts. Through extensive experimentation, we observe that MLLMs encounter substantial challenges in multi-visual math tasks, with a considerable performance gap relative to human capabilities on MV-MATH. Furthermore, we analyze the performance and error patterns of various models, providing insights into MLLMs' mathematical reasoning capabilities within multi-visual settings.

Leaderboard on MV-MATH

Accuracy scores of models on our MV-MATH benchmark

Model Overall AG Algebra MG Combinatorics TG Logic SG Arithmetic CG DG Statistics
Claude-3.5🥇 33.9 32.7 38.1 34.3 46.7 33.3 29.8 36.3 54.2 27.0 38.2 41.1
GPT-4o🥈 32.1 28.7 36.7 34.4 39.4 30.6 29.8 38.2 41.7 20.8 44.3 47.0
Gemini-1.5-Pro🥉 29.1 29.9 32.9 28.3 28.0 30.5 40.5 33.9 42.7 21.7 30.6 35.2
Qwen-vl-max 26.9 27.6 32.1 24.7 36.5 29.6 31.8 30.9 37.5 23.7 32.3 23.5
GPT-4V 24.5 18.7 31.6 32.4 25.6 26.3 36.3 26.8 43.7 19.3 33.8 35.2
Qwen-vl-plus 19.7 17.9 24.1 22.0 16.0 19.9 24.8 15.9 15.2 18.7 31.4 29.4
QVQ-72B-Preview 29.3 - - - - - - - - - - -
LLaVA-OneVision-Chat-72B 26.2 25.1 32.4 23.9 35.3 28.1 27.2 31.6 31.2 22.6 35.9 35.2
LLaVA-OneVision-SFT-72B 25.9 24.2 31.3 21.1 23.1 28.9 31.8 32.8 18.7 21.5 39.5 29.4
LLaVA-OneVision-SI-72B 25.0 24.7 24.3 27.6 27.0 25.3 37.9 24.4 37.1 20.4 31.2 23.5
LLaVA-OneVision-Chat-7B 19.1 19.6 20.4 21.4 14.6 18.8 4.5 20.4 43.7 16.7 28.9 29.4
LLaVA-OneVision-SFT-7B 18.8 18.2 20.3 22.3 17.3 20.1 9.0 15.8 43.1 15.8 27.3 23.5
LLaVA-OneVision-SI-7B 17.2 16.1 19.5 13.2 16.0 19.5 12.6 15.0 36.5 13.2 31.3 13.6
Qwen2VL-Instruct-7B 16.5 14.2 18.6 14.8 17.0 21.9 22.7 17.2 31.2 16.1 25.1 23.5
Mantis-siglip-8B 15.8 17.9 17.7 17.9 14.6 20.4 22.7 12.1 18.7 10.8 32.3 17.6
LLaVA-NeXT-Interleave-7B 14.7 14.0 15.5 15.2 17.0 18.2 18.1 16.3 6.2 14.1 24.4 23.5
Deepseek-VL-7B 14.5 14.8 20.2 10.8 17.0 19.8 9.0 15.1 18.7 10.9 26.6 29.4
Llama-3.2-Vision-Instruct-11B 14.4 15.0 15.4 16.2 23.1 15.6 18.1 11.9 31.2 14.3 25.1 17.6
InternVL-Chat-8B 14.4 14.1 20.4 17.5 19.5 19.6 27.2 13.0 31.2 9.9 20.1 23.5
InternLM-XComposer2.5-VL-7B 13.1 12.2 12.6 13.2 24.3 20.6 36.3 9.4 18.7 11.1 23.7 17.6
VILA-13B 12.0 11.5 11.0 11.0 12.1 14.4 18.1 13.2 37.5 10.6 20.8 5.8
LLaVA-v1.5-7B 10.3 9.3 11.7 11.2 9.7 12.8 13.6 10.2 0.0 7.7 23.7 11.7
LLaVA-v1.5-13B 5.0 4.8 6.8 4.1 4.8 8.7 9.0 3.5 12.5 5.1 5.0 11.7
Math-LLaVA-13B 3.0 1.6 6.9 4.7 4.8 2.9 0.0 3.2 18.7 6.6 2.1 5.8
Models: Closed-source models are highlighted in red, while open-source models are highlighted in blue.
Mathematical Subjects: AG: Analytic Geometry, MG: Metric Geometry, TG: Transformation Geometry, SG: Solid Geometry, CG: Combinatorial Geometry, DG: Descriptive Geometry.

🚨 For more details, please refer to this link

MV-MATH Dataset

Overview

MV-MATH is a meticulously annotated dataset designed to evaluate the mathematical reasoning capabilities of MLLMs in multi-visual contexts. Each sample in MV-MATH consists of interleaved multi-image and text. It comprises 2,009 multi-image questions, with some questions containing up to 8 images. It includes three types: multiple-choice, free-form and multi-step questions.

MV-MATH is organized into 11 subjects over 3 difficulty levels, including Analytic Geometry, Algebra, Metric Geometry, Combinatorics, Transformation Geometry, Logic, Solid Geometry, Arithmetic, Combinatorial Geometry, Descriptive Geometry and Statistics, covering a range of scenarios from the K-12 mathematics curriculum.

Based on image relevance, we categorize MV-MATH into two subsets: a mutually dependent set (MD), where images are interrelated and understanding one image necessitates information from another; and an independent set (ID), where images are unrelated and can be interpreted independently without reference to other images.



You can download the dataset on Hugging Face Dataset.

data-overview

Key statistics of MV-MATH.

data-composition

Inner: divided by question type (3 types).
Middle: divided by difficulty level (3 levels).
Outer: divided by subjects (11 subjects).

data-overview

(a) comparison with existing mathematical benchmarks, (b) distribution of question length per question, (c) distribution of the images per question.

Experiment Results

Main Results

More Results

grade-lv

Data Examples

Data Examples

BibTeX

@misc{wang2025mvmathevaluatingmultimodalmath,
      title={MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts}, 
      author={Peijie Wang and Zhongzhi Li and Fei Yin and Dekang Ran and Chenglin Liu},
      year={2025},
      eprint={2502.20808},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2502.20808}, 
}