MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts

¹MAIS, Institute of Automation of Chinese Academy of Sciences ²School of Artificial Intelligence, University of Chinese Academy of Sciences
CVPR 2025

Introduction

Multimodal Large Language Models (MLLMs) have shown promising capabilities in mathematical reasoning within visual contexts across various datasets. However, most existing multimodal math benchmarks are limited to single-visual contexts, which diverges from the multi-visual scenarios commonly encountered in real-world mathematical applications. To address this gap, we introduce MV-MATH: a meticulously curated dataset of 2,009 high-quality mathematical problems. Each problem integrates multiple images interleaved with text, derived from authentic K-12 scenarios, and enriched with detailed annotations. MV-MATH includes multiple-choice, free-form, and multi-step questions, covering 11 subject areas across 3 difficulty levels, and serves as a comprehensive and rigorous benchmark for assessing MLLMs’ mathematical reasoning in multi-visual contexts. Through extensive experimentation, we observe that MLLMs encounter substantial challenges in multi-visual math tasks, with a considerable performance gap relative to human capabilities on MV-MATH. Furthermore, we analyze the performance and error patterns of various models, providing insights into MLLMs' mathematical reasoning capabilities within multi-visual settings.

Leaderboard on MV-MATH

Accuracy scores of models on our MV-MATH benchmark

Model	Overall	AG	Algebra	MG	Combinatorics	TG	Logic	SG	Arithmetic	CG	DG	Statistics
Claude-3.5🥇	33.9	32.7	38.1	34.3	46.7	33.3	29.8	36.3	54.2	27.0	38.2	41.1
GPT-4o🥈	32.1	28.7	36.7	34.4	39.4	30.6	29.8	38.2	41.7	20.8	44.3	47.0
Gemini-1.5-Pro🥉	29.1	29.9	32.9	28.3	28.0	30.5	40.5	33.9	42.7	21.7	30.6	35.2
Qwen-vl-max	26.9	27.6	32.1	24.7	36.5	29.6	31.8	30.9	37.5	23.7	32.3	23.5
GPT-4V	24.5	18.7	31.6	32.4	25.6	26.3	36.3	26.8	43.7	19.3	33.8	35.2
Qwen-vl-plus	19.7	17.9	24.1	22.0	16.0	19.9	24.8	15.9	15.2	18.7	31.4	29.4
QVQ-72B-Preview	29.3	-	-	-	-	-	-	-	-	-	-	-
LLaVA-OneVision-Chat-72B	26.2	25.1	32.4	23.9	35.3	28.1	27.2	31.6	31.2	22.6	35.9	35.2
LLaVA-OneVision-SFT-72B	25.9	24.2	31.3	21.1	23.1	28.9	31.8	32.8	18.7	21.5	39.5	29.4
LLaVA-OneVision-SI-72B	25.0	24.7	24.3	27.6	27.0	25.3	37.9	24.4	37.1	20.4	31.2	23.5
LLaVA-OneVision-Chat-7B	19.1	19.6	20.4	21.4	14.6	18.8	4.5	20.4	43.7	16.7	28.9	29.4
LLaVA-OneVision-SFT-7B	18.8	18.2	20.3	22.3	17.3	20.1	9.0	15.8	43.1	15.8	27.3	23.5
LLaVA-OneVision-SI-7B	17.2	16.1	19.5	13.2	16.0	19.5	12.6	15.0	36.5	13.2	31.3	13.6
Qwen2VL-Instruct-7B	16.5	14.2	18.6	14.8	17.0	21.9	22.7	17.2	31.2	16.1	25.1	23.5
Mantis-siglip-8B	15.8	17.9	17.7	17.9	14.6	20.4	22.7	12.1	18.7	10.8	32.3	17.6
LLaVA-NeXT-Interleave-7B	14.7	14.0	15.5	15.2	17.0	18.2	18.1	16.3	6.2	14.1	24.4	23.5
Deepseek-VL-7B	14.5	14.8	20.2	10.8	17.0	19.8	9.0	15.1	18.7	10.9	26.6	29.4
Llama-3.2-Vision-Instruct-11B	14.4	15.0	15.4	16.2	23.1	15.6	18.1	11.9	31.2	14.3	25.1	17.6
InternVL-Chat-8B	14.4	14.1	20.4	17.5	19.5	19.6	27.2	13.0	31.2	9.9	20.1	23.5
InternLM-XComposer2.5-VL-7B	13.1	12.2	12.6	13.2	24.3	20.6	36.3	9.4	18.7	11.1	23.7	17.6
VILA-13B	12.0	11.5	11.0	11.0	12.1	14.4	18.1	13.2	37.5	10.6	20.8	5.8
LLaVA-v1.5-7B	10.3	9.3	11.7	11.2	9.7	12.8	13.6	10.2	0.0	7.7	23.7	11.7
LLaVA-v1.5-13B	5.0	4.8	6.8	4.1	4.8	8.7	9.0	3.5	12.5	5.1	5.0	11.7
Math-LLaVA-13B	3.0	1.6	6.9	4.7	4.8	2.9	0.0	3.2	18.7	6.6	2.1	5.8

Models: Closed-source models are highlighted in red, while open-source models are highlighted in blue.
Mathematical Subjects: AG: Analytic Geometry, MG: Metric Geometry, TG: Transformation Geometry, SG: Solid Geometry, CG: Combinatorial Geometry, DG: Descriptive Geometry.

🚨 For more details, please refer to this link

Overview

MV-MATH is a meticulously annotated dataset designed to evaluate the mathematical reasoning capabilities of MLLMs in multi-visual contexts. Each sample in MV-MATH consists of interleaved multi-image and text. It comprises 2,009 multi-image questions, with some questions containing up to 8 images. It includes three types: multiple-choice, free-form and multi-step questions.

MV-MATH is organized into 11 subjects over 3 difficulty levels, including Analytic Geometry, Algebra, Metric Geometry, Combinatorics, Transformation Geometry, Logic, Solid Geometry, Arithmetic, Combinatorial Geometry, Descriptive Geometry and Statistics, covering a range of scenarios from the K-12 mathematics curriculum.

Based on image relevance, we categorize MV-MATH into two subsets: a mutually dependent set (MD), where images are interrelated and understanding one image necessitates information from another; and an independent set (ID), where images are unrelated and can be interpreted independently without reference to other images.

Sampled MV-MATH examples from each question type. Each sample contains a multi-visual contexts.

Comparison with existing multimodal math benchmarks. MC: Multiple Choice, FF: Free-form, MS:Multi-Step

You can download the dataset on Hugging Face Dataset.

Key statistics of MV-MATH.

Inner: divided by question type (3 types).
Middle: divided by difficulty level (3 levels).
Outer: divided by subjects (11 subjects).

(a) comparison with existing mathematical benchmarks, (b) distribution of question length per question, (c) distribution of the images per question.

Main Results

Comparison of model performances across various mathematical subjects. AG: Analytic Geometry, MG: Metric Geometry, TG: Transformation Geometry, SG: Solid Geometry, CG: Combinatorial Geometry, DG: Descriptive Geometry. The first and second highest accuracy of LMMs are marked in red and blue, respectively.

Comparison of model performances across various mathematical subjects on the choice problem set. The first and second highest accuracy of LMMs are marked in red and blue, respectively.

Comparison of model performances across various mathematical subjects on the single-step problem set. The first and second highest accuracy of LMMs are marked in red and blue, respectively.

Comparison of model performances across various mathematical subjects on the multi-step problem set(SAR: Step Accuracy Rate). The first and second highest accuracy of LMMs are marked in red and blue, respectively.

Comparison of model performances across various mathematical subjects on the multi-step problem set(QCR: Question Completeness Rate). The first and second highest accuracy of LMMs are marked in red and blue, respectively.

Data Examples

Data Example of Multiple-choice questions.

Data Example of Free-form questions.

Data Example of Multiple-choice questions.

Data Example of Multi-step questions.

Data Example of Free-form questions.

Data Example of Multiple-choice questions.

BibTeX

@misc{wang2025mvmathevaluatingmultimodalmath, title={MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts}, author={Peijie Wang and Zhongzhi Li and Fei Yin and Dekang Ran and Chenglin Liu}, year={2025}, eprint={2502.20808}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2502.20808}, }

MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts

Introduction

Leaderboard on MV-MATH

MV-MATH Dataset

Overview

Experiment Results

Main Results

More Results

Data Examples

Data Examples

BibTeX