Introduction

This project comprises a LLM evaluation of Geometry Problem Solving methods and the construction of comprehensive datasets. The aim is to advance the field of solving geometry problems. The project is focused on the construction of datasets in the field of geometry problem solving and to provide a comprehensive evaluation of current large language models.

The GeoEval benchmark is specifically designed for assessing the ability of models in resolving geometric math problems. This benchmark features five characteristics: Comprehensive Variety, Varied Problems, Dual Inputs, Diverse Challenges, and Complexity Ratings.

For an insightful contrast, we offer a comparative analysis of GeoEval against earlier datasets.

For an insightful contrast, we offer a comparative analysis of GeoEval against earlier datasets. GeoEval Benchmark Features Comprehensive Variety: The benchmark covers a wide range of geometric topics, providing a comprehensive test for models.

Varied Problems: Problems in the benchmark are varied, testing the model's ability to handle different types of geometric problems.
Dual Inputs: The benchmark includes both text and diagram inputs, testing the model's ability to process and integrate information from different sources.
Diverse Challenges: The benchmark poses diverse challenges, testing the model's ability to handle complex and varied geometric problems.
Complexity Ratings: The benchmark includes problems of different complexity levels, allowing for a nuanced assessment of the model's capabilities.

Leaderboard on GeoEval

Accuracy scores of models on our GeoEval benchmark

Model	GeoEval-2000 (A/T %)	GeoEval-backward (A %)	GeoEval-aug (A %)	GeoEval-hard (A %)
CodeGen2-16B ◇	28.76 / 22.06	5.10	8.50	5.66
GPT-3.5 ◇	24.71 / 21.27	22.66	41.25	22.33
GPT-4 ◇	27.95 / 43.86	26.00	45.75	10.10
WizardMath-70B ◇	55.67 / 34.20	28.66	37.75	6.00
WizardMath-7B-V1.1 ◇	54.78 / 32.76	32.66	47.75	6.00
Llava-7B-V1.5	12.80 / 21.01	11.33	20.25	20.30
Qwen-VL	25.60 / 25.97	5.66	22.25	21.66
mPLUG-Owl2	37.76 / n/a	35.33	38.00	22.66
InstructBLIP †	52.18 / n/a	15.66	35.00	70.30
GPT-4V	37.22 / 43.86 ‡	26.00	45.75	10.10

◇: All LLMs, A: the overall accuracy across all problems
T: accuracy for problems containing only texts without diagrams, n/a: scores are unavailable due to models cannot process text-only inputs
†: doubt on the high accuracy rates reported by the IntructBLIP model, ‡: accuracy figures for GPT-4V are derived from GPT-4

🚨 For more details, please refer to this link

Overview

The GeoEval benchmark is structured into four subsets: GeoEval-2000, comprising 2,000 problems; GeoEval-backward, with 750 problems; GeoEvalaug, containing 2,000 problems; and GeoEval-hard, including 300 problems.

A total of 24,912 problems were collected from Geometry3K, PGPS9K, UniGeo, GeoQA+, GeometryQA, and geometry problems from the MATH and MathQA datasets. From these, 2,000 geometry math problems were selected to create the GeoEval-2000 subset.

GeoEval-backward problems start with the answers from forward problems, posing queries to find hidden numbers. From the GeoEval-2000 subset, 750 problems were selected and backward questions were created.

To evaluate model resilience and prevent data leakage during pre-training, a context learning strategy rephrases problems from the GeoEval-2000 subset to make GeoEvalaug. Each problem is rephrased into five variants by GPT-3.5.

The GeoEval-hard subset includes 300 solid and analytic geometry problems. These were selected from an initial pool of approximately 10,000 problems through web scraping, followed by a refined manual review process.

Visualization of the proportional contributions and varied distribution of geometric shapes of these source datasets to the GeoEval-2000 subset, showcasing the variety and scope of the geometry problems collected from each source.

Comparison between GeoEval benchmark and other datasets
GeoEval includes problems from seven public datasets and three newly created ones

Examples

Examples from different perspectives in GeoEval

Unique examples of our GeoEval benchmark

Examples of the flat geometry problem, the analytic geometry problem, and the solid geometry problem

Examples of GeoEval-2000, GeoEval-backward, GeoEval-aug, GeoEval-hard subsets

Experiment Results

Results on Existing Foundation Models

Models performances on GeoEval-2000 subset according to different question lengths

Model performances on GeoEval-2000 subset according to different complexity levels

Comparison of models requiring external knowledge ("w" in blue color) and those do not ("w/o" in orange color).

BibTeX

@article{zhang2024geoeval,
      title={GeoEval: benchmark for evaluating LLMs and Multi-Modal Models on geometry problem-solving},
      author={Zhang, Jiaxin and Li, Zhongzhi and Zhang, Mingliang and Yin, Fei and Liu, Chenglin and Moshfeghi, Yashar},
      journal={arXiv preprint arXiv:2402.10104},
      year={2024}
    }

GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving