GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving

1Chinese Academy of Sciences, 2University of Strathclyde
ACL 2024
geometric reasoning

The performance of models across various subjects, revealing distinct strengths. The WizardMath-7B model significantly outperforms others in flat geometry problems, such as length and lines. Conversely, in solid geometry problems like cuboids and spheres, GPT-4V surpasses WizardMath-7B, indicating its superior capability in addressing solid geometry questions.

Introduction

This project comprises a LLM evaluation of Geometry Problem Solving methods and the construction of comprehensive datasets. The aim is to advance the field of solving geometry problems. The project is focused on the construction of datasets in the field of geometry problem solving and to provide a comprehensive evaluation of current large language models.

The GeoEval benchmark is specifically designed for assessing the ability of models in resolving geometric math problems. This benchmark features five characteristics: Comprehensive Variety, Varied Problems, Dual Inputs, Diverse Challenges, and Complexity Ratings.

For an insightful contrast, we offer a comparative analysis of GeoEval against earlier datasets.

For an insightful contrast, we offer a comparative analysis of GeoEval against earlier datasets. GeoEval Benchmark Features Comprehensive Variety: The benchmark covers a wide range of geometric topics, providing a comprehensive test for models.

  • Varied Problems: Problems in the benchmark are varied, testing the model's ability to handle different types of geometric problems.
  • Dual Inputs: The benchmark includes both text and diagram inputs, testing the model's ability to process and integrate information from different sources.
  • Diverse Challenges: The benchmark poses diverse challenges, testing the model's ability to handle complex and varied geometric problems.
  • Complexity Ratings: The benchmark includes problems of different complexity levels, allowing for a nuanced assessment of the model's capabilities.

Leaderboard on GeoEval

Accuracy scores of models on our GeoEval benchmark

Model GeoEval-2000 (A/T %) GeoEval-backward (A %) GeoEval-aug (A %) GeoEval-hard (A %)
CodeGen2-16B ◇ 28.76 / 22.06 5.10 8.50 5.66
GPT-3.5 ◇ 24.71 / 21.27 22.66 41.25 22.33
GPT-4 ◇ 27.95 / 43.86 26.00 45.75 10.10
WizardMath-70B ◇ 55.67 / 34.20 28.66 37.75 6.00
WizardMath-7B-V1.1 ◇ 54.78 / 32.76 32.66 47.75 6.00
Llava-7B-V1.5 12.80 / 21.01 11.33 20.25 20.30
Qwen-VL 25.60 / 25.97 5.66 22.25 21.66
mPLUG-Owl2 37.76 / n/a 35.33 38.00 22.66
InstructBLIP † 52.18 / n/a 15.66 35.00 70.30
GPT-4V 37.22 / 43.86 ‡ 26.00 45.75 10.10
◇: All LLMs, A: the overall accuracy across all problems
T: accuracy for problems containing only texts without diagrams, n/a: scores are unavailable due to models cannot process text-only inputs
†: doubt on the high accuracy rates reported by the IntructBLIP model, ‡: accuracy figures for GPT-4V are derived from GPT-4

🚨 For more details, please refer to this link

GeoEval Dataset

Overview

The GeoEval benchmark is structured into four subsets: GeoEval-2000, comprising 2,000 problems; GeoEval-backward, with 750 problems; GeoEvalaug, containing 2,000 problems; and GeoEval-hard, including 300 problems.

A total of 24,912 problems were collected from Geometry3K, PGPS9K, UniGeo, GeoQA+, GeometryQA, and geometry problems from the MATH and MathQA datasets. From these, 2,000 geometry math problems were selected to create the GeoEval-2000 subset.

GeoEval-backward problems start with the answers from forward problems, posing queries to find hidden numbers. From the GeoEval-2000 subset, 750 problems were selected and backward questions were created.

To evaluate model resilience and prevent data leakage during pre-training, a context learning strategy rephrases problems from the GeoEval-2000 subset to make GeoEvalaug. Each problem is rephrased into five variants by GPT-3.5.

The GeoEval-hard subset includes 300 solid and analytic geometry problems. These were selected from an initial pool of approximately 10,000 problems through web scraping, followed by a refined manual review process.

Examples

Examples from different perspectives in GeoEval



Experiment Results

Results on Existing Foundation Models

BibTeX

@article{zhang2024geoeval,
      title={GeoEval: benchmark for evaluating LLMs and Multi-Modal Models on geometry problem-solving},
      author={Zhang, Jiaxin and Li, Zhongzhi and Zhang, Mingliang and Yin, Fei and Liu, Chenglin and Moshfeghi, Yashar},
      journal={arXiv preprint arXiv:2402.10104},
      year={2024}
    }