Accuracy scores of models on our GeoEval benchmark
Model | GeoEval-2000 (A/T %) | GeoEval-backward (A %) | GeoEval-aug (A %) | GeoEval-hard (A %) |
---|---|---|---|---|
CodeGen2-16B ◇ | 28.76 / 22.06 | 5.10 | 8.50 | 5.66 |
GPT-3.5 ◇ | 24.71 / 21.27 | 22.66 | 41.25 | 22.33 |
GPT-4 ◇ | 27.95 / 43.86 | 26.00 | 45.75 | 10.10 |
WizardMath-70B ◇ | 55.67 / 34.20 | 28.66 | 37.75 | 6.00 |
WizardMath-7B-V1.1 ◇ | 54.78 / 32.76 | 32.66 | 47.75 | 6.00 |
Llava-7B-V1.5 | 12.80 / 21.01 | 11.33 | 20.25 | 20.30 |
Qwen-VL | 25.60 / 25.97 | 5.66 | 22.25 | 21.66 |
mPLUG-Owl2 | 37.76 / n/a | 35.33 | 38.00 | 22.66 |
InstructBLIP † | 52.18 / n/a | 15.66 | 35.00 | 70.30 |
GPT-4V | 37.22 / 43.86 ‡ | 26.00 | 45.75 | 10.10 |
T: accuracy for problems containing only texts without diagrams, n/a: scores are unavailable due to models cannot process text-only inputs
†: doubt on the high accuracy rates reported by the IntructBLIP model, ‡: accuracy figures for GPT-4V are derived from GPT-4
🚨 For more details, please refer to this link