About
What is SMART-840 dataset?
Recent years have seen a significant progress in the general-purpose problem solving abilities of large vision and language models (LVLMs), such as ChatGPT, Gemini, etc.; some of these breakthroughs even seem to enable AI models to outperform human abilities in varied tasks that demand higher-order cognitive skills. Are the current large AI models indeed capable of generalized problem solving as humans do? A systematic analysis of AI capabilities for joint vision and text reasoning, however, is missing in the current scientific literature. In this paper, we make an effort towards filling this gap, by evaluating state-of-the-art LVLMs on their mathematical and algorithmic reasoning abilities using visuo-linguistic problems from children's Olympiads. Specifically, we consider problems from the Mathematical Kangaroo (MK) Olympiad, which is a popular international competition targeted at children from grades 1-12, that tests children's deeper mathematical abilities using puzzles that are appropriately gauged to their age and skills. Using the puzzles from MK, we created a dataset, dubbed SMART-840, consisting of 840 problems from years 2020-2024. With our dataset, we analyze LVLMs power on mathematical reasoning; their responses on our puzzles offer a direct way to compare against that of children. Our results show that modern LVLMs do demonstrate increasingly powerful reasoning skills in solving problems for higher grades, but lack the foundations to correctly answer problems designed for younger children. Further analysis shows that there is no significant correlation between the reasoning capabilities of AI models and that of young children, and their capabilities appear to be based on a different type of reasoning than the cumulative knowledge that underlies children's mathematics and logic skills.
LVLM Performance
Note: If you would like to add your method to the table, please send us a link to your paper that reports comparisons to the SMART-840 dataset.
Model | 1 & 2 | 3 & 4 | 5 & 6 | 7 & 8 | 9 & 10 | 11 & 12 | Mean | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Human | 58.8 & 67.6 | 62.3 & 70.1 | 59.1 & 65.4 | 59.7 & 64.3 | 64.2 & 69.3 | 64.9 & 65.6 | 64.2 | |||||
GPT-4o | 41.6 | 38.6 | 35.1 | 47.1 | 41.3 | 50 | 42.4 | |||||
GPT-4v | 39.2 | 38.3 | 29.3 | 35.3 | 38.7 | 43.3 | 37.4 | |||||
Gemini-Pro 1.5 | 25.8 | 27.5 | 25.3 | 30.7 | 39.3 | 41.3 | 31.7 | |||||
Claude-3 Sonnet | 51.6 | 47.9 | 38.6 | 44.9 | 46.7 | 49.7 | 49.7 | |||||
InternLM-XComposer2 | 22.5 | 14.2 | 18.6 | 24.2 | 18.1 | 16.9 | 19.1 | |||||
InternVL-Chat | 16.7 | 25.0 | 17.3 | 14.6 | 15.3 | 16.7 | 17.6 | |||||
LlaVa-NEXT (34B) | 15.0 | 9.0 | 20.1 | 14.6 | 18.7 | 16.0 | 15.6 |
Caption: Accuracy (%) of correct responses of children in the respective grades against the accuracy of LVLMs when the agent is asked to provide an explanation of their responses. We find that LVLMs perform well on higher-grader problems (11-12) and relatively low on earlier grades (1-8).
Children's Participation Statistics

Caption: Figure (a) plota the distributions of children participating in MK Olympiads per year over 2020–2024 for grades 1–12. Figure (b) plots the total number of participants per grade during 2020–2024. Figure (c) plots the total number of participants each year over all grades (1-12). Figure (d) shows the number of puzzles and its portion for each category. Figure (e) shows the statistics of image-text and text-only puzzles. Figure (f) shows the statistics of puzzle difficulty (defined by their attributed weights).
Children's Participation Statistics

Caption: Pearson’s correlation coefficient on various problem difficulty metrics between children and LVLMs. The table shows the correlation between the difficulty index (i.e., problems that are hard for children, are they hard for LVLMs too?), discriminative index (i.e., problems that can discriminate between good learners and bad learners, can they also discriminate between LVLMs?), Weight-Correlation (i.e., higher point problems may be expected to be difficult for children, are they for LVLMs too?) , Entropy-Correlation between the entropy of distribution of answers by children against those of LVLMs (i.e., if children are confused, are LVLMs also confused?), Time-Correlation between the time taken by children to solve a problem (assuming difficult problems take longer) against accuracy by LVLMs on those problems. The green/red cell color indicates positive/negative person's coefficient, where darker cells represent larger absolute values. We find that the correlations are weak or often negative. That is, LVLMs do not correlate well with humans on the difficulty problems' solutions.

Grades 1-2
Question: The kangaroo goes up 3 steps each time the rabbit goes down 2 steps. On which step do they meet?
A: 3 B: 4 C: 5 D: 6 E: 7

Grades 3-4
Question: Which key would it be impossible to cut into three different figures of five shaded squares?
A: A B: B C: C D: D E: E

Grades 5-6
Question: The figure shows the plan of the seven train routes of a small town. The circles indicate the stations. Martin wants to paint the lines in such a way that if two lines share a common station, then they are painted with different colors. What is the smallest number of colors that he can use?
A: 3 B: 4 C: 5 D: 6 E: 7

Grades 7-8
Question: A community with 8 huts has 4 straight roads and 4 circular roads. The drawing shows 7 of the huts. On every straight road there are 2 huts.
On every circular road, there are also 2 huts. Where on the drawing should the 8th hut be added?An isosceles triangle ABC, with AB = AC, is split into three smaller isosceles triangles, as shown, so that AD= DB, CE= CD, and BE= EC. (Note that the diagram is not drawn to scale.) What is the size, in degrees, of angle BAC?
A: 24 B: 28 C: 30 D: 35 E: 36

Grades 9-10
Question: The figure shows a semicircle with center O. Two of the angles are given. What is the size, in degrees, of the angle \alpha??
A: 9° B: 11° C: 16° D: 17.5° E: 18°

Grades 11-12
Question: Part of the fifth degree polynomial shown cannot be seen because of an inkblot. It is known that all five roots of the poly nomial are integers. What is the highest power of x - l that divides the polynomial?
A: (x-1)^1 B: (x-1)^2 C: (x-1)^3 D: (x-1)^4 E: (x-1)^5
LVLM Category-wise Performances

Caption: Category-wise performance of LVLMs against humans. The SMART-840 problems are divided into 4 categories, namely geometry, algebra, number, and logic. We also show the performance on image-text and text-only problems, and as well as a radar plot against human performance.
Citation and Contact
If you use this dataset, please cite the following NeurIPS 2024 paper:
@article{cherian2024evaluating,
title={Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads},
author={Cherian, Anoop and Peng, Kuan-Chuan and Lohit, Suhas and Matthiesen, Joanna and Smith, Kevin and Tenenbaum, Joshua B},
journal={arXiv preprint arXiv:2406.15736},
year={2024}
}
For questions or issues, contact:
Anoop Cherian (cherian at merl dot com)