SMART-840 Project Page

Paper

Evaluating Large Vision and Language Models on Children's Mathematical Olympiads

Slides

Slides summarizing the dataset and the paper

Poster (png, pdf)

Poster summarizing the dataset and the paper

About

What is SMART-840 dataset?

Recent years have seen a significant progress in the general-purpose problem solving abilities of large vision and language models (LVLMs), such as ChatGPT, Gemini, etc.; some of these breakthroughs even seem to enable AI models to outperform human abilities in varied tasks that demand higher-order cognitive skills. Are the current large AI models indeed capable of generalized problem solving as humans do? A systematic analysis of AI capabilities for joint vision and text reasoning, however, is missing in the current scientific literature. In this paper, we make an effort towards filling this gap, by evaluating state-of-the-art LVLMs on their mathematical and algorithmic reasoning abilities using visuo-linguistic problems from children's Olympiads. Specifically, we consider problems from the Mathematical Kangaroo (MK) Olympiad, which is a popular international competition targeted at children from grades 1-12, that tests children's deeper mathematical abilities using puzzles that are appropriately gauged to their age and skills. Using the puzzles from MK, we created a dataset, dubbed SMART-840, consisting of 840 problems from years 2020-2024. With our dataset, we analyze LVLMs power on mathematical reasoning; their responses on our puzzles offer a direct way to compare against that of children. Our results show that modern LVLMs do demonstrate increasingly powerful reasoning skills in solving problems for higher grades, but lack the foundations to correctly answer problems designed for younger children. Further analysis shows that there is no significant correlation between the reasoning capabilities of AI models and that of young children, and their capabilities appear to be based on a different type of reasoning than the cumulative knowledge that underlies children's mathematics and logic skills.

LVLM Performance

Note: If you would like to add your method to the table, please send us a link to your paper that reports comparisons to the SMART-840 dataset.

Model	1 & 2	3 & 4	5 & 6	7 & 8	9 & 10	11 & 12	Mean
Human	58.8 & 67.6	62.3 & 70.1	59.1 & 65.4	59.7 & 64.3	64.2 & 69.3	64.9 & 65.6	64.2
GPT-4o	41.6	38.6	35.1	47.1	41.3	50	42.4
GPT-4v	39.2	38.3	29.3	35.3	38.7	43.3	37.4
Gemini-Pro 1.5	25.8	27.5	25.3	30.7	39.3	41.3	31.7
Claude-3 Sonnet	51.6	47.9	38.6	44.9	46.7	49.7	49.7
InternLM-XComposer2	22.5	14.2	18.6	24.2	18.1	16.9	19.1
InternVL-Chat	16.7	25.0	17.3	14.6	15.3	16.7	17.6
LlaVa-NEXT (34B)	15.0	9.0	20.1	14.6	18.7	16.0	15.6

Caption: Accuracy (%) of correct responses of children in the respective grades against the accuracy of LVLMs when the agent is asked to provide an explanation of their responses. We find that LVLMs perform well on higher-grader problems (11-12) and relatively low on earlier grades (1-8).

Children's Participation Statistics

Caption: Figure (a) plota the distributions of children participating in MK Olympiads per year over 2020–2024 for grades 1–12. Figure (b) plots the total number of participants per grade during 2020–2024. Figure (c) plots the total number of participants each year over all grades (1-12). Figure (d) shows the number of puzzles and its portion for each category. Figure (e) shows the statistics of image-text and text-only puzzles. Figure (f) shows the statistics of puzzle difficulty (defined by their attributed weights).

Children's Participation Statistics

Caption: Pearson’s correlation coefficient on various problem difficulty metrics between children and LVLMs. The table shows the correlation between the difficulty index (i.e., problems that are hard for children, are they hard for LVLMs too?), discriminative index (i.e., problems that can discriminate between good learners and bad learners, can they also discriminate between LVLMs?), Weight-Correlation (i.e., higher point problems may be expected to be difficult for children, are they for LVLMs too?) , Entropy-Correlation between the entropy of distribution of answers by children against those of LVLMs (i.e., if children are confused, are LVLMs also confused?), Time-Correlation between the time taken by children to solve a problem (assuming difficult problems take longer) against accuracy by LVLMs on those problems. The green/red cell color indicates positive/negative person's coefficient, where darker cells represent larger absolute values. We find that the correlations are weak or often negative. That is, LVLMs do not correlate well with humans on the difficulty problems' solutions.

Puzzles

Example Puzzles from SMART-840 Dataset

Grades 1-2

Question: The kangaroo goes up 3 steps each time the rabbit goes down 2 steps. On which step do they meet?
A: 3 B: 4 C: 5 D: 6 E: 7

LLM Responses
GPT-4o
Gemini-1.5
Claude-Sonnet
InternLM-XComposer
InternVL-Chat
LlaVa-1.6

Grades 3-4

Question: Which key would it be impossible to cut into three different figures of five shaded squares?
A: A B: B C: C D: D E: E