SMART-101 Dataset

A Simple Multimodal Algorithmic Reasoning Task!

About

What is SMART-101 dataset?

Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, ChatGPT, etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101 dataset, for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning, among others. To scale our dataset towards training deep neural networks, we programmatically generate entirely new instances for each puzzle, while retaining their solution algorithm. To benchmark performances on SMART-101, we propose a vision and language meta-learning model using varied state-of-the-art backbones. Our experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization. We also evaluate the recent ChatGPT and other large language models on a part of SMART-101 and find that while these models show convincing reasoning abilities, the answers are often incorrect.

Puzzles

Example Puzzles from SMART-101 Dataset

We show below several example puzzles from the various skill set categories in the SMART-101 dataset. To see examples from all the 101 puzzles, please see here. Most of the puzzles in SMART-101 have an image and a question. To solve the puzzle, a method must use the content of the image and connect it with the details in the question to derive an algorithm -- usually a simple math algorithm. The method must then select the solution from the five answer candidates to complete the puzzle.

Path Tracing

Question: Which object is linked to the hat?
A: Flower   B: Disk   C: Book   D: Drink    E: Ball  

Algebra

Question: The correct additions in the squares were performed according to the pattern shown in the table. What number is covered by the question mark?
A: 18   B: 21   C: 17   D: 20   E: 14  

Counting

Question: All the flowers outside the triangle and outside the rectangle simultaneously are picked up. The number of flowers which are picked up is:
A: 10   B: 13   C: 14   D: 7   E: 11  

Spatial Reasoning

Question: A community with 8 huts has 4 straight roads and 4 circular roads. The drawing shows 7 of the huts. On every straight road there are 2 huts. On every circular road, there are also 2 huts. Where on the drawing should the 8th hut be added?
A: A   B: B   C: C   D: D   E: E  

Pattern Finding

Question: Carl had some 5-ray slices as depicted in the picture. He glued them together as depicted in the picture on the right. At minimum, how many slices did he use?
A: 8   B: 1   C: 4   D: 7   E: 6  

Path Tracing

Question: As shown in the image,Minna can only jump from one circle to a neighboring circle connected by a line. She cannot jump into any circle more than once. She starts at circle 1 and needs to make exactly 3 jumps to reach circle 3. In how many different ways can Minna do this?
A: 2   B: 0   C: 5   D: 1   E: 4  

Arithmetic

Question: A bird jumps on a fence from the post on one end to the other end. He needs 1 second for each jump. He makes 9 jumps ahead and then 7 jumps back. Then he again makes 9 jumps ahead and 7 jumps back, and so on. In how many seconds can the bird get from one end to the other end?
A: 74   B: 71   C: 75   D: 73   E: 72  

Logic

Question: Gina encrypts words applying the grid presented. For instance, the word UJEV is encrypyed as IO IU VU EG. What word did Gina encrypt EO IU VG IG?
A: RLYE   B: CJBL    C: GJLF   D: IXEL   E: TLRH   

Measurement

Question: Ariel had a few plancks with a height of 2 units and a length of 4 units. Making use of the plancks, he created the decoration depicted. How wide is the decoration?
A: 44   B: 24   C: 32   D: 28   E: 20  

Dataset Statistics


Skill Categories


Compositional Skill Categories


Vision and Language Puzzle Splits


Text on Puzzle Images

Baseline Performances

Second grader performance 77.1%
Random Answer Selection 21.6%
Supervised Learning Using ResNet50 + GPT2 49.6%
Answer Generalization Using ResNet50 + BERT23.4%
Zero-shot Generalization Using ViT-16 + BERT 21.6%
Zero-shot Generalization Using CLIP 24.1%
Zero-shot Generalization Using ResNet50 + BERT 18.9%
Few-shot Generalization Using ResNet50 + BERT 25.3%

Puzzle split Performances

Second grader performance 60.4%
ChatGPT 3.5 36.4%
Bing GPT-4 Creative 26.4%
Bard12.7%

Text-only subset Performances

Team

Anoop Cherian

MERL

Kuan-Chuan Peng

MERL

Suhas Lohit

MERL

Kevin Smith

MIT

Josh Tenenbaum

MIT

License

The SMART-101 dataset is released under `CC-BY-SA-4.0`.
Created by Mitsubishi Electric Research Laboratories (MERL), 2022-2023
SPDX-License-Identifier: CC-BY-SA-4.0

Citation and Contact

If you use this dataset, please cite the following CVPR 2023 paper:
@article{cherian2022deep,
  title={Are Deep Neural Networks SMARTer than Second Graders?},
  author={Cherian, Anoop and Peng, Kuan-Chuan and Lohit, Suhas and Smith, Kevin and Tenenbaum, Joshua B},
  journal={arXiv preprint arXiv:2212.09993},
  year={2022}
}

For questions or issues, contact:
Anoop Cherian (cherian at merl.com), Kuan-Chuan Peng (kpeng at merl.com), Suhas Lohit (slohit at merl.com)


Acknowledgements: We thank Joanna Matthiesen (CEO of Math Kangaroo USA) for sharing with us the human performance statistics and permission to use the puzzle images from the Math Kangaroo USA Olympiad.