For machine agents to successfully interact with humans in real-world settings, they will need to develop an understanding of human mental life. Intuitive psychology, the ability to reason about hidden mental variables that drive observable actions, comes naturally to people: even pre-verbal infants can tell agents from objects, expecting agents to act efficiently to achieve goals given constraints. Despite recent interest in machine agents that reason about other agents, it is not clear if such agents learn or hold the core psychology principles that drive human reasoning. Inspired by cognitive development studies on intuitive psychology, we present a benchmark consisting of a large dataset of procedurally generated 3D animations, AGENT (Action, Goal, Efficiency, coNstraint, uTility), structured around four scenarios (goal preferences, action efficiency, unobserved constraints, and cost-reward trade-offs) that probe key concepts of core intuitive psychology. We validate AGENT with human-ratings, propose an evaluation protocol emphasizing generalization, and compare two strong baselines built on Bayesian inverse planning and a Theory of Mind neural network. Our results suggest that to pass the designed tests of core intuitive psychology at human levels, a model must acquire or have built-in representations of how agents plan, combining utility computations and core knowledge of objects and physics.

Paper and Demo


Tianmin Shu, Abhishek Bhandwaldar, Chuang Gan, Kevin A. Smith, Shari Liu, Dan Gutfreund, Elizabeth Spelke, Joshua B. Tenenbaum, and Tomer D. Ullman. AGENT: A Benchmark for Core Psychological Reasoning. 38th International Conference on Machine Learning (ICML), 2021. [Paper] [Supp] [Code (Data Generation)]

Demo: Example Trials

You can view example trials of different scenarios through the following links:

Scenario 1: Goal Preferences

Scenario 2: Action Efficiency

Scenario 3: Unobserved Constraints

Scenario 4: Cost-Reward Trade-offs


You can now submit your results for the official benchmark evaluation on EvalAI.

Full version (RGB-D frames + segmentation maps + GT states)

Train/Val Set (366.3 GB)

Test Set (518.3 GB) Videos in the test set have a higher resolution (so they may be used for human experiments too)

Small version (with GT states only)

Train/Val Set (547 MB)

Test Set (166 MB)


The videos are synthesized using the ThreeDWorld (TDW) simulation engine.


Code for data generation


Any questions? Please contact Tianmin Shu (tshu [at] mit.edu)