TY - GEN
T1 - MovieGraph-ToM
T2 - 40th AAAI Conference on Artificial Intelligence, AAAI 2026
AU - Wei, Tingjiang
AU - Ni, Qin
AU - Gao, Rong
AU - Wang, Yingying
AU - He, Liang
N1 - Publisher Copyright:
© 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2026
Y1 - 2026
N2 - The capacity for social reasoning, particularly Theory of Mind (ToM), is a foundational prerequisite for aligning Large Language Models (LLMs) with human values. However, current evaluations are predominantly confined to simplistic, short-text scenarios, obscuring their true capabilities and potential failure modes in complex, long-range social dynamics. To address this deficit, we introduce MovieGraph-ToM, a large-scale benchmark for evaluating long-range ToM and social cognition within extended, multimodal narratives. We employ a ”scaffold-and-probe” methodology, and we construct a ground-truth Social-Causal Graph offline, which maps the narrative’s latent mental states and causal chains. During evaluation, the model is denied access to this graph and must reason directly from raw multimodal inputs. This decoupling forces genuine inference over superficial pattern matching. Reasoning is probed via a hierarchical questioning framework designed to differentiate spontaneous understanding from logical robustness. Our empirical results reveal systematic vulnerabilities in even state-of-the-art models. We identify a critical multiple-choice pitfall, where accuracy plummets against well-crafted distractors, and a stark ”generative-discriminative divide,” where models fail to construct coherent explanations for answers they correctly identify. These findings highlight a latent risk, as models that feign comprehension could lead to unpredictable and mis-aligned behaviors. MovieGraph-ToM thus offers a rigorous platform for assessing and advancing the robust social intelligence required for safely aligned AI systems.
AB - The capacity for social reasoning, particularly Theory of Mind (ToM), is a foundational prerequisite for aligning Large Language Models (LLMs) with human values. However, current evaluations are predominantly confined to simplistic, short-text scenarios, obscuring their true capabilities and potential failure modes in complex, long-range social dynamics. To address this deficit, we introduce MovieGraph-ToM, a large-scale benchmark for evaluating long-range ToM and social cognition within extended, multimodal narratives. We employ a ”scaffold-and-probe” methodology, and we construct a ground-truth Social-Causal Graph offline, which maps the narrative’s latent mental states and causal chains. During evaluation, the model is denied access to this graph and must reason directly from raw multimodal inputs. This decoupling forces genuine inference over superficial pattern matching. Reasoning is probed via a hierarchical questioning framework designed to differentiate spontaneous understanding from logical robustness. Our empirical results reveal systematic vulnerabilities in even state-of-the-art models. We identify a critical multiple-choice pitfall, where accuracy plummets against well-crafted distractors, and a stark ”generative-discriminative divide,” where models fail to construct coherent explanations for answers they correctly identify. These findings highlight a latent risk, as models that feign comprehension could lead to unpredictable and mis-aligned behaviors. MovieGraph-ToM thus offers a rigorous platform for assessing and advancing the robust social intelligence required for safely aligned AI systems.
UR - https://www.scopus.com/pages/publications/105034839899
U2 - 10.1609/aaai.v40i40.40674
DO - 10.1609/aaai.v40i40.40674
M3 - 会议稿件
AN - SCOPUS:105034839899
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
T3 - Proceedings of the AAAI Conference on Artificial Intelligence
SP - 33827
EP - 33835
BT - Proceedings of the AAAI Conference on Artificial Intelligence
A2 - Koenig, Sven
A2 - Jenkins, Chad
A2 - Taylor, Matthew E.
PB - Association for the Advancement of Artificial Intelligence
Y2 - 20 January 2026 through 27 January 2026
ER -