1 research outputs found
ToMBench: Benchmarking Theory of Mind in Large Language Models
Theory of Mind (ToM) is the cognitive capability to perceive and ascribe
mental states to oneself and others. Recent research has sparked a debate over
whether large language models (LLMs) exhibit a form of ToM. However, existing
ToM evaluations are hindered by challenges such as constrained scope,
subjective judgment, and unintended contamination, yielding inadequate
assessments. To address this gap, we introduce ToMBench with three key
characteristics: a systematic evaluation framework encompassing 8 tasks and 31
abilities in social cognition, a multiple-choice question format to support
automated and unbiased evaluation, and a build-from-scratch bilingual inventory
to strictly avoid data leakage. Based on ToMBench, we conduct extensive
experiments to evaluate the ToM performance of 10 popular LLMs across tasks and
abilities. We find that even the most advanced LLMs like GPT-4 lag behind human
performance by over 10% points, indicating that LLMs have not achieved a
human-level theory of mind yet. Our aim with ToMBench is to enable an efficient
and effective evaluation of LLMs' ToM capabilities, thereby facilitating the
development of LLMs with inherent social intelligence.Comment: Under revie