203,175 research outputs found

    FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions

    Full text link
    Theory of mind (ToM) evaluations currently focus on testing models using passive narratives that inherently lack interactivity. We introduce FANToM, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering. Our benchmark draws upon important theoretical requisites from psychology and necessary empirical considerations when evaluating large language models (LLMs). In particular, we formulate multiple types of questions that demand the same underlying reasoning to identify illusory or false sense of ToM capabilities in LLMs. We show that FANToM is challenging for state-of-the-art LLMs, which perform significantly worse than humans even with chain-of-thought reasoning or fine-tuning.Comment: EMNLP 2023. Code and dataset can be found here: https://hyunw.kim/fanto

    ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind

    Full text link
    Theory of Mind (ToM), the capacity to comprehend the mental states of distinct individuals, is essential for numerous practical applications. With the development of large language models, there is a heated debate about whether they are able to perform ToM tasks. Previous studies have used different tasks and prompts to test the ToM on large language models and the results are inconsistent: some studies asserted these models are capable of exhibiting ToM, while others suggest the opposite. In this study, We present ToMChallenges, a dataset for comprehensively evaluating Theory of Mind based on Sally-Anne and Smarties tests. We created 30 variations of each test (e.g., changing the person's name, location, and items). For each variation, we test the model's understanding of different aspects: reality, belief, 1st order belief, and 2nd order belief. We adapt our data for various tasks by creating unique prompts tailored for each task category: Fill-in-the-Blank, Multiple Choice, True/False, Chain-of-Thought True/False, Question Answering, and Text Completion. If the model has a robust ToM, it should be able to achieve good performance for different prompts across different tests. We evaluated two GPT-3.5 models, text-davinci-003 and gpt-3.5-turbo-0301, with our datasets. Our results indicate that consistent performance in ToM tasks remains a challenge.Comment: work in progres

    How to Evaluate your Question Answering System Every Day and Still Get Real Work Done

    Full text link
    In this paper, we report on Qaviar, an experimental automated evaluation system for question answering applications. The goal of our research was to find an automatically calculated measure that correlates well with human judges' assessment of answer correctness in the context of question answering tasks. Qaviar judges the response by computing recall against the stemmed content words in the human-generated answer key. It counts the answer correct if it exceeds agiven recall threshold. We determined that the answer correctness predicted by Qaviar agreed with the human 93% to 95% of the time. 41 question-answering systems were ranked by both Qaviar and human assessors, and these rankings correlated with a Kendall's Tau measure of 0.920, compared to a correlation of 0.956 between human assessors on the same data.Comment: 6 pages, 3 figures, to appear in Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000

    Designing Effective Questions for Classroom Response System Teaching

    Get PDF
    Classroom response systems (CRSs) can be potent tools for teaching physics. Their efficacy, however, depends strongly on the quality of the questions used. Creating effective questions is difficult, and differs from creating exam and homework problems. Every CRS question should have an explicit pedagogic purpose consisting of a content goal, a process goal, and a metacognitive goal. Questions can be engineered to fulfil their purpose through four complementary mechanisms: directing students' attention, stimulating specific cognitive processes, communicating information to instructor and students via CRS-tabulated answer counts, and facilitating the articulation and confrontation of ideas. We identify several tactics that help in the design of potent questions, and present four "makeovers" showing how these tactics can be used to convert traditional physics questions into more powerful CRS questions.Comment: 11 pages, including 6 figures and 2 tables. Submitted (and mostly approved) to the American Journal of Physics. Based on invited talk BL05 at the 2005 Winter Meeting of the American Association of Physics Teachers (Albuquerque, NM
    • …
    corecore