18 research outputs found

    Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets

    Get PDF
    Ideally Open-Domain Question Answering models should exhibit a number of competencies, ranging from simply memorizing questions seen at training time, to answering novel question formulations with answers seen during training, to generalizing to completely novel questions with novel answers. However, single aggregated test set scores do not show the full picture of what capabilities models truly have. In this work, we perform a detailed study of the test sets of three popular open-domain benchmark datasets with respect to these competencies. We find that 60-70% of test-time answers are also present somewhere in the training sets. We also find that 30% of test-set questions have a near-duplicate paraphrase in their corresponding training sets. Using these findings, we evaluate a variety of popular open-domain models to obtain greater insight into what extent they can actually generalize, and what drives their overall performance. We find that all models perform dramatically worse on questions that cannot be memorized from training sets, with a mean absolute performance difference of 63% between repeated and non-repeated data. Finally we show that simple nearest-neighbor models out-perform a BART closed-book QA model, further highlighting the role that training set memorization plays in these benchmark

    A Few More Examples May Be Worth Billions of Parameters

    Get PDF
    We investigate the dynamics of increasing the number of model parameters versus the number of labeled examples across a wide variety of tasks. Our exploration reveals that while scaling parameters consistently yields performance improvements, the contribution of additional examples highly depends on the task's format. Specifically, in open question answering tasks, enlarging the training set does not improve performance. In contrast, classification, extractive question answering, and multiple choice tasks benefit so much from additional examples that collecting a few hundred examples is often “worth” billions of parameters. We hypothesize that unlike open question answering, which involves recalling specific information, solving strategies for tasks with a more restricted output space transfer across examples, and can therefore be learned with small amounts of labeled data

    Effectiveness of learning mathematics derivative materials using modules equipped with cooperative models in high schools

    Get PDF
    The aim of the research is to improve the learning outcomes of mathematics material in high schools. At the high school level, there are three materials that are difficult for students to understand, one of which is derivative material. In fact, student learning outcomes in low-derived material. Difficulties arise because teachers rarely write teaching modules. There is difficulty understanding the definition (71.42%), concepts (71.42%), principles (57.14%), and skills (42.85%). In the needs analysis, 90% of students had difficulty with derivative material and the teacher was of the opinion that 85% of students had low scores on derived material. The research used research and development (R&D) method. The stages of research are needs analysis, design, development, implementation, and evaluation. As a result, the validation of material experts is 91.72%, math teachers are 92.42%, and students are 95.90%, all three are categorized as very good. Students who do not use the module get an average score of 65.51, and students who are assisted by the module get an average score of 87.20. In conclusion, there is a significant difference between using a module and not using a module of 21.69. The research interprets the developed modules to significantly improve student learning outcomes

    Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

    Full text link
    In this work, we carry out a data archaeology to infer books that are known to ChatGPT and GPT-4 using a name cloze membership inference query. We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web. The ability of these models to memorize an unknown set of books complicates assessments of measurement validity for cultural analytics by contaminating test data; we show that models perform much better on memorized books than on non-memorized books for downstream tasks. We argue that this supports a case for open models whose training data is known.Comment: EMNLP 2023 camera-ready (16 pages, 4 figures

    A Survey on Measuring and Mitigating Reasoning Shortcuts in Machine Reading Comprehension

    Full text link
    The issue of shortcut learning is widely known in NLP and has been an important research focus in recent years. Unintended correlations in the data enable models to easily solve tasks that were meant to exhibit advanced language understanding and reasoning capabilities. In this survey paper, we focus on the field of machine reading comprehension (MRC), an important task for showcasing high-level language understanding that also suffers from a range of shortcuts. We summarize the available techniques for measuring and mitigating shortcuts and conclude with suggestions for further progress in shortcut research. Importantly, we highlight two concerns for shortcut mitigation in MRC: (1) the lack of public challenge sets, a necessary component for effective and reusable evaluation, and (2) the lack of certain mitigation techniques that are prominent in other areas.Comment: 18 pages, 2 figures, 4 table

    CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations

    Full text link
    Recent work has aimed to capture nuances of human behavior by using LLMs to simulate responses from particular demographics in settings like social science experiments and public opinion surveys. However, there are currently no established ways to discuss or evaluate the quality of such LLM simulations. Moreover, there is growing concern that these LLM simulations are flattened caricatures of the personas that they aim to simulate, failing to capture the multidimensionality of people and perpetuating stereotypes. To bridge these gaps, we present CoMPosT, a framework to characterize LLM simulations using four dimensions: Context, Model, Persona, and Topic. We use this framework to measure open-ended LLM simulations' susceptibility to caricature, defined via two criteria: individuation and exaggeration. We evaluate the level of caricature in scenarios from existing work on LLM simulations. We find that for GPT-4, simulations of certain demographics (political and marginalized groups) and topics (general, uncontroversial) are highly susceptible to caricature.Comment: To appear at EMNLP 2023 (Main
    corecore