Search CORE

18 research outputs found

Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets

Author: Lewis Patrick
Riedel Sebastian
Stenetorp Pontus
Publication venue
Publication date: 06/08/2020
Field of study

Ideally Open-Domain Question Answering models should exhibit a number of competencies, ranging from simply memorizing questions seen at training time, to answering novel question formulations with answers seen during training, to generalizing to completely novel questions with novel answers. However, single aggregated test set scores do not show the full picture of what capabilities models truly have. In this work, we perform a detailed study of the test sets of three popular open-domain benchmark datasets with respect to these competencies. We find that 60-70% of test-time answers are also present somewhere in the training sets. We also find that 30% of test-set questions have a near-duplicate paraphrase in their corresponding training sets. Using these findings, we evaluate a variety of popular open-domain models to obtain greater insight into what extent they can actually generalize, and what drives their overall performance. We find that all models perform dramatically worse on questions that cannot be memorized from training sets, with a mean absolute performance difference of 63% between repeated and non-repeated data. Finally we show that simple nearest-neighbor models out-perform a BART closed-book QA model, further highlighting the role that training set memorization plays in these benchmark

arXiv.org e-Print Archive

UCL Discovery

A Few More Examples May Be Worth Billions of Parameters

Author: Kirstain Y
Levy O
Lewis P
Riedel S
Publication venue: ACL Anthology
Publication date: 08/10/2021
Field of study

We investigate the dynamics of increasing the number of model parameters versus the number of labeled examples across a wide variety of tasks. Our exploration reveals that while scaling parameters consistently yields performance improvements, the contribution of additional examples highly depends on the task's format. Specifically, in open question answering tasks, enlarging the training set does not improve performance. In contrast, classification, extractive question answering, and multiple choice tasks benefit so much from additional examples that collecting a few hundred examples is often “worth” billions of parameters. We hypothesize that unlike open question answering, which involves recalling specific information, solving strategies for tasks with a more restricted output space transfer across examples, and can therefore be learned with small amounts of labeled data

arXiv.org e-Print Archive

UCL Discovery

Effectiveness of learning mathematics derivative materials using modules equipped with cooperative models in high schools

Author: Lumbantoruan Jitu Halomoan
Manalu Risma Uly
Publication venue: Institute of Advanced Engineering and Science
Publication date: 01/02/2024
Field of study

The aim of the research is to improve the learning outcomes of mathematics material in high schools. At the high school level, there are three materials that are difficult for students to understand, one of which is derivative material. In fact, student learning outcomes in low-derived material. Difficulties arise because teachers rarely write teaching modules. There is difficulty understanding the definition (71.42%), concepts (71.42%), principles (57.14%), and skills (42.85%). In the needs analysis, 90% of students had difficulty with derivative material and the teacher was of the opinion that 85% of students had low scores on derived material. The research used research and development (R&D) method. The stages of research are needs analysis, design, development, implementation, and evaluation. As a result, the validation of material experts is 91.72%, math teachers are 92.42%, and students are 95.90%, all three are categorized as very good. Students who do not use the module get an average score of 65.51, and students who are assisted by the module get an average score of 87.20. In conclusion, there is a significant difference between using a module and not using a module of 21.69. The research interprets the developed modules to significantly improve student learning outcomes

International Journal of Evaluation and Research in Education (IJERE)

Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4

Author: Bamman David
Chang Kent K.
Cramer Mackenzie
Soni Sandeep
Publication venue
Publication date: 20/10/2023
Field of study

In this work, we carry out a data archaeology to infer books that are known to ChatGPT and GPT-4 using a name cloze membership inference query. We find that OpenAI models have memorized a wide collection of copyrighted materials, and that the degree of memorization is tied to the frequency with which passages of those books appear on the web. The ability of these models to memorize an unknown set of books complicates assessments of measurement validity for cultural analytics by contaminating test data; we show that models perform much better on memorized books than on non-memorized books for downstream tasks. We argue that this supports a case for open models whose training data is known.Comment: EMNLP 2023 camera-ready (16 pages, 4 figures

arXiv.org e-Print Archive

A Survey on Measuring and Mitigating Reasoning Shortcuts in Machine Reading Comprehension

Author: Aizawa Akiko
Ho Xanh
Meissner Johannes Mario
Sugawara Saku
Publication venue
Publication date: 06/09/2023
Field of study

The issue of shortcut learning is widely known in NLP and has been an important research focus in recent years. Unintended correlations in the data enable models to easily solve tasks that were meant to exhibit advanced language understanding and reasoning capabilities. In this survey paper, we focus on the field of machine reading comprehension (MRC), an important task for showcasing high-level language understanding that also suffers from a range of shortcuts. We summarize the available techniques for measuring and mitigating shortcuts and conclude with suggestions for further progress in shortcut research. Importantly, we highlight two concerns for shortcut mitigation in MRC: (1) the lack of public challenge sets, a necessary component for effective and reusable evaluation, and (2) the lack of certain mitigation techniques that are prominent in other areas.Comment: 18 pages, 2 figures, 4 table

arXiv.org e-Print Archive

CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations

Author: Cheng Myra
Piccardi Tiziano
Yang Diyi
Publication venue
Publication date: 17/10/2023
Field of study

Recent work has aimed to capture nuances of human behavior by using LLMs to simulate responses from particular demographics in settings like social science experiments and public opinion surveys. However, there are currently no established ways to discuss or evaluate the quality of such LLM simulations. Moreover, there is growing concern that these LLM simulations are flattened caricatures of the personas that they aim to simulate, failing to capture the multidimensionality of people and perpetuating stereotypes. To bridge these gaps, we present CoMPosT, a framework to characterize LLM simulations using four dimensions: Context, Model, Persona, and Topic. We use this framework to measure open-ended LLM simulations' susceptibility to caricature, defined via two criteria: individuation and exaggeration. We evaluate the level of caricature in scenarios from existing work on LLM simulations. We find that for GPT-4, simulations of certain demographics (political and marginalized groups) and topics (general, uncontroversial) are highly susceptible to caricature.Comment: To appear at EMNLP 2023 (Main

arXiv.org e-Print Archive