18 research outputs found
Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets
Ideally Open-Domain Question Answering models should exhibit a number of
competencies, ranging from simply memorizing questions seen at training time,
to answering novel question formulations with answers seen during training, to
generalizing to completely novel questions with novel answers. However, single
aggregated test set scores do not show the full picture of what capabilities
models truly have. In this work, we perform a detailed study of the test sets
of three popular open-domain benchmark datasets with respect to these
competencies. We find that 60-70% of test-time answers are also present
somewhere in the training sets. We also find that 30% of test-set questions
have a near-duplicate paraphrase in their corresponding training sets. Using
these findings, we evaluate a variety of popular open-domain models to obtain
greater insight into what extent they can actually generalize, and what drives
their overall performance. We find that all models perform dramatically worse
on questions that cannot be memorized from training sets, with a mean absolute
performance difference of 63% between repeated and non-repeated data. Finally
we show that simple nearest-neighbor models out-perform a BART closed-book QA
model, further highlighting the role that training set memorization plays in
these benchmark
A Few More Examples May Be Worth Billions of Parameters
We investigate the dynamics of increasing the number of model parameters versus the number of labeled examples across a wide variety of tasks. Our exploration reveals that while scaling parameters consistently yields performance improvements, the contribution of additional examples highly depends on the task's format. Specifically, in open question answering tasks, enlarging the training set does not improve performance. In contrast, classification, extractive question answering, and multiple choice tasks benefit so much from additional examples that collecting a few hundred examples is often “worth” billions of parameters. We hypothesize that unlike open question answering, which involves recalling specific information, solving strategies for tasks with a more restricted output space transfer across examples, and can therefore be learned with small amounts of labeled data
Effectiveness of learning mathematics derivative materials using modules equipped with cooperative models in high schools
The aim of the research is to improve the learning outcomes of mathematics material in high schools. At the high school level, there are three materials that are difficult for students to understand, one of which is derivative material. In fact, student learning outcomes in low-derived material. Difficulties arise because teachers rarely write teaching modules. There is difficulty understanding the definition (71.42%), concepts (71.42%), principles (57.14%), and skills (42.85%). In the needs analysis, 90% of students had difficulty with derivative material and the teacher was of the opinion that 85% of students had low scores on derived material. The research used research and development (R&D) method. The stages of research are needs analysis, design, development, implementation, and evaluation. As a result, the validation of material experts is 91.72%, math teachers are 92.42%, and students are 95.90%, all three are categorized as very good. Students who do not use the module get an average score of 65.51, and students who are assisted by the module get an average score of 87.20. In conclusion, there is a significant difference between using a module and not using a module of 21.69. The research interprets the developed modules to significantly improve student learning outcomes
Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4
In this work, we carry out a data archaeology to infer books that are known
to ChatGPT and GPT-4 using a name cloze membership inference query. We find
that OpenAI models have memorized a wide collection of copyrighted materials,
and that the degree of memorization is tied to the frequency with which
passages of those books appear on the web. The ability of these models to
memorize an unknown set of books complicates assessments of measurement
validity for cultural analytics by contaminating test data; we show that models
perform much better on memorized books than on non-memorized books for
downstream tasks. We argue that this supports a case for open models whose
training data is known.Comment: EMNLP 2023 camera-ready (16 pages, 4 figures
A Survey on Measuring and Mitigating Reasoning Shortcuts in Machine Reading Comprehension
The issue of shortcut learning is widely known in NLP and has been an
important research focus in recent years. Unintended correlations in the data
enable models to easily solve tasks that were meant to exhibit advanced
language understanding and reasoning capabilities. In this survey paper, we
focus on the field of machine reading comprehension (MRC), an important task
for showcasing high-level language understanding that also suffers from a range
of shortcuts. We summarize the available techniques for measuring and
mitigating shortcuts and conclude with suggestions for further progress in
shortcut research. Importantly, we highlight two concerns for shortcut
mitigation in MRC: (1) the lack of public challenge sets, a necessary component
for effective and reusable evaluation, and (2) the lack of certain mitigation
techniques that are prominent in other areas.Comment: 18 pages, 2 figures, 4 table
CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations
Recent work has aimed to capture nuances of human behavior by using LLMs to
simulate responses from particular demographics in settings like social science
experiments and public opinion surveys. However, there are currently no
established ways to discuss or evaluate the quality of such LLM simulations.
Moreover, there is growing concern that these LLM simulations are flattened
caricatures of the personas that they aim to simulate, failing to capture the
multidimensionality of people and perpetuating stereotypes. To bridge these
gaps, we present CoMPosT, a framework to characterize LLM simulations using
four dimensions: Context, Model, Persona, and Topic. We use this framework to
measure open-ended LLM simulations' susceptibility to caricature, defined via
two criteria: individuation and exaggeration. We evaluate the level of
caricature in scenarios from existing work on LLM simulations. We find that for
GPT-4, simulations of certain demographics (political and marginalized groups)
and topics (general, uncontroversial) are highly susceptible to caricature.Comment: To appear at EMNLP 2023 (Main