396 research outputs found
JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction
We present a new parallel corpus, JHU FLuency-Extended GUG corpus (JFLEG) for
developing and evaluating grammatical error correction (GEC). Unlike other
corpora, it represents a broad range of language proficiency levels and uses
holistic fluency edits to not only correct grammatical errors but also make the
original text more native sounding. We describe the types of corrections made
and benchmark four leading GEC systems on this corpus, identifying specific
areas in which they do well and how they can improve. JFLEG fulfills the need
for a new gold standard to properly assess the current state of GEC.Comment: To appear in EACL 2017 (short papers
Hypothesis Only Baselines in Natural Language Inference
We propose a hypothesis only baseline for diagnosing Natural Language
Inference (NLI). Especially when an NLI dataset assumes inference is occurring
based purely on the relationship between a context and a hypothesis, it follows
that assessing entailment relations while ignoring the provided context is a
degenerate solution. Yet, through experiments on ten distinct NLI datasets, we
find that this approach, which we refer to as a hypothesis-only model, is able
to significantly outperform a majority class baseline across a number of NLI
datasets. Our analysis suggests that statistical irregularities may allow a
model to perform NLI in some datasets beyond what should be achievable without
access to the context.Comment: Accepted at *SEM 2018 as long paper. 12 page
Grammaticality, Acceptability, and Probability: A Probabilistic View of Linguistic Knowledge
The question of whether humans represent grammatical knowledge as a binary condition on membership in a set of well‐formed sentences, or as a probabilistic property has been the subject of debate among linguists, psychologists, and cognitive scientists for many decades. Acceptability judgments present a serious problem for both classical binary and probabilistic theories of grammaticality. These judgements are gradient in nature, and so cannot be directly accommodated in a binary formal grammar. However, it is also not possible to simply reduce acceptability to probability. The acceptability of a sentence is not the same as the likelihood of its occurrence, which is, in part, determined by factors like sentence length and lexical frequency. In this paper, we present the results of a set of large‐scale experiments using crowd‐sourced acceptability judgments that demonstrate gradience to be a pervasive feature in acceptability judgments. We then show how one can predict acceptability judgments on the basis of probability by augmenting probabilistic language models with an acceptability measure. This is a function that normalizes probability values to eliminate the confounding factors of length and lexical frequency. We describe a sequence of modeling experiments with unsupervised language models drawn from state‐of‐the‐art machine learning methods in natural language processing. Several of these models achieve very encouraging levels of accuracy in the acceptability prediction task, as measured by the correlation between the acceptability measure scores and mean human acceptability values. We consider the relevance of these results to the debate on the nature of grammatical competence, and we argue that they support the view that linguistic knowledge can be intrinsically probabilistic
BLiMP: The Benchmark of Linguistic Minimal Pairs for English
We introduce The Benchmark of Linguistic Minimal Pairs (shortened to BLiMP),
a challenge set for evaluating what language models (LMs) know about major
grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each
containing 1000 minimal pairs isolating specific contrasts in syntax,
morphology, or semantics. The data is automatically generated according to
expert-crafted grammars, and aggregate human agreement with the labels is
96.4%. We use it to evaluate n-gram, LSTM, and Transformer (GPT-2 and
Transformer-XL) LMs. We find that state-of-the-art models identify
morphological contrasts reliably, but they struggle with semantic restrictions
on the distribution of quantifiers and negative polarity items and subtle
syntactic phenomena such as extraction islands.Comment: To appear in TAC
Recommended from our members
Quantification at a distance and grammatical illusions in French
Recent research in psycholinguistics supports the hypothesis that retrieval from working memory is a key component of establishing syntactic dependencies in comprehension. This can result in so-called grammatical illusions. These illusions have been modeled as the result of a content-addressable retrieval process in sentence comprehension that allows grammatically inaccessible licensing elements to be reactivated, creating a spurious perception of acceptability. This article reports five studies that establish the existence of a new grammatical illusion involving quantification at a distance and the licensing of so-called de NPs in French. Our results suggest that this grammatical illusion is interestingly constrained by syntactic properties of the licensors. Specifically, quantifiers that independently participate in quantification-at-a-distance constructions were seen to create grammatical illusions to a greater extent than quantifiers that do not participate in that construction. Consistent with previous work on the nature of cues in memory retrieval, we suggest that this is the result of fairly specific abstract syntactic cues that guide retrieval of a licensing element. This article thus brings further evidence that syntax is crucially used to structure working memory over the course of a parse
Why We Need New Evaluation Metrics for NLG
The majority of NLG evaluation relies on automatic metrics, such as BLEU . In
this paper, we motivate the need for novel, system- and data-independent
automatic evaluation methods: We investigate a wide range of metrics, including
state-of-the-art word-based and novel grammar-based ones, and demonstrate that
they only weakly reflect human judgements of system outputs as generated by
data-driven, end-to-end NLG. We also show that metric performance is data- and
system-specific. Nevertheless, our results also suggest that automatic metrics
perform reliably at system-level and can support system development by finding
cases where a system performs poorly.Comment: accepted to EMNLP 201
Relationship between metalinguistic knowledge/learning contexts and language proficiency, The
2013 Spring.Includes bibliographical references.This study explores the relationship between learning context on learners' oral proficiency, metalinguistic knowledge of Spanish (MKS) and metalinguistic knowledge of English (MKE). The study also explores the relationship between MKE and MKS, and MKS on oral proficiency between the two learning contexts. The two contexts in question were a traditional semester (TS) that met five days a week, fifty minutes a day for fifteen weeks and a four-week summer intensive program that met five days a week, four hours a day for four weeks. A COPI (computerized oral proficiency interview) was administered to measure oral proficiency and two different measures of metalinguistic knowledge were employed to test MKE and MKS. The MKE test was administered as a pre and posttest, whereas the MKS test was given at the end of the semester. The study found that, a) students in the TS group have significantly higher levels of MKS, b) student in the TS group significantly improve their MKE more so than the IS group, c) there is a significant relationship between MKS and oral proficiency regardless of group, d) there is a significant relationship between MKE pretest and MKS at the end of the semester, and e) there is no significant difference between oral proficiency between the two contexts
- …