396 research outputs found

    JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction

    Full text link
    We present a new parallel corpus, JHU FLuency-Extended GUG corpus (JFLEG) for developing and evaluating grammatical error correction (GEC). Unlike other corpora, it represents a broad range of language proficiency levels and uses holistic fluency edits to not only correct grammatical errors but also make the original text more native sounding. We describe the types of corrections made and benchmark four leading GEC systems on this corpus, identifying specific areas in which they do well and how they can improve. JFLEG fulfills the need for a new gold standard to properly assess the current state of GEC.Comment: To appear in EACL 2017 (short papers

    Ordinal GAMMs: a new window on human ratings

    Get PDF

    Hypothesis Only Baselines in Natural Language Inference

    Get PDF
    We propose a hypothesis only baseline for diagnosing Natural Language Inference (NLI). Especially when an NLI dataset assumes inference is occurring based purely on the relationship between a context and a hypothesis, it follows that assessing entailment relations while ignoring the provided context is a degenerate solution. Yet, through experiments on ten distinct NLI datasets, we find that this approach, which we refer to as a hypothesis-only model, is able to significantly outperform a majority class baseline across a number of NLI datasets. Our analysis suggests that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context.Comment: Accepted at *SEM 2018 as long paper. 12 page

    Grammaticality, Acceptability, and Probability: A Probabilistic View of Linguistic Knowledge

    Get PDF
    The question of whether humans represent grammatical knowledge as a binary condition on membership in a set of well‐formed sentences, or as a probabilistic property has been the subject of debate among linguists, psychologists, and cognitive scientists for many decades. Acceptability judgments present a serious problem for both classical binary and probabilistic theories of grammaticality. These judgements are gradient in nature, and so cannot be directly accommodated in a binary formal grammar. However, it is also not possible to simply reduce acceptability to probability. The acceptability of a sentence is not the same as the likelihood of its occurrence, which is, in part, determined by factors like sentence length and lexical frequency. In this paper, we present the results of a set of large‐scale experiments using crowd‐sourced acceptability judgments that demonstrate gradience to be a pervasive feature in acceptability judgments. We then show how one can predict acceptability judgments on the basis of probability by augmenting probabilistic language models with an acceptability measure. This is a function that normalizes probability values to eliminate the confounding factors of length and lexical frequency. We describe a sequence of modeling experiments with unsupervised language models drawn from state‐of‐the‐art machine learning methods in natural language processing. Several of these models achieve very encouraging levels of accuracy in the acceptability prediction task, as measured by the correlation between the acceptability measure scores and mean human acceptability values. We consider the relevance of these results to the debate on the nature of grammatical competence, and we argue that they support the view that linguistic knowledge can be intrinsically probabilistic

    BLiMP: The Benchmark of Linguistic Minimal Pairs for English

    Full text link
    We introduce The Benchmark of Linguistic Minimal Pairs (shortened to BLiMP), a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars, and aggregate human agreement with the labels is 96.4%. We use it to evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs. We find that state-of-the-art models identify morphological contrasts reliably, but they struggle with semantic restrictions on the distribution of quantifiers and negative polarity items and subtle syntactic phenomena such as extraction islands.Comment: To appear in TAC

    Why We Need New Evaluation Metrics for NLG

    Full text link
    The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.Comment: accepted to EMNLP 201

    Relationship between metalinguistic knowledge/learning contexts and language proficiency, The

    Get PDF
    2013 Spring.Includes bibliographical references.This study explores the relationship between learning context on learners' oral proficiency, metalinguistic knowledge of Spanish (MKS) and metalinguistic knowledge of English (MKE). The study also explores the relationship between MKE and MKS, and MKS on oral proficiency between the two learning contexts. The two contexts in question were a traditional semester (TS) that met five days a week, fifty minutes a day for fifteen weeks and a four-week summer intensive program that met five days a week, four hours a day for four weeks. A COPI (computerized oral proficiency interview) was administered to measure oral proficiency and two different measures of metalinguistic knowledge were employed to test MKE and MKS. The MKE test was administered as a pre and posttest, whereas the MKS test was given at the end of the semester. The study found that, a) students in the TS group have significantly higher levels of MKS, b) student in the TS group significantly improve their MKE more so than the IS group, c) there is a significant relationship between MKS and oral proficiency regardless of group, d) there is a significant relationship between MKE pretest and MKS at the end of the semester, and e) there is no significant difference between oral proficiency between the two contexts
    corecore