9,917 research outputs found

    GenERRate: generating errors for use in grammatical error detection

    Get PDF
    This paper explores the issue of automatically generated ungrammatical data and its use in error detection, with a focus on the task of classifying a sentence as grammatical or ungrammatical. We present an error generation tool called GenERRate and show how GenERRate can be used to improve the performance of a classifier on learner data. We describe initial attempts to replicate Cambridge Learner Corpus errors using GenERRate

    A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors

    Get PDF
    This paper compares a deep and a shallow processing approach to the problem of classifying a sentence as grammatically wellformed or ill-formed. The deep processing approach uses the XLE LFG parser and English grammar: two versions are presented, one which uses the XLE directly to perform the classification, and another one which uses a decision tree trained on features consisting of the XLE’s output statistics. The shallow processing approach predicts grammaticality based on n-gram frequency statistics: we present two versions, one which uses frequency thresholds and one which uses a decision tree trained on the frequencies of the rarest n-grams in the input sentence. We find that the use of a decision tree improves on the basic approach only for the deep parser-based approach. We also show that combining both the shallow and deep decision tree features is effective. Our evaluation is carried out using a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting grammatical errors into well-formed BNC sentences

    On the Similarities Between Native, Non-native and Translated Texts

    Full text link
    We present a computational analysis of three language varieties: native, advanced non-native, and translation. Our goal is to investigate the similarities and differences between non-native language productions and translations, contrasting both with native language. Using a collection of computational methods we establish three main results: (1) the three types of texts are easily distinguishable; (2) non-native language and translations are closer to each other than each of them is to native language; and (3) some of these characteristics depend on the source or native language, while others do not, reflecting, perhaps, unified principles that similarly affect translations and non-native language.Comment: ACL2016, 12 page

    Treebanks gone bad: generating a treebank of ungrammatical English

    Get PDF
    This paper describes how a treebank of ungrammatical sentences can be created from a treebank of well-formed sentences. The treebank creation procedure involves the automatic introduction of frequently occurring grammatical errors into the sentences in an existing treebank, and the minimal transformation of the analyses in the treebank so that they describe the newly created ill-formed sentences. Such a treebank can be used to test how well a parser is able to ignore grammatical errors in texts (as people can), and can be used to induce a grammar capable of analysing such sentences. This paper also demonstrates the first of these uses

    Spoken language 'grammatical error correction'

    Get PDF
    Spoken language ‘grammatical error correction’ (GEC) is an important mechanism to help learners of a foreign language, here English, improve their spoken grammar. GEC is challeng- ing for non-native spoken language due to interruptions from disfluent speech events such as repetitions and false starts and issues in strictly defining what is acceptable in spoken language. Furthermore there is little labelled data to train models. One way to mitigate the impact of speech events is to use a disflu- ency detection (DD) model. Removing the detected disfluencies converts the speech transcript to be closer to written language, which has significantly more labelled training data. This paper considers two types of approaches to leveraging DD models to boost spoken GEC performance. One is sequential, a separately trained DD model acts as a pre-processing module providing a more structured input to the GEC model. The second approach is to train DD and GEC models in an end-to-end fashion, simul- taneously optimising both modules. Embeddings enable end- to-end models to have a richer information flow. Experimen- tal results show that DD effectively regulates GEC input; end- to-end training works well when fine-tuned on limited labelled in-domain data; and improving DD by incorporating acoustic information helps improve spoken GEC

    On the automaticity of language processing

    Get PDF
    People speak and listen to language all the time. Given this high frequency of use, it is often suggested that at least some aspects of language processing are highly overlearned and therefore occur “automatically”. Here we critically examine this suggestion. We first sketch a framework that views automaticity as a set of interrelated features of mental processes and a matter of degree rather than a single feature that is all-or-none. We then apply this framework to language processing. To do so, we carve up the processes involved in language use according to (a) whether language processing takes place in monologue or dialogue, (b) whether the individual is comprehending or producing language, (c) whether the spoken or written modality is used, and (d) the linguistic processing level at which they occur, that is, phonology, the lexicon, syntax, or conceptual processes. This exercise suggests that while conceptual processes are relatively non-automatic (as is usually assumed), there is also considerable evidence that syntactic and lexical lower-level processes are not fully automatic. We close by discussing entrenchment as a set of mechanisms underlying automatization
    corecore