374 research outputs found

    A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors

    Get PDF
    This paper compares a deep and a shallow processing approach to the problem of classifying a sentence as grammatically wellformed or ill-formed. The deep processing approach uses the XLE LFG parser and English grammar: two versions are presented, one which uses the XLE directly to perform the classification, and another one which uses a decision tree trained on features consisting of the XLEā€™s output statistics. The shallow processing approach predicts grammaticality based on n-gram frequency statistics: we present two versions, one which uses frequency thresholds and one which uses a decision tree trained on the frequencies of the rarest n-grams in the input sentence. We find that the use of a decision tree improves on the basic approach only for the deep parser-based approach. We also show that combining both the shallow and deep decision tree features is effective. Our evaluation is carried out using a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting grammatical errors into well-formed BNC sentences

    A comparison of parsing technologies for the biomedical domain

    Get PDF
    This paper reports on a number of experiments which are designed to investigate the extent to which current nlp resources are able to syntactically and semantically analyse biomedical text. We address two tasks: parsing a real corpus with a hand-built widecoverage grammar, producing both syntactic analyses and logical forms; and automatically computing the interpretation of compound nouns where the head is a nominalisation (e.g., hospital arrival means an arrival at hospital, while patient arrival means an arrival of a patient). For the former task we demonstrate that exible and yet constrained `preprocessing ' techniques are crucial to success: these enable us to use part-of-speech tags to overcome inadequate lexical coverage, and to `package up' complex technical expressions prior to parsing so that they are blocked from creating misleading amounts of syntactic complexity. We argue that the xml-processing paradigm is ideally suited for automatically preparing the corpus for parsing. For the latter task, we compute interpretations of the compounds by exploiting surface cues and meaning paraphrases, which in turn are extracted from the parsed corpus. This provides an empirical setting in which we can compare the utility of a comparatively deep parser vs. a shallow one, exploring the trade-o between resolving attachment ambiguities on the one hand and generating errors in the parses on the other. We demonstrate that a model of the meaning of compound nominalisations is achievable with the aid of current broad-coverage parsers

    Foreebank: Syntactic Analysis of Customer Support Forums

    Get PDF
    International audienceWe present a new treebank of English and French technical forum content which has been annotated for grammatical errors and phrase structure. This double annotation allows us to empirically measure the effect of errors on parsing performance. While it is slightly easier to parse the corrected versions of the forum sentences, the errors are not the main factor in making this kind of text hard to parse

    Assessing Grammatical Correctness in Language Learning

    Get PDF
    We present experiments on assessing the grammatical correctness of learnersā€™ answers in a language-learning System (references to the System, and the links to the released data and code are withheld for anonymity). In particular, we explore the problem of detecting alternative-correct answers: when more than one inflected form of a lemma fits syntactically and semantically in a given context. We approach the problem with the methods for grammatical error detection (GED), since we hypothesize that models for detecting grammatical mistakes can assess the correctness of potential alternative answers in a learning setting. Due to the paucity of training data, we explore the ability of pre-trained BERT to detect grammatical errors and then fine-tune it using synthetic training data. In this work, we focus on errors in inflection. Our experiments show a. that pre-trained BERT performs worse at detecting grammatical irregularities for Russian than for English; b. that fine-tuned BERT yields promising results on assessing the correctness of grammatical exercises; and c. establish a new benchmark for Russian. To further investigate its performance, we compare fine-tuned BERT with one of the state-of-the-art models for GED (Bell et al., 2019) on our dataset and RULEC-GEC (Rozovskaya and Roth, 2019). We release the manually annotated learner dataset, used for testing, for general use.Peer reviewe

    Memory-Based Shallow Parsing

    Full text link
    We present memory-based learning approaches to shallow parsing and apply these to five tasks: base noun phrase identification, arbitrary base phrase recognition, clause detection, noun phrase parsing and full parsing. We use feature selection techniques and system combination methods for improving the performance of the memory-based learner. Our approach is evaluated on standard data sets and the results are compared with that of other systems. This reveals that our approach works well for base phrase identification while its application towards recognizing embedded structures leaves some room for improvement

    GumDrop at the DISRPT2019 Shared Task: A Model Stacking Approach to Discourse Unit Segmentation and Connective Detection

    Full text link
    In this paper we present GumDrop, Georgetown University's entry at the DISRPT 2019 Shared Task on automatic discourse unit segmentation and connective detection. Our approach relies on model stacking, creating a heterogeneous ensemble of classifiers, which feed into a metalearner for each final task. The system encompasses three trainable component stacks: one for sentence splitting, one for discourse unit segmentation and one for connective detection. The flexibility of each ensemble allows the system to generalize well to datasets of different sizes and with varying levels of homogeneity.Comment: Proceedings of Discourse Relation Parsing and Treebanking (DISRPT2019

    Memory-Based Grammatical Relation Finding

    Get PDF

    Detecting grammatical errors with treebank-induced, probabilistic parsers

    Get PDF
    Today's grammar checkers often use hand-crafted rule systems that define acceptable language. The development of such rule systems is labour-intensive and has to be repeated for each language. At the same time, grammars automatically induced from syntactically annotated corpora (treebanks) are successfully employed in other applications, for example text understanding and machine translation. At first glance, treebank-induced grammars seem to be unsuitable for grammar checking as they massively over-generate and fail to reject ungrammatical input due to their high robustness. We present three new methods for judging the grammaticality of a sentence with probabilistic, treebank-induced grammars, demonstrating that such grammars can be successfully applied to automatically judge the grammaticality of an input string. Our best-performing method exploits the differences between parse results for grammars trained on grammatical and ungrammatical treebanks. The second approach builds an estimator of the probability of the most likely parse using grammatical training data that has previously been parsed and annotated with parse probabilities. If the estimated probability of an input sentence (whose grammaticality is to be judged by the system) is higher by a certain amount than the actual parse probability, the sentence is flagged as ungrammatical. The third approach extracts discriminative parse tree fragments in the form of CFG rules from parsed grammatical and ungrammatical corpora and trains a binary classifier to distinguish grammatical from ungrammatical sentences. The three approaches are evaluated on a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting common grammatical errors into the British National Corpus. The results are compared to two traditional approaches, one that uses a hand-crafted, discriminative grammar, the XLE ParGram English LFG, and one based on part-of-speech n-grams. In addition, the baseline methods and the new methods are combined in a machine learning-based framework, yielding further improvements
    • ā€¦
    corecore