13 research outputs found

    Using parse features for preposition selection and error detection

    Get PDF
    We evaluate the effect of adding parse features to a leading model of preposition usage. Results show a significant improvement in the preposition selection task on native speaker text and a modest increment in precision and recall in an ESL error detection task. Analysis of the parser output indicates that it is robust enough in the face of noisy non-native writing to extract useful information

    An Empirical Comparison of Parsing Methods for Stanford Dependencies

    Full text link
    Stanford typed dependencies are a widely desired representation of natural language sentences, but parsing is one of the major computational bottlenecks in text analysis systems. In light of the evolving definition of the Stanford dependencies and developments in statistical dependency parsing algorithms, this paper revisits the question of Cer et al. (2010): what is the tradeoff between accuracy and speed in obtaining Stanford dependencies in particular? We also explore the effects of input representations on this tradeoff: part-of-speech tags, the novel use of an alternative dependency representation as input, and distributional representaions of words. We find that direct dependency parsing is a more viable solution than it was found to be in the past. An accompanying software release can be found at: http://www.ark.cs.cmu.edu/TBSDComment: 13 pages, 2 figure

    Universal Dependencies for Learner English

    Get PDF
    We introduce the Treebank of Learner English (TLE), the first publicly available syntactic treebank for English as a Second Language (ESL). The TLE provides manually annotated POS tags and Universal Dependency (UD) trees for 5,124 sentences from the Cambridge First Certificate in English (FCE) corpus. The UD annotations are tied to a pre-existing error annotation of the FCE, whereby full syntactic analyses are provided for both the original and error corrected versions of each sentence. Further on, we delineate ESL annotation guidelines that allow for consistent syntactic treatment of ungrammatical English. Finally, we benchmark POS tagging and dependency parsing performance on the TLE dataset and measure the effect of grammatical errors on parsing accuracy. We envision the treebank to support a wide range of linguistic and computational research o n second language acquisition as well as automatic processing of ungrammatical language.This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF – 1231216

    Correcting Preposition Errors in Learner English Using Error Case Frames and Feedback Messages

    Get PDF
    Abstract This paper presents a novel framework called error case frames for correcting preposition errors. They are case frames specially designed for describing and correcting preposition errors. Their most distinct advantage is that they can correct errors with feedback messages explaining why the preposition is erroneous. This paper proposes a method for automatically generating them by comparing learner and native corpora. Experiments show (i) automatically generated error case frames achieve a performance comparable to conventional methods; (ii) error case frames are intuitively interpretable and manually modifiable to improve them; (iii) feedback messages provided by error case frames are effective in language learning assistance. Considering these advantages and the fact that it has been difficult to provide feedback messages by automatically generated rules, error case frames will likely be one of the major approaches for preposition error correction

    Problems in Evaluating Grammatical Error Detection Systems

    Get PDF
    ABSTRACT Many evaluation issues for grammatical error detection have previously been overlooked, making it hard to draw meaningful comparisons between different approaches, even when they are evaluated on the same corpus. To begin with, the three-way contingency between a writer's sentence, the annotator's correction, and the system's output makes evaluation more complex than in some other NLP tasks, which we address by presenting an intuitive evaluation scheme. Of particular importance to error detection is the skew of the data -the low frequency of errors as compared to non-errors -which distorts some traditional measures of performance and limits their usefulness, leading us to recommend the reporting of raw measurements (true positives, false negatives, false positives, true negatives). Other issues that are particularly vexing for error detection focus on defining these raw measurements: specifying the size or scope of an error, properly treating errors as graded rather than discrete phenomena, and counting non-errors. We discuss recommendations for best practices with regard to reporting the results of system evaluation for these cases, recommendations which depend upon making clear one's assumptions and applications for error detection. By highlighting the problems with current error detection evaluation, the field will be better able to move forward

    Towards Orthographic and Grammatical Clinical Text Correction: a First Approach

    Get PDF
    Akats Gramatikalen Zuzenketa (GEC, ingelesetik, Grammatical Error Analysis) Hizkuntza Naturalaren Prozesamenduaren azpieremu bat da, ortogra a, puntuazio edo gramatika akatsak dituzten testuak automatikoki zuzentzea helburu duena. Orain arte, bigarren hizkuntzako ikasleek ekoitzitako testuetara bideratu da gehien bat, ingelesez idatzitako testuetara batez ere. Master-Tesi honetan gaztelaniaz idatzitako mediku-txostenetarako Akats Gramatikalen Zuzenketa lantzen da. Arlo espezi ko hau ez da asko esploratu orain arte, ez gaztelaniarako zentzu orokorrean, ezta domeinu klinikorako konkretuki ere. Hasteko, IMEC (gaztelaniatik, Informes Médicos en Español Corregidos) corpusa aurkezten da, eskuz zuzendutako mediku-txosten elektronikoen bilduma paralelo berria. Corpusa automatikoki etiketatu da zeregin honetarako egokitutako ERRANT tresna erabiliz. Horrez gain, hainbat esperimentu deskribatzen dira, zeintzuetan sare neuronaletan oinarritutako sistemak ataza honetarako diseinatutako baseline sistema batekin alderatzen diren.Grammatical Error Correction (GEC) is a sub field of Natural Language Processing that aims to automatically correct texts that include errors related to spelling, punctuation or grammar. So far, it has mainly focused on texts produced by second language learners, mostly in English. This Master's Thesis describes a first approach to Grammatical Error Correction for Spanish health records. This specific field has not been explored much until now, nor in Spanish in a general sense nor for the clinical domain specifically. For this purpose, the corpus IMEC (Informes Médicos en Español Corregidos) ---a manually-corrected parallel collection of Electronic Health Records--- is introduced. This corpus has been automatically annotated using the toolkit ERRANT, specialized in the automatic annotation of GEC parallel corpora, which was adapted to Spanish for this task. Furthermore, some experiments using neural networks and data augmentation are shown and compared with a baseline system also created specifically for this task
    corecore