13 research outputs found
Using parse features for preposition selection and error detection
We evaluate the effect of adding parse features to a leading model of preposition usage. Results show a significant improvement in the preposition selection task on
native speaker text and a modest increment in precision and recall in an ESL error detection task. Analysis of the parser output indicates that it is robust enough in the face
of noisy non-native writing to extract useful information
An Empirical Comparison of Parsing Methods for Stanford Dependencies
Stanford typed dependencies are a widely desired representation of natural
language sentences, but parsing is one of the major computational bottlenecks
in text analysis systems. In light of the evolving definition of the Stanford
dependencies and developments in statistical dependency parsing algorithms,
this paper revisits the question of Cer et al. (2010): what is the tradeoff
between accuracy and speed in obtaining Stanford dependencies in particular? We
also explore the effects of input representations on this tradeoff:
part-of-speech tags, the novel use of an alternative dependency representation
as input, and distributional representaions of words. We find that direct
dependency parsing is a more viable solution than it was found to be in the
past. An accompanying software release can be found at:
http://www.ark.cs.cmu.edu/TBSDComment: 13 pages, 2 figure
Universal Dependencies for Learner English
We introduce the Treebank of Learner English (TLE), the first publicly available syntactic treebank for English as a Second Language (ESL). The TLE provides manually annotated POS tags and Universal Dependency (UD) trees for 5,124 sentences from the Cambridge First Certificate in English (FCE) corpus. The UD annotations are tied to a pre-existing error annotation of the FCE, whereby full syntactic analyses are provided for both the original and error corrected versions of each sentence. Further on, we delineate ESL annotation guidelines that allow for consistent syntactic treatment of ungrammatical English. Finally, we benchmark POS tagging and dependency parsing performance on the TLE dataset and measure the effect of grammatical errors on parsing accuracy. We envision the treebank to support a wide range of linguistic and computational research o n second language acquisition as well as automatic processing of ungrammatical language.This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF – 1231216
Correcting Preposition Errors in Learner English Using Error Case Frames and Feedback Messages
Abstract This paper presents a novel framework called error case frames for correcting preposition errors. They are case frames specially designed for describing and correcting preposition errors. Their most distinct advantage is that they can correct errors with feedback messages explaining why the preposition is erroneous. This paper proposes a method for automatically generating them by comparing learner and native corpora. Experiments show (i) automatically generated error case frames achieve a performance comparable to conventional methods; (ii) error case frames are intuitively interpretable and manually modifiable to improve them; (iii) feedback messages provided by error case frames are effective in language learning assistance. Considering these advantages and the fact that it has been difficult to provide feedback messages by automatically generated rules, error case frames will likely be one of the major approaches for preposition error correction
Problems in Evaluating Grammatical Error Detection Systems
ABSTRACT Many evaluation issues for grammatical error detection have previously been overlooked, making it hard to draw meaningful comparisons between different approaches, even when they are evaluated on the same corpus. To begin with, the three-way contingency between a writer's sentence, the annotator's correction, and the system's output makes evaluation more complex than in some other NLP tasks, which we address by presenting an intuitive evaluation scheme. Of particular importance to error detection is the skew of the data -the low frequency of errors as compared to non-errors -which distorts some traditional measures of performance and limits their usefulness, leading us to recommend the reporting of raw measurements (true positives, false negatives, false positives, true negatives). Other issues that are particularly vexing for error detection focus on defining these raw measurements: specifying the size or scope of an error, properly treating errors as graded rather than discrete phenomena, and counting non-errors. We discuss recommendations for best practices with regard to reporting the results of system evaluation for these cases, recommendations which depend upon making clear one's assumptions and applications for error detection. By highlighting the problems with current error detection evaluation, the field will be better able to move forward
Towards Orthographic and Grammatical Clinical Text Correction: a First Approach
Akats Gramatikalen Zuzenketa (GEC, ingelesetik, Grammatical Error Analysis)
Hizkuntza Naturalaren Prozesamenduaren azpieremu bat da, ortogra a, puntuazio edo
gramatika akatsak dituzten testuak automatikoki zuzentzea helburu duena. Orain arte,
bigarren hizkuntzako ikasleek ekoitzitako testuetara bideratu da gehien bat, ingelesez
idatzitako testuetara batez ere. Master-Tesi honetan gaztelaniaz idatzitako
mediku-txostenetarako Akats Gramatikalen Zuzenketa lantzen da. Arlo espezi ko hau ez
da asko esploratu orain arte, ez gaztelaniarako zentzu orokorrean, ezta domeinu
klinikorako konkretuki ere. Hasteko, IMEC (gaztelaniatik, Informes Médicos en Español
Corregidos) corpusa aurkezten da, eskuz zuzendutako mediku-txosten elektronikoen
bilduma paralelo berria. Corpusa automatikoki etiketatu da zeregin honetarako
egokitutako ERRANT tresna erabiliz. Horrez gain, hainbat esperimentu deskribatzen
dira, zeintzuetan sare neuronaletan oinarritutako sistemak ataza honetarako
diseinatutako baseline sistema batekin alderatzen diren.Grammatical Error Correction (GEC) is a sub field of Natural Language Processing that aims to automatically correct texts that include errors related to spelling, punctuation or grammar. So far, it has mainly focused on texts produced by second language learners, mostly in English. This Master's Thesis describes a first approach to Grammatical Error Correction for Spanish health records. This specific field has not been explored much until now, nor in Spanish in a general sense nor for the clinical domain specifically. For this purpose, the corpus IMEC (Informes Médicos en Español Corregidos) ---a manually-corrected parallel collection of Electronic Health Records--- is introduced. This corpus has been automatically annotated using the toolkit ERRANT, specialized in the automatic annotation of GEC parallel corpora, which was adapted to Spanish for this task. Furthermore, some experiments using neural networks and data augmentation are shown and compared with a baseline system also created specifically for this task