19,262 research outputs found
A comparative evaluation of deep and shallow approaches to the automatic detection of common grammatical errors
This paper compares a deep and a shallow processing approach to the problem of classifying a sentence as grammatically wellformed or ill-formed. The deep processing
approach uses the XLE LFG parser and English grammar: two versions are presented, one which uses the XLE directly to perform the classification, and another one which uses a decision tree trained on features consisting of the XLE’s output statistics. The shallow processing approach predicts grammaticality based on n-gram frequency statistics:
we present two versions, one which uses frequency thresholds and one which uses a decision tree trained on the frequencies of the rarest n-grams in the input sentence.
We find that the use of a decision tree improves on the basic approach only for the deep parser-based approach. We also show that combining both the shallow and deep
decision tree features is effective. Our evaluation
is carried out using a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting grammatical errors
into well-formed BNC sentences
Wronging a Right: Generating Better Errors to Improve Grammatical Error Detection
Grammatical error correction, like other machine learning tasks, greatly
benefits from large quantities of high quality training data, which is
typically expensive to produce. While writing a program to automatically
generate realistic grammatical errors would be difficult, one could learn the
distribution of naturallyoccurring errors and attempt to introduce them into
other datasets. Initial work on inducing errors in this way using statistical
machine translation has shown promise; we investigate cheaply constructing
synthetic samples, given a small corpus of human-annotated data, using an
off-the-rack attentive sequence-to-sequence model and a straight-forward
post-processing procedure. Our approach yields error-filled artificial data
that helps a vanilla bi-directional LSTM to outperform the previous state of
the art at grammatical error detection, and a previously introduced model to
gain further improvements of over 5% score. When attempting to
determine if a given sentence is synthetic, a human annotator at best achieves
39.39 score, indicating that our model generates mostly human-like
instances.Comment: Accepted as a short paper at EMNLP 201
GenERRate: generating errors for use in grammatical error detection
This paper explores the issue of automatically generated ungrammatical data and its use in error detection, with a focus on the task of classifying a sentence as grammatical or ungrammatical. We present an error generation tool called GenERRate and show how GenERRate can be used to improve the performance of a classifier on learner data. We describe
initial attempts to replicate Cambridge Learner Corpus errors using GenERRate
An Analysis of Source-Side Grammatical Errors in NMT
The quality of Neural Machine Translation (NMT) has been shown to
significantly degrade when confronted with source-side noise. We present the
first large-scale study of state-of-the-art English-to-German NMT on real
grammatical noise, by evaluating on several Grammar Correction corpora. We
present methods for evaluating NMT robustness without true references, and we
use them for extensive analysis of the effects that different grammatical
errors have on the NMT output. We also introduce a technique for visualizing
the divergence distribution caused by a source-side error, which allows for
additional insights.Comment: Accepted and to be presented at BlackboxNLP 201
- …