2 research outputs found
Recommended from our members
Automatic annotation of error types for grammatical error correction
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting
grammatical errors in text. Although previous work has focused on developing systems that
target specific error types, the current state of the art uses machine translation to correct all error
types simultaneously. A significant disadvantage of this approach is that machine translation
does not produce annotated output and so error type information is lost. This means we can only
evaluate a system in terms of overall performance and cannot carry out a more detailed analysis
of different aspects of system performance.
In this thesis, I develop a system to automatically annotate parallel original and corrected
sentence pairs with explicit edits and error types. In particular, I first extend the Damerau-
Levenshtein alignment algorithm to make use of linguistic information when aligning parallel
sentences, and supplement this alignment with a set of merging rules to handle multi-token
edits. The output from this algorithm surpasses other edit extraction approaches in terms of
approximating human edit annotations and is the current state of the art. Having extracted the
edits, I next classify them according to a new rule-based error type framework that depends only
on automatically obtained linguistic properties of the data, such as part-of-speech tags. This
framework was inspired by existing frameworks, and human judges rated the appropriateness
of the predicted error types as ‘Good’ (85%) or ‘Acceptable’ (10%) in a random sample of 200
edits. The whole system is called the ERRor ANnotation Toolkit (ERRANT) and is the first
toolkit capable of automatically annotating parallel sentences with error types.
I demonstrate the value of ERRANT by applying it to the system output produced by the participants of the CoNLL-2014 shared task, and carry out a detailed error type analysis of
system performance for the first time. I also develop a simple language model based approach
to GEC, that does not require annotated training data, and show how it can be improved using
ERRANT error types