24 research outputs found
How do you correct run-on sentences it's not as easy as it seems
Run-on sentences are common grammatical mistakes but little research has
tackled this problem to date. This work introduces two machine learning models
to correct run-on sentences that outperform leading methods for related tasks,
punctuation restoration and whole-sentence grammatical error correction. Due to
the limited annotated data for this error, we experiment with artificially
generating training data from clean newswire text. Our findings suggest
artificial training data is viable for this task. We discuss implications for
correcting run-ons and other types of mistakes that have low coverage in
error-annotated corpora.Comment: To appear in W-NUT 2018: Workshop on Noisy User-generated Text (at
EMNLP
Constructing a Family Tree of Ten Indo-European Languages with Delexicalized Cross-linguistic Transfer Patterns
It is reasonable to hypothesize that the divergence patterns formulated by
historical linguists and typologists reflect constraints on human languages,
and are thus consistent with Second Language Acquisition (SLA) in a certain
way. In this paper, we validate this hypothesis on ten Indo-European languages.
We formalize the delexicalized transfer as interpretable tree-to-string and
tree-to-tree patterns which can be automatically induced from web data by
applying neural syntactic parsing and grammar induction technologies. This
allows us to quantitatively probe cross-linguistic transfer and extend
inquiries of SLA. We extend existing works which utilize mixed features and
support the agreement between delexicalized cross-linguistic transfer and the
phylogenetic structure resulting from the historical-comparative paradigm
Parallel Data Augmentation for Formality Style Transfer
The main barrier to progress in the task of Formality Style Transfer is the
inadequacy of training data. In this paper, we study how to augment parallel
data and propose novel and simple data augmentation methods for this task to
obtain useful sentence pairs with easily accessible models and systems.
Experiments demonstrate that our augmented parallel data largely helps improve
formality style transfer when it is used to pre-train the model, leading to the
state-of-the-art results in the GYAFC benchmark dataset.Comment: Accepted by ACL 2020. arXiv admin note: text overlap with
arXiv:1909.0600
A Multilayer Convolutional Encoder-Decoder Neural Network for Grammatical Error Correction
We improve automatic correction of grammatical, orthographic, and collocation
errors in text using a multilayer convolutional encoder-decoder neural network.
The network is initialized with embeddings that make use of character N-gram
information to better suit this task. When evaluated on common benchmark test
data sets (CoNLL-2014 and JFLEG), our model substantially outperforms all prior
neural approaches on this task as well as strong statistical machine
translation-based systems with neural and task-specific features trained on the
same data. Our analysis shows the superiority of convolutional neural networks
over recurrent neural networks such as long short-term memory (LSTM) networks
in capturing the local context via attention, and thereby improving the
coverage in correcting grammatical errors. By ensembling multiple models, and
incorporating an N-gram language model and edit features via rescoring, our
novel method becomes the first neural approach to outperform the current
state-of-the-art statistical machine translation-based approach, both in terms
of grammaticality and fluency.Comment: 8 pages, 3 figures, In Proceedings of AAAI 201
Neural Language Correction with Character-Based Attention
Natural language correction has the potential to help language learners
improve their writing skills. While approaches with separate classifiers for
different error types have high precision, they do not flexibly handle errors
such as redundancy or non-idiomatic phrasing. On the other hand, word and
phrase-based machine translation methods are not designed to cope with
orthographic errors, and have recently been outpaced by neural models.
Motivated by these issues, we present a neural network-based approach to
language correction. The core component of our method is an encoder-decoder
recurrent neural network with an attention mechanism. By operating at the
character level, the network avoids the problem of out-of-vocabulary words. We
illustrate the flexibility of our approach on dataset of noisy, user-generated
text collected from an English learner forum. When combined with a language
model, our method achieves a state-of-the-art -score on the CoNLL 2014
Shared Task. We further demonstrate that training the network on additional
data with synthesized errors can improve performance.Comment: 10 page
Personalizing Grammatical Error Correction: Adaptation to Proficiency Level and L1
Grammar error correction (GEC) systems have become ubiquitous in a variety of
software applications, and have started to approach human-level performance for
some datasets. However, very little is known about how to efficiently
personalize these systems to the user's characteristics, such as their
proficiency level and first language, or to emerging domains of text. We
present the first results on adapting a general-purpose neural GEC system to
both the proficiency level and the first language of a writer, using only a few
thousand annotated sentences. Our study is the broadest of its kind, covering
five proficiency levels and twelve different languages, and comparing three
different adaptation scenarios: adapting to the proficiency level only, to the
first language only, or to both aspects simultaneously. We show that tailoring
to both scenarios achieves the largest performance improvement (3.6 F0.5)
relative to a strong baseline.Comment: Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on
Noisy User-generated Tex
CUNI System for the Building Educational Applications 2019 Shared Task: Grammatical Error Correction
In this paper, we describe our systems submitted to the Building Educational
Applications (BEA) 2019 Shared Task (Bryant et al., 2019). We participated in
all three tracks. Our models are NMT systems based on the Transformer model,
which we improve by incorporating several enhancements: applying dropout to
whole source and target words, weighting target subwords, averaging model
checkpoints, and using the trained model iteratively for correcting the
intermediate translations. The system in the Restricted Track is trained on the
provided corpora with oversampled "cleaner" sentences and reaches 59.39 F0.5
score on the test set. The system in the Low-Resource Track is trained from
Wikipedia revision histories and reaches 44.13 F0.5 score. Finally, we finetune
the system from the Low-Resource Track on restricted data and achieve 64.55
F0.5 score, placing third in the Unrestricted Track
Improving the Efficiency of Grammatical Error Correction with Erroneous Span Detection and Correction
We propose a novel language-independent approach to improve the efficiency
for Grammatical Error Correction (GEC) by dividing the task into two subtasks:
Erroneous Span Detection (ESD) and Erroneous Span Correction (ESC). ESD
identifies grammatically incorrect text spans with an efficient sequence
tagging model. Then, ESC leverages a seq2seq model to take the sentence with
annotated erroneous spans as input and only outputs the corrected text for
these spans. Experiments show our approach performs comparably to conventional
seq2seq approaches in both English and Chinese GEC benchmarks with less than
50% time cost for inference.Comment: Accepted by EMNLP 202
Parallel Iterative Edit Models for Local Sequence Transduction
We present a Parallel Iterative Edit (PIE) model for the problem of local
sequence transduction arising in tasks like Grammatical error correction (GEC).
Recent approaches are based on the popular encoder-decoder (ED) model for
sequence to sequence learning. The ED model auto-regressively captures full
dependency among output tokens but is slow due to sequential decoding. The PIE
model does parallel decoding, giving up the advantage of modelling full
dependency in the output, yet it achieves accuracy competitive with the ED
model for four reasons: 1.~predicting edits instead of tokens, 2.~labeling
sequences instead of generating sequences, 3.~iteratively refining predictions
to capture dependencies, and 4.~factorizing logits over edits and their token
argument to harness pre-trained language models like BERT. Experiments on tasks
spanning GEC, OCR correction and spell correction demonstrate that the PIE
model is an accurate and significantly faster alternative for local sequence
transduction.Comment: Accepted at EMNLP-IJCNLP 201
Learning to combine Grammatical Error Corrections
The field of Grammatical Error Correction (GEC) has produced various systems
to deal with focused phenomena or general text editing. We propose an automatic
way to combine black-box systems. Our method automatically detects the strength
of a system or the combination of several systems per error type, improving
precision and recall while optimizing score directly. We show consistent
improvement over the best standalone system in all the configurations tested.
This approach also outperforms average ensembling of different RNN models with
random initializations.
In addition, we analyze the use of BERT for GEC - reporting promising results
on this end. We also present a spellchecker created for this task which
outperforms standard spellcheckers tested on the task of spellchecking.
This paper describes a system submission to Building Educational Applications
2019 Shared Task: Grammatical Error Correction.
Combining the output of top BEA 2019 shared task systems using our approach,
currently holds the highest reported score in the open phase of the BEA 2019
shared task, improving F0.5 by 3.7 points over the best result reported.Comment: BEA 201