902 research outputs found
Improving subject-verb agreement in SMT
Ensuring agreement between the subject and the main verb is crucial for the correctness of the information that a sentence conveys. While generating correct subject-verb agreement
is relatively straightforward in rule-based approaches to Machine Translation (RBMT), today’s
leading statistical Machine Translation (SMT) systems often fail to generate correct subject-verb
agreements, especially when the target language is morphologically richer than the source language. The main problem is that one surface verb form in the source language corresponds to
many surface verb forms in the target language. To deal with subject-verb agreement we built a
hybrid SMT system that augments source verbs with extra linguistic information drawn from their
source-language context. This information, in the form of labels attached to verbs that indicate
person and number, creates a closer association between a verb from the source and a verb in the
target language. We used our preprocessing approach on English as source language and built an
SMT system for translation to French. In a range of experiments, the results show improvements
in translation quality for our augmented SMT system over a Moses baseline engine, on both automatic and manual evaluations, for the majority of cases where the subject-verb agreement was
previously incorrectly translated
Reordering of Source Side for a Factored English to Manipuri SMT System
Similar languages with massive parallel corpora are readily implemented by large-scale systems using either Statistical Machine Translation (SMT) or Neural Machine Translation (NMT). Translations involving low-resource language pairs with linguistic divergence have always been a challenge. We consider one such pair, English-Manipuri, which shows linguistic divergence and belongs to the low resource category. For such language pairs, SMT gets better acclamation than NMT. However, SMT’s more prominent phrase- based model uses groupings of surface word forms treated as phrases for translation. Therefore, without any linguistic knowledge, it fails to learn a proper mapping between the source and target language symbols. Our model adopts a factored model of SMT (FSMT3*) with a part-of-speech (POS) tag as a factor to incorporate linguistic information about the languages followed by hand-coded reordering. The reordering of source sentences makes them similar to the target language allowing better mapping between source and target symbols. The reordering also converts long-distance reordering problems to monotone reordering that SMT models can better handle, thereby reducing the load during decoding time. Additionally, we discover that adding a POS feature data enhances the system’s precision. Experimental results using automatic evaluation metrics show that our model improved over phrase-based and other factored models using the lexicalised Moses reordering options. Our FSMT3* model shows an increase in the automatic scores of translation result over the factored model with lexicalised phrase reordering (FSMT2) by an amount of 11.05% (Bilingual Evaluation Understudy), 5.46% (F1), 9.35% (Precision), and 2.56% (Recall), respectively
Correcting input noise in SMT as a char-based translation problem
Misspelled words have a direct impact on the final quality obtained by Statistical Machine Translation (SMT) systems as the input becomes noisy and unpredictable. This paper presents some improvement strategies for translating real-life noisy input. The proposed strategies are based on a preprocessing step consisting in a character-based translator.Peer ReviewedPreprin
An Empirical Comparison of Parsing Methods for Stanford Dependencies
Stanford typed dependencies are a widely desired representation of natural
language sentences, but parsing is one of the major computational bottlenecks
in text analysis systems. In light of the evolving definition of the Stanford
dependencies and developments in statistical dependency parsing algorithms,
this paper revisits the question of Cer et al. (2010): what is the tradeoff
between accuracy and speed in obtaining Stanford dependencies in particular? We
also explore the effects of input representations on this tradeoff:
part-of-speech tags, the novel use of an alternative dependency representation
as input, and distributional representaions of words. We find that direct
dependency parsing is a more viable solution than it was found to be in the
past. An accompanying software release can be found at:
http://www.ark.cs.cmu.edu/TBSDComment: 13 pages, 2 figure
Using parse features for preposition selection and error detection
We evaluate the effect of adding parse features to a leading model of preposition usage. Results show a significant improvement in the preposition selection task on
native speaker text and a modest increment in precision and recall in an ESL error detection task. Analysis of the parser output indicates that it is robust enough in the face
of noisy non-native writing to extract useful information
Recommended from our members
Neural Sequence-Labelling Models for Grammatical Error Correction
We propose an approach to N-best list reranking
using neural sequence-labelling
models. We train a compositional model
for error detection that calculates the probability
of each token in a sentence being
correct or incorrect, utilising the full sentence
as context. Using the error detection
model, we then re-rank the N best
hypotheses generated by statistical machine
translation systems. Our approach
achieves state-of-the-art results on error
correction for three different datasets, and
it has the additional advantage of only using
a small set of easily computed features
that require no linguistic input
- …