19,625 research outputs found
Neural Network-based Word Alignment through Score Aggregation
We present a simple neural network for word alignment that builds source and
target word window representations to compute alignment scores for sentence
pairs. To enable unsupervised training, we use an aggregation operation that
summarizes the alignment scores for a given target word. A soft-margin
objective increases scores for true target words while decreasing scores for
target words that are not present. Compared to the popular Fast Align model,
our approach improves alignment accuracy by 7 AER on English-Czech, by 6 AER on
Romanian-English and by 1.7 AER on English-French alignment
Improving Evaluation of English-Czech MT through Paraphrasing
In this paper, we present a method of improving the accuracy of machine translation
evaluation of Czech sentences. Given a reference sentence, our algorithm transforms it
by targeted paraphrasing into a new synthetic reference sentence that is closer in
wording to the machine translation output, but at the same time preserves the meaning of
the original reference sentence.
Grammatical correctness of~the new reference sentence is provided by applying Depfix on
newly created paraphrases. Depfix is a system for post-editing English-to-Czech machine
translation outputs. We adjusted it to fix the errors in paraphrased sentences.
Due to a noisy source of our paraphrases, we experiment with adding word alignment. However,
the alignment reduces the number of paraphrases found and the best results were achieved
by~a~simple greedy method with only one-word paraphrases thanks to their intensive filtering.
BLEU scores computed using these new reference sentences show significantly higher correlation
with human judgment than scores computed on the original reference sentences
Automatické párování tektogramatických stromů z česko-anglického paralelního korpusu
Název práce: Automatické párování tektogramatických stromů z česko-anglického paralelního korpusu Autor: David Mareček Katedra (ústav): Ústav formální a aplikované lingvistiky Vedoucí diplomové práce: Ing. Zdeněk Žabokrtský, Ph.D. Abstrakt: Cílem této práce je implementovat a zhodnotit softwarový nástroj pro automatické zarovnávání (alignment) českých a anglických tektogramatických stromů. Úkolem je najít odpovídajicí si uzly stromů, které reprezentují anglickou větu a její český překlad. Velké množství zarovnaných stromů získaných z paralelního korpusu může být užitečné pro trénování modelu pro transfer strojového překladu. Zároveň může posloužit lingvistům při studování překladových ekvivalentů mezi dvěma jazyky. Výsledky našich experimentů ukazují, že přesunutím problému alignmentu ze slovní roviny na tektogramatickou (a) zvýšíme mezianotátorskou shodu (b) můžeme vytvořit alignovací algoritmus, který využívá i stromovou strukturu věty a překoná nástroj pro alignment GIZA++ spuštěný na uzly tektogramatických stromů. To je pravděpodobně zapříčiněno tím, že tektogramatické reprezentace českých a anglických vět si jsou mnohem podobnější než samotné věty na povrchu. Klíčová slova: tektogramatická rovina, word alignment, strojový překladTitle: Automatic Alignment of Tectogrammatical Trees from Czech-English Parallel Corpus Author: David Mareček Department: Institute of Formal and Applied Linguistics Supervisor: Ing. Zdeněk Žabokrtský, Ph.D. Abstract: The goal of this thesis is to implement and evaluate a software tool for automatic alignment of Czech and English tectogrammatical trees. The task is to find correspondent nodes between two trees that represent an English sentence and its Czech translation. Great amount of aligned trees acquired from parallel corpora can be used for training transfer models for machine translation systems. It is also useful for linguists in studying translation equivalents in two languages. In this thesis there is also described word alignment annotation process. The manual word alignment was necessary for evaluation of the aligner. The results of our experiments show that shifting the alignment task from the word layer to the tectogrammatical layer both (a) increases the interannotator agreement on the task and (b) allows to construct a feature-based algorithm which uses sentence structure and which outperforms the GIZA++ aligner in terms of f-measure on aligned tectogrammatical node pairs. This is probably caused by the fact that tectogrammatical representations of Czech and English sentences are much closer...Ústav formální a aplikované lingvistikyInstitute of Formal and Applied LinguisticsFaculty of Mathematics and PhysicsMatematicko-fyzikální fakult
An augmented three-pass system combination framework: DCU combination system for WMT 2010
This paper describes the augmented threepass
system combination framework of
the Dublin City University (DCU) MT
group for the WMT 2010 system combination
task. The basic three-pass framework
includes building individual confusion
networks (CNs), a super network, and
a modified Minimum Bayes-risk (mCon-
MBR) decoder. The augmented parts for
WMT2010 tasks include 1) a rescoring
component which is used to re-rank the
N-best lists generated from the individual
CNs and the super network, 2) a new hypothesis
alignment metric – TERp – that
is used to carry out English-targeted hypothesis
alignment, and 3) more different
backbone-based CNs which are employed
to increase the diversity of the
mConMBR decoding phase. We took
part in the combination tasks of Englishto-
Czech and French-to-English. Experimental
results show that our proposed
combination framework achieved 2.17 absolute
points (13.36 relative points) and
1.52 absolute points (5.37 relative points)
in terms of BLEU score on English-to-
Czech and French-to-English tasks respectively
than the best single system. We
also achieved better performance on human
evaluation
MATREX: the DCU MT System for WMT 2008
In this paper, we give a description of the machine translation system developed at DCU that was used for our participation in the evaluation campaign of the Third Workshop on Statistical Machine Translation at ACL 2008.
We describe the modular design of our data driven MT system with particular focus on the components used in this participation. We also describe some of the significant modules which were unused in this task. We participated in the EuroParl task for the following translation directions: Spanish–English and French–English, in which we employed
our hybrid EBMT-SMT architecture to translate. We also participated in the Czech–English News and News Commentary tasks which represented a previously untested language
pair for our system. We report results on the provided development and test sets
Conditional Random Field Autoencoders for Unsupervised Structured Prediction
We introduce a framework for unsupervised learning of structured predictors
with overlapping, global features. Each input's latent representation is
predicted conditional on the observable data using a feature-rich conditional
random field. Then a reconstruction of the input is (re)generated, conditional
on the latent structure, using models for which maximum likelihood estimation
has a closed-form. Our autoencoder formulation enables efficient learning
without making unrealistic independence assumptions or restricting the kinds of
features that can be used. We illustrate insightful connections to traditional
autoencoders, posterior regularization and multi-view learning. We show
competitive results with instantiations of the model for two canonical NLP
tasks: part-of-speech induction and bitext word alignment, and show that
training our model can be substantially more efficient than comparable
feature-rich baselines
Example-based machine translation of the Basque language
Basque is both a minority and a highly inflected language with free order of sentence constituents. Machine Translation of Basque is thus both a real need and a test bed for MT techniques. In this paper, we present a modular Data-Driven MT system which includes different chunkers as well as chunk aligners which can deal with the free order of sentence constituents of Basque. We conducted Basque to English translation experiments, evaluated on a large corpus
(270, 000 sentence pairs). The experimental results show that our system significantly outperforms state-of-the-art
approaches according to several common automatic evaluation metrics
- …