Natural language documents exhibit coherence and cohesion by means of interrelated
structures both within and across sentences. Sentences do not stand in isolation from
each other and only a coherent structure makes them understandable and sound natural
to humans. In Statistical Machine Translation (SMT) only little research exists on
translating a document from a source language into a coherent document in the target
language. The dominant paradigm is still one that considers sentences independently
from each other. There is both a need for a deeper understanding of how to handle specific
discourse phenomena, and for automatic evaluation of how well these phenomena
are handled in SMT.
In this thesis we explore an approach how to treat sentences as dependent on each
other by focussing on the problem of pronoun translation as an instance of a discourse-related
non-local phenomenon. We direct our attention to pronoun translation in the
form of cross-lingual pronoun prediction (CLPP) and develop a model to tackle this
problem. We obtain state-of-the-art results exhibiting the benefit of having access to
the antecedent of a pronoun for predicting the right translation of that pronoun. Experiments
also showed that features from the target side are more informative than features
from the source side, confirming linguistic knowledge that referential pronouns need to
agree in gender and number with their target-side antecedent. We show our approach
to be applicable across the two language pairs English-French and English-German.
The experimental setting for CLPP is artificially restricted, both to enable automatic
evaluation and to provide a controlled environment. This is a limitation which
does not yet allow us to test the full potential of CLPP systems within a more realistic
setting that is closer to a full SMT scenario. We provide an annotation scheme, a tool
and a corpus that enable evaluation of pronoun prediction in a more realistic setting.
The annotated corpus consists of parallel documents translated by a state-of-the-art
neural machine translation (NMT) system, where the appropriate target-side pronouns
have been chosen by annotators. With this corpus, we exhibit a weakness of our current
CLPP systems in that they are outperformed by a state-of-the-art NMT system in
this more realistic context. This corpus provides a basis for future CLPP shared tasks
and allows the research community to further understand and test their methods.
The lack of appropriate evaluation metrics that explicitly capture non-local phenomena
is one of the main reasons why handling non-local phenomena has not yet
been widely adopted in SMT. To overcome this obstacle and evaluate the coherence of
translated documents, we define a bilingual model of entity-based coherence, inspired
by work on monolingual coherence modelling, and frame it as a learning-to-rank problem.
We first evaluate this model on a corpus where we artificially introduce coherence
errors based on typical errors CLPP systems make. This allows us to assess the quality
of the model in a controlled environment with automatically provided gold coherence
rankings. Results show that this model can distinguish with high accuracy between a
human-authored translation and one with coherence errors, that it can also distinguish
between document pairs from two corpora with different degrees of coherence errors,
and that the learnt model can be successfully applied when the test set distribution
of errors comes from a different one than the one from the training data, showing its
generalization potentials.
To test our bilingual model of coherence as a discourse-aware SMT evaluation
metric, we apply it to more realistic data. We use it to evaluate a state-of-the-art NMT
system against post-editing systems with pronouns corrected by our CLPP systems.
For verifying our metric, we reuse our annotated parallel corpus and consider the pronoun
annotations as proxy for human document-level coherence judgements. Experiments
show far lower accuracy in ranking translations according to their entity-based
coherence than on the artificial corpus, suggesting that the metric has difficulties generalizing
to a more realistic setting. Analysis reveals that the system translations in our
test corpus do not differ in their pronoun translations in almost half of the document
pairs. To circumvent this data sparsity issue, and to remove the need for parameter
learning, we define a score-based SMT evaluation metric which directly uses features
from our bilingual coherence model