21 research outputs found
ParCor 1.0: A Parallel Pronoun-Coreference Corpus to Support Statistical MT
We present ParCor, a parallel corpus of texts in which pronoun coreference â reduced coreference in which pronouns are used as referringexpressions â has been annotated. The corpus is intended to be used both as a resource from which to learn systematic differences inpronoun use between languages and ultimately for developing and testing informed Statistical Machine Translation systems aimed ataddressing the problem of pronoun coreference in translation. At present, the corpus consists of a collection of parallel English-Germandocuments from two different text genres: TED Talks (transcribed planned speech), and EU Bookshop publications (written text). Alldocuments in the corpus have been manually annotated with respect to the type and location of each pronoun and, where relevant, itsantecedent. We provide details of the texts that we selected, the guidelines and tools used to support annotation and some corpus statistics.The texts in the corpus have already been translated into many languages, and we plan to expand the corpus into these other languages, aswell as other genres, in the future
Incorporating pronoun function into statistical machine translation
Pronouns are used frequently in language, and perform a range of functions.
Some pronouns are used to express coreference, and others are not. Languages
and genres differ in how and when they use pronouns and this poses a problem
for Statistical Machine Translation (SMT) systems (Le Nagard and Koehn,
2010; Hardmeier and Federico, 2010; NovĂĄk, 2011; Guillou, 2012; Weiner, 2014;
Hardmeier, 2014). Attention to date has focussed on coreferential (anaphoric)
pronouns with NP antecedents, which when translated from English into a language
with grammatical gender, must agree with the translation of the head of
the antecedent. Despite growing attention to this problem, little progress has
been made, and little attention has been given to other pronouns.
The central claim of this thesis is that pronouns performing different functions
in text should be handled differently by SMT systems and when evaluating
pronoun translation. This motivates the introduction of a new framework to
categorise pronouns according to their function: Anaphoric/cataphoric reference,
event reference, extra-textual reference, pleonastic, addressee reference, speaker
reference, generic reference, or other function. Labelling pronouns according to
their function also helps to resolve instances of functional ambiguity arising from
the same pronoun in the source language having multiple functions, each with different
translation requirements in the target language. The categorisation framework
is used in corpus annotation, corpus analysis, SMT system development and
evaluation.
I have directed the annotation and conducted analyses of a parallel corpus of
English-German texts called ParCor (Guillou et al., 2014), in which pronouns
are manually annotated according to their function. This provides a first step
toward understanding the problems that SMT systems face when translating pronouns.
In the thesis, I show how analysis of manual translation can prove useful in
identifying and understanding systematic differences in pronoun use between two
languages and can help inform the design of SMT systems. In particular, the analysis
revealed that the German translations in ParCor contain more anaphoric and
pleonastic pronouns than their English originals, reflecting differences in pronoun
use. This raises a particular problem for the evaluation of pronoun translation.
Automatic evaluation methods that rely on reference translations to assess pronoun
translation, will not be able to provide an adequate evaluation when the
reference translation departs from the original source-language text. I also show
how analysis of the output of state-of-the-art SMT systems can reveal how well
current systems perform in translating different types of pronouns and indicate
where future efforts would be best directed. The analysis revealed that biases
in the training data, for example arising from the use of âitâ and âesâ as both
anaphoric and pleonastic pronouns in both English and German, is a problem
that SMT systems must overcome. SMT systems also need to disambiguate the
function of those pronouns with ambiguous surface forms so that each pronoun
may be translated in an appropriate way.
To demonstrate the value of this work, I have developed an automated post-editing
system in which automated tools are used to construct ParCor-style annotations
over the source-language pronouns. The annotations are then used to resolve
functional ambiguity for the pronoun âitâ with separate rules applied to the
output of a baseline SMT system for anaphoric vs. non-anaphoric instances. The
system was submitted to the DiscoMT 2015 shared task on pronoun translation
for English-French. As with all other participating systems, the automatic post-editing
system failed to beat a simple phrase-based baseline. A detailed analysis,
including an oracle experiment in which manual annotation replaces the automated
tools, was conducted to discover the causes of poor system performance.
The analysis revealed that the design of the rules and their strict application to
the SMT output are the biggest factors in the failure of the system.
The lack of automatic evaluation metrics for pronoun translation is a limiting
factor in SMT system development. To alleviate this problem, Christian Hardmeier
and I have developed a testing regimen called PROTEST comprising (1)
a hand-selected set of pronoun tokens categorised according to the different problems
that SMT systems face and (2) an automated evaluation script. Pronoun
translations can then be automatically compared against a reference translation,
with mismatches referred for manual evaluation. The automatic evaluation was
applied to the output of systems submitted to the DiscoMT 2015 shared task
on pronoun translation. This again highlighted the weakness of the post-editing
system, which performs poorly due to its focus on producing gendered pronoun
translations, and its inability to distinguish between pleonastic and event reference
pronouns
ParCorFull: a Parallel Corpus Annotated with Full Coreference
ParCorFull is a parallel corpus annotated with full coreference chains that has been created to address an important problem that machine translation and other multilingual natural language processing (NLP) technologies face -- translation of coreference across languages. Our corpus contains parallel texts for the language pair English-German, two major European languages. Despite being typologically very close, these languages still have systemic differences in the realisation of coreference, and thus pose problems for multilingual coreference resolution and machine translation. Our parallel corpus covers the genres of planned speech (public lectures) and newswire. It is richly annotated for coreference in both languages, including annotation of both nominal coreference and reference to antecedents expressed as clauses, sentences and verb phrases. This resource supports research in the areas of natural language processing, contrastive linguistics and translation studies on the mechanisms involved in coreference translation in order to develop a better understanding of the phenomenon
Cross-lingual Coreference Resolution of Pronouns
This work is, to our knowledge, a first attempt at a machine learning approach to cross-lingual
coreference resolution, i.e. coreference resolution (CR) performed on a bitext. Focusing on CR of English pronouns, we leverage language differences and enrich the feature set of a standard monolingual CR system for English with features extracted from the Czech side of the bitext. Our work also includes a supervised pronoun aligner that outperforms a GIZA++ baseline in terms of both intrinsic evaluation and evaluation on CR. The final cross-lingual CR system has successfully outperformed both a monolingual CR and a cross-lingual projection system