7 research outputs found
Recommended from our members
Adapting Semantic Role Labeling to New Genres and Languages
Semantic role labeling (SRL) is the identification of semantic predicates and their participants within a sentence, which is vital for deeper natural language understanding. State-of-the-art SRL models require annotated text for training, but those annotations don't exist for many languages and domains. The ability to annotate new corpora is hampered by limited time and budget. We explore two different ways of reducing the annotation required to produce SRL systems for new domains or languages: active learning and annotation projection.
Active learning reduces annotation requirements by selecting just the most informative training instances through an iterative process of training and annotation. In this work, we investigate the use of Bayesian Active Learning by Disagreement, ways of tuning it for SRL, and assessing its performance across multiple corpora. We study the choices being made by different selection methods over the course of iterations, examining vocabulary coverage, diversity, predicates selected, and the shifts in confidence. We also explore the impact of various strategies of selecting the initial training data. We investigate a number of potentially influential factors within batches of queries, such as diversity and disagreement scores. In order to reduce the overhead of training time, we additionally compare the effect of increasing the amount of queries being selected on each iteration.
Abstract Meaning Representations (AMRs) are increasingly popular semantic representations of whole sentences. Based on our successful results using active learning to assess the informativeness of annotation instances for SRL, we look into whether the commonalities between these representations can be leveraged to supply targeted annotation for AMR parsing.
Finally, we explore annotation projection of SRL. This approach attempts to create semantic annotations in a target language given parallel translations that have been given SRL annotations through manual or automatic means. We assess the recently developed Russian PropBank and the feasibility of generating the same semantic annotations by projecting from the English PropBank annotation. We use both our own system with English-Russian automatic word alignments and the recent Universal PropBanks 2.0. We examine the types of errors that arise from inconsistencies or gaps in annotations as well as systemic issues arising from the strong English-bias of the projections. This analysis leads us to the development of several filtering techniques that improve the precision of the projections.</p
Quantifying Cross-lingual Semantic Similarity for Natural Language Processing Applications
Translation and cross-lingual access to information are key technologies in a global economy. Even though the quality of machine translation (MT) output is still far from the level of human translations, many real-world applications have emerged, for which MT can be employed. Machine translation supports human translators in computer-assisted translation (CAT), providing the opportunity to improve translation systems based on human interaction and feedback. Besides, many tasks that involve natural language processing operate in a cross-lingual setting, where there is no need for perfectly fluent translations and the transfer of meaning can be modeled by employing MT technology. This thesis describes cumulative work in the field of cross-lingual natural language processing in a user-oriented setting. A common denominator of the presented approaches is their anchoring in an alignment between texts in two different languages to quantify the similarity of their content
Translation-based Ranking in Cross-Language Information Retrieval
Today's amount of user-generated, multilingual textual data generates the necessity for information processing
systems, where cross-linguality, i.e the ability to work on more than one
language, is fully integrated into the underlying models. In the particular
context of Information Retrieval (IR), this amounts to rank and retrieve relevant
documents from a large repository in language A, given a user's information
need expressed in a query in language B. This kind of application is commonly
termed a Cross-Language Information Retrieval (CLIR) system. Such
CLIR systems typically involve a translation component of varying complexity,
which is responsible for translating the user input into the document
language. Using query translations from modern, phrase-based Statistical
Machine Translation (SMT) systems, and subsequently retrieving monolingually
is thus a straightforward choice. However, the amount of work committed to
integrate such SMT models into CLIR, or even jointly model translation and
retrieval, is rather small.
In this thesis, I focus on the shared aspect of ranking in translation-based
CLIR: Both, translation and retrieval models, induce rankings over a set of
candidate structures through assignment of scores. The subject of this thesis
is to exploit this commonality in three different ranking tasks: (1) "Mate-ranking" refers to the
task of mining comparable data for SMT domain adaptation through translation-based
CLIR. "Cross-lingual mates" are direct or close translations of the query.
I will show that such a CLIR system is able to find
in-domain comparable data from noisy user-generated corpora and improves
in-domain translation performance of an SMT system. Conversely, the CLIR system
relies itself on a translation model that is tailored for retrieval. This
leads to the second direction of research, in which I develop two ways to
optimize an SMT model for retrieval, namely (2) by SMT parameter optimization
towards a retrieval objective ("translation ranking"), and (3) by presenting
a joint model of translation and retrieval for "document ranking". The latter
abandons the common architecture of modeling both components separately. The
former task refers to optimizing for preference of
translation candidates that work well for retrieval. In the core task of "document ranking" for CLIR, I present a model that directly ranks documents using an SMT decoder. I present substantial improvements
over state-of-the-art translation-based CLIR baseline systems, indicating that
a joint model of translation and retrieval is a promising direction of
research in the field of CLIR
Incorporating pronoun function into statistical machine translation
Pronouns are used frequently in language, and perform a range of functions.
Some pronouns are used to express coreference, and others are not. Languages
and genres differ in how and when they use pronouns and this poses a problem
for Statistical Machine Translation (SMT) systems (Le Nagard and Koehn,
2010; Hardmeier and Federico, 2010; NovĂĄk, 2011; Guillou, 2012; Weiner, 2014;
Hardmeier, 2014). Attention to date has focussed on coreferential (anaphoric)
pronouns with NP antecedents, which when translated from English into a language
with grammatical gender, must agree with the translation of the head of
the antecedent. Despite growing attention to this problem, little progress has
been made, and little attention has been given to other pronouns.
The central claim of this thesis is that pronouns performing different functions
in text should be handled differently by SMT systems and when evaluating
pronoun translation. This motivates the introduction of a new framework to
categorise pronouns according to their function: Anaphoric/cataphoric reference,
event reference, extra-textual reference, pleonastic, addressee reference, speaker
reference, generic reference, or other function. Labelling pronouns according to
their function also helps to resolve instances of functional ambiguity arising from
the same pronoun in the source language having multiple functions, each with different
translation requirements in the target language. The categorisation framework
is used in corpus annotation, corpus analysis, SMT system development and
evaluation.
I have directed the annotation and conducted analyses of a parallel corpus of
English-German texts called ParCor (Guillou et al., 2014), in which pronouns
are manually annotated according to their function. This provides a first step
toward understanding the problems that SMT systems face when translating pronouns.
In the thesis, I show how analysis of manual translation can prove useful in
identifying and understanding systematic differences in pronoun use between two
languages and can help inform the design of SMT systems. In particular, the analysis
revealed that the German translations in ParCor contain more anaphoric and
pleonastic pronouns than their English originals, reflecting differences in pronoun
use. This raises a particular problem for the evaluation of pronoun translation.
Automatic evaluation methods that rely on reference translations to assess pronoun
translation, will not be able to provide an adequate evaluation when the
reference translation departs from the original source-language text. I also show
how analysis of the output of state-of-the-art SMT systems can reveal how well
current systems perform in translating different types of pronouns and indicate
where future efforts would be best directed. The analysis revealed that biases
in the training data, for example arising from the use of âitâ and âesâ as both
anaphoric and pleonastic pronouns in both English and German, is a problem
that SMT systems must overcome. SMT systems also need to disambiguate the
function of those pronouns with ambiguous surface forms so that each pronoun
may be translated in an appropriate way.
To demonstrate the value of this work, I have developed an automated post-editing
system in which automated tools are used to construct ParCor-style annotations
over the source-language pronouns. The annotations are then used to resolve
functional ambiguity for the pronoun âitâ with separate rules applied to the
output of a baseline SMT system for anaphoric vs. non-anaphoric instances. The
system was submitted to the DiscoMT 2015 shared task on pronoun translation
for English-French. As with all other participating systems, the automatic post-editing
system failed to beat a simple phrase-based baseline. A detailed analysis,
including an oracle experiment in which manual annotation replaces the automated
tools, was conducted to discover the causes of poor system performance.
The analysis revealed that the design of the rules and their strict application to
the SMT output are the biggest factors in the failure of the system.
The lack of automatic evaluation metrics for pronoun translation is a limiting
factor in SMT system development. To alleviate this problem, Christian Hardmeier
and I have developed a testing regimen called PROTEST comprising (1)
a hand-selected set of pronoun tokens categorised according to the different problems
that SMT systems face and (2) an automated evaluation script. Pronoun
translations can then be automatically compared against a reference translation,
with mismatches referred for manual evaluation. The automatic evaluation was
applied to the output of systems submitted to the DiscoMT 2015 shared task
on pronoun translation. This again highlighted the weakness of the post-editing
system, which performs poorly due to its focus on producing gendered pronoun
translations, and its inability to distinguish between pleonastic and event reference
pronouns
Representation and parsing of multiword expressions
This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches
Current trends
Deep parsing is the fundamental process aiming at the representation of the syntactic
structure of phrases and sentences. In the traditional methodology this process is
based on lexicons and grammars representing roughly properties of words and interactions
of words and structures in sentences. Several linguistic frameworks, such as Headdriven
Phrase Structure Grammar (HPSG), Lexical Functional Grammar (LFG), Tree Adjoining
Grammar (TAG), Combinatory Categorial Grammar (CCG), etc., offer different
structures and combining operations for building grammar rules. These already contain
mechanisms for expressing properties of Multiword Expressions (MWE), which, however,
need improvement in how they account for idiosyncrasies of MWEs on the one
hand and their similarities to regular structures on the other hand. This collaborative
book constitutes a survey on various attempts at representing and parsing MWEs in the
context of linguistic theories and applications