11 research outputs found
The Circle of Meaning: From Translation to Paraphrasing and Back
The preservation of meaning between inputs and outputs is perhaps
the most ambitious and, often, the most elusive goal of systems
that attempt to process natural language. Nowhere is this goal of
more obvious importance than for the tasks of machine translation
and paraphrase generation. Preserving meaning between the input and
the output is paramount for both, the monolingual vs bilingual distinction
notwithstanding. In this thesis, I present a novel, symbiotic relationship
between these two tasks that I term the "circle of meaning''.
Today's statistical machine translation (SMT) systems require high
quality human translations for parameter tuning, in addition to
large bi-texts for learning the translation units. This parameter
tuning usually involves generating translations at different points
in the parameter space and obtaining feedback against human-authored
reference translations as to how good the translations. This feedback
then dictates what point in the parameter space should be explored
next. To measure this feedback, it is generally considered wise to have
multiple (usually 4) reference translations to avoid unfair penalization of translation
hypotheses which could easily happen given the large number of ways in which
a sentence can be translated from one language to another. However, this reliance on multiple reference translations
creates a problem since they are labor intensive and expensive to obtain.
Therefore, most current MT datasets only contain a single reference.
This leads to the problem of reference sparsity---the primary open problem
that I address in this dissertation---one that has a serious effect on the
SMT parameter tuning process.
Bannard and Callison-Burch (2005) were the first to provide a practical
connection between phrase-based statistical machine translation and paraphrase
generation. However, their technique is restricted to generating phrasal
paraphrases. I build upon their approach and augment a phrasal paraphrase
extractor into a sentential paraphraser with extremely broad coverage.
The novelty in this augmentation lies in the further strengthening of
the connection between statistical machine translation and paraphrase
generation; whereas Bannard and Callison-Burch only relied on SMT machinery
to extract phrasal paraphrase rules and stopped there, I take it a few
steps further and build a full English-to-English SMT system. This system
can, as expected, ``translate'' any English input sentence into a new English
sentence with the same degree of meaning preservation that exists in a bilingual
SMT system. In fact, being a state-of-the-art SMT system, it is able to generate
n-best "translations" for any given input sentence. This sentential
paraphraser, built almost entirely from existing SMT machinery, represents
the first 180 degrees of the circle of meaning.
To complete the circle, I describe a novel connection in the other direction.
I claim that the sentential paraphraser, once built in this fashion, can
provide a solution to the reference sparsity problem and, hence, be used
to improve the performance a bilingual SMT system. I discuss two different
instantiations of the sentential paraphraser and show several results that
provide empirical validation for this connection
Automated mood boards - Ontology-based semantic image retrieval
The main goal of this research is to support concept designers’ search for inspirational and meaningful images in developing mood boards. Finding the right images has
become a well-known challenge as the amount of images stored and shared on the Internet and elsewhere keeps increasing steadily and rapidly. The development of
image retrieval technologies, which collect, store and pre-process image information to return relevant images instantly in response to users’ needs, have achieved great
progress in the last decade.
However, the keyword-based content description and query processing techniques for Image Retrieval (IR) currently used have their limitations. Most of these techniques
are adapted from the Information Retrieval research, and therefore provide limited capabilities to grasp and exploit conceptualisations due to their inability to handle
ambiguity, synonymy, and semantic constraints. Conceptual search (i.e. searching by meaning rather than literal strings) aims to solve the limitations of the keyword-based
models.
Starting from this point, this thesis investigates the existing IR models, which are oriented to the exploitation of domain knowledge in support of semantic search
capabilities, with a focus on the use of lexical ontologies to improve the semantic perspective. It introduces a technique for extracting semantic DNA (SDNA) from
textual image annotations and constructing semantic image signatures. The semantic signatures are called semantic chromosomes; they contain semantic information
related to the images.
Central to the method of constructing semantic signatures is the concept disambiguation technique developed, which identifies the most relevant SDNA by measuring the semantic importance of each word/phrase in the image annotation. In
addition, a conceptual model of an ontology-based system for generating visual mood boards is proposed. The proposed model, which is adapted from the Vector Space Model, exploits the use of semantic chromosomes in semantic indexing and assessing the semantic similarity of images within a collection
Turker-Assisted Paraphrasing for English-Arabic Machine Translation
This paper describes a semi-automatic paraphrasing task for English-Arabic machine translation conducted using Amazon Mechanical Turk. The method for automatically extracting paraphrases is described, as are several human judgment tasks completed by Turkers. An ideal task type, revised specifically to address feedback from Turkers, is shown to be sophisticated enough to identify and filter problem Turkers while remaining simple enough for non-experts to complete. The results of this task are discussed along with the viability of using this data to combat data sparsity in MT.