1,323 research outputs found
Semantic Parsing in Limited Resource Conditions
This thesis explores challenges in semantic parsing, specifically focusing on
scenarios with limited data and computational resources. It offers solutions
using techniques like automatic data curation, knowledge transfer, active
learning, and continual learning.
For tasks with no parallel training data, the thesis proposes generating
synthetic training examples from structured database schemas. When there is
abundant data in a source domain but limited parallel data in a target domain,
knowledge from the source is leveraged to improve parsing in the target domain.
For multilingual situations with limited data in the target languages, the
thesis introduces a method to adapt parsers using a limited human translation
budget. Active learning is applied to select source-language samples for manual
translation, maximizing parser performance in the target language. In addition,
an alternative method is also proposed to utilize machine translation services,
supplemented by human-translated data, to train a more effective parser.
When computational resources are limited, a continual learning approach is
introduced to minimize training time and computational memory. This maintains
the parser's efficiency in previously learned tasks while adapting it to new
tasks, mitigating the problem of catastrophic forgetting.
Overall, the thesis provides a comprehensive set of methods to improve
semantic parsing in resource-constrained conditions.Comment: PhD thesis, year of award 2023, 172 page
The Circle of Meaning: From Translation to Paraphrasing and Back
The preservation of meaning between inputs and outputs is perhaps
the most ambitious and, often, the most elusive goal of systems
that attempt to process natural language. Nowhere is this goal of
more obvious importance than for the tasks of machine translation
and paraphrase generation. Preserving meaning between the input and
the output is paramount for both, the monolingual vs bilingual distinction
notwithstanding. In this thesis, I present a novel, symbiotic relationship
between these two tasks that I term the "circle of meaning''.
Today's statistical machine translation (SMT) systems require high
quality human translations for parameter tuning, in addition to
large bi-texts for learning the translation units. This parameter
tuning usually involves generating translations at different points
in the parameter space and obtaining feedback against human-authored
reference translations as to how good the translations. This feedback
then dictates what point in the parameter space should be explored
next. To measure this feedback, it is generally considered wise to have
multiple (usually 4) reference translations to avoid unfair penalization of translation
hypotheses which could easily happen given the large number of ways in which
a sentence can be translated from one language to another. However, this reliance on multiple reference translations
creates a problem since they are labor intensive and expensive to obtain.
Therefore, most current MT datasets only contain a single reference.
This leads to the problem of reference sparsity---the primary open problem
that I address in this dissertation---one that has a serious effect on the
SMT parameter tuning process.
Bannard and Callison-Burch (2005) were the first to provide a practical
connection between phrase-based statistical machine translation and paraphrase
generation. However, their technique is restricted to generating phrasal
paraphrases. I build upon their approach and augment a phrasal paraphrase
extractor into a sentential paraphraser with extremely broad coverage.
The novelty in this augmentation lies in the further strengthening of
the connection between statistical machine translation and paraphrase
generation; whereas Bannard and Callison-Burch only relied on SMT machinery
to extract phrasal paraphrase rules and stopped there, I take it a few
steps further and build a full English-to-English SMT system. This system
can, as expected, ``translate'' any English input sentence into a new English
sentence with the same degree of meaning preservation that exists in a bilingual
SMT system. In fact, being a state-of-the-art SMT system, it is able to generate
n-best "translations" for any given input sentence. This sentential
paraphraser, built almost entirely from existing SMT machinery, represents
the first 180 degrees of the circle of meaning.
To complete the circle, I describe a novel connection in the other direction.
I claim that the sentential paraphraser, once built in this fashion, can
provide a solution to the reference sparsity problem and, hence, be used
to improve the performance a bilingual SMT system. I discuss two different
instantiations of the sentential paraphraser and show several results that
provide empirical validation for this connection
Paraphrasing and Translation
Paraphrasing and translation have previously been treated as unconnected natural lan¬
guage processing tasks. Whereas translation represents the preservation of meaning
when an idea is rendered in the words in a different language, paraphrasing represents
the preservation of meaning when an idea is expressed using different words in the
same language. We show that the two are intimately related. The major contributions
of this thesis are as follows:• We define a novel technique for automatically generating paraphrases using
bilingual parallel corpora, which are more commonly used as training data for
statistical models of translation.• We show that paraphrases can be used to improve the quality of statistical ma¬
chine translation by addressing the problem of coverage and introducing a degree
of generalization into the models.• We explore the topic of automatic evaluation of translation quality, and show that
the current standard evaluation methodology cannot be guaranteed to correlate
with human judgments of translation quality.Whereas previous data-driven approaches to paraphrasing were dependent upon
either data sources which were uncommon such as multiple translation of the same
source text, or language specific resources such as parsers, our approach is able to
harness more widely parallel corpora and can be applied to any language which has
a parallel corpus. The technique was evaluated by replacing phrases with their para¬
phrases, and asking judges whether the meaning of the original phrase was retained
and whether the resulting sentence remained grammatical. Paraphrases extracted from
a parallel corpus with manual alignments are judged to be accurate (both meaningful
and grammatical) 75% of the time, retaining the meaning of the original phrase 85%
of the time. Using automatic alignments, meaning can be retained at a rate of 70%.Being a language independent and probabilistic approach allows our method to be
easily integrated into statistical machine translation. A paraphrase model derived from
parallel corpora other than the one used to train the translation model can be used to
increase the coverage of statistical machine translation by adding translations of previously unseen words and phrases. If the translation of a word was not learned, but
a translation of a synonymous word has been learned, then the word is paraphrased and its paraphrase is translated. Phrases can be treated similarly. Results show that
augmenting a state-of-the-art SMT system with paraphrases in this way leads to significantly improved coverage and translation quality. For a training corpus with 10,000
sentence pairs, we increase the coverage of unique test set unigrams from 48% to 90%,
with more than half of the newly covered items accurately translated, as opposed to
none in current approaches
Investigating translation strategies in Indonesian best seller novel
Translation strategies have been the subject of extensive investigation. Most people believe that translators use specific strategies and that basic translation strategies are sometimes insufficient. As a result, numerous scholars have investigated and analyzed various translation techniques from various perspectives. This study determined the translation strategies in the novel of Negeri 5 Menara and its English Version, The Land of Five Towers using Baker's (2011) framework. This study was conducted using a descriptive qualitative technique to determine the translation strategies in Negeri 5 Menara and its English version, The Land of Five Towers. There were 130 data points in all. According to the findings, 11% about the use of the more general word, 14 % in the use of the more neutral or expensive word, 8% of cultural substitution, 5% of loan words, 4% of omission, paraphrase with related terms accounted for 57% of all translation tactics, while paraphrasing with unrelated words accounted for 2%, and there was no data on illustration. There were 21 uncategorized data points for every given strategy. It was predicted that in the future, a translator, who is also a pre-service teacher, should widen his or her translation methodologies in order to combat non-equivalence translation
Recommended from our members
Adapting Automatic Summarization to New Sources of Information
English-language news articles are no longer necessarily the best source of information. The Web allows information to spread more quickly and travel farther: first-person accounts of breaking news events pop up on social media, and foreign-language news articles are accessible to, if not immediately understandable by, English-speaking users. This thesis focuses on developing automatic summarization techniques for these new sources of information.
We focus on summarizing two specific new sources of information: personal narratives, first-person accounts of exciting or unusual events that are readily found in blog entries and other social media posts, and non-English documents, which must first be translated into English, often introducing translation errors that complicate the summarization process. Personal narratives are a very new area of interest in natural language processing research, and they present two key challenges for summarization. First, unlike many news articles, whose lead sentences serve as summaries of the most important ideas in the articles, personal narratives provide no such shortcuts for determining where important information occurs in within them; second, personal narratives are written informally and colloquially, and unlike news articles, they are rarely edited, so they require heavier editing and rewriting during the summarization process. Non-English documents, whether news or narrative, present yet another source of difficulty on top of any challenges inherent to their genre: they must be translated into English, potentially introducing translation errors and disfluencies that must be identified and corrected during summarization.
The bulk of this thesis is dedicated to addressing the challenges of summarizing personal narratives found on the Web. We develop a two-stage summarization system for personal narrative that first extracts sentences containing important content and then rewrites those sentences into summary-appropriate forms. Our content extraction system is inspired by contextualist narrative theory, using changes in writing style throughout a narrative to detect sentences containing important information; it outperforms both graph-based and neural network approaches to sentence extraction for this genre. Our paraphrasing system rewrites the extracted sentences into shorter, standalone summary sentences, learning to mimic the paraphrasing choices of human summarizers more closely than can traditional lexicon- or translation-based paraphrasing approaches.
We conclude with a chapter dedicated to summarizing non-English documents written in low-resource languages – documents that would otherwise be unreadable for English-speaking users. We develop a cross-lingual summarization system that performs even heavier editing and rewriting than does our personal narrative paraphrasing system; we create and train on large amounts of synthetic errorful translations of foreign-language documents. Our approach produces fluent English summaries from disdisfluent translations of non-English documents, and it generalizes across languages
Understanding and Enhancing the Use of Context for Machine Translation
To understand and infer meaning in language, neural models have to learn
complicated nuances. Discovering distinctive linguistic phenomena from data is
not an easy task. For instance, lexical ambiguity is a fundamental feature of
language which is challenging to learn. Even more prominently, inferring the
meaning of rare and unseen lexical units is difficult with neural networks.
Meaning is often determined from context. With context, languages allow meaning
to be conveyed even when the specific words used are not known by the reader.
To model this learning process, a system has to learn from a few instances in
context and be able to generalize well to unseen cases. The learning process is
hindered when training data is scarce for a task. Even with sufficient data,
learning patterns for the long tail of the lexical distribution is challenging.
In this thesis, we focus on understanding certain potentials of contexts in
neural models and design augmentation models to benefit from them. We focus on
machine translation as an important instance of the more general language
understanding problem. To translate from a source language to a target
language, a neural model has to understand the meaning of constituents in the
provided context and generate constituents with the same meanings in the target
language. This task accentuates the value of capturing nuances of language and
the necessity of generalization from few observations. The main problem we
study in this thesis is what neural machine translation models learn from data
and how we can devise more focused contexts to enhance this learning. Looking
more in-depth into the role of context and the impact of data on learning
models is essential to advance the NLP field. Moreover, it helps highlight the
vulnerabilities of current neural networks and provides insights into designing
more robust models.Comment: PhD dissertation defended on November 10th, 202
Kulttuurisidonnaisten elementtien kääntäminen elokuvassa Zootropolis
Tämän maisterintutkielman päämääränä on tutkia, kuinka kulttuurisidonnaiset elementit on käännetty Disneyn Zootropolis-elokuvassa. Tärkeimpinä lähteinä ovat Fredric Chaume (2012) dubbausosiossa, Lawrence Venuti (1995) kotouttamisessa, Ritva Leppihalmeen teos (1997) alluusioiden kääntämisestä ja Jan Pedersenin (2011) teoria kielen ulkopuolisista kulttuuriviittauksista.
Tutkielman materiaalina ovat Zootropoliksen alkuperäinen englanninkielinen versio ja sen suomenkielinen käännös. Elokuvassa esiintyvät kulttuurisidonnaiset elementit on listattu ja jaettu kuuteen kategoriaan: nimet, lempinimet ja haukkumanimet, puhuttelut, instituutiot, ammatit ja yhteiskunta, idiomit ja puhekielisyydet, yleinen kulttuuritietous ja viittaukset pop-kulttuuriin.
Tulosten perusteella voidaan sanoa, että kääntäjä ei ole noudattanut yhtä globaalia käännösstrategiaa, vaan jokainen kulttuurisidonnainen elementti on käännetty tilannekohtaisesti, välillä kotouttavalla ja välillä vieraannuttavalla strategialla. Käytetyimmät strategiat olivat tilannekohtainen korvaus ja suora käännös, jotka jakaantuivat melko tasaisesti eri kategorioiden kesken. Suurimmat erot olivat kategoriassa instituutiot, ammatit ja yhteiskunta, jossa suora käännös oli selkeästi yleisin strategia ja korvausta käytettiin hyvin vähän, sekä kategoriassa idiomit ja puhekielisyydet, jossa tilannekohtainen korvaus oli selkeästi yleisin ja suoraa käännöstä käytettiin todella vähän
Explicit Sentence Compression for Neural Machine Translation
State-of-the-art Transformer-based neural machine translation (NMT) systems
still follow a standard encoder-decoder framework, in which source sentence
representation can be well done by an encoder with self-attention mechanism.
Though Transformer-based encoder may effectively capture general information in
its resulting source sentence representation, the backbone information, which
stands for the gist of a sentence, is not specifically focused on. In this
paper, we propose an explicit sentence compression method to enhance the source
sentence representation for NMT. In practice, an explicit sentence compression
goal used to learn the backbone information in a sentence. We propose three
ways, including backbone source-side fusion, target-side fusion, and both-side
fusion, to integrate the compressed sentence into NMT. Our empirical tests on
the WMT English-to-French and English-to-German translation tasks show that the
proposed sentence compression method significantly improves the translation
performances over strong baselines.Comment: Working in progress, part of this work is accepted in AAAI-202
- …