1,323 research outputs found

    Semantic Parsing in Limited Resource Conditions

    Full text link
    This thesis explores challenges in semantic parsing, specifically focusing on scenarios with limited data and computational resources. It offers solutions using techniques like automatic data curation, knowledge transfer, active learning, and continual learning. For tasks with no parallel training data, the thesis proposes generating synthetic training examples from structured database schemas. When there is abundant data in a source domain but limited parallel data in a target domain, knowledge from the source is leveraged to improve parsing in the target domain. For multilingual situations with limited data in the target languages, the thesis introduces a method to adapt parsers using a limited human translation budget. Active learning is applied to select source-language samples for manual translation, maximizing parser performance in the target language. In addition, an alternative method is also proposed to utilize machine translation services, supplemented by human-translated data, to train a more effective parser. When computational resources are limited, a continual learning approach is introduced to minimize training time and computational memory. This maintains the parser's efficiency in previously learned tasks while adapting it to new tasks, mitigating the problem of catastrophic forgetting. Overall, the thesis provides a comprehensive set of methods to improve semantic parsing in resource-constrained conditions.Comment: PhD thesis, year of award 2023, 172 page

    The Circle of Meaning: From Translation to Paraphrasing and Back

    Get PDF
    The preservation of meaning between inputs and outputs is perhaps the most ambitious and, often, the most elusive goal of systems that attempt to process natural language. Nowhere is this goal of more obvious importance than for the tasks of machine translation and paraphrase generation. Preserving meaning between the input and the output is paramount for both, the monolingual vs bilingual distinction notwithstanding. In this thesis, I present a novel, symbiotic relationship between these two tasks that I term the "circle of meaning''. Today's statistical machine translation (SMT) systems require high quality human translations for parameter tuning, in addition to large bi-texts for learning the translation units. This parameter tuning usually involves generating translations at different points in the parameter space and obtaining feedback against human-authored reference translations as to how good the translations. This feedback then dictates what point in the parameter space should be explored next. To measure this feedback, it is generally considered wise to have multiple (usually 4) reference translations to avoid unfair penalization of translation hypotheses which could easily happen given the large number of ways in which a sentence can be translated from one language to another. However, this reliance on multiple reference translations creates a problem since they are labor intensive and expensive to obtain. Therefore, most current MT datasets only contain a single reference. This leads to the problem of reference sparsity---the primary open problem that I address in this dissertation---one that has a serious effect on the SMT parameter tuning process. Bannard and Callison-Burch (2005) were the first to provide a practical connection between phrase-based statistical machine translation and paraphrase generation. However, their technique is restricted to generating phrasal paraphrases. I build upon their approach and augment a phrasal paraphrase extractor into a sentential paraphraser with extremely broad coverage. The novelty in this augmentation lies in the further strengthening of the connection between statistical machine translation and paraphrase generation; whereas Bannard and Callison-Burch only relied on SMT machinery to extract phrasal paraphrase rules and stopped there, I take it a few steps further and build a full English-to-English SMT system. This system can, as expected, ``translate'' any English input sentence into a new English sentence with the same degree of meaning preservation that exists in a bilingual SMT system. In fact, being a state-of-the-art SMT system, it is able to generate n-best "translations" for any given input sentence. This sentential paraphraser, built almost entirely from existing SMT machinery, represents the first 180 degrees of the circle of meaning. To complete the circle, I describe a novel connection in the other direction. I claim that the sentential paraphraser, once built in this fashion, can provide a solution to the reference sparsity problem and, hence, be used to improve the performance a bilingual SMT system. I discuss two different instantiations of the sentential paraphraser and show several results that provide empirical validation for this connection

    Paraphrasing and Translation

    Get PDF
    Paraphrasing and translation have previously been treated as unconnected natural lan¬ guage processing tasks. Whereas translation represents the preservation of meaning when an idea is rendered in the words in a different language, paraphrasing represents the preservation of meaning when an idea is expressed using different words in the same language. We show that the two are intimately related. The major contributions of this thesis are as follows:• We define a novel technique for automatically generating paraphrases using bilingual parallel corpora, which are more commonly used as training data for statistical models of translation.• We show that paraphrases can be used to improve the quality of statistical ma¬ chine translation by addressing the problem of coverage and introducing a degree of generalization into the models.• We explore the topic of automatic evaluation of translation quality, and show that the current standard evaluation methodology cannot be guaranteed to correlate with human judgments of translation quality.Whereas previous data-driven approaches to paraphrasing were dependent upon either data sources which were uncommon such as multiple translation of the same source text, or language specific resources such as parsers, our approach is able to harness more widely parallel corpora and can be applied to any language which has a parallel corpus. The technique was evaluated by replacing phrases with their para¬ phrases, and asking judges whether the meaning of the original phrase was retained and whether the resulting sentence remained grammatical. Paraphrases extracted from a parallel corpus with manual alignments are judged to be accurate (both meaningful and grammatical) 75% of the time, retaining the meaning of the original phrase 85% of the time. Using automatic alignments, meaning can be retained at a rate of 70%.Being a language independent and probabilistic approach allows our method to be easily integrated into statistical machine translation. A paraphrase model derived from parallel corpora other than the one used to train the translation model can be used to increase the coverage of statistical machine translation by adding translations of previously unseen words and phrases. If the translation of a word was not learned, but a translation of a synonymous word has been learned, then the word is paraphrased and its paraphrase is translated. Phrases can be treated similarly. Results show that augmenting a state-of-the-art SMT system with paraphrases in this way leads to significantly improved coverage and translation quality. For a training corpus with 10,000 sentence pairs, we increase the coverage of unique test set unigrams from 48% to 90%, with more than half of the newly covered items accurately translated, as opposed to none in current approaches

    Investigating translation strategies in Indonesian best seller novel

    Get PDF
    Translation strategies have been the subject of extensive investigation. Most people believe that translators use specific strategies and that basic translation strategies are sometimes insufficient. As a result, numerous scholars have investigated and analyzed various translation techniques from various perspectives. This study determined the translation strategies in the novel of Negeri 5 Menara and its English Version, The Land of Five Towers using Baker's (2011) framework. This study was conducted using a descriptive qualitative technique to determine the translation strategies in Negeri 5 Menara and its English version, The Land of Five Towers. There were 130 data points in all. According to the findings, 11% about the use of the more general word, 14 % in the use of the more neutral or expensive word, 8% of cultural substitution, 5% of loan words, 4% of omission, paraphrase with related terms accounted for 57% of all translation tactics, while paraphrasing with unrelated words accounted for 2%, and there was no data on illustration. There were 21 uncategorized data points for every given strategy. It was predicted that in the future, a translator, who is also a pre-service teacher, should widen his or her translation methodologies in order to combat non-equivalence translation

    Understanding and Enhancing the Use of Context for Machine Translation

    Get PDF
    To understand and infer meaning in language, neural models have to learn complicated nuances. Discovering distinctive linguistic phenomena from data is not an easy task. For instance, lexical ambiguity is a fundamental feature of language which is challenging to learn. Even more prominently, inferring the meaning of rare and unseen lexical units is difficult with neural networks. Meaning is often determined from context. With context, languages allow meaning to be conveyed even when the specific words used are not known by the reader. To model this learning process, a system has to learn from a few instances in context and be able to generalize well to unseen cases. The learning process is hindered when training data is scarce for a task. Even with sufficient data, learning patterns for the long tail of the lexical distribution is challenging. In this thesis, we focus on understanding certain potentials of contexts in neural models and design augmentation models to benefit from them. We focus on machine translation as an important instance of the more general language understanding problem. To translate from a source language to a target language, a neural model has to understand the meaning of constituents in the provided context and generate constituents with the same meanings in the target language. This task accentuates the value of capturing nuances of language and the necessity of generalization from few observations. The main problem we study in this thesis is what neural machine translation models learn from data and how we can devise more focused contexts to enhance this learning. Looking more in-depth into the role of context and the impact of data on learning models is essential to advance the NLP field. Moreover, it helps highlight the vulnerabilities of current neural networks and provides insights into designing more robust models.Comment: PhD dissertation defended on November 10th, 202

    Kulttuurisidonnaisten elementtien kääntäminen elokuvassa Zootropolis

    Get PDF
    Tämän maisterintutkielman päämääränä on tutkia, kuinka kulttuurisidonnaiset elementit on käännetty Disneyn Zootropolis-elokuvassa. Tärkeimpinä lähteinä ovat Fredric Chaume (2012) dubbausosiossa, Lawrence Venuti (1995) kotouttamisessa, Ritva Leppihalmeen teos (1997) alluusioiden kääntämisestä ja Jan Pedersenin (2011) teoria kielen ulkopuolisista kulttuuriviittauksista. Tutkielman materiaalina ovat Zootropoliksen alkuperäinen englanninkielinen versio ja sen suomenkielinen käännös. Elokuvassa esiintyvät kulttuurisidonnaiset elementit on listattu ja jaettu kuuteen kategoriaan: nimet, lempinimet ja haukkumanimet, puhuttelut, instituutiot, ammatit ja yhteiskunta, idiomit ja puhekielisyydet, yleinen kulttuuritietous ja viittaukset pop-kulttuuriin. Tulosten perusteella voidaan sanoa, että kääntäjä ei ole noudattanut yhtä globaalia käännösstrategiaa, vaan jokainen kulttuurisidonnainen elementti on käännetty tilannekohtaisesti, välillä kotouttavalla ja välillä vieraannuttavalla strategialla. Käytetyimmät strategiat olivat tilannekohtainen korvaus ja suora käännös, jotka jakaantuivat melko tasaisesti eri kategorioiden kesken. Suurimmat erot olivat kategoriassa instituutiot, ammatit ja yhteiskunta, jossa suora käännös oli selkeästi yleisin strategia ja korvausta käytettiin hyvin vähän, sekä kategoriassa idiomit ja puhekielisyydet, jossa tilannekohtainen korvaus oli selkeästi yleisin ja suoraa käännöstä käytettiin todella vähän

    Explicit Sentence Compression for Neural Machine Translation

    Full text link
    State-of-the-art Transformer-based neural machine translation (NMT) systems still follow a standard encoder-decoder framework, in which source sentence representation can be well done by an encoder with self-attention mechanism. Though Transformer-based encoder may effectively capture general information in its resulting source sentence representation, the backbone information, which stands for the gist of a sentence, is not specifically focused on. In this paper, we propose an explicit sentence compression method to enhance the source sentence representation for NMT. In practice, an explicit sentence compression goal used to learn the backbone information in a sentence. We propose three ways, including backbone source-side fusion, target-side fusion, and both-side fusion, to integrate the compressed sentence into NMT. Our empirical tests on the WMT English-to-French and English-to-German translation tasks show that the proposed sentence compression method significantly improves the translation performances over strong baselines.Comment: Working in progress, part of this work is accepted in AAAI-202
    corecore