Search CORE

153 research outputs found

Fine-Grained Linguistic Soft Constraints on Statistical Natural Language Processing Models

Author: Marton Yuval Yehezkel
Publication venue
Publication date: 01/01/2009
Field of study

This dissertation focuses on effective combination of data-driven natural language processing (NLP) approaches with linguistic knowledge sources that are based on manual text annotation or word grouping according to semantic commonalities. I gainfully apply fine-grained linguistic soft constraints -- of syntactic or semantic nature -- on statistical NLP models, evaluated in end-to-end state-of-the-art statistical machine translation (SMT) systems. The introduction of semantic soft constraints involves intrinsic evaluation on word-pair similarity ranking tasks, extension from words to phrases, application in a novel distributional paraphrase generation technique, and an introduction of a generalized framework of which these soft semantic and syntactic constraints can be viewed as instances, and in which they can be potentially combined. Fine granularity is key in the successful combination of these soft constraints, in many cases. I show how to softly constrain SMT models by adding fine-grained weighted features, each preferring translation of only a specific syntactic constituent. Previous attempts using coarse-grained features yielded negative results. I also show how to softly constrain corpus-based semantic models of words (“distributional profiles”) to effectively create word-sense-aware models, by using semantic word grouping information found in a manually compiled thesaurus. Previous attempts, using hard constraints and resulting in aggregated, coarse-grained models, yielded lower gains. A novel paraphrase generation technique incorporating these soft semantic constraints is then also evaluated in a SMT system. This paraphrasing technique is based on the Distributional Hypothesis. The main advantage of this novel technique over current “pivoting” techniques for paraphrasing is the independence from parallel texts, which are a limited resource. The evaluation is done by augmenting translation models with paraphrase-based translation rules, where fine-grained scoring of paraphrase-based rules yields significantly higher gains. The model augmentation includes a novel semantic reinforcement component: In many cases there are alternative paths of generating a paraphrase-based translation rule. Each of these paths reinforces a dedicated score for the “goodness” of the new translation rule. This augmented score is then used as a soft constraint, in a weighted log-linear feature, letting the translation model learn how much to “trust” the paraphrase-based translation rules. The work reported here is the first to use distributional semantic similarity measures to improve performance of an end-to-end phrase-based SMT system. The unified framework for statistical NLP models with soft linguistic constraints enables, in principle, the combination of both semantic and syntactic constraints -- and potentially other constraints, too -- in a single SMT model

Digital Repository at the University of Maryland

Semi-automatic Filtering of Translation Errors in Triangle Corpus

Author: Choi Sung-Kwon
Kim Young-Gil
Shin Jong-Hun
Publication venue
Publication date: 01/01/2015
Field of study

Waseda University Repository

Semi-automatic Filtering of Translation Errors in Triangle Corpus

Author: Choi Sung-Kwon
Shin Jong-Hun
Kim Young-Gil
Publication venue
Publication date: 01/01/2006
Field of study

The meaning that Justice has after a conflict in a society might vary regarding the political development and cultural and shared values of a certain society. Rawls, in his Theory of Justice gives his idea of what justice is and presents two principles of justice that he argues are required to live in a good society: a first principle that secures equal rights and liberties for all individuals and a second egalitarian principle that restrains the consequences of economic inequalities within societies. He also introduces the concept of “overlapping consensus” which I will use regarding the idea of Reconciliation, at the end of this paper. In the cases presented in this paper (i.e. Argentina and South Africa), essential human rights were violated, therefore wrongdoers made the society unjust. The aim of this thesis is to elucidate the conditions that are necessary to re-establish justice when a society goes through a conflict. I will introduce some ideas concerning that issue: ideas of retribution, reparation and reconciliation. These are seen as different paths for several countries when trying to tackle to the matter of achieving justice. In my view, this question can be answered appealing first to an intuitive conception of moral justice that may exist at an individual and collective level, as well. The ethical dilemmas both levels have are in relation to the harm done, punishments and how to balance them, limiting, for instance, the punishment in order to accomplish a just and a better society. I will also present how shared values can result from a process of reconciliation, which is considered as the ideal alternative to achieve justice. However, when the equilibrium between members of a community is broken, some people claim that punishment can restore that lost equilibrium that existed before in the community. Nevertheless, peace, reconciliation and justice cannot be constructed under the basis of silence. One way to keep memory alive is to let survivors, for instance, narrate what they have lived through; telling stories also creates a new space to share with others their experiences, revealing their fears and emotions. Regarding this theme, I will present the NUNCA MAS (Never Again) report, which is fundamental as it gives some testimonies, facts and proposals that will help to reach a consensus and therefore, future reconciliations. Why is important to achieve justice? Because then members of a certain community will be able to interact in the present with common shared values and thus, deal with the past. Not to consider reconciliation as one important step to achieve justice, and only think in terms of punishment, instead of giving way to peace and justice, could perhaps promote the possibility of further conflicts. To consider both punishment and reconciliation might just be one possible blueprint in the long and difficult way of searching for a just society

Publikationer från Linköpings universitet

Waseda University Repository

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Recommended from our members

Pivot-based Statistical Machine Translation for Morphologically Rich Languages

Author: Kholy Ahmed El
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2016
Field of study

This thesis describes the research efforts on pivot-based statistical machine translation (SMT) for morphologically rich languages (MRL). We provide a framework to translate to and from morphologically rich languages especially in the context of having little or no parallel corpora between the source and the target languages. We basically address three main challenges. The first one is the sparsity of data as a result of morphological richness. The second one is maximizing the precision and recall of the pivoting process itself. And the last one is making use of any parallel data between the source and the target languages. To address the challenge of data sparsity, we explored a space of tokenization schemes and normalization options. We also examined a set of six detokenization techniques to evaluate detokenized and orthographically corrected (enriched) output. We provide a recipe of the best settings to translate to one of the most challenging languages, namely Arabic. Our best model improves the translation quality over the baseline by 1.3 BLEU points. We also investigated the idea of separation between translation and morphology generation. We compared three methods of modeling morphological features. Features can be modeled as part of the core translation. Alternatively these features can be generated using target monolingual context. Finally, the features can be predicted using both source and target information. In our experimental results, we outperform the vanilla factored translation model. In order to decide on which features to translate, generate or predict, a detailed error analysis should be provided on the system output. As a result, we present AMEANA, an open-source tool for error analysis of natural language processing tasks, targeting morphologically rich languages. The second challenge we are concerned with is the pivoting process itself. We discuss several techniques to improve the precision and recall of the pivot matching. One technique to improve the recall works on the level of the word alignment as an optimization process for pivoting driven by generating phrase pairs between source and target languages. Despite the fact that improving the recall of the pivot matching improves the overall translation quality, we also need to increase the precision of the pivot quality. To achieve this, we introduce quality constraints scores to determine the quality of the pivot phrase pairs between source and target languages. We show positive results for different language pairs which shows the consistency of our approaches. In one of our best models we reach an improvement of 1.2 BLEU points. The third challenge we are concerned with is how to make use of any parallel data between the source and the target languages. We build on the approach of improving the precision of the pivoting process and the methods of combination between the pivot system and the direct system built from the parallel data. In one of the approaches, we introduce morphology constraint scores which are added to the log linear space of features in order to determine the quality of the pivot phrase pairs. We compare two methods of generating the morphology constraints. One method is based on hand-crafted rules relying on our knowledge of the source and target languages; while in the other method, the morphology constraints are induced from available parallel data between the source and target languages which we also use to build a direct translation model. We then combine both the pivot and direct models to achieve better coverage and overall translation quality. Using induced morphology constraints outperformed the handcrafted rules and improved over our best model from all previous approaches by 0.6 BLEU points (7.2/6.7 BLEU points from the direct and pivot baselines respectively). Finally, we introduce applying smart techniques to combine pivot and direct models. We show that smart selective combination can lead to a large reduction of the pivot model without affecting the performance and in some cases improving it

Columbia University Academic Commons

The Circle of Meaning: From Translation to Paraphrasing and Back

Author: Madnani Nitin
Publication venue
Publication date: 01/01/2010
Field of study

The preservation of meaning between inputs and outputs is perhaps the most ambitious and, often, the most elusive goal of systems that attempt to process natural language. Nowhere is this goal of more obvious importance than for the tasks of machine translation and paraphrase generation. Preserving meaning between the input and the output is paramount for both, the monolingual vs bilingual distinction notwithstanding. In this thesis, I present a novel, symbiotic relationship between these two tasks that I term the "circle of meaning''. Today's statistical machine translation (SMT) systems require high quality human translations for parameter tuning, in addition to large bi-texts for learning the translation units. This parameter tuning usually involves generating translations at different points in the parameter space and obtaining feedback against human-authored reference translations as to how good the translations. This feedback then dictates what point in the parameter space should be explored next. To measure this feedback, it is generally considered wise to have multiple (usually 4) reference translations to avoid unfair penalization of translation hypotheses which could easily happen given the large number of ways in which a sentence can be translated from one language to another. However, this reliance on multiple reference translations creates a problem since they are labor intensive and expensive to obtain. Therefore, most current MT datasets only contain a single reference. This leads to the problem of reference sparsity---the primary open problem that I address in this dissertation---one that has a serious effect on the SMT parameter tuning process. Bannard and Callison-Burch (2005) were the first to provide a practical connection between phrase-based statistical machine translation and paraphrase generation. However, their technique is restricted to generating phrasal paraphrases. I build upon their approach and augment a phrasal paraphrase extractor into a sentential paraphraser with extremely broad coverage. The novelty in this augmentation lies in the further strengthening of the connection between statistical machine translation and paraphrase generation; whereas Bannard and Callison-Burch only relied on SMT machinery to extract phrasal paraphrase rules and stopped there, I take it a few steps further and build a full English-to-English SMT system. This system can, as expected, ``translate'' any English input sentence into a new English sentence with the same degree of meaning preservation that exists in a bilingual SMT system. In fact, being a state-of-the-art SMT system, it is able to generate n-best "translations" for any given input sentence. This sentential paraphraser, built almost entirely from existing SMT machinery, represents the first 180 degrees of the circle of meaning. To complete the circle, I describe a novel connection in the other direction. I claim that the sentential paraphraser, once built in this fashion, can provide a solution to the reference sparsity problem and, hence, be used to improve the performance a bilingual SMT system. I discuss two different instantiations of the sentential paraphraser and show several results that provide empirical validation for this connection

CiteSeerX

Digital Repository at the University of Maryland

A Correlational Encoder Decoder Architecture for Pivot Based Sequence Generation

Author: Chandar Sarath
Cho Kyunghyun
Khapra Mitesh M.
Rajendran Janarthanan
Saha Amrita
Publication venue
Publication date: 01/01/2016
Field of study

Interlingua based Machine Translation (MT) aims to encode multiple languages into a common linguistic representation and then decode sentences in multiple target languages from this representation. In this work we explore this idea in the context of neural encoder decoder architectures, albeit on a smaller scale and without MT as the end goal. Specifically, we consider the case of three languages or modalities X, Z and Y wherein we are interested in generating sequences in Y starting from information available in X. However, there is no parallel training data available between X and Y but, training data is available between X & Z and Z & Y (as is often the case in many real world applications). Z thus acts as a pivot/bridge. An obvious solution, which is perhaps less elegant but works very well in practice is to train a two stage model which first converts from X to Z and then from Z to Y. Instead we explore an interlingua inspired solution which jointly learns to do the following (i) encode X and Z to a common representation and (ii) decode Y from this common representation. We evaluate our model on two tasks: (i) bridge transliteration and (ii) bridge captioning. We report promising results in both these applications and believe that this is a right step towards truly interlingua inspired encoder decoder architectures.Comment: 10 page

arXiv.org e-Print Archive

PolyPublie

Recommended from our members

Machine Translation of Arabic Dialects

Author: Salloum Wael Sameer
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2018
Field of study

This thesis discusses different approaches to machine translation (MT) from Dialectal Arabic (DA) to English. These approaches handle the varying stages of Arabic dialects in terms of types of available resources and amounts of training data. The overall theme of this work revolves around building dialectal resources and MT systems or enriching existing ones using the currently available resources (dialectal or standard) in order to quickly and cheaply scale to more dialects without the need to spend years and millions of dollars to create such resources for every dialect. Unlike Modern Standard Arabic (MSA), DA-English parallel corpora is scarcely available for few dialects only. Dialects differ from each other and from MSA in orthography, morphology, phonology, and to some lesser degree syntax. This means that combining all available parallel data, from dialects and MSA, to train DA-to-English statistical machine translation (SMT) systems might not provide the desired results. Similarly, translating dialectal sentences with an SMT system trained on that dialect only is also challenging due to different factors that affect the sentence word choices against that of the SMT training data. Such factors include the level of dialectness (e.g., code switching to MSA versus dialectal training data), topic (sports versus politics), genre (tweets versus newspaper), script (Arabizi versus Arabic), and timespan of test against training. The work we present utilizes any available Arabic resource such as a preprocessing tool or a parallel corpus, whether MSA or DA, to improve DA-to-English translation and expand to more dialects and sub-dialects. The majority of Arabic dialects have no parallel data to English or to any other foreign language. They also have no preprocessing tools such as normalizers, morphological analyzers, or tokenizers. For such dialects, we present an MSA-pivoting approach where DA sentences are translated to MSA first, then the MSA output is translated to English using the wealth of MSA-English parallel data. Since there is virtually no DA-MSA parallel data to train an SMT system, we build a rule-based DA-to-MSA MT system, ELISSA, that uses morpho-syntactic translation rules along with dialect identification and language modeling components. We also present a rule-based approach to quickly and cheaply build a dialectal morphological analyzer, ADAM, which provides ELISSA with dialectal word analyses. Other Arabic dialects have a relatively small-sized DA-English parallel data amounting to a few million words on the DA side. Some of these dialects have dialect-dependent preprocessing tools that can be used to prepare the DA data for SMT systems. We present techniques to generate synthetic parallel data from the available DA-English and MSA- English data. We use this synthetic data to build statistical and hybrid versions of ELISSA as well as improve our rule-based ELISSA-based MSA-pivoting approach. We evaluate our best MSA-pivoting MT pipeline against three direct SMT baselines trained on these three parallel corpora: DA-English data only, MSA-English data only, and the combination of DA-English and MSA-English data. Furthermore, we leverage the use of these four MT systems (the three baselines along with our MSA-pivoting system) in two system combination approaches that benefit from their strengths while avoiding their weaknesses. Finally, we propose an approach to model dialects from monolingual data and limited DA-English parallel data without the need for any language-dependent preprocessing tools. We learn DA preprocessing rules using word embedding and expectation maximization. We test this approach by building a morphological segmentation system and we evaluate its performance on MT against the state-of-the-art dialectal tokenization tool

Columbia University Academic Commons

Proceedings of the 17th Annual Conference of the European Association for Machine Translation

Author
Publication venue: Hrvatsko društvo za jezične tehnologije
Publication date: 01/01/2014
Field of study

Proceedings of the 17th Annual Conference of the European Association for Machine Translation (EAMT

Repozitorij Filozofskog fakulteta u Zagrebu' at University of Zagreb

A survey of cross-lingual word embedding models

Author: Ruder S
Søgaard A
Vulić I
Publication venue: Journal of Artificial Intelligence Research
Publication date: 01/01/2019
Field of study

Cross-lingual representations of words enable us to reason about word meaning in multilingual contexts and are a key facilitator of cross-lingual transfer when developing natural language processing models for low-resource languages. In this survey, we provide a comprehensive typology of cross-lingual word embedding models. We compare their data requirements and objective functions. The recurring theme of the survey is that many of the models presented in the literature optimize for the same objectives, and that seemingly different models are often equivalent, modulo optimization strategies, hyper-parameters, and such. We also discuss the different ways cross-lingual word embeddings are evaluated, as well as future challenges and research horizons.</jats:p

arXiv.org e-Print Archive

Copenhagen University Research Information System

Apollo (Cambridge)