27 research outputs found

    Constrained word alignment models for statistical machine translation

    Get PDF
    Word alignment is a fundamental and crucial component in Statistical Machine Translation (SMT) systems. Despite the enormous progress made in the past two decades, this task remains an active research topic simply because the quality of word alignment is still far from optimal. Most state-of-the-art word alignment models are grounded on statistical learning theory treating word alignment as a general sequence alignment problem, where many linguistically motivated insights are not incorporated. In this thesis, we propose new word alignment models with linguistically motivated constraints in a bid to improve the quality of word alignment for Phrase-Based SMT systems (PB-SMT). We start the exploration with an investigation into segmentation constraints for word alignment by proposing a novel algorithm, namely word packing, which is motivated by the fact that one concept expressed by one word in one language can frequently surface as a compound or collocation in another language. Our algorithm takes advantage of the interaction between segmentation and alignment, starting with some segmentation for both the source and target language and updating the segmentation with respect to the word alignment results using state-of-the-art word alignment models; thereafter a refined word alignment can be obtained based on the updated segmentation. In this process, the updated segmentation acts as a hard constraint on the word alignment models and reduces the complexity of the alignment models by generating more 1-to-1 correspondences through word packing. Experimental results show that this algorithm can lead to statistically significant improvements over the state-of-the-art word alignment models. Given that word packing imposes "hard" segmentation constraints on the word aligner, which is prone to introducing noise, we propose two new word alignment models using syntactic dependencies as soft constraints. The first model is a syntactically enhanced discriminative word alignment model, where we use a set of feature functions to express the syntactic dependency information encoded in both source and target languages. One the one hand, this model enjoys great flexibility in its capacity to incorporate multiple features; on the other hand, this model is designed to facilitate model tuning for different objective functions. Experimental results show that using syntactic constraints can improve the performance of the discriminative word alignment model, which also leads to better PB-SMT performance compared to using state-of-the-art word alignment models. The second model is a syntactically constrained generative word alignment model, where we add in a syntactic coherence model over the target phrases in the context of HMM word-to-phrase alignment. The advantages of our model are that (i) the addition of the syntactic coherence model preserves the efficient parameter estimation procedures; and (ii) the flexibility of the model can be increased so that it can be tuned according to different objective functions. Experimental results show that tuning this model properly leads to a significant gain in MT performance over the state-of-the-art

    Dependency-based Bilingual Word Embeddings and Neural Machine Translation

    Get PDF
    Bilingual word embeddings, which represent lexicons from various languages in a common embedding space, are critical for facilitating semantic and knowledge trans- fers in a wide range of cross-lingual NLP applications. The significance of learning bilingual word embedding representations in many Natural Language Processing (NLP) tasks motivates us to investigate the effect of many factors, including syntac- tical information, on the learning process for different languages with varying levels of structural complexity. By analysing the components that influence the learning process of bilingual word embeddings (BWEs), this thesis examines some factors for learning bilingual word embeddings effectively. Our findings in this thesis demon- strate that increasing the embedding size for language pairs has a positive impact on the learning process for BWEs. While sentence length depends on the language. Short sentences perform better than long ones in the En-ES experiment. However, by increasing the sentence, En-Ar and En-De experiment achieve improved model accuracy. Arabic segmentation, according to En-Ar experiments, is essential to the learning process for BWEs and can boost model accuracy by up to 10%. Incorporating dependency features into the learning process enhances the trained models performance and results in more improved BWEs in all language pairs. Finally, we investigated how the dependancy-based pretrained BWEs affected the neural machine translation (NMT) model. The findings indicate that in various MT evaluation matrices, the trained dependancy-based NMT models outperform the baseline NMT model

    ProphetMT: a tree-based SMT-driven controlled language authoring/post-editing tools

    Get PDF
    This paper presents ProphetMT, a tree-based SMT-driven Controlled Language (CL) authoring and post-editing tool. ProphetMT employs the source-side rules in a translation model and provides them as auto-suggestions to users. Accordingly, one might say that users are writing in a ‘Controlled Language’ that is ‘understood’ by the computer. ProphetMT also allows users to easily attach structural information as they compose content. When a specific rule is selected, a partial translation is promptly generated on-the-fly with the help of the structural information. Our experiments conducted on English-to-Chinese show that our proposed ProphetMT system can not only better regularise an author’s writing behaviour, but also significantly improve translation fluency which is vital to reduce the post-editing time. Additionally, when the writing and translation process is over, ProphetMT can provide an effective colour scheme to further improve the productivity of post-editors by explicitly featuring the relations between the source and target rules

    Multiword expression processing: A survey

    Get PDF
    Multiword expressions (MWEs) are a class of linguistic forms spanning conventional word boundaries that are both idiosyncratic and pervasive across different languages. The structure of linguistic processing that depends on the clear distinction between words and phrases has to be re-thought to accommodate MWEs. The issue of MWE handling is crucial for NLP applications, where it raises a number of challenges. The emergence of solutions in the absence of guiding principles motivates this survey, whose aim is not only to provide a focused review of MWE processing, but also to clarify the nature of interactions between MWE processing and downstream applications. We propose a conceptual framework within which challenges and research contributions can be positioned. It offers a shared understanding of what is meant by "MWE processing," distinguishing the subtasks of MWE discovery and identification. It also elucidates the interactions between MWE processing and two use cases: Parsing and machine translation. Many of the approaches in the literature can be differentiated according to how MWE processing is timed with respect to underlying use cases. We discuss how such orchestration choices affect the scope of MWE-aware systems. For each of the two MWE processing subtasks and for each of the two use cases, we conclude on open issues and research perspectives

    Incorporating translation quality-oriented features into log-linear models of machine translation

    Get PDF
    The current state-of-the-art approach to Machine Translation (MT) has limitations which could be alleviated by the use of syntax-based models. Although the benefits of syntax use in MT are becoming clear with the ongoing improvements in string-to-tree and tree-to-string systems, tree-to-tree systems such as Data Oriented Translation (DOT) have, until recently, suffered from lack of training resources, and as a consequence are currently immature, lacking key features compared to Phrase-Based Statistical MT (PB-SMT) systems. In this thesis we propose avenues to bridge the gap between our syntax-based DOT model and state-of-the-art PB-SMT systems. Noting that both types of systems score translations using probabilities not necessarily related to the quality of the translations they produce, we introduce a training mechanism which takes translation quality into account by averaging the edit distance between a translation unit and translation units used in oracle translations. This training mechanism could in principle be adapted to a very broad class of MT systems. In particular, we show how when translating Spanish sentences into English, it leads to improvements in the translation quality of both PB-SMT and DOT. In addition, we show how our method leads to a PB-SMT system which uses significantly less resources and translates significantly faster than the original, while maintaining the improvements in translation quality. We then address the issue of the limited feature set in DOT by defining a new DOT model which is able to exploit features of the complete source sentence. We introduce a feature into this new model which conditions each target word to the source-context it is associated with, and we also make the first attempt at incorporating a language model (LM) to a DOT system. We investigate different estimation methods for our lexical feature (namely Maximum Entropy and improved Kneser-Ney), reporting on their empirical performance. After describing methods which enable us to improve the efficiency of our system, and which allows us to scale to larger training data sizes, we evaluate the performance of our new model on English-to-Spanish translation, obtaining significant translation quality improvements compared to the original DOT system

    Undirected dependency parsing

    Get PDF
    Dependency parsers, which are widely used in natural language processing tasks, employ a representation of syntax in which the structure of sentences is expressed in the form of directed links (dependencies) between their words. In this article, we introduce a new approach to transition-based dependency parsing in which the parsing algorithm does not directly construct dependencies, but rather undirected links, which are then assigned a direction in a postprocessing step. We show that this alleviates error propagation, because undirected parsers do not need to observe the single-head constraint, resulting in better accuracy. Undirected parsers can be obtained by transforming existing directed transition-based parsers as long as they satisfy certain conditions. We apply this approach to obtain undirected variants of three different parsers (the Planar, 2-Planar, and Covington algorithms) and perform experiments on several data sets from the CoNLL-X shared tasks and on the Wall Street Journal portion of the Penn Treebank, showing that our approach is successful in reducing error propagation and produces improvements in parsing accuracy in most of the cases and achieving results competitive with state-of-the-art transition-based parsers.Xunta de Galicia | Ref. CN2012/008Xunta de Galicia | Ref. CN2012/317Xunta de Galicia | Ref. CN2012/319Ministerio de Ciencia e InnovaciĂłn | Ref. TIN2010-18552-C03-01Ministerio de Ciencia e InnovaciĂłn | Ref. TIN2010-18552-C03-0

    Exploiting Cross-Lingual Representations For Natural Language Processing

    Get PDF
    Traditional approaches to supervised learning require a generous amount of labeled data for good generalization. While such annotation-heavy approaches have proven useful for some Natural Language Processing (NLP) tasks in high-resource languages (like English), they are unlikely to scale to languages where collecting labeled data is di cult and time-consuming. Translating supervision available in English is also not a viable solution, because developing a good machine translation system requires expensive to annotate resources which are not available for most languages. In this thesis, I argue that cross-lingual representations are an effective means of extending NLP tools to languages beyond English without resorting to generous amounts of annotated data or expensive machine translation. These representations can be learned in an inexpensive manner, often from signals completely unrelated to the task of interest. I begin with a review of different ways of inducing such representations using a variety of cross-lingual signals and study algorithmic approaches of using them in a diverse set of downstream tasks. Examples of such tasks covered in this thesis include learning representations to transfer a trained model across languages for document classification, assist in monolingual lexical semantics like word sense induction, identify asymmetric lexical relationships like hypernymy between words in different languages, or combining supervision across languages through a shared feature space for cross-lingual entity linking. In all these applications, the representations make information expressed in other languages available in English, while requiring minimal additional supervision in the language of interest

    Addressing the data bottleneck in implicit discourse relation classification

    Get PDF
    When humans comprehend language, their interpretation consists of more than just the sum of the content of the sentences. Additional logic and semantic links (known as coherence relations or discourse relations) are inferred between sentences/clauses in the text. The identification of discourse relations is beneficial for various NLP applications such as question-answering, summarization, machine translation, information extraction, etc. Discourse relations are categorized into implicit and explicit discourse relations depending on whether there is an explicit discourse marker between the arguments. In this thesis, we mainly focus on the implicit discourse relation classification, given that with the explicit markers acting as informative cues, the explicit relations are relatively easier to identify for machines. The recent neural network-based approaches in particular suffer from insufficient training (and test) data. As shown in Chapter 3 of this thesis, we start out by showing to what extent the limited data size is a problem in implicit discourse relation classification and propose data augmentation methods with the help of cross-lingual data. And then we propose several approaches for better exploiting and encoding various types of existing data in the discourse relation classification task. Most of the existing machine learning methods train on sections 2-21 of the PDTB and test on section 23, which only includes a total of less than 800 implicit discourse relation instances. With the help of cross validation, we argue that the standard test section of the PDTB is too small to draw conclusions upon. With more test samples in the cross validation, we would come to very different conclusions about whether a feature is generally useful. Second, we propose a simple approach to automatically extract samples of implicit discourse relations from multilingual parallel corpus via back-translation. After back-translating from target languages, it is easy for the discourse parser to identify those examples that are originally implicit but explicit in the back-translations. Having those additional data in the training set, the experiments show significant improvements on different settings. Finally, having better encoding ability is also of crucial importance in terms of improving classification performance. We propose different methods including a sequence-to-sequence neural network and a memory component to help have a better representation of the arguments. We also show that having the correct next sentence is beneficial for the task within and across domains, with the help of the BERT (Devlin et al., 2019) model. When it comes to a new domain, it is beneficial to integrate external domain-specific knowledge. In Chapter 8, we show that with the entity-enhancement, the performance on BioDRB is improved significantly, comparing with other BERT-based methods. In sum, the studies reported in this dissertation contribute to addressing the data bottleneck problem in implicit discourse relation classification and propose corresponding approaches that achieve 54.82% and 69.57% on PDTB and BioDRB respectively.Wenn Menschen Sprache verstehen, besteht ihre Interpretation aus mehr als nur der Summe des Inhalts der SĂ€tze. Zwischen SĂ€tzen im Text werden zusĂ€tzliche logische und semantische VerknĂŒpfungen (sogenannte KohĂ€renzrelationen oder Diskursrelationen) hergeleitet. Die Identifizierung von Diskursrelationen ist fĂŒr verschiedene NLP-Anwendungen wie Frage- Antwort, Zusammenfassung, maschinelle Übersetzung, Informationsextraktion usw. von Vorteil. Diskursrelationen werden in implizite und explizite Diskursrelationen unterteilt, je nachdem, ob es eine explizite Diskursrelationen zwischen den Argumenten gibt. In dieser Arbeit konzentrieren wir uns hauptsĂ€chlich auf die Klassifizierung der impliziten Diskursrelationen, da die expliziten Marker als hilfreiche Hinweise dienen und die expliziten Beziehungen fĂŒr Maschinen relativ leicht zu identifizieren sind. Es wurden verschiedene AnsĂ€tze vorgeschlagen, die bei der impliziten Diskursrelationsklassifikation beeindruckende Ergebnisse erzielt haben. Die meisten von ihnen leiden jedoch darunter, dass die Daten fĂŒr auf neuronalen Netzen basierende Methoden unzureichend sind. In dieser Arbeit gehen wir zunĂ€chst auf das Problem begrenzter Daten bei dieser Aufgabe ein und schlagen dann Methoden zur Datenanreicherung mit Hilfe von sprachĂŒbergreifenden Daten vor. Zuletzt schlagen wir mehrere Methoden vor, um die Argumente aus verschiedenen Aspekten besser kodieren zu können. Die meisten der existierenden Methoden des maschinellen Lernens werden auf den Abschnitten 2-21 der PDTB trainiert und auf dem Abschnitt 23 getestet, der insgesamt nur weniger als 800 implizite Diskursrelationsinstanzen enthĂ€lt. Mit Hilfe der Kreuzvalidierung argumentieren wir, dass der Standardtestausschnitt der PDTB zu klein ist um daraus Schlussfolgerungen zu ziehen. Mit mehr Teststichproben in der Kreuzvalidierung wĂŒrden wir zu anderen Schlussfolgerungen darĂŒber kommen, ob ein Merkmal fĂŒr diese Aufgabe generell vorteilhaft ist oder nicht, insbesondere wenn wir einen relativ großen Labelsatz verwenden. Wenn wir nur unseren kleinen Standardtestsatz herausstellen, laufen wir Gefahr, falsche SchlĂŒsse darĂŒber zu ziehen, welche Merkmale hilfreich sind. Zweitens schlagen wir einen einfachen Ansatz zur automatischen Extraktion von Samples impliziter Diskursrelationen aus mehrsprachigen Parallelkorpora durch RĂŒckĂŒbersetzung vor. Er ist durch den Explikationsprozess motiviert, wenn Menschen einen Text ĂŒbersetzen. Nach der RĂŒckĂŒbersetzung aus den Zielsprachen ist es fĂŒr den Diskursparser leicht, diejenigen Beispiele zu identifizieren, die ursprĂŒnglich implizit, in den RĂŒckĂŒbersetzungen aber explizit enthalten sind. Da diese zusĂ€tzlichen Daten im Trainingsset enthalten sind, zeigen die Experimente signifikante Verbesserungen in verschiedenen Situationen. Wir verwenden zunĂ€chst nur französisch-englische Paare und haben keine Kontrolle ĂŒber die QualitĂ€t und konzentrieren uns meist auf die satzinternen Relationen. Um diese Fragen in Angriff zu nehmen, erweitern wir die Idee spĂ€ter mit mehr Vorverarbeitungsschritten und mehr Sprachpaaren. Mit den Mehrheitsentscheidungen aus verschiedenen Sprachpaaren sind die gemappten impliziten Labels zuverlĂ€ssiger. Schließlich ist auch eine bessere KodierfĂ€higkeit von entscheidender Bedeutung fĂŒr die Verbesserung der Klassifizierungsleistung. Wir schlagen ein neues Modell vor, das aus einem Klassifikator und einem Sequenz-zu-Sequenz-Modell besteht. Neben der korrekten Vorhersage des Labels werden sie auch darauf trainiert, eine ReprĂ€sentation der Diskursrelationsargumente zu erzeugen, indem sie versuchen, die Argumente einschließlich eines geeigneten impliziten Konnektivs vorherzusagen. Die neuartige sekundĂ€re Aufgabe zwingt die interne ReprĂ€sentation dazu, die Semantik der Relationsargumente vollstĂ€ndiger zu kodieren und eine feinkörnigere Klassifikation vorzunehmen. Um das allgemeine Wissen in Kontexten weiter zu erfassen, setzen wir auch ein GedĂ€chtnisnetzwerk ein, um eine explizite KontextreprĂ€sentation von Trainingsbeispielen fĂŒr Kontexte zu erhalten. FĂŒr jede Testinstanz erzeugen wir durch gewichtetes Lesen des GedĂ€chtnisses einen Wissensvektor. Wir evaluieren das vorgeschlagene Modell unter verschiedenen Bedingungen und die Ergebnisse zeigen, dass das Modell mit dem Speichernetzwerk die Vorhersage von Diskursrelationen erleichtern kann, indem es Beispiele auswĂ€hlt, die eine Ă€hnliche semantische ReprĂ€sentation und Diskursrelationen aufweisen. Auch wenn ein besseres VerstĂ€ndnis, eine Kodierung und semantische Interpretation fĂŒr die Aufgabe der impliziten Diskursrelationsklassifikation unerlĂ€sslich und nĂŒtzlich sind, so leistet sie doch nur einen Teil der Arbeit. Ein guter impliziter Diskursrelationsklassifikator sollte sich auch der bevorstehenden Ereignisse, Ursachen, Folgen usw. bewusst sein, um die Diskurserwartung in die Satzdarstellungen zu kodieren. Mit Hilfe des kĂŒrzlich vorgeschlagenen BERT-Modells versuchen wir herauszufinden, ob es fĂŒr die Aufgabe vorteilhaft ist, den richtigen nĂ€chsten Satz zu haben oder nicht. Die experimentellen Ergebnisse zeigen, dass das Entfernen der Aufgabe zur Vorhersage des nĂ€chsten Satzes die Leistung sowohl innerhalb der DomĂ€ne als auch domĂ€nenĂŒbergreifend stark beeintrĂ€chtigt. Die begrenzte FĂ€higkeit von BioBERT, domĂ€nenspezifisches Wissen, d.h. EntitĂ€tsinformationen, EntitĂ€tsbeziehungen etc. zu erlernen, motiviert uns, externes Wissen in die vortrainierten Sprachmodelle zu integrieren. Wir schlagen eine unĂŒberwachte Methode vor, bei der Information-Retrieval-System und Wissensgraphen-Techniken verwendet werden, mit der Annahme, dass, wenn zwei Instanzen Ă€hnliche EntitĂ€ten in beiden relationalen Argumenten teilen, die Wahrscheinlichkeit groß ist, dass sie die gleiche oder eine Ă€hnliche Diskursrelation haben. Der Ansatz erzielt vergleichbare Ergebnisse auf BioDRB, verglichen mit Baselinemodellen. Anschließend verwenden wir die extrahierten relevanten EntitĂ€ten zur Verbesserung des vortrainierten Modells K-BERT, um die Bedeutung der Argumente besser zu kodieren und das ursprĂŒngliche BERT und BioBERT mit einer Genauigkeit von 6,5% bzw. 2% zu ĂŒbertreffen. Zusammenfassend trĂ€gt diese Dissertation dazu bei, das Problem des Datenengpasses bei der impliziten Diskursrelationsklassifikation anzugehen, und schlĂ€gt entsprechende AnsĂ€tze in verschiedenen Aspekten vor, u.a. die Darstellung des begrenzten Datenproblems und der Risiken bei der Schlussfolgerung daraus; die Erfassung automatisch annotierter Daten durch den Explikationsprozess wĂ€hrend der manuellen Übersetzung zwischen Englisch und anderen Sprachen; eine bessere ReprĂ€sentation von Diskursrelationsargumenten; Entity-Enhancement mit einer unĂŒberwachten Methode und einem vortrainierten Sprachmodell
    corecore