Search CORE

32 research outputs found

Discontinuous Data-Oriented Parsing: A mildly context-sensitive all-fragments grammar

Author: Sangati F.
Scha R.
van Cranenburgh A.
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2011
Field of study

International Migration, Integration and Social Cohesion online publications

Discontinuous Data-Oriented Parsing: A mildly context-sensitive all-fragments grammar

Author: Andreas van Cranenburgh
Remko Scha
Sangati Federico
Publication venue
Publication date: 01/01/2011
Field of study

Recent advances in parsing technology have made treebank parsing with discontinuous constituents possible, with parser output of competitive quality (Kallmeyer and Maier, 2010). We apply Data-Oriented Parsing (DOP) to a grammar formalism that allows for discontinuous trees (LCFRS). Decisions during parsing are conditioned on all possible fragments, resulting in improved performance. Despite the fact that both DOP and discontinuity present formidable challenges in terms of computational complexity, the model is reasonably efficient, and surpasses the state of the art in discontinuous parsing.

CiteSeerX

Archivio della ricerca - Fondazione Bruno Kessler

ARCHIVIO ISTITUZIONALE DELLA RICERCA-UNIVERSITA' DEGLI STUDI DI NAPOLI "L'ORIENTALE"

Università degli Studi di Napoli L'Orientale: CINECA IRIS

International Migration, Integration and Social Cohesion online publications

UvA-DARE

A Formal View on Training of Weighted Tree Automata by Likelihood-Driven State Splitting and Merging

Author: Dietze Toni
Publication venue
Publication date: 03/06/2019
Field of study

The use of computers and algorithms to deal with human language, in both spoken and written form, is summarized by the term natural language processing (nlp). Modeling language in a way that is suitable for computers plays an important role in nlp. One idea is to use formalisms from theoretical computer science for that purpose. For example, one can try to find an automaton to capture the valid written sentences of a language. Finding such an automaton by way of examples is called training. In this work, we also consider the structure of sentences by making use of trees. We use weighted tree automata (wta) in order to deal with such tree structures. Those devices assign weights to trees in order to, for example, distinguish between good and bad structures. The well-known expectation-maximization algorithm can be used to train the weights for a wta while the state behavior stays fixed. As a way to adapt the state behavior of a wta, state splitting, i.e. dividing a state into several new states, and state merging, i.e. replacing several states by a single new state, can be used. State splitting, state merging, and the expectation maximization algorithm already were combined into the state splitting and merging algorithm, which was successfully applied in practice. In our work, we formalized this approach in order to show properties of the algorithm. We also examined a new approach – the count-based state merging algorithm – which exclusively relies on state merging. When dealing with trees, another important tool is binarization. A binarization is a strategy to code arbitrary trees by binary trees. For each of three different binarizations we showed that wta together with the binarization are as powerful as weighted unranked tree automata (wuta). We also showed that this is still true if only probabilistic wta and probabilistic wuta are considered.:How to Read This Thesis 1. Introduction 1.1. The Contributions and the Structure of This Work 2. Preliminaries 2.1. Sets, Relations, Functions, Families, and Extrema 2.2. Algebraic Structures 2.3. Formal Languages 3. Language Formalisms 3.1. Context-Free Grammars (CFGs) 3.2. Context-Free Grammars with Latent Annotations (CFG-LAs) 3.3. Weighted Tree Automata (WTAs) 3.4. Equivalences of WCFG-LAs and WTAs 4. Training of WTAs 4.1. Probability Distributions 4.2. Maximum Likelihood Estimation 4.3. Probabilities and WTAs 4.4. The EM Algorithm for WTAs 4.5. Inside and Outside Weights 4.6. Adaption of the Estimation of Corazza and Satta [CS07] to WTAs 5. State Splitting and Merging 5.1. State Splitting and Merging for Weighted Tree Automata 5.1.1. Splitting Weights and Probabilities 5.1.2. Merging Probabilities 5.2. The State Splitting and Merging Algorithm 5.2.1. Finding a Good π-Distributor 5.2.2. Notes About the Berkeley Parser 5.3. Conclusion and Further Research 6. Count-Based State Merging 6.1. Preliminaries 6.2. The Likelihood of the Maximum Likelihood Estimate and Its Behavior While Merging 6.3. The Count-Based State Merging Algorithm 6.3.1. Further Adjustments for Practical Implementations 6.4. Implementation of Count-Based State Merging 6.5. Experiments with Artificial Automata and Corpora 6.5.1. The Artificial Automata 6.5.2. Results 6.6. Experiments with the Penn Treebank 6.7. Comparison to the Approach of Carrasco, Oncina, and Calera-Rubio [COC01] 6.8. Conclusion and Further Research 7. Binarization 7.1. Preliminaries 7.2. Relating WSTAs and WUTAs via Binarizations 7.2.1. Left-Branching Binarization 7.2.2. Right-Branching Binarization 7.2.3. Mixed Binarization 7.3. The Probabilistic Case 7.3.1. Additional Preliminaries About WSAs 7.3.2. Constructing an Out-Probabilistic WSA from a Converging WSA 7.3.3. Binarization and Probabilistic Tree Automata 7.4. Connection to the Training Methods in Previous Chapters 7.5. Conclusion and Further Research A. Proofs for Preliminaries B. Proofs for Training of WTAs C. Proofs for State Splitting and Merging D. Proofs for Count-Based State Merging Bibliography List of Algorithms List of Figures List of Tables Index Table of Variable Name

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

Decomposing and Regenerating Syntactic Trees

Author: Sangati Federico
Publication venue
Publication date: 01/01/2012
Field of study

Università degli Studi di Napoli L'Orientale: CINECA IRIS

Algebraic decoder specification: coupling formal-language theory and statistical machine translation: Algebraic decoder specification: coupling formal-language theory and statistical machine translation

Author: Büchse Matthias
Publication venue
Publication date: 18/12/2014
Field of study

The specification of a decoder, i.e., a program that translates sentences from one natural language into another, is an intricate process, driven by the application and lacking a canonical methodology. The practical nature of decoder development inhibits the transfer of knowledge between theory and application, which is unfortunate because many contemporary decoders are in fact related to formal-language theory. This thesis proposes an algebraic framework where a decoder is specified by an expression built from a fixed set of operations. As yet, this framework accommodates contemporary syntax-based decoders, it spans two levels of abstraction, and, primarily, it encourages mutual stimulation between the theory of weighted tree automata and the application

Technische Universität Dresden: Qucosa

Recommended from our members

Learning with Joint Inference and Latent Linguistic Structure in Graphical Models

Author: Narad Jason
Publication venue: ScholarWorks@UMass Amherst
Publication date: 17/03/2015
Field of study

Constructing end-to-end NLP systems requires the processing of many types of linguistic information prior to solving the desired end task. A common approach to this problem is to construct a pipeline, one component for each task, with each system\u27s output becoming input for the next. This approach poses two problems. First, errors propagate, and, much like the childhood game of telephone , combining systems in this manner can lead to unintelligible outcomes. Second, each component task requires annotated training data to act as supervision for training the model. These annotations are often expensive and time-consuming to produce, may differ from each other in genre and style, and may not match the intended application. In this dissertation we present a general framework for constructing and reasoning on joint graphical model formulations of NLP problems. Individual models are composed using weighted Boolean logic constraints, and inference is performed using belief propagation. The systems we develop are composed of two parts: one a representation of syntax, the other a desired end task (semantic role labeling, named entity recognition, or relation extraction). By modeling these problems jointly, both models are trained in a single, integrated process, with uncertainty propagated between them. This mitigates the accumulation of errors typical of pipelined approaches. Additionally we propose a novel marginalization-based training method in which the error signal from end task annotations is used to guide the induction of a constrained latent syntactic representation. This allows training in the absence of syntactic training data, where the latent syntactic structure is instead optimized to best support the end task predictions. We find that across many NLP tasks this training method offers performance comparable to fully supervised training of each individual component, and in some instances improves upon it by learning latent structures which are more appropriate for the task

ScholarWorks@UMass Amherst

Macquarie University ResearchOnline

Unification-based constraints for statistical machine translation

Author: Williams Philip James
Publication venue: The University of Edinburgh
Publication date: 27/11/2014
Field of study

Morphology and syntax have both received attention in statistical machine translation research, but they are usually treated independently and the historical emphasis on translation into English has meant that many morphosyntactic issues remain under-researched. Languages with richer morphologies pose additional problems and conventional approaches tend to perform poorly when either source or target language has rich morphology. In both computational and theoretical linguistics, feature structures together with the associated operation of unification have proven a powerful tool for modelling many morphosyntactic aspects of natural language. In this thesis, we propose a framework that extends a state-of-the-art syntax-based model with a feature structure lexicon and unification-based constraints on the target-side of the synchronous grammar. Whilst our framework is language-independent, we focus on problems in the translation of English to German, a language pair that has a high degree of syntactic reordering and rich target-side morphology. We first apply our approach to modelling agreement and case government phenomena. We use the lexicon to link surface form words with grammatical feature values, such as case, gender, and number, and we use constraints to enforce feature value identity for the words in agreement and government relations. We demonstrate improvements in translation quality of up to 0.5 BLEU over a strong baseline model. We then examine verbal complex production, another aspect of translation that requires the coordination of linguistic features over multiple words, often with long-range discontinuities. We develop a feature structure representation of verbal complex types, using constraint failure as an indicator of translation error and use this to automatically identify and quantify errors that occur in our baseline system. A manual analysis and classification of errors informs an extended version of the model that incorporates information derived from a parse of the source. We identify clause spans and use model features to encourage the generation of complete verbal complex types. We are able to improve accuracy as measured using precision and recall against values extracted from the reference test sets. Our framework allows for the incorporation of rich linguistic information and we present sketches of further applications that could be explored in future work

Edinburgh Research Archive

Syntactic and semantic features for statistical and neural machine translation

Author: Nădejde Maria
Publication venue: The University of Edinburgh
Publication date: 02/07/2018
Field of study

Machine Translation (MT) for language pairs with long distance dependencies and word reordering, such as German–English, is prone to producing output that is lexically or syntactically incoherent. Statistical MT (SMT) models used explicit or latent syntax to improve reordering, however failed at capturing other long distance dependencies. This thesis explores how explicit sentence-level syntactic information can improve translation for such complex linguistic phenomena. In particular, we work at the level of the syntactic-semantic interface with representations conveying the predicate-argument structures. These are essential to preserving semantics in translation and SMT systems have long struggled to model them. String-to-tree SMT systems use explicit target syntax to handle long-distance reordering, but make strong independence assumptions which lead to inconsistent lexical choices. To address this, we propose a Selectional Preferences feature which models the semantic affinities between target predicates and their argument fillers using the target dependency relations available in the decoder. We found that our feature is not effective in a string-to-tree system for German→English and that often the conditioning context is wrong because of mistranslated verbs. To improve verb translation, we proposed a Neural Verb Lexicon Model (NVLM) incorporating sentence-level syntactic context from the source which carries relevant semantic information for verb disambiguation. When used as an extra feature for re-ranking the output of a German→ English string-to-tree system, the NVLM improved verb translation precision by up to 2.7% and recall by up to 7.4%. While the NVLM improved some aspects of translation, other syntactic and lexical inconsistencies are not being addressed by a linear combination of independent models. In contrast to SMT, neural machine translation (NMT) avoids strong independence assumptions thus generating more fluent translations and capturing some long-distance dependencies. Still, incorporating additional linguistic information can improve translation quality. We proposed a method for tightly coupling target words and syntax in the NMT decoder. To represent syntax explicitly, we used CCG supertags, which encode subcategorization information, capturing long distance dependencies and attachments. Our method improved translation quality on several difficult linguistic constructs, including prepositional phrases which are the most frequent type of predicate arguments. These improvements over a strong baseline NMT system were consistent across two language pairs: 0.9 BLEU for German→English and 1.2 BLEU for Romanian→English

Edinburgh Research Archive

Edinburgh Research Explorer

Proceedings

Author: Dickinson Markus
Müürisep Kaili
Passarotti Marco
Publication venue
Publication date: 01/12/2010
Field of study

Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 268 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

DSpace at Tartu University Library