32 research outputs found
Discontinuous Data-Oriented Parsing: A mildly context-sensitive all-fragments grammar
Recent advances in parsing technology have made treebank parsing with discontinuous constituents possible, with parser output of competitive quality (Kallmeyer and Maier, 2010). We apply Data-Oriented Parsing (DOP) to a grammar formalism that allows for discontinuous trees (LCFRS). Decisions during parsing are conditioned on all possible fragments, resulting in improved performance. Despite the fact that both DOP and discontinuity present formidable challenges in terms of computational complexity, the model is reasonably efficient, and surpasses the state of the art in discontinuous parsing.
A Formal View on Training of Weighted Tree Automata by Likelihood-Driven State Splitting and Merging
The use of computers and algorithms to deal with human language, in both spoken and written form, is summarized by the term natural language processing (nlp). Modeling language in a way that is suitable for computers plays an important role in nlp. One idea is to use formalisms from theoretical computer science for that purpose. For example, one can try to find an automaton to capture the valid written sentences of a language. Finding such an automaton by way of examples is called training.
In this work, we also consider the structure of sentences by making use of trees. We use weighted tree automata (wta) in order to deal with such tree structures. Those devices assign weights to trees in order to, for example, distinguish between good and bad structures. The well-known expectation-maximization algorithm can be used to train the weights for a wta while the state behavior stays fixed. As a way to adapt the state behavior of a wta, state splitting, i.e. dividing a state into several new states, and state merging, i.e. replacing several states by a single new state, can be used. State splitting, state merging, and the expectation maximization algorithm already were combined into the state splitting and merging algorithm, which was successfully applied in practice. In our work, we formalized this approach in order to show properties of the algorithm. We also examined a new approach – the count-based state merging algorithm – which exclusively relies on state merging.
When dealing with trees, another important tool is binarization. A binarization is a strategy to code arbitrary trees by binary trees. For each of three different binarizations we showed that wta together with the binarization are as powerful as weighted unranked tree automata (wuta). We also showed that this is still true if only probabilistic wta and probabilistic wuta are considered.:How to Read This Thesis
1. Introduction
1.1. The Contributions and the Structure of This Work
2. Preliminaries
2.1. Sets, Relations, Functions, Families, and Extrema
2.2. Algebraic Structures
2.3. Formal Languages
3. Language Formalisms
3.1. Context-Free Grammars (CFGs)
3.2. Context-Free Grammars with Latent Annotations (CFG-LAs)
3.3. Weighted Tree Automata (WTAs)
3.4. Equivalences of WCFG-LAs and WTAs
4. Training of WTAs
4.1. Probability Distributions
4.2. Maximum Likelihood Estimation
4.3. Probabilities and WTAs
4.4. The EM Algorithm for WTAs
4.5. Inside and Outside Weights
4.6. Adaption of the Estimation of Corazza and Satta [CS07] to WTAs
5. State Splitting and Merging
5.1. State Splitting and Merging for Weighted Tree Automata
5.1.1. Splitting Weights and Probabilities
5.1.2. Merging Probabilities
5.2. The State Splitting and Merging Algorithm
5.2.1. Finding a Good π-Distributor
5.2.2. Notes About the Berkeley Parser
5.3. Conclusion and Further Research
6. Count-Based State Merging
6.1. Preliminaries
6.2. The Likelihood of the Maximum Likelihood Estimate and Its Behavior While Merging
6.3. The Count-Based State Merging Algorithm
6.3.1. Further Adjustments for Practical Implementations
6.4. Implementation of Count-Based State Merging
6.5. Experiments with Artificial Automata and Corpora
6.5.1. The Artificial Automata
6.5.2. Results
6.6. Experiments with the Penn Treebank
6.7. Comparison to the Approach of Carrasco, Oncina, and Calera-Rubio [COC01]
6.8. Conclusion and Further Research
7. Binarization
7.1. Preliminaries
7.2. Relating WSTAs and WUTAs via Binarizations
7.2.1. Left-Branching Binarization
7.2.2. Right-Branching Binarization
7.2.3. Mixed Binarization
7.3. The Probabilistic Case
7.3.1. Additional Preliminaries About WSAs
7.3.2. Constructing an Out-Probabilistic WSA from a Converging WSA
7.3.3. Binarization and Probabilistic Tree Automata
7.4. Connection to the Training Methods in Previous Chapters
7.5. Conclusion and Further Research
A. Proofs for Preliminaries
B. Proofs for Training of WTAs
C. Proofs for State Splitting and Merging
D. Proofs for Count-Based State Merging
Bibliography
List of Algorithms
List of Figures
List of Tables
Index
Table of Variable Name
Algebraic decoder specification: coupling formal-language theory and statistical machine translation: Algebraic decoder specification: coupling formal-language theory and statistical machine translation
The specification of a decoder, i.e., a program that translates sentences from one natural language into another, is an intricate process, driven by the application and lacking a canonical methodology. The practical nature of decoder development inhibits the transfer of knowledge between theory and application, which is unfortunate because many contemporary decoders are in fact related to formal-language theory. This thesis proposes an algebraic framework where a decoder is specified by an expression built from a fixed set of operations. As yet, this framework accommodates contemporary syntax-based decoders, it spans two levels of abstraction, and, primarily, it encourages mutual stimulation between the theory of weighted tree automata and the application
Recommended from our members
Learning with Joint Inference and Latent Linguistic Structure in Graphical Models
Constructing end-to-end NLP systems requires the processing of many types of linguistic information prior to solving the desired end task. A common approach to this problem is to construct a pipeline, one component for each task, with each system\u27s output becoming input for the next. This approach poses two problems. First, errors propagate, and, much like the childhood game of telephone , combining systems in this manner can lead to unintelligible outcomes. Second, each component task requires annotated training data to act as supervision for training the model. These annotations are often expensive and time-consuming to produce, may differ from each other in genre and style, and may not match the intended application.
In this dissertation we present a general framework for constructing and reasoning on joint graphical model formulations of NLP problems. Individual models are composed using weighted Boolean logic constraints, and inference is performed using belief propagation. The systems we develop are composed of two parts: one a representation of syntax, the other a desired end task (semantic role labeling, named entity recognition, or relation extraction). By modeling these problems jointly, both models are trained in a single, integrated process, with uncertainty propagated between them. This mitigates the accumulation of errors typical of pipelined approaches.
Additionally we propose a novel marginalization-based training method in which the error signal from end task annotations is used to guide the induction of a constrained latent syntactic representation. This allows training in the absence of syntactic training data, where the latent syntactic structure is instead optimized to best support the end task predictions. We find that across many NLP tasks this training method offers performance comparable to fully supervised training of each individual component, and in some instances improves upon it by learning latent structures which are more appropriate for the task
Unification-based constraints for statistical machine translation
Morphology and syntax have both received attention in statistical machine translation
research, but they are usually treated independently and the historical emphasis on
translation into English has meant that many morphosyntactic issues remain under-researched.
Languages with richer morphologies pose additional problems and conventional
approaches tend to perform poorly when either source or target language has
rich morphology.
In both computational and theoretical linguistics, feature structures together with
the associated operation of unification have proven a powerful tool for modelling many
morphosyntactic aspects of natural language. In this thesis, we propose a framework
that extends a state-of-the-art syntax-based model with a feature structure lexicon and
unification-based constraints on the target-side of the synchronous grammar. Whilst
our framework is language-independent, we focus on problems in the translation of
English to German, a language pair that has a high degree of syntactic reordering and
rich target-side morphology.
We first apply our approach to modelling agreement and case government phenomena.
We use the lexicon to link surface form words with grammatical feature
values, such as case, gender, and number, and we use constraints to enforce feature
value identity for the words in agreement and government relations. We demonstrate
improvements in translation quality of up to 0.5 BLEU over a strong baseline model.
We then examine verbal complex production, another aspect of translation that
requires the coordination of linguistic features over multiple words, often with long-range
discontinuities. We develop a feature structure representation of verbal complex
types, using constraint failure as an indicator of translation error and use this to automatically
identify and quantify errors that occur in our baseline system. A manual
analysis and classification of errors informs an extended version of the model that incorporates
information derived from a parse of the source. We identify clause spans
and use model features to encourage the generation of complete verbal complex types.
We are able to improve accuracy as measured using precision and recall against values
extracted from the reference test sets.
Our framework allows for the incorporation of rich linguistic information and we
present sketches of further applications that could be explored in future work
Syntactic and semantic features for statistical and neural machine translation
Machine Translation (MT) for language pairs with long distance dependencies and
word reordering, such as German–English, is prone to producing output that is lexically
or syntactically incoherent. Statistical MT (SMT) models used explicit or latent
syntax to improve reordering, however failed at capturing other long distance dependencies.
This thesis explores how explicit sentence-level syntactic information can improve
translation for such complex linguistic phenomena. In particular, we work at the
level of the syntactic-semantic interface with representations conveying the predicate-argument
structures. These are essential to preserving semantics in translation and
SMT systems have long struggled to model them.
String-to-tree SMT systems use explicit target syntax to handle long-distance reordering,
but make strong independence assumptions which lead to inconsistent lexical
choices. To address this, we propose a Selectional Preferences feature which models
the semantic affinities between target predicates and their argument fillers using the
target dependency relations available in the decoder. We found that our feature is not
effective in a string-to-tree system for German→English and that often the conditioning
context is wrong because of mistranslated verbs.
To improve verb translation, we proposed a Neural Verb Lexicon Model (NVLM)
incorporating sentence-level syntactic context from the source which carries relevant
semantic information for verb disambiguation. When used as an extra feature for re-ranking
the output of a German→ English string-to-tree system, the NVLM improved
verb translation precision by up to 2.7% and recall by up to 7.4%.
While the NVLM improved some aspects of translation, other syntactic and lexical
inconsistencies are not being addressed by a linear combination of independent models.
In contrast to SMT, neural machine translation (NMT) avoids strong independence
assumptions thus generating more fluent translations and capturing some long-distance
dependencies. Still, incorporating additional linguistic information can improve translation
quality.
We proposed a method for tightly coupling target words and syntax in the NMT
decoder. To represent syntax explicitly, we used CCG supertags, which encode subcategorization
information, capturing long distance dependencies and attachments. Our
method improved translation quality on several difficult linguistic constructs, including
prepositional phrases which are the most frequent type of predicate arguments. These
improvements over a strong baseline NMT system were consistent across two language
pairs: 0.9 BLEU for German→English and 1.2 BLEU for Romanian→English
Proceedings
Proceedings of the Ninth International Workshop
on Treebanks and Linguistic Theories.
Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti.
NEALT Proceedings Series, Vol. 9 (2010), 268 pages.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15891