1,096 research outputs found
A Formal View on Training of Weighted Tree Automata by Likelihood-Driven State Splitting and Merging
The use of computers and algorithms to deal with human language, in both spoken and written form, is summarized by the term natural language processing (nlp). Modeling language in a way that is suitable for computers plays an important role in nlp. One idea is to use formalisms from theoretical computer science for that purpose. For example, one can try to find an automaton to capture the valid written sentences of a language. Finding such an automaton by way of examples is called training.
In this work, we also consider the structure of sentences by making use of trees. We use weighted tree automata (wta) in order to deal with such tree structures. Those devices assign weights to trees in order to, for example, distinguish between good and bad structures. The well-known expectation-maximization algorithm can be used to train the weights for a wta while the state behavior stays fixed. As a way to adapt the state behavior of a wta, state splitting, i.e. dividing a state into several new states, and state merging, i.e. replacing several states by a single new state, can be used. State splitting, state merging, and the expectation maximization algorithm already were combined into the state splitting and merging algorithm, which was successfully applied in practice. In our work, we formalized this approach in order to show properties of the algorithm. We also examined a new approach – the count-based state merging algorithm – which exclusively relies on state merging.
When dealing with trees, another important tool is binarization. A binarization is a strategy to code arbitrary trees by binary trees. For each of three different binarizations we showed that wta together with the binarization are as powerful as weighted unranked tree automata (wuta). We also showed that this is still true if only probabilistic wta and probabilistic wuta are considered.:How to Read This Thesis
1. Introduction
1.1. The Contributions and the Structure of This Work
2. Preliminaries
2.1. Sets, Relations, Functions, Families, and Extrema
2.2. Algebraic Structures
2.3. Formal Languages
3. Language Formalisms
3.1. Context-Free Grammars (CFGs)
3.2. Context-Free Grammars with Latent Annotations (CFG-LAs)
3.3. Weighted Tree Automata (WTAs)
3.4. Equivalences of WCFG-LAs and WTAs
4. Training of WTAs
4.1. Probability Distributions
4.2. Maximum Likelihood Estimation
4.3. Probabilities and WTAs
4.4. The EM Algorithm for WTAs
4.5. Inside and Outside Weights
4.6. Adaption of the Estimation of Corazza and Satta [CS07] to WTAs
5. State Splitting and Merging
5.1. State Splitting and Merging for Weighted Tree Automata
5.1.1. Splitting Weights and Probabilities
5.1.2. Merging Probabilities
5.2. The State Splitting and Merging Algorithm
5.2.1. Finding a Good π-Distributor
5.2.2. Notes About the Berkeley Parser
5.3. Conclusion and Further Research
6. Count-Based State Merging
6.1. Preliminaries
6.2. The Likelihood of the Maximum Likelihood Estimate and Its Behavior While Merging
6.3. The Count-Based State Merging Algorithm
6.3.1. Further Adjustments for Practical Implementations
6.4. Implementation of Count-Based State Merging
6.5. Experiments with Artificial Automata and Corpora
6.5.1. The Artificial Automata
6.5.2. Results
6.6. Experiments with the Penn Treebank
6.7. Comparison to the Approach of Carrasco, Oncina, and Calera-Rubio [COC01]
6.8. Conclusion and Further Research
7. Binarization
7.1. Preliminaries
7.2. Relating WSTAs and WUTAs via Binarizations
7.2.1. Left-Branching Binarization
7.2.2. Right-Branching Binarization
7.2.3. Mixed Binarization
7.3. The Probabilistic Case
7.3.1. Additional Preliminaries About WSAs
7.3.2. Constructing an Out-Probabilistic WSA from a Converging WSA
7.3.3. Binarization and Probabilistic Tree Automata
7.4. Connection to the Training Methods in Previous Chapters
7.5. Conclusion and Further Research
A. Proofs for Preliminaries
B. Proofs for Training of WTAs
C. Proofs for State Splitting and Merging
D. Proofs for Count-Based State Merging
Bibliography
List of Algorithms
List of Figures
List of Tables
Index
Table of Variable Name
Rely-guarantee protocols for safe interference over shared memory
Mutable state can be useful in certain algorithms, to structure programs, or for
efficiency purposes. However, when shared mutable state is used in non-local or nonobvious
ways, the interactions that can occur via aliases to that shared memory can be
a source of program errors. Undisciplined uses of shared state may unsafely interfere
with local reasoning as other aliases may interleave their changes to the shared state
in unexpected ways. We propose a novel technique, rely-guarantee protocols, that
structures the interactions between aliases and ensures that only safe interference is
possible.
We present a linear type system outfitted with our novel sharing mechanism that
enables controlled interference over shared mutable resources. Each alias is assigned
separate, local roles encoded in a protocol abstraction that constrains how an alias can
legally use that shared state. By following the spirit of rely-guarantee reasoning, our
rely-guarantee protocols ensure that only safe interference can occur but still allow
many interesting uses of shared state, such as going beyond invariant and monotonic
usages.
This thesis describes the three core mechanisms that enable our type-based technique
to work: 1) we show how a protocol models an alias’s perspective on how the
shared state evolves and constrains that alias’s interactions with the shared state; 2) we
show how protocols can be used while enforcing the agreed interference contract; and
finally, 3) we show how to check that all local protocols to some shared state can be
safely composed to ensure globally safe interference over that shared memory. The
interference caused by shared state is rooted at how the uses of di↵erent aliases to that
state may be interleaved (perhaps even in non-deterministic ways) at run-time. Therefore,
our technique is mostly agnostic as to whether this interference was the result
of alias interleaving caused by sequential or concurrent semantics. We show implementations
of our technique in both settings, and highlight their di↵erences. Because
sharing is “first-class” (and not tied to a module), we show a polymorphic procedure
that enables abstract compositions of protocols. Thus, protocols can be specialized or
extended without requiring specific knowledge of the interference produce by other
protocols to that state. We show that protocol composition can ensure safety even
when considering abstracted protocols. We show that this core composition mechanism
is sound, decidable (without the need for manual intervention), and provide an
algorithm implementation
Data-Oriented Parsing with discontinuous constituents and function tags
Statistical parsers are e ective but are typically limited to producing projective dependencies or constituents. On the other hand, linguisti- cally rich parsers recognize non-local relations and analyze both form and function phenomena but rely on extensive manual grammar development. We combine advantages of the two by building a statistical parser that produces richer analyses.Â
We investigate new techniques to implement treebank-based parsers that allow for discontinuous constituents. We present two systems. One system is based on a string-rewriting Linear Context-Free Rewriting System (LCFRS), while using a Probabilistic Discontinuous Tree Substitution Grammar (PDTSG) to improve disambiguation performance. Another system encodes the discontinuities in the labels of phrase structure trees, allowing for efficient context-free grammar parsing.
The two systems demonstrate that tree fragments as used in tree-substitution grammar improve disambiguation performance while capturing non-local relations on an as-needed basis. Additionally, we present results of models that produce function tags, resulting in a more linguistically adequate model of the data. We report substantial accuracy improvements in discontinuous parsing for German, English, and Dutch, including results on spoken Dutch
- …