98 research outputs found
XML Compression via DAGs
Unranked trees can be represented using their minimal dag (directed acyclic
graph). For XML this achieves high compression ratios due to their repetitive
mark up. Unranked trees are often represented through first child/next sibling
(fcns) encoded binary trees. We study the difference in size (= number of
edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One
main finding is that the size of the dag of the binary tree can never be
smaller than the square root of the size of the minimal dag, and that there are
examples that match this bound. We introduce a new combined structure, the
hybrid dag, which is guaranteed to be smaller than (or equal in size to) both
dags. Interestingly, we find through experiments that last child/previous
sibling encodings are much better for XML compression via dags, than fcns
encodings. We determine the average sizes of unranked and binary dags over a
given set of labels (under uniform distribution) in terms of their exact
generating functions, and in terms of their asymptotical behavior.Comment: A short version of this paper appeared in the Proceedings of ICDT
201
A Formal View on Training of Weighted Tree Automata by Likelihood-Driven State Splitting and Merging
The use of computers and algorithms to deal with human language, in both spoken and written form, is summarized by the term natural language processing (nlp). Modeling language in a way that is suitable for computers plays an important role in nlp. One idea is to use formalisms from theoretical computer science for that purpose. For example, one can try to find an automaton to capture the valid written sentences of a language. Finding such an automaton by way of examples is called training.
In this work, we also consider the structure of sentences by making use of trees. We use weighted tree automata (wta) in order to deal with such tree structures. Those devices assign weights to trees in order to, for example, distinguish between good and bad structures. The well-known expectation-maximization algorithm can be used to train the weights for a wta while the state behavior stays fixed. As a way to adapt the state behavior of a wta, state splitting, i.e. dividing a state into several new states, and state merging, i.e. replacing several states by a single new state, can be used. State splitting, state merging, and the expectation maximization algorithm already were combined into the state splitting and merging algorithm, which was successfully applied in practice. In our work, we formalized this approach in order to show properties of the algorithm. We also examined a new approach ā the count-based state merging algorithm ā which exclusively relies on state merging.
When dealing with trees, another important tool is binarization. A binarization is a strategy to code arbitrary trees by binary trees. For each of three different binarizations we showed that wta together with the binarization are as powerful as weighted unranked tree automata (wuta). We also showed that this is still true if only probabilistic wta and probabilistic wuta are considered.:How to Read This Thesis
1. Introduction
1.1. The Contributions and the Structure of This Work
2. Preliminaries
2.1. Sets, Relations, Functions, Families, and Extrema
2.2. Algebraic Structures
2.3. Formal Languages
3. Language Formalisms
3.1. Context-Free Grammars (CFGs)
3.2. Context-Free Grammars with Latent Annotations (CFG-LAs)
3.3. Weighted Tree Automata (WTAs)
3.4. Equivalences of WCFG-LAs and WTAs
4. Training of WTAs
4.1. Probability Distributions
4.2. Maximum Likelihood Estimation
4.3. Probabilities and WTAs
4.4. The EM Algorithm for WTAs
4.5. Inside and Outside Weights
4.6. Adaption of the Estimation of Corazza and Satta [CS07] to WTAs
5. State Splitting and Merging
5.1. State Splitting and Merging for Weighted Tree Automata
5.1.1. Splitting Weights and Probabilities
5.1.2. Merging Probabilities
5.2. The State Splitting and Merging Algorithm
5.2.1. Finding a Good Ļ-Distributor
5.2.2. Notes About the Berkeley Parser
5.3. Conclusion and Further Research
6. Count-Based State Merging
6.1. Preliminaries
6.2. The Likelihood of the Maximum Likelihood Estimate and Its Behavior While Merging
6.3. The Count-Based State Merging Algorithm
6.3.1. Further Adjustments for Practical Implementations
6.4. Implementation of Count-Based State Merging
6.5. Experiments with Artificial Automata and Corpora
6.5.1. The Artificial Automata
6.5.2. Results
6.6. Experiments with the Penn Treebank
6.7. Comparison to the Approach of Carrasco, Oncina, and Calera-Rubio [COC01]
6.8. Conclusion and Further Research
7. Binarization
7.1. Preliminaries
7.2. Relating WSTAs and WUTAs via Binarizations
7.2.1. Left-Branching Binarization
7.2.2. Right-Branching Binarization
7.2.3. Mixed Binarization
7.3. The Probabilistic Case
7.3.1. Additional Preliminaries About WSAs
7.3.2. Constructing an Out-Probabilistic WSA from a Converging WSA
7.3.3. Binarization and Probabilistic Tree Automata
7.4. Connection to the Training Methods in Previous Chapters
7.5. Conclusion and Further Research
A. Proofs for Preliminaries
B. Proofs for Training of WTAs
C. Proofs for State Splitting and Merging
D. Proofs for Count-Based State Merging
Bibliography
List of Algorithms
List of Figures
List of Tables
Index
Table of Variable Name
Linear Bounded Composition of Tree-Walking Tree Transducers: Linear Size Increase and Complexity
Compositions of tree-walking tree transducers form a hierarchy with respect
to the number of transducers in the composition. As main technical result it is
proved that any such composition can be realized as a linear bounded
composition, which means that the sizes of the intermediate results can be
chosen to be at most linear in the size of the output tree. This has
consequences for the expressiveness and complexity of the translations in the
hierarchy. First, if the computed translation is a function of linear size
increase, i.e., the size of the output tree is at most linear in the size of
the input tree, then it can be realized by just one, deterministic,
tree-walking tree transducer. For compositions of deterministic transducers it
is decidable whether or not the translation is of linear size increase. Second,
every composition of deterministic transducers can be computed in deterministic
linear time on a RAM and in deterministic linear space on a Turing machine,
measured in the sum of the sizes of the input and output tree. Similarly, every
composition of nondeterministic transducers can be computed in simultaneous
polynomial time and linear space on a nondeterministic Turing machine. Their
output tree languages are deterministic context-sensitive, i.e., can be
recognized in deterministic linear space on a Turing machine. The membership
problem for compositions of nondeterministic translations is nondeterministic
polynomial time and deterministic linear space. The membership problem for the
composition of a nondeterministic and a deterministic tree-walking tree
translation (for a nondeterministic IO macro tree translation) is log-space
reducible to a context-free language, whereas the membership problem for the
composition of a deterministic and a nondeterministic tree-walking tree
translation (for a nondeterministic OI macro tree translation) is possibly
NP-complete
Automata for Unordered Trees
International audienceWe present a framework for defining automata for unordereddata trees that is parametrized by the way in which multisets of children nodes are described. Presburger tree automata and alternatingPresburger tree automata are particular instances. We establish the usual equivalence in expressiveness of tree automata and MSO for the automata defined inour framework.We then investigate subclasses of automata for unordered treesfor which testing language equivalence is in P-time. For this we start from automata in our framework that describe multisets of childrenby finite automata, and propose two approaches of how todo this deterministically. We show that a restriction to confluent horizontal evaluation leads to polynomial-time emptiness and universality, but still suffers fromcoNP-completeness of the emptiness of binary intersections. Finally, efficient algorithms can be obtained by imposing an order of horizontal evaluation globally for all automata in the class. Depending onthe choice of the order, we obtain different classes of automata, eachof which has the same expressiveness as Counting MSO
Reasoning about XML with temporal logics and automata
We show that problems arising in static analysis of XML specifications and transformations can be dealt with using techniques similar to those developed for static analysis of programs. Many properties of interest in the XML context are related to navigation, and can be formulated in temporal logics for trees. We choose a logic that admits a simple single-exponential translation into unranked tree automata, in the spirit of the classical LTL-to-BĆ¼chi automata translation. Automata arising from this translation have a number of additional properties; in particular, they are convenient for reasoning about unary node-selecting queries, which are important in the XML context. We give two applications of such reasoning: one deals with a classical XML problem of reasoning about navigation in the presence of schemas, and the other relates to verifying security properties of XML views
- ā¦