98 research outputs found

    XML Compression via DAGs

    Full text link
    Unranked trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size (= number of edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One main finding is that the size of the dag of the binary tree can never be smaller than the square root of the size of the minimal dag, and that there are examples that match this bound. We introduce a new combined structure, the hybrid dag, which is guaranteed to be smaller than (or equal in size to) both dags. Interestingly, we find through experiments that last child/previous sibling encodings are much better for XML compression via dags, than fcns encodings. We determine the average sizes of unranked and binary dags over a given set of labels (under uniform distribution) in terms of their exact generating functions, and in terms of their asymptotical behavior.Comment: A short version of this paper appeared in the Proceedings of ICDT 201

    A Formal View on Training of Weighted Tree Automata by Likelihood-Driven State Splitting and Merging

    Get PDF
    The use of computers and algorithms to deal with human language, in both spoken and written form, is summarized by the term natural language processing (nlp). Modeling language in a way that is suitable for computers plays an important role in nlp. One idea is to use formalisms from theoretical computer science for that purpose. For example, one can try to find an automaton to capture the valid written sentences of a language. Finding such an automaton by way of examples is called training. In this work, we also consider the structure of sentences by making use of trees. We use weighted tree automata (wta) in order to deal with such tree structures. Those devices assign weights to trees in order to, for example, distinguish between good and bad structures. The well-known expectation-maximization algorithm can be used to train the weights for a wta while the state behavior stays fixed. As a way to adapt the state behavior of a wta, state splitting, i.e. dividing a state into several new states, and state merging, i.e. replacing several states by a single new state, can be used. State splitting, state merging, and the expectation maximization algorithm already were combined into the state splitting and merging algorithm, which was successfully applied in practice. In our work, we formalized this approach in order to show properties of the algorithm. We also examined a new approach ā€“ the count-based state merging algorithm ā€“ which exclusively relies on state merging. When dealing with trees, another important tool is binarization. A binarization is a strategy to code arbitrary trees by binary trees. For each of three different binarizations we showed that wta together with the binarization are as powerful as weighted unranked tree automata (wuta). We also showed that this is still true if only probabilistic wta and probabilistic wuta are considered.:How to Read This Thesis 1. Introduction 1.1. The Contributions and the Structure of This Work 2. Preliminaries 2.1. Sets, Relations, Functions, Families, and Extrema 2.2. Algebraic Structures 2.3. Formal Languages 3. Language Formalisms 3.1. Context-Free Grammars (CFGs) 3.2. Context-Free Grammars with Latent Annotations (CFG-LAs) 3.3. Weighted Tree Automata (WTAs) 3.4. Equivalences of WCFG-LAs and WTAs 4. Training of WTAs 4.1. Probability Distributions 4.2. Maximum Likelihood Estimation 4.3. Probabilities and WTAs 4.4. The EM Algorithm for WTAs 4.5. Inside and Outside Weights 4.6. Adaption of the Estimation of Corazza and Satta [CS07] to WTAs 5. State Splitting and Merging 5.1. State Splitting and Merging for Weighted Tree Automata 5.1.1. Splitting Weights and Probabilities 5.1.2. Merging Probabilities 5.2. The State Splitting and Merging Algorithm 5.2.1. Finding a Good Ļ€-Distributor 5.2.2. Notes About the Berkeley Parser 5.3. Conclusion and Further Research 6. Count-Based State Merging 6.1. Preliminaries 6.2. The Likelihood of the Maximum Likelihood Estimate and Its Behavior While Merging 6.3. The Count-Based State Merging Algorithm 6.3.1. Further Adjustments for Practical Implementations 6.4. Implementation of Count-Based State Merging 6.5. Experiments with Artificial Automata and Corpora 6.5.1. The Artificial Automata 6.5.2. Results 6.6. Experiments with the Penn Treebank 6.7. Comparison to the Approach of Carrasco, Oncina, and Calera-Rubio [COC01] 6.8. Conclusion and Further Research 7. Binarization 7.1. Preliminaries 7.2. Relating WSTAs and WUTAs via Binarizations 7.2.1. Left-Branching Binarization 7.2.2. Right-Branching Binarization 7.2.3. Mixed Binarization 7.3. The Probabilistic Case 7.3.1. Additional Preliminaries About WSAs 7.3.2. Constructing an Out-Probabilistic WSA from a Converging WSA 7.3.3. Binarization and Probabilistic Tree Automata 7.4. Connection to the Training Methods in Previous Chapters 7.5. Conclusion and Further Research A. Proofs for Preliminaries B. Proofs for Training of WTAs C. Proofs for State Splitting and Merging D. Proofs for Count-Based State Merging Bibliography List of Algorithms List of Figures List of Tables Index Table of Variable Name

    Linear Bounded Composition of Tree-Walking Tree Transducers: Linear Size Increase and Complexity

    Get PDF
    Compositions of tree-walking tree transducers form a hierarchy with respect to the number of transducers in the composition. As main technical result it is proved that any such composition can be realized as a linear bounded composition, which means that the sizes of the intermediate results can be chosen to be at most linear in the size of the output tree. This has consequences for the expressiveness and complexity of the translations in the hierarchy. First, if the computed translation is a function of linear size increase, i.e., the size of the output tree is at most linear in the size of the input tree, then it can be realized by just one, deterministic, tree-walking tree transducer. For compositions of deterministic transducers it is decidable whether or not the translation is of linear size increase. Second, every composition of deterministic transducers can be computed in deterministic linear time on a RAM and in deterministic linear space on a Turing machine, measured in the sum of the sizes of the input and output tree. Similarly, every composition of nondeterministic transducers can be computed in simultaneous polynomial time and linear space on a nondeterministic Turing machine. Their output tree languages are deterministic context-sensitive, i.e., can be recognized in deterministic linear space on a Turing machine. The membership problem for compositions of nondeterministic translations is nondeterministic polynomial time and deterministic linear space. The membership problem for the composition of a nondeterministic and a deterministic tree-walking tree translation (for a nondeterministic IO macro tree translation) is log-space reducible to a context-free language, whereas the membership problem for the composition of a deterministic and a nondeterministic tree-walking tree translation (for a nondeterministic OI macro tree translation) is possibly NP-complete

    XML Schema subtyping.

    Get PDF

    Automata for Unordered Trees

    Get PDF
    International audienceWe present a framework for defining automata for unordereddata trees that is parametrized by the way in which multisets of children nodes are described. Presburger tree automata and alternatingPresburger tree automata are particular instances. We establish the usual equivalence in expressiveness of tree automata and MSO for the automata defined inour framework.We then investigate subclasses of automata for unordered treesfor which testing language equivalence is in P-time. For this we start from automata in our framework that describe multisets of childrenby finite automata, and propose two approaches of how todo this deterministically. We show that a restriction to confluent horizontal evaluation leads to polynomial-time emptiness and universality, but still suffers fromcoNP-completeness of the emptiness of binary intersections. Finally, efficient algorithms can be obtained by imposing an order of horizontal evaluation globally for all automata in the class. Depending onthe choice of the order, we obtain different classes of automata, eachof which has the same expressiveness as Counting MSO

    Reasoning about XML with temporal logics and automata

    Get PDF
    We show that problems arising in static analysis of XML specifications and transformations can be dealt with using techniques similar to those developed for static analysis of programs. Many properties of interest in the XML context are related to navigation, and can be formulated in temporal logics for trees. We choose a logic that admits a simple single-exponential translation into unranked tree automata, in the spirit of the classical LTL-to-BĆ¼chi automata translation. Automata arising from this translation have a number of additional properties; in particular, they are convenient for reasoning about unary node-selecting queries, which are important in the XML context. We give two applications of such reasoning: one deals with a classical XML problem of reasoning about navigation in the presence of schemas, and the other relates to verifying security properties of XML views
    • ā€¦
    corecore