83,517 research outputs found
Comparing similar ordered trees in linear-time
AbstractWe describe a linear-time algorithm for comparing two similar ordered rooted trees with node labels. The method for comparing trees is the usual tree edit distance. We show that an optimal mapping that uses at most k insertions or deletions can then be constructed in O(nk3) where n is the size of the trees. The approach is inspired by the Zhang–Shasha algorithm for tree edit distance in combination with an adequate pruning of the search space based on the tree edit graph
Rule-based Machine Learning Methods for Functional Prediction
We describe a machine learning method for predicting the value of a
real-valued function, given the values of multiple input variables. The method
induces solutions from samples in the form of ordered disjunctive normal form
(DNF) decision rules. A central objective of the method and representation is
the induction of compact, easily interpretable solutions. This rule-based
decision model can be extended to search efficiently for similar cases prior to
approximating function values. Experimental results on real-world data
demonstrate that the new techniques are competitive with existing machine
learning and statistical methods and can sometimes yield superior regression
performance.Comment: See http://www.jair.org/ for any accompanying file
Arithmetic for Rooted Trees
We propose a new arithmetic for non-empty rooted unordered trees simply
called trees. After discussing tree representation and enumeration, we define
the operations of tree addition, multiplication and stretch, prove their
properties, and show that all trees can be generated from a starting tree of
one vertex. We then show how a given tree can be obtained as the sum or product
of two trees, thus defining prime trees with respect to addition and
multiplication. In both cases we show how primality can be decided in time
polynomial in the number of vertices and we prove that factorization is unique.
We then define negative trees and suggest dealing with tree equations, giving
some preliminary results. Finally we comment on how our arithmetic might be
useful, and discuss preceding studies that have some relations with our. To the
best of our knowledge our approach and results are completely new aside for an
earlier version of this work submitte as an arXiv manuscript.Comment: 18 pages, 8 figure
XML Compression via DAGs
Unranked trees can be represented using their minimal dag (directed acyclic
graph). For XML this achieves high compression ratios due to their repetitive
mark up. Unranked trees are often represented through first child/next sibling
(fcns) encoded binary trees. We study the difference in size (= number of
edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One
main finding is that the size of the dag of the binary tree can never be
smaller than the square root of the size of the minimal dag, and that there are
examples that match this bound. We introduce a new combined structure, the
hybrid dag, which is guaranteed to be smaller than (or equal in size to) both
dags. Interestingly, we find through experiments that last child/previous
sibling encodings are much better for XML compression via dags, than fcns
encodings. We determine the average sizes of unranked and binary dags over a
given set of labels (under uniform distribution) in terms of their exact
generating functions, and in terms of their asymptotical behavior.Comment: A short version of this paper appeared in the Proceedings of ICDT
201
Computing Runs on a General Alphabet
We describe a RAM algorithm computing all runs (maximal repetitions) of a
given string of length over a general ordered alphabet in
time and linear space. Our algorithm outperforms all
known solutions working in time provided , where is the alphabet size. We conjecture that there
exists a linear time RAM algorithm finding all runs.Comment: 4 pages, 2 figure
Partitioned conditional generalized linear models for categorical data
In categorical data analysis, several regression models have been proposed
for hierarchically-structured response variables, e.g. the nested logit model.
But they have been formally defined for only two or three levels in the
hierarchy. Here, we introduce the class of partitioned conditional generalized
linear models (PCGLMs) defined for any numbers of levels. The hierarchical
structure of these models is fully specified by a partition tree of categories.
Using the genericity of the (r,F,Z) specification, the PCGLM can handle
nominal, ordinal but also partially-ordered response variables.Comment: 25 pages, 13 figure
Efficient chaining of seeds in ordered trees
We consider here the problem of chaining seeds in ordered trees. Seeds are
mappings between two trees Q and T and a chain is a subset of non overlapping
seeds that is consistent with respect to postfix order and ancestrality. This
problem is a natural extension of a similar problem for sequences, and has
applications in computational biology, such as mining a database of RNA
secondary structures. For the chaining problem with a set of m constant size
seeds, we describe an algorithm with complexity O(m2 log(m)) in time and O(m2)
in space
- …