Search CORE

553 research outputs found

Compression of Unordered XML Trees

Author: Lohrey Markus
Maneth Sebastian
Reh Carl Philipp
Publication venue
Publication date: 01/01/2017
Field of study

Many XML documents are data-centric and do not make use of the inherent document order. Can we provide stronger compression for such documents through giving up order? We first consider compression via minimal dags (directed acyclic graphs) and study the worst case ratio of the size of the ordered dag divided by the size of the unordered dag, where the worst case is taken for all trees of size n. We prove that this worst case ratio is n / log n for the edge size and n log log n / log n for the node size. In experiments we compare several known compressors on the original document tree versus on a canonical version obtained by length-lexicographical sorting of subtrees. For some documents this difference is surprisingly large: reverse binary dags can be smaller by a factor of 3.7 and other compressors can be smaller by factors of up to 190

Edinburgh Research Explorer

Dagstuhl Research Online Publication Server

The Weight Function in the Subtree Kernel is Decisive

Author: Azaïs Romain
Ingels Florian
Publication venue
Publication date: 12/04/2019
Field of study

Tree data are ubiquitous because they model a large variety of situations, e.g., the architecture of plants, the secondary structure of RNA, or the hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data is difficult per se. In this paper, we focus on the subtree kernel that is a convolution kernel for tree data introduced by Vishwanathan and Smola in the early 2000's. More precisely, we investigate the influence of the weight function from a theoretical perspective and in real data applications. We establish on a 2-classes stochastic model that the performance of the subtree kernel is improved when the weight of leaves vanishes, which motivates the definition of a new weight function, learned from the data and not fixed by the user as usually done. To this end, we define a unified framework for computing the subtree kernel from ordered or unordered trees, that is particularly suitable for tuning parameters. We show through eight real data classification problems the great efficiency of our approach, in particular for small datasets, which also states the high importance of the weight function. Finally, a visualization tool of the significant features is derived.Comment: 36 page

arXiv.org e-Print Archive

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

The Weight Function in the Subtree Kernel is Decisive

Author: Azaïs Romain
Ingels Florian
Publication venue
Publication date: 01/01/1989
Field of study

Tree data are ubiquitous because they model a large variety of situations, e.g., the architecture of plants, the secondary structure of RNA, or the hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data is difficul per se. In this paper, we focus on the subtree kernel that is a convolution kernel for tree data introduced by Vishwanathan and Smola in the early 2000's. More precisely, we investigate the influence of the weight function from a theoretical perspective and in real data applications. We establish on a 2-classes stochastic model that the performance of the subtree kernel is improved when the weight of leaves vanishes, which motivates the definition of a new weight function, learned from the data and not fixed by the user as usually done. To this end, we define a unified framework for computing the subtree kernel from ordered or unordered trees, that is particularly suitable for tuning parameters. We show through two real data classification problems the great efficiency of our approach, in particular with respect to the ones considered in the literature, which also states the high importance of the weight function. Finally, a visualization tool of the significant features is derived.Comment: 28 page

arXiv.org e-Print Archive

Yale University

Lossy compression of plant architectures

Author: Boudon Frédéric
Ferraro Pascal
Gaillard Anne-Laure
Godin Christophe
Publication venue: HAL CCSD
Publication date: 01/01/2010
Field of study

International audiencePlants usually show intricate structures whose representation and management are an important source of complexity of models. Yet plant structures are also repetitive: although not identical, the organs, axes, and branches at diﬀerent positions are often highly similar. From a formal perspective, this repetitive character of plant structures was ﬁrst exploited in fractal-based plant models (Barnsley, 2000; Ferraro et al., 2005; Prusinkiewicz and Hanan, 1989; Smith, 1984). In particular, L-systems have extensively been used in the last two decades to amplify parsimonious rule-based models into complex branching structures by specifying how fundamental units are repeatedly duplicated and modiﬁed in space and over time (Prusinkiewicz et al., 2001). However, the inverse problem of ﬁnding a compact representation of a branching structure has remained largely opened, and is now becoming a key issue in modeling applications as it needs to be solved to both get insight into the complex organization of plants and to decrease time and space complexity of simulation algorithms. The idea is that a compressed version of a plant structure might be much more eﬃcient to manipulate than the original extensive branching structure. For instance, Soler et al. (2003) have shown that the complexity of radiation simulation can be drastically reduced if self-similar representations of plants are used. Unfor- tunately, strict self-similarity has a limited range of applications, because neither real plants nor more sophisticated plant models are exactly self-similar. Consequently, we propose in this paper an algorithm that exploit approximate self-similarity to compress plant structures to various degrees, representing a tradeoﬀ between compression rate and accuracy. This new compression method aims at making possible to eﬃciently model, simulate and analyze plants using these compressed representations

INRIA a CCSD electronic archive server

HAL Descartes

Agritrop

HAL-CIRAD

Integer programming-based method for grammar-based tree compression and its application to pattern extraction of glycan tree structures

Author: G Busatto
H Sakamoto
K Hashimoto
K Yamagata
M Charikar
M Hayashida
M Hayashida
M Li
Morihiro Hayashida
S Murakami
T Akutsu
Tatsuya Akutsu
W Rytter
Y Hizukuri
Yang Zhao
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Archiving scientific data

Author: Altinel M.
Avilla-Campillo I.
Buneman P.
Chawathe S. S.
Chawathe S. S.
Chien S.
Cobena G.
Diao Y.
Keishi Tajima
Marian A.
Peter Buneman
Sanjeev Khanna
Schmidt A. R.
Tufte K.
Wang-Chiew Tan
Publication venue
Publication date: 01/01/2002
Field of study

We present an archiving technique for hierarchical data with key structure. Our approach is based on the notion of timestamps whereby an element appearing in multiple versions of the database is stored only once along with a compact description of versions in which it appears. The basic idea of timestamping was discovered by Driscoll et. al. in the context of persistent data structures where one wishes to track the sequences of changes made to a data structure. We extend this idea to develop an archiving tool for XML data that is capable of providing meaningful change descriptions and can also efficiently support a variety of basic functions concerning the evolution of data such as retrieval of any specific version from the archive and querying the temporal history of any element. This is in contrast to diff-based approaches where such operations may require undoing a large number of changes or significant reasoning with the deltas. Surprisingly, our archiving technique does not incur any significant space overhead when contrasted with other approaches. Our experimental results support this and also show that the compacted archive file interacts well with other compression techniques. Finally, another useful property of our approach is that the resulting archive is also in XML and hence can directly leverage existing XML tools

CiteSeerX

Crossref

Edinburgh Research Explorer

ScholarlyCommons@Penn