158,745 research outputs found
XML Compression via DAGs
Unranked trees can be represented using their minimal dag (directed acyclic
graph). For XML this achieves high compression ratios due to their repetitive
mark up. Unranked trees are often represented through first child/next sibling
(fcns) encoded binary trees. We study the difference in size (= number of
edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One
main finding is that the size of the dag of the binary tree can never be
smaller than the square root of the size of the minimal dag, and that there are
examples that match this bound. We introduce a new combined structure, the
hybrid dag, which is guaranteed to be smaller than (or equal in size to) both
dags. Interestingly, we find through experiments that last child/previous
sibling encodings are much better for XML compression via dags, than fcns
encodings. We determine the average sizes of unranked and binary dags over a
given set of labels (under uniform distribution) in terms of their exact
generating functions, and in terms of their asymptotical behavior.Comment: A short version of this paper appeared in the Proceedings of ICDT
201
Parallel Maximum Clique Algorithms with Applications to Network Analysis and Storage
We propose a fast, parallel maximum clique algorithm for large sparse graphs
that is designed to exploit characteristics of social and information networks.
The method exhibits a roughly linear runtime scaling over real-world networks
ranging from 1000 to 100 million nodes. In a test on a social network with 1.8
billion edges, the algorithm finds the largest clique in about 20 minutes. Our
method employs a branch and bound strategy with novel and aggressive pruning
techniques. For instance, we use the core number of a vertex in combination
with a good heuristic clique finder to efficiently remove the vast majority of
the search space. In addition, we parallelize the exploration of the search
tree. During the search, processes immediately communicate changes to upper and
lower bounds on the size of maximum clique, which occasionally results in a
super-linear speedup because vertices with large search spaces can be pruned by
other processes. We apply the algorithm to two problems: to compute temporal
strong components and to compress graphs.Comment: 11 page
On the Complexity of BWT-Runs Minimization via Alphabet Reordering
The Burrows-Wheeler Transform (BWT) has been an essential tool in text
compression and indexing. First introduced in 1994, it went on to provide the
backbone for the first encoding of the classic suffix tree data structure in
space close to the entropy-based lower bound. Recently, there has been the
development of compact suffix trees in space proportional to "", the number
of runs in the BWT, as well as the appearance of in the time complexity of
new algorithms. Unlike other popular measures of compression, the parameter
is sensitive to the lexicographic ordering given to the text's alphabet.
Despite several past attempts to exploit this, a provably efficient algorithm
for finding, or approximating, an alphabet ordering which minimizes has
been open for years.
We present the first set of results on the computational complexity of
minimizing BWT-runs via alphabet reordering. We prove that the decision version
of this problem is NP-complete and cannot be solved in time unless the Exponential Time Hypothesis fails, where is the
size of the alphabet and is the length of the text. We also show that the
optimization problem is APX-hard. In doing so, we relate two previously
disparate topics: the optimal traveling salesperson path and the number of runs
in the BWT of a text, providing a surprising connection between problems on
graphs and text compression. Also, by relating recent results in the field of
dictionary compression, we illustrate that an arbitrary alphabet ordering
provides a -approximation.
We provide an optimal linear-time algorithm for the problem of finding a run
minimizing ordering on a subset of symbols (occurring only once) under ordering
constraints, and prove a generalization of this problem to a class of graphs
with BWT like properties called Wheeler graphs is NP-complete
Evolution of Network Architecture in a Granular Material Under Compression
As a granular material is compressed, the particles and forces within the system arrange to form complex and heterogeneous collective structures. Force chains are a prime example of such structures, and are thought to constrain bulk properties such as mechanical stability and acoustic transmission. However, capturing and characterizing the evolving nature of the intrinsic inhomogeneity and mesoscale architecture of granular systems can be challenging. A growing body of work has shown that graph theoretic approaches may provide a useful foundation for tackling these problems. Here, we extend the current approaches by utilizing multilayer networks as a framework for directly quantifying the progression of mesoscale architecture in a compressed granular system. We examine a quasi-two-dimensional aggregate of photoelastic disks, subject to biaxial compressions through a series of small, quasistatic steps. Treating particles as network nodes and interparticle forces as network edges, we construct a multilayer network for the system by linking together the series of static force networks that exist at each strain step. We then extract the inherent mesoscale structure from the system by using a generalization of community detection methods to multilayer networks, and we define quantitative measures to characterize the changes in this structure throughout the compression process. We separately consider the network of normal and tangential forces, and find that they display a different progression throughout compression. To test the sensitivity of the network model to particle properties, we examine whether the method can distinguish a subsystem of low-friction particles within a bath of higher-friction particles. We find that this can be achieved by considering the network of tangential forces, and that the community structure is better able to separate the subsystem than a purely local measure of interparticle forces alone. The results discussed throughout this study suggest that these network science techniques may provide a direct way to compare and classify data from systems under different external conditions or with different physical makeup
Information Compression, Intelligence, Computing, and Mathematics
This paper presents evidence for the idea that much of artificial
intelligence, human perception and cognition, mainstream computing, and
mathematics, may be understood as compression of information via the matching
and unification of patterns. This is the basis for the "SP theory of
intelligence", outlined in the paper and fully described elsewhere. Relevant
evidence may be seen: in empirical support for the SP theory; in some
advantages of information compression (IC) in terms of biology and engineering;
in our use of shorthands and ordinary words in language; in how we merge
successive views of any one thing; in visual recognition; in binocular vision;
in visual adaptation; in how we learn lexical and grammatical structures in
language; and in perceptual constancies. IC via the matching and unification of
patterns may be seen in both computing and mathematics: in IC via equations; in
the matching and unification of names; in the reduction or removal of
redundancy from unary numbers; in the workings of Post's Canonical System and
the transition function in the Universal Turing Machine; in the way computers
retrieve information from memory; in systems like Prolog; and in the
query-by-example technique for information retrieval. The chunking-with-codes
technique for IC may be seen in the use of named functions to avoid repetition
of computer code. The schema-plus-correction technique may be seen in functions
with parameters and in the use of classes in object-oriented programming. And
the run-length coding technique may be seen in multiplication, in division, and
in several other devices in mathematics and computing. The SP theory resolves
the apparent paradox of "decompression by compression". And computing and
cognition as IC is compatible with the uses of redundancy in such things as
backup copies to safeguard data and understanding speech in a noisy
environment
- …