158,745 research outputs found

    XML Compression via DAGs

    Full text link
    Unranked trees can be represented using their minimal dag (directed acyclic graph). For XML this achieves high compression ratios due to their repetitive mark up. Unranked trees are often represented through first child/next sibling (fcns) encoded binary trees. We study the difference in size (= number of edges) of minimal dag versus minimal dag of the fcns encoded binary tree. One main finding is that the size of the dag of the binary tree can never be smaller than the square root of the size of the minimal dag, and that there are examples that match this bound. We introduce a new combined structure, the hybrid dag, which is guaranteed to be smaller than (or equal in size to) both dags. Interestingly, we find through experiments that last child/previous sibling encodings are much better for XML compression via dags, than fcns encodings. We determine the average sizes of unranked and binary dags over a given set of labels (under uniform distribution) in terms of their exact generating functions, and in terms of their asymptotical behavior.Comment: A short version of this paper appeared in the Proceedings of ICDT 201

    Parallel Maximum Clique Algorithms with Applications to Network Analysis and Storage

    Full text link
    We propose a fast, parallel maximum clique algorithm for large sparse graphs that is designed to exploit characteristics of social and information networks. The method exhibits a roughly linear runtime scaling over real-world networks ranging from 1000 to 100 million nodes. In a test on a social network with 1.8 billion edges, the algorithm finds the largest clique in about 20 minutes. Our method employs a branch and bound strategy with novel and aggressive pruning techniques. For instance, we use the core number of a vertex in combination with a good heuristic clique finder to efficiently remove the vast majority of the search space. In addition, we parallelize the exploration of the search tree. During the search, processes immediately communicate changes to upper and lower bounds on the size of maximum clique, which occasionally results in a super-linear speedup because vertices with large search spaces can be pruned by other processes. We apply the algorithm to two problems: to compute temporal strong components and to compress graphs.Comment: 11 page

    On the Complexity of BWT-Runs Minimization via Alphabet Reordering

    Get PDF
    The Burrows-Wheeler Transform (BWT) has been an essential tool in text compression and indexing. First introduced in 1994, it went on to provide the backbone for the first encoding of the classic suffix tree data structure in space close to the entropy-based lower bound. Recently, there has been the development of compact suffix trees in space proportional to "rr", the number of runs in the BWT, as well as the appearance of rr in the time complexity of new algorithms. Unlike other popular measures of compression, the parameter rr is sensitive to the lexicographic ordering given to the text's alphabet. Despite several past attempts to exploit this, a provably efficient algorithm for finding, or approximating, an alphabet ordering which minimizes rr has been open for years. We present the first set of results on the computational complexity of minimizing BWT-runs via alphabet reordering. We prove that the decision version of this problem is NP-complete and cannot be solved in time 2o(σ+n)2^{o(\sigma + \sqrt{n})} unless the Exponential Time Hypothesis fails, where σ\sigma is the size of the alphabet and nn is the length of the text. We also show that the optimization problem is APX-hard. In doing so, we relate two previously disparate topics: the optimal traveling salesperson path and the number of runs in the BWT of a text, providing a surprising connection between problems on graphs and text compression. Also, by relating recent results in the field of dictionary compression, we illustrate that an arbitrary alphabet ordering provides a O(log2n)O(\log^2 n)-approximation. We provide an optimal linear-time algorithm for the problem of finding a run minimizing ordering on a subset of symbols (occurring only once) under ordering constraints, and prove a generalization of this problem to a class of graphs with BWT like properties called Wheeler graphs is NP-complete

    Evolution of Network Architecture in a Granular Material Under Compression

    Full text link
    As a granular material is compressed, the particles and forces within the system arrange to form complex and heterogeneous collective structures. Force chains are a prime example of such structures, and are thought to constrain bulk properties such as mechanical stability and acoustic transmission. However, capturing and characterizing the evolving nature of the intrinsic inhomogeneity and mesoscale architecture of granular systems can be challenging. A growing body of work has shown that graph theoretic approaches may provide a useful foundation for tackling these problems. Here, we extend the current approaches by utilizing multilayer networks as a framework for directly quantifying the progression of mesoscale architecture in a compressed granular system. We examine a quasi-two-dimensional aggregate of photoelastic disks, subject to biaxial compressions through a series of small, quasistatic steps. Treating particles as network nodes and interparticle forces as network edges, we construct a multilayer network for the system by linking together the series of static force networks that exist at each strain step. We then extract the inherent mesoscale structure from the system by using a generalization of community detection methods to multilayer networks, and we define quantitative measures to characterize the changes in this structure throughout the compression process. We separately consider the network of normal and tangential forces, and find that they display a different progression throughout compression. To test the sensitivity of the network model to particle properties, we examine whether the method can distinguish a subsystem of low-friction particles within a bath of higher-friction particles. We find that this can be achieved by considering the network of tangential forces, and that the community structure is better able to separate the subsystem than a purely local measure of interparticle forces alone. The results discussed throughout this study suggest that these network science techniques may provide a direct way to compare and classify data from systems under different external conditions or with different physical makeup

    Information Compression, Intelligence, Computing, and Mathematics

    Full text link
    This paper presents evidence for the idea that much of artificial intelligence, human perception and cognition, mainstream computing, and mathematics, may be understood as compression of information via the matching and unification of patterns. This is the basis for the "SP theory of intelligence", outlined in the paper and fully described elsewhere. Relevant evidence may be seen: in empirical support for the SP theory; in some advantages of information compression (IC) in terms of biology and engineering; in our use of shorthands and ordinary words in language; in how we merge successive views of any one thing; in visual recognition; in binocular vision; in visual adaptation; in how we learn lexical and grammatical structures in language; and in perceptual constancies. IC via the matching and unification of patterns may be seen in both computing and mathematics: in IC via equations; in the matching and unification of names; in the reduction or removal of redundancy from unary numbers; in the workings of Post's Canonical System and the transition function in the Universal Turing Machine; in the way computers retrieve information from memory; in systems like Prolog; and in the query-by-example technique for information retrieval. The chunking-with-codes technique for IC may be seen in the use of named functions to avoid repetition of computer code. The schema-plus-correction technique may be seen in functions with parameters and in the use of classes in object-oriented programming. And the run-length coding technique may be seen in multiplication, in division, and in several other devices in mathematics and computing. The SP theory resolves the apparent paradox of "decompression by compression". And computing and cognition as IC is compatible with the uses of redundancy in such things as backup copies to safeguard data and understanding speech in a noisy environment