67 research outputs found

    Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks

    Full text link
    We continue the line of research on graph compression started with WebGraph, but we move our focus to the compression of social networks in a proper sense (e.g., LiveJournal): the approaches that have been used for a long time to compress web graphs rely on a specific ordering of the nodes (lexicographical URL ordering) whose extension to general social networks is not trivial. In this paper, we propose a solution that mixes clusterings and orders, and devise a new algorithm, called Layered Label Propagation, that builds on previous work on scalable clustering and can be used to reorder very large graphs (billions of nodes). Our implementation uses overdecomposition to perform aggressively on multi-core architecture, making it possible to reorder graphs of more than 600 millions nodes in a few hours. Experiments performed on a wide array of web graphs and social networks show that combining the order produced by the proposed algorithm with the WebGraph compression framework provides a major increase in compression with respect to all currently known techniques, both on web graphs and on social networks. These improvements make it possible to analyse in main memory significantly larger graphs

    Repetition- and Linearity-Aware Rank/Select Dictionaries

    Get PDF
    We revisit the fundamental problem of compressing an integer dictionary that supports efficient rank and select operations by exploiting two kinds of regularities arising in real data: repetitiveness and approximate linearity. Our first contribution is a Lempel-Ziv parsing properly enriched to also capture approximate linearity in the data and still be compressed to the kth order entropy. Our second contribution is a variant of the block tree structure whose space complexity takes advantage of both repetitiveness and approximate linearity, and results highly competitive in time too. Our third and final contribution is an implementation and experimentation of this last data structure, which achieves new space-time trade-offs compared to known data structures that exploit only one of the two regularities

    Advanced rank/select data structures: succinctness, bounds and applications.

    Get PDF
    The thesis explores new theoretical results and applications of rank and select data structures. Given a string, select(c, i) gives the position of the ith occurrence of character c in the string, while rank(c, p) counts the number of instances of character c on the left of position p. Succinct rank/select data structures are space-efficient versions of standard ones, designed to keep data compressed and at the same time answer to queries rapidly. They are at the basis of more involved compressed and succinct data structures which in turn are motivated by the nowadays need to analyze and operate on massive data sets quickly, where space efficiency is crucial. The thesis builds up on the state of the art left by years of study and produces results on multiple fronts. Analyzing binary succinct data structures and their link with predecessor data structures, we integrate data structures for the latter problem in the former. The result is a data structure which outperforms the one of Patrascu 08 in a range of cases which were not studied before, namely when the lower bound for predecessor do not apply and constant-time rank is not feasible. Further, we propose the first lower bound for succinct data structures on generic strings, achieving a linear trade-off between time for rank/select execution and additional space (w.r.t. to the plain data) needed by the data structure. The proposal addresses systematic data structures, namely those that only access the underlying string through ADT calls and do not encode it directly. Also, we propose a matching upper bound that proves the tightness of our lower bound. Finally, we apply rank/select data structures to the substring counting problem, where we seek to preprocess a text and generate a summary data structure which is stored in lieu of the text and answers to substring counting queries with additive error. The results include a theory-proven optimal data structure with generic additive error and a data structure that errs only on infrequent patterns with significative practical space gains

    Compressed weighted de Bruijn graphs

    Get PDF
    We propose a new compressed representation for weighted de Bruijn graphs, which is based on the idea of delta-encoding the variations of k-mer abundances on a spanning branching of the graph. Our new data structure is likely to be of practical value: to give an idea, when combined with the compressed BOSS de Bruijn graph representation, it encodes the weighted de Bruijn graph of a 16x-covered DNA read-set (60M distinct k-mers, k = 28) within 4.15 bits per distinct k-mer and can answer abundance queries in about 60 microseconds on a standard machine. In contrast, state of the art tools declare a space usage of at least 30 bits per distinct k-mer for the same task, which is confirmed by our experiments. As a by-product of our new data structure, we exhibit efficient compressed data structures for answering partial sums on edge-weighted trees, which might be of independent interest

    Trie-Compressed Adaptive Set Intersection

    Get PDF
    We introduce space- and time-efficient algorithms and data structures for the offline set intersection problem. We show that a sorted integer set S ? [0..u) of n elements can be represented using compressed space while supporting k-way intersections in adaptive O(k?lg(u/?)) time, ? being the alternation measure introduced by Barbay and Kenyon. Our experimental results suggest that our approaches are competitive in practice, outperforming the most efficient alternatives (Partitioned Elias-Fano indexes, Roaring Bitmaps, and Recursive Universe Partitioning (RUP)) in several scenarios, offering in general relevant space-time trade-offs

    Storage and Retrieval of Individual Genomes and other Repetitive Sequence Collections

    Get PDF
    Computing Reviews (1998) Categories and Subject Descriptors: E.4 Coding and Information Theory — data compaction and compression F.2.2 Analysis of Algorithms and Problem Complexity: Nonnumerical Algorithms and Problems — pattern matching, sorting and searchingIn the near future, biomolecular engineering techniques will reach a state where the sequencing of individual genomes becomes feasible. This progress will create huge expectations for the data analysis domain to reveal new knowledge on the ”secrets of life”. Quite rudimentary reasons may inhibit such breakthroughs; it may not be feasible to store all the data in a form that would enable anything but most basic data analysis routines to be executed. This paper is devoted into studying ways to store massive sets of complete individual genomes in space-efficient manner so that retrieval of the content as well as queries on the content of the sequences can be provided time-efficiently. We show that although the state-of-the-art full-text self-indexes do not yet provide satisfactory space bounds for this specific task, after carefully engineering those structures it is possible to achieve very attractive results; the new structures are fully able to exploit the fact that the individual genomes are highly similar. We confirm the theoretical findings by experiments on large DNA sequences, and also on version control data, that forms another application domain for our methods
    • …
    corecore