306 research outputs found

    External memory BWT and LCP computation for sequence collections with applications

    Get PDF
    Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows-Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.ResultsWe propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs.ConclusionsWe prove that our algorithm performs O(nmaxlcp) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.14CONSELHO NACIONAL DE DESENVOLVIMENTO CIENTÍFICO E TECNOLÓGICO - CNPQCOORDENAÇÃO DE APERFEIÇOAMENTO DE PESSOAL DE NÍVEL SUPERIOR - CAPESUniversity of Eastern Piedmont project Behavioural Types for Dependability Analysis with Bayesian Networks; Sao Paulo Research Foundation (FAPESP)Fundacao de Amparo a Pesquisa do Estado de Sao Paulo (FAPESP) [2017/09105-0, 2018/21509-2]; PRIN grant [201534HNXC]; INdAM-GNCS Project 2019 Innovative methods for the solution of medical and biological big data; Brazilian agency Conselho Nacional de Desenvolvimento Cientifico e Tecnologico (CNPq)National Council for Scientific and Technological Development (CNPq); Brazilian agency Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior (CAPES)CAPE

    External memory BWT and LCP computation for sequence collections with applications

    Get PDF
    We propose an external memory algorithm for the computation of the BWT and LCP array for a collection of sequences. Our algorithm takes the amount of available memory as an input parameter, and tries to make the best use of it by splitting the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external memory and in the process it also computes the LCP values. We show that our algorithm performs O(n maxlcp) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm outperforms the current best algorithm for collections of sequences with different lengths and when the average LCP of the collection is relatively small compared to the length of the sequences. In the second part of the paper, we show that our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP arrays, provide simple, scan based, external memory algorithms for three well known problems in bioinformatics: the computation of the all pairs suffix-prefix overlaps, the computation of maximal repeats, and the construction of succinct de Bruijn graphs

    Compression with the tudocomp Framework

    Get PDF
    We present a framework facilitating the implementation and comparison of text compression algorithms. We evaluate its features by a case study on two novel compression algorithms based on the Lempel-Ziv compression schemes that perform well on highly repetitive texts

    Algorithms and data structures for grammar-compressed strings

    Get PDF

    Grammar Boosting: A New Technique for Proving Lower Bounds for Computation over Compressed Data

    Full text link
    Grammar compression is a general compression framework in which a string TT of length NN is represented as a context-free grammar of size nn whose language contains only TT. In this paper, we focus on studying the limitations of algorithms and data structures operating on strings in grammar-compressed form. Previous work focused on proving lower bounds for grammars constructed using algorithms that achieve the approximation ratio ρ=O(polylog N)\rho=\mathcal{O}(\text{polylog }N). Unfortunately, for the majority of grammar compressors, ρ\rho is either unknown or satisfies ρ=ω(polylog N)\rho=\omega(\text{polylog }N). In their seminal paper, Charikar et al. [IEEE Trans. Inf. Theory 2005] studied seven popular grammar compression algorithms: RePair, Greedy, LongestMatch, Sequential, Bisection, LZ78, and α\alpha-Balanced. Only one of them (α\alpha-Balanced) is known to achieve ρ=O(polylog N)\rho=\mathcal{O}(\text{polylog }N). We develop the first technique for proving lower bounds for data structures and algorithms on grammars that is fully general and does not depend on the approximation ratio ρ\rho of the used grammar compressor. Using this technique, we first prove that Ω(log⁥N/log⁥log⁥N)\Omega(\log N/\log \log N) time is required for random access on RePair, Greedy, LongestMatch, Sequential, and Bisection, while Ω(log⁥log⁥N)\Omega(\log\log N) time is required for random access to LZ78. All these lower bounds hold within space O(n polylog N)\mathcal{O}(n\text{ polylog }N) and match the existing upper bounds. We also generalize this technique to prove several conditional lower bounds for compressed computation. For example, we prove that unless the Combinatorial kk-Clique Conjecture fails, there is no combinatorial algorithm for CFG parsing on Bisection (for which it holds ρ=Θ~(N1/2)\rho=\tilde{\Theta}(N^{1/2})) that runs in O(nc⋅N3−ϔ)\mathcal{O}(n^c\cdot N^{3-\epsilon}) time for all constants c>0c>0 and Ï”>0\epsilon>0. Previously, this was known only for c<2Ï”c<2\epsilon

    Graph Processing in Main-Memory Column Stores

    Get PDF
    Evermore, novel and traditional business applications leverage the advantages of a graph data model, such as the offered schema flexibility and an explicit representation of relationships between entities. As a consequence, companies are confronted with the challenge of storing, manipulating, and querying terabytes of graph data for enterprise-critical applications. Although these business applications operate on graph-structured data, they still require direct access to the relational data and typically rely on an RDBMS to keep a single source of truth and access. Existing solutions performing graph operations on business-critical data either use a combination of SQL and application logic or employ a graph data management system. For the first approach, relying solely on SQL results in poor execution performance caused by the functional mismatch between typical graph operations and the relational algebra. To the worse, graph algorithms expose a tremendous variety in structure and functionality caused by their often domain-specific implementations and therefore can be hardly integrated into a database management system other than with custom coding. Since the majority of these enterprise-critical applications exclusively run on relational DBMSs, employing a specialized system for storing and processing graph data is typically not sensible. Besides the maintenance overhead for keeping the systems in sync, combining graph and relational operations is hard to realize as it requires data transfer across system boundaries. A basic ingredient of graph queries and algorithms are traversal operations and are a fundamental component of any database management system that aims at storing, manipulating, and querying graph data. Well-established graph traversal algorithms are standalone implementations relying on optimized data structures. The integration of graph traversals as an operator into a database management system requires a tight integration into the existing database environment and a development of new components, such as a graph topology-aware optimizer and accompanying graph statistics, graph-specific secondary index structures to speedup traversals, and an accompanying graph query language. In this thesis, we introduce and describe GRAPHITE, a hybrid graph-relational data management system. GRAPHITE is a performance-oriented graph data management system as part of an RDBMS allowing to seamlessly combine processing of graph data with relational data in the same system. We propose a columnar storage representation for graph data to leverage the already existing and mature data management and query processing infrastructure of relational database management systems. At the core of GRAPHITE we propose an execution engine solely based on set operations and graph traversals. Our design is driven by the observation that different graph topologies expose different algorithmic requirements to the design of a graph traversal operator. We derive two graph traversal implementations targeting the most common graph topologies and demonstrate how graph-specific statistics can be leveraged to select the optimal physical traversal operator. To accelerate graph traversals, we devise a set of graph-specific, updateable secondary index structures to improve the performance of vertex neighborhood expansion. Finally, we introduce a domain-specific language with an intuitive programming model to extend graph traversals with custom application logic at runtime. We use the LLVM compiler framework to generate efficient code that tightly integrates the user-specified application logic with our highly optimized built-in graph traversal operators. Our experimental evaluation shows that GRAPHITE can outperform native graph management systems by several orders of magnitude while providing all the features of an RDBMS, such as transaction support, backup and recovery, security and user management, effectively providing a promising alternative to specialized graph management systems that lack many of these features and require expensive data replication and maintenance processes

    27th Annual European Symposium on Algorithms: ESA 2019, September 9-11, 2019, Munich/Garching, Germany

    Get PDF
    • 

    corecore