88 research outputs found

    A simple algorithm for computing the Lempel-Ziv factorization

    Get PDF
    We give a space-efficient simple algorithm for computing the Lempel?Ziv factorization ofa string. For a string of length n over an integer alphabet, it runs in O(n) time independentlyof alphabet size and uses o(n) additional space

    XBWT Tricks

    Get PDF
    The eXtended Burrows-Wheeler Transform (XBWT) is a data transformation introduced in [Ferragina et al., FOCS 2005] to com- pactly represent a labeled tree and simultaneously support navigation and path-search operations over its label structure. A natural application of the XBWT is to store a dictionary of strings. A recent extensive experimental study [Martı́nez-Prieto et al., Informa- tion Systems, 2016] shows that, among the available string dictionary implementations, the XBWT is attractive because of its good tradeoff between small space usage, speed, and support for substring searches. In this paper we further investigate the use of the XBWT for storing a string dictionary. Our first contribution is to show how to add suffix links (aka failure links) to a XBWT string dictionary. For a XBWT dictionary with n internal nodes our suffix links can be traversed in constant time and only take 2n + o(n) bits of space. Our second contribution are practical construction algorithms for the XBWT, including the additional data structure supporting the traver- sal of suffix links. Our algorithms build on the many well engineered algorithms for Suffix Array and BWT construction and offer different tradeoffs between running time and working space

    Computing regularities in strings

    Get PDF
    Regularities in strings model many phenomena and thus form the subject of extensive mathematical studies . Perhaps the most conspicuous regularities in strings are those that manifest themselves in the form of repeated subpatterns. In this paper, we study several forms of regularities of strings, that is, repeats, multirepeats, repetitions and runs. We present their similarities and differences by discussing their forms and properties and we explore the existing computation algorithms. We also discuss several data structures useful for computing regularities

    External memory BWT and LCP computation for sequence collections with applications

    Get PDF
    We propose an external memory algorithm for the computation of the BWT and LCP array for a collection of sequences. Our algorithm takes the amount of available memory as an input parameter, and tries to make the best use of it by splitting the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external memory and in the process it also computes the LCP values. We show that our algorithm performs O(n maxlcp) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm outperforms the current best algorithm for collections of sequences with different lengths and when the average LCP of the collection is relatively small compared to the length of the sequences. In the second part of the paper, we show that our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP arrays, provide simple, scan based, external memory algorithms for three well known problems in bioinformatics: the computation of the all pairs suffix-prefix overlaps, the computation of maximal repeats, and the construction of succinct de Bruijn graphs

    Faster External Memory LCP Array Construction

    Get PDF
    The suffix array, perhaps the most important data structure in modern string processing, needs to be augmented with the longest-common-prefix (LCP) array in many applications. Their construction is often a major bottleneck especially when the data is too big for internal memory. We describe two new algorithms for computing the LCP array from the suffix array in external memory. Experiments demonstrate that the new algorithms are about a factor of two faster than the fastest previous algorithm

    Lightweight BWT and LCP merging via the gap algorithm

    Get PDF
    Recently, Holt and McMillan [Bioinformatics 2014, ACM-BCB 2014] have proposed a simple and elegant algorithm to merge the Burrows-Wheeler transforms of a collection of strings. In this paper we show that their algorithm can be improved so that, in addition to the BWTs, it also merges the Longest Common Prefix (LCP) arrays. Because of its small memory footprint this new algorithm can be used for the final merge of BWT and LCP arrays computed by a faster but memory intensive construction algorithm

    mkESA: enhanced suffix array construction tool

    Get PDF
    Summary: We introduce the tool mkESA, an open source program for constructing enhanced suffix arrays (ESAs), striving for low memory consumption, yet high practical speed. mkESA is a user-friendly program written in portable C99, based on a parallelized version of the Deep-Shallow suffix array construction algorithm, which is known for its high speed and small memory usage. The tool handles large FASTA files with multiple sequences, and computes suffix arrays and various additional tables, such as the LCP table (longest common prefix) or the inverse suffix array, from given sequence data
    corecore