68 research outputs found

    Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array

    Full text link
    The longest common prefix (LCP) array is a versatile auxiliary data structure in indexed string matching. It can be used to speed up searching using the suffix array (SA) and provides an implicit representation of the topology of an underlying suffix tree. The LCP array of a string of length nn can be represented as an array of length nn words, or, in the presence of the SA, as a bit vector of 2n2n bits plus asymptotically negligible support data structures. External memory construction algorithms for the LCP array have been proposed, but those proposed so far have a space requirement of O(n)O(n) words (i.e. O(nlogn)O(n \log n) bits) in external memory. This space requirement is in some practical cases prohibitively expensive. We present an external memory algorithm for constructing the 2n2n bit version of the LCP array which uses O(nlogσ)O(n \log \sigma) bits of additional space in external memory when given a (compressed) BWT with alphabet size σ\sigma and a sampled inverse suffix array at sampling rate O(logn)O(\log n). This is often a significant space gain in practice where σ\sigma is usually much smaller than nn or even constant. We also consider the case of computing succinct LCP arrays for circular strings

    Engineering External Memory LCP Array Construction: Parallel, In-Place and Large Alphabet

    Get PDF
    Peer reviewe

    Engineering External Memory LCP Array Construction: Parallel, In-Place and Large Alphabet

    Get PDF
    The suffix array augmented with the LCP array is perhaps the most important data structure in modern string processing. There has been a lot of recent research activity on constructing these arrays in external memory. In this paper, we engineer the two fastest LCP array construction algorithms (ESA 2016) and improve them in three ways. First, we speed up the algorithms by up to a factor of two through parallelism. Just 8 threads is sufficient for making the algorithms essentially I/O bound. Second, we reduce the disk space usage of the algorithms making them in-place: The input (text and suffix array) is treated as read-only and the working disk space never exceeds the size of the final output (the LCP array). Third, we add support for large alphabets. All previous implementations assume the byte alphabet

    Lightweight Lempel-Ziv Parsing

    Full text link
    We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.Comment: 12 page

    Faster External Memory LCP Array Construction

    Get PDF
    The suffix array, perhaps the most important data structure in modern string processing, needs to be augmented with the longest-common-prefix (LCP) array in many applications. Their construction is often a major bottleneck especially when the data is too big for internal memory. We describe two new algorithms for computing the LCP array from the suffix array in external memory. Experiments demonstrate that the new algorithms are about a factor of two faster than the fastest previous algorithm

    Compressed Full-Text Indexes for Highly Repetitive Collections

    Get PDF
    This thesis studies problems related to compressed full-text indexes. A full-text index is a data structure for indexing textual (sequence) data, so that the occurrences of any query string in the data can be found efficiently. While most full-text indexes require much more space than the sequences they index, recent compressed indexes have overcome this limitation. These compressed indexes combine a compressed representation of the index with some extra information that allows decompressing any part of the data efficiently. This way, they provide similar functionality as the uncompressed indexes, while using only slightly more space than the compressed data. The efficiency of data compression is usually measured in terms of entropy. While entropy-based estimates predict the compressed size of most texts accurately, they fail with highly repetitive collections of texts. Examples of such collections include different versions of a document and the genomes of a number of individuals from the same population. While the entropy of a highly repetitive collection is usually similar to that of a text of the same kind, the collection can often be compressed much better than the entropy-based estimate. Most compressed full-text indexes are based on the Burrows-Wheeler transform (BWT). Originally intended for data compression, the BWT has deep connections with full-text indexes such as the suffix tree and the suffix array. With some additional information, these indexes can be simulated with the Burrows-Wheeler transform. The first contribution of this thesis is the first BWT-based index that can compress highly repetitive collections efficiently. Compressed indexes allow us to handle much larger data sets than the corresponding uncompressed indexes. To take full advantage of this, we need algorithms for constructing the compressed index directly, instead of first constructing an uncompressed index and then compressing it. The second contribution of this thesis is an algorithm for merging the BWT-based indexes of two text collections. By using this algorithm, we can derive better space-efficient construction algorithms for BWT-based indexes. The basic BWT-based indexes provide similar functionality as the suffix array. With some additional structures, the functionality can be extended to that of the suffix tree. One of the structures is an array storing the lengths of the longest common prefixes of lexicographically adjacent suffixes of the text. The third contribution of this thesis is a space-efficient algorithm for constructing this array, and a new compressed representation of the array. In the case of individual genomes, the highly repetitive collection can be considered a sample from a larger collection. This collection consists of a reference sequence and a set of possible differences from the reference, so that each sequence contains a subset of the differences. The fourth contribution of this thesis is a BWT-based index that extrapolates the larger collection from the sample and indexes it.Tässä väitöskirjassa käsitellään tiivistettyjä kokotekstihakemistoja tekstimuotoisille aineistoille. Kokotekstihakemistot ovat tietorakenteita, jotka mahdollistavat mielivaltaisten hahmojen esiintymien löytämisen tekstistä tehokkaasti. Perinteiset kokotekstihakemistot, kuten loppuosapuut ja -taulukot, vievät moninkertaisesti tilaa itse aineistoon nähden. Viime aikoina on kuitenkin kehitetty tiivistettyjä hakemistorakenteita, jotka tarjoavat vastaavan toiminnallisuuden alkuperäistä tekstiä pienemmässä tilassa. Tämä on mahdollistanut aikaisempaa suurempien aineistojen käsittelyn. Tekstin tiivistyvyyttä mitataan yleensä suhteessa sen entropiaan. Vaikka entropiaan perustuvat arviot ovat useimmilla aineistoilla varsin tarkkoja, aliarvioivat ne vahvasti toisteisien aineistojen tiivistyvyyttä. Esimerkkejä tällaisista aineistoista ovat kokoelmat saman populaation yksilöiden genomeita tai saman dokumentin eri versioita. Siinä missä tällaisen kokoelman entropia suhteessa aineiston kokoon on vastaava kuin yksittäisellä samaa tyyppiä olevalla tekstillä, tiivistyy kokoelma yleensä huomattavasti paremmin kuin entropian perusteella voisi odottaa. Useimmat tiivistetyt kokotekstihakemistot perustuvat Burrows-Wheeler-muunnokseen (BWT), joka kehitettiin alun perin tekstimuotoisten aineistojen tiivistämiseen. Pian kuitenkin havaittiin, että koska BWT muistuttaa rakenteeltaan loppuosapuuta ja -taulukkoa, voidaan sitä käyttää niissä tehtävien hakujen simulointiin. Tässä väitöskirjassa esitetään ensimmäinen BWT-pohjainen kokotekstihakemisto, joka pystyy tiivistämään vahvasti toisteiset aineistot tehokkaasti. Tiivistettyjen tietorakenteiden käyttö mahdollistaa suurempien aineistoiden käsittelemisen kuin tavallisia tietorakenteita käytettäessä. Tämä etu kuitenkin menetetään, jos tiivistetty tietorakenne muodostetaan luomalla ensin vastaava tavallinen tietorakenne ja tiivistämällä se. Tässä väitöskirjassa esitetään aikaisempaa vähemmän muistia käyttäviä algoritmeja BWT-pohjaisten kokotekstihakemistojen muodostamiseen. Kokoelma yksilöiden genomeita voidaan käsittää otokseksi suuremmasta kokoelmasta, joka koostuu populaation kaikkien yksilöiden sekä niiden hypoteettisten jälkeläisten genomeista. Tällainen kokoelma voidaan esittää äärellisenä automaattina, joka muodostuu referenssigenomista ja yksilöiden genomeissa esiintyvistä poikkeamista referenssistä. Tässä väitöskirjassa esitetään BWT-pohjaisten kokotekstihakemistojen yleistys, joka mahdollistaa tällaisten automaattien indeksoinnin

    Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

    Get PDF
    Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n over an alphabet of size {\sigma} on a RAM machine with words of w = {\Omega}(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length ` in almost-optimal time O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log log_w(n/r + sigma)

    Parallel text index construction

    Get PDF
    In dieser Dissertation betrachten wir die parallele Konstruktion von Text-Indizes. Text-Indizes stellen Zusatzinformationen über Texte bereit, die Anfragen hinsichtlich dieser Texte beschleunigen können. Ein Beispiel hierfür sind Volltext-Indizes, welche für eine effiziente Phrasensuche genutzt werden, also etwa für die Frage, ob eine Phrase in einem Text vorkommt oder nicht. Diese Dissertation befasst sich hauptsächlich, aber nicht ausschließlich mit der parallelen Konstruktion von Text-Indizes im geteilten und verteilten Speicher. Im ersten Teil der Dissertation betrachten wir Wavelet-Trees. Dabei handelt es sich um kompakte Indizes, welche Rank- und Select-Anfragen von binären Alphabeten auf Alphabete beliebiger Größe verallgemeinern. Im zweiten Teil der Dissertation betrachten wir das Suffix-Array, den am besten erforschten Text-Index überhaupt. Das Suffix-Array enthält die Startpositionen aller lexikografisch sortierten Suffixe eines Textes, d.h., wir möchten alle Suffixe eines Textes sortieren. Oft wird das Suffix-Array um das Longest-Common-Prefix-Array (LCP-Array) erweitert. Das LCP-Array enthält die Länge der längsten gemeinsamen Präfixe zweier lexikografisch konsekutiven Suffixe. Abschließend nutzen wir verteilte Suffix- und LCP-Arrays, um den Distributed-Patricia-Trie zu konstruieren. Dieser erlaubt es uns, verschiedene Phrase-Anfragen effizienter zu beantworten, als wenn wir nur das Suffix-Array nutzen.The focus of this dissertation is the parallel construction of text indices. Text indices provide additional information about a text that allow to answer queries faster. Full-text indices for example are used to efficiently answer phrase queries, i.e., if and where a phrase occurs in a text. The research in this dissertation is focused on but not limited to parallel construction algorithms for text indices in both shared and distributed memory. In the first part, we look at wavelet trees: a compact index that generalizes rank and select queries from binary alphabets to alphabets of arbitrary size. In the second part of this dissertation, we consider the suffix array---one of the most researched text indices.The suffix array of a text contains the starting positions of the text's lexicographically sorted suffixes, i.e., we want to sort all its suffixes. Finally, we use the distributed suffix arrays (and LCP arrays) to compute distributed Patricia tries. This allows us to answer different phrase queries more efficiently than using only the suffix array

    Scalable succinct indexing for large text collections

    Get PDF
    Self-indexes save space by emulating operations of traditional data structures using basic operations on bitvectors. Succinct text indexes provide full-text search functionality which is traditionally provided by suffix trees and suffix arrays for a given text, while using space equivalent to the compressed representation of the text. Succinct text indexes can therefore provide full-text search functionality over inputs much larger than what is viable using traditional uncompressed suffix-based data structures. Fields such as Information Retrieval involve the processing of massive text collections. However, the in-memory space requirements of succinct text indexes during construction have hampered their adoption for large text collections. One promising approach to support larger data sets is to avoid constructing the full suffix array by using alternative indexing representations. This thesis focuses on several aspects related to the scalability of text indexes to larger data sets. We identify practical improvements in the core building blocks of all succinct text indexing algorithms, and subsequently improve the index performance on large data sets. We evaluate our findings using several standard text collections and demonstrate: (1) the practical applications of our improved indexing techniques; and (2) that succinct text indexes are a practical alternative to inverted indexes for a variety of top-k ranked document retrieval problems
    corecore