4,240 research outputs found

    Perushakurakenteiden tehokas muodostus suurille tekstimassoille

    Get PDF
    This thesis studies efficient algorithms for constructing the most fundamental data structures used as building blocks in (compressed) full-text indexes. Full-text indexes are data structures that allow efficiently searching for occurrences of a query string in a (much larger) text. We are mostly interested in large-scale indexing, that is, dealing with input instances that cannot be processed entirely in internal memory and thus a much slower, external memory needs to be used. Specifically, we focus on three data structures: the suffix array, the LCP array and the Lempel-Ziv (LZ77) parsing. These are routinely found as components or used as auxiliary data structures in the construction of many modern full-text indexes. The suffix array is a list of all suffixes of a text in lexicographical order. Despite its simplicity, the suffix array is a powerful tool used extensively not only in indexing but also in data compression, string combinatorics or computational biology. The first contribution of this thesis is an improved algorithm for external memory suffix array construction based on constructing suffix arrays for blocks of text and merging them into the full suffix array. In many applications, the suffix array needs to be augmented with the information about the longest common prefix between each two adjacent suffixes in lexicographical order. The array containing such information is called the longest-common-prefix (LCP) array. The second contribution of this thesis is the first algorithm for computing the LCP array in external memory that is not an extension of a suffix-sorting algorithm. When the input text is highly repetitive, the general-purpose text indexes are usually outperformed (particularly in space usage) by specialized indexes. One of the most popular families of such indexes is based on the Lempel-Ziv (LZ77) parsing. LZ77 parsing is the encoding of text that replaces long repeating substrings with references to other occurrences. In addition to indexing, LZ77 is a heavily used tool in data compression. The third contribution of this thesis is a series of new algorithms to compute the LZ77 parsing, both in RAM and in external memory. The algorithms introduced in this thesis significantly improve upon the prior art. For example: (i) our new approach for constructing the LCP array in external memory is faster than the previously best algorithm by a factor of 2-4 and simultaneously reduces the disk space usage by a factor of four; (ii) a parallel version of our improved suffix array construction algorithm is able to handle inputs much larger than considered in the literature so far. In our experiments, computing the suffix array of a 1 TiB file with the new algorithm took a little over a week and required only 7.2 TiB of disk space (including input and output), whereas on the same machine the previously best algorithm would require 3.5 times as much disk space and take about four times longer.Tutkielman aiheena olevilla algoritmeilla voidaan tehokkaasti muodostaa perustietorakenteita, joita käytetään rakennuspalikoina (tiivistetyissä) tekstihakurakenteissa. Tekstihakurakenteet ovat tietorakenteita, jotka mahdollistavat tehokkaat merkkijonohaut tekstissä. Pääasiallisena kiinnostuksen kohteena ovat algoritmit suurille tekstimassoille, joita ei pystytä käsittelemään keskusmuistissa, ja jotka siksi vaativat paljon hitaamman ulkoisen muistin käyttöä. Kohdetietorakenteita on kolme: loppuosataulukko, LCP-taulukko ja Lempel-Ziv (LZ77) jäsennys. Näitä käytetään laajasti komponentteina tai välivaiheina modernien tekstihakurakenteiden muodostamisessa. Loppuosataulukko listaa tekstin kaikki loppuosat aakkosjärjestyksessä. Yksinkertaisuudestaan huolimatta loppuosataulukko on tehokas työkalu, jota käytetään laajalti ei vain tekstihakurakenteissa vaan myös tekstintiivistyksessä, merkkijonokombinatoriikassa ja laskennallisessa biologiassa. Tutkielman ensimmäinen tulos on parannettu algoritmi loppuosataulukon muodostamiseen ulkoisessa muistissa perustuen tekstin osille muodostettujen loppuosataulukkojen yhdistämiseen koko tekstin loppuosataulukoksi. Monissa sovelluksissa loppuosataulukon rinnalla tarvitaan tietoa aakkosellisesti vierekkäisten loppuosien pisimpien yhteisten alkuosien pituuksista. Tämän tiedon sisältävää taulukkoa sanotaan LCP (longest common prefix) taulukoksi. Tutkielman toinen tulos on ensimmäinen LCP taulukon ulkoisessa muistissa muodostava algoritmi, joka ei ole loppuosataulukonmuodostusalgoritmin laajennus. Vahvasti toisteiselle tekstille on olemassa erikoistuneita tekstihakurakenteita, jotka ovat yleiskäyttöisiä tekstihakurakenteita tehokkaampia (erityisesti muistin käytön suhteen). Yksi suosituimmista tällaisista hakurakenneperheistä perustuu Lempel-Ziv (LZ77) jäsennykseen. LZ77-jäsennys on tekstin tallennusmuoto, jossa pitkät toistuvat osajonot on korvattu viittauksilla aiempiin esiintymiin. Tekstihakurakenteiden lisäksi LZ77-jäsennystä käytetään laajasti tekstintiivistyksessä. Tutkielman kolmas osuus on sarja uusia algoritmeja LZ77-jäsennyksen muodostamiseen, sekä sisäisessä että ulkoisessa muistissa. Tutkielmassa esitellyt algoritmit ovat merkittävä parannus aiempaan tilanteeseen. Esimerkiksi: (i) uusi menetelmä LCP-taulukon muodostamiseen ulkoisessa muistissa on 2-4 kertaa aiempia menetelmiä nopeampi ja samanaikaisesti vähentää levytilankäytön neljännekseen; (ii) parannettu loppuosataulukonmuodostusalgoritmi mahdollistaa paljon aiemmin nähtyjä suurempien syötteiden käsittelyn. Kokeissa yhden teratavun kokoisen tiedoston loppuosataulukon muodostaminen vei vähän yli viikon ja vaati vain 7,2 teratavua levytilaa (syöte ja tulos mukaanlukien), kun aiemmat menetelmät olisivat vaatineet 3,5-kertaisen määrän levytilaa ja vieneet noin nelinkertaisen ajan

    Handling Massive N-Gram Datasets Efficiently

    Get PDF
    This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2

    From Theory to Practice: Plug and Play with Succinct Data Structures

    Full text link
    Engineering efficient implementations of compact and succinct structures is a time-consuming and challenging task, since there is no standard library of easy-to- use, highly optimized, and composable components. One consequence is that measuring the practical impact of new theoretical proposals is a difficult task, since older base- line implementations may not rely on the same basic components, and reimplementing from scratch can be very time-consuming. In this paper we present a framework for experimentation with succinct data structures, providing a large set of configurable components, together with tests, benchmarks, and tools to analyze resource requirements. We demonstrate the functionality of the framework by recomposing succinct solutions for document retrieval.Comment: 10 pages, 4 figures, 3 table

    Bidirectional Text Compression in External Memory

    Get PDF
    Bidirectional compression algorithms work by substituting repeated substrings by references that, unlike in the famous LZ77-scheme, can point to either direction. We present such an algorithm that is particularly suited for an external memory implementation. We evaluate it experimentally on large data sets of size up to 128 GiB (using only 16 GiB of RAM) and show that it is significantly faster than all known LZ77 compressors, while producing a roughly similar number of factors. We also introduce an external memory decompressor for texts compressed with any uni- or bidirectional compression scheme

    Indexing arbitrary-length kk-mers in sequencing reads

    Full text link
    We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating kk-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments
    corecore