413 research outputs found

    Rank, select and access in grammar-compressed strings

    Full text link
    Given a string SS of length NN on a fixed alphabet of σ\sigma symbols, a grammar compressor produces a context-free grammar GG of size nn that generates SS and only SS. In this paper we describe data structures to support the following operations on a grammar-compressed string: \mbox{rank}_c(S,i) (return the number of occurrences of symbol cc before position ii in SS); \mbox{select}_c(S,i) (return the position of the iith occurrence of cc in SS); and \mbox{access}(S,i,j) (return substring S[i,j]S[i,j]). For rank and select we describe data structures of size O(nσlogN)O(n\sigma\log N) bits that support the two operations in O(logN)O(\log N) time. We propose another structure that uses O(nσlog(N/n)(logN)1+ϵ)O(n\sigma\log (N/n)(\log N)^{1+\epsilon}) bits and that supports the two queries in O(logN/loglogN)O(\log N/\log\log N), where ϵ>0\epsilon>0 is an arbitrary constant. To our knowledge, we are the first to study the asymptotic complexity of rank and select in the grammar-compressed setting, and we provide a hardness result showing that significantly improving the bounds we achieve would imply a major breakthrough on a hard graph-theoretical problem. Our main result for access is a method that requires O(nlogN)O(n\log N) bits of space and O(logN+m/logσN)O(\log N+m/\log_\sigma N) time to extract m=ji+1m=j-i+1 consecutive symbols from SS. Alternatively, we can achieve O(logN/loglogN+m/logσN)O(\log N/\log\log N+m/\log_\sigma N) query time using O(nlog(N/n)(logN)1+ϵ)O(n\log (N/n)(\log N)^{1+\epsilon}) bits of space. This matches a lower bound stated by Verbin and Yu for strings where NN is polynomially related to nn.Comment: 16 page

    Compressing dictionaries of strings

    Get PDF
    The aim of this work is to develop a data structure capable of storing a set of strings in a compressed way providing the facility to access and search by prefix any string in the set. The notion of string will be formally exposed in this work, but it is enough to think a string as a stream of characters or a variable length dat}. We will prove that the data structure devised in our work will be able to search prefixes of the stored strings in a very efficient way, hence giving a performant solution to one of the most discussed problem of our age. In the discussion of our data structure, particular emphasis will be given to both space and time efficiency and a tradeoff between these two will be constantly searched. To understand how much string based data structures are important, think about modern search engines and social networks; they must store and process continuously immense streams of data which are mainly strings, while the output of such processed data must be available in few milliseconds not to try the patience of the user. Space efficiency is one of the main concern in this kind of problem. In order to satisfy real-time latency bounds, the largest possible amount of data must be stored in the highest levels of the memory hierarchy. Moreover, data compression allows to save money because it reduces the amount of physical memory needed to store abstract data and this particularly important since storage is the main source of expenditure in modern systems

    Compressed Full-Text Indexes for Highly Repetitive Collections

    Get PDF
    This thesis studies problems related to compressed full-text indexes. A full-text index is a data structure for indexing textual (sequence) data, so that the occurrences of any query string in the data can be found efficiently. While most full-text indexes require much more space than the sequences they index, recent compressed indexes have overcome this limitation. These compressed indexes combine a compressed representation of the index with some extra information that allows decompressing any part of the data efficiently. This way, they provide similar functionality as the uncompressed indexes, while using only slightly more space than the compressed data. The efficiency of data compression is usually measured in terms of entropy. While entropy-based estimates predict the compressed size of most texts accurately, they fail with highly repetitive collections of texts. Examples of such collections include different versions of a document and the genomes of a number of individuals from the same population. While the entropy of a highly repetitive collection is usually similar to that of a text of the same kind, the collection can often be compressed much better than the entropy-based estimate. Most compressed full-text indexes are based on the Burrows-Wheeler transform (BWT). Originally intended for data compression, the BWT has deep connections with full-text indexes such as the suffix tree and the suffix array. With some additional information, these indexes can be simulated with the Burrows-Wheeler transform. The first contribution of this thesis is the first BWT-based index that can compress highly repetitive collections efficiently. Compressed indexes allow us to handle much larger data sets than the corresponding uncompressed indexes. To take full advantage of this, we need algorithms for constructing the compressed index directly, instead of first constructing an uncompressed index and then compressing it. The second contribution of this thesis is an algorithm for merging the BWT-based indexes of two text collections. By using this algorithm, we can derive better space-efficient construction algorithms for BWT-based indexes. The basic BWT-based indexes provide similar functionality as the suffix array. With some additional structures, the functionality can be extended to that of the suffix tree. One of the structures is an array storing the lengths of the longest common prefixes of lexicographically adjacent suffixes of the text. The third contribution of this thesis is a space-efficient algorithm for constructing this array, and a new compressed representation of the array. In the case of individual genomes, the highly repetitive collection can be considered a sample from a larger collection. This collection consists of a reference sequence and a set of possible differences from the reference, so that each sequence contains a subset of the differences. The fourth contribution of this thesis is a BWT-based index that extrapolates the larger collection from the sample and indexes it.Tässä väitöskirjassa käsitellään tiivistettyjä kokotekstihakemistoja tekstimuotoisille aineistoille. Kokotekstihakemistot ovat tietorakenteita, jotka mahdollistavat mielivaltaisten hahmojen esiintymien löytämisen tekstistä tehokkaasti. Perinteiset kokotekstihakemistot, kuten loppuosapuut ja -taulukot, vievät moninkertaisesti tilaa itse aineistoon nähden. Viime aikoina on kuitenkin kehitetty tiivistettyjä hakemistorakenteita, jotka tarjoavat vastaavan toiminnallisuuden alkuperäistä tekstiä pienemmässä tilassa. Tämä on mahdollistanut aikaisempaa suurempien aineistojen käsittelyn. Tekstin tiivistyvyyttä mitataan yleensä suhteessa sen entropiaan. Vaikka entropiaan perustuvat arviot ovat useimmilla aineistoilla varsin tarkkoja, aliarvioivat ne vahvasti toisteisien aineistojen tiivistyvyyttä. Esimerkkejä tällaisista aineistoista ovat kokoelmat saman populaation yksilöiden genomeita tai saman dokumentin eri versioita. Siinä missä tällaisen kokoelman entropia suhteessa aineiston kokoon on vastaava kuin yksittäisellä samaa tyyppiä olevalla tekstillä, tiivistyy kokoelma yleensä huomattavasti paremmin kuin entropian perusteella voisi odottaa. Useimmat tiivistetyt kokotekstihakemistot perustuvat Burrows-Wheeler-muunnokseen (BWT), joka kehitettiin alun perin tekstimuotoisten aineistojen tiivistämiseen. Pian kuitenkin havaittiin, että koska BWT muistuttaa rakenteeltaan loppuosapuuta ja -taulukkoa, voidaan sitä käyttää niissä tehtävien hakujen simulointiin. Tässä väitöskirjassa esitetään ensimmäinen BWT-pohjainen kokotekstihakemisto, joka pystyy tiivistämään vahvasti toisteiset aineistot tehokkaasti. Tiivistettyjen tietorakenteiden käyttö mahdollistaa suurempien aineistoiden käsittelemisen kuin tavallisia tietorakenteita käytettäessä. Tämä etu kuitenkin menetetään, jos tiivistetty tietorakenne muodostetaan luomalla ensin vastaava tavallinen tietorakenne ja tiivistämällä se. Tässä väitöskirjassa esitetään aikaisempaa vähemmän muistia käyttäviä algoritmeja BWT-pohjaisten kokotekstihakemistojen muodostamiseen. Kokoelma yksilöiden genomeita voidaan käsittää otokseksi suuremmasta kokoelmasta, joka koostuu populaation kaikkien yksilöiden sekä niiden hypoteettisten jälkeläisten genomeista. Tällainen kokoelma voidaan esittää äärellisenä automaattina, joka muodostuu referenssigenomista ja yksilöiden genomeissa esiintyvistä poikkeamista referenssistä. Tässä väitöskirjassa esitetään BWT-pohjaisten kokotekstihakemistojen yleistys, joka mahdollistaa tällaisten automaattien indeksoinnin

    Pattern matching of compressed terms and contexts and polynomial rewriting

    Get PDF
    A generalization of the compressed string pattern match that applies to terms with variables is investigated: Given terms s and t compressed by singleton tree grammars, the task is to find an instance of s that occurs as a subterm in t. We show that this problem is in NP and that the task can be performed in time O(ncjVar(s)j), including the construction of the compressed substitution, and a representation of all occurrences. We show that the special case where s is uncompressed can be performed in polynomial time. As a nice application we show that for an equational deduction of t to t0 by an equality axiom l = r (a rewrite) a single step can be performed in polynomial time in the size of compression of t and l; r if the number of variables is fixed in l. We also show that n rewriting steps can be performed in polynomial time, if the equational axioms are compressed and assumed to be constant for the rewriting sequence. Another potential application are querying mechanisms on compressed XML-data bases

    Algorithms and data structures for grammar-compressed strings

    Get PDF

    Recompression: a simple and powerful technique for word equations

    Get PDF
    In this paper we present an application of a simple technique of local recompression, previously developed by the author in the context of compressed membership problems and compressed pattern matching, to word equations. The technique is based on local modification of variables (replacing X by aX or Xa) and iterative replacement of pairs of letters appearing in the equation by a `fresh' letter, which can be seen as a bottom-up compression of the solution of the given word equation, to be more specific, building an SLP (Straight-Line Programme) for the solution of the word equation. Using this technique we give a new, independent and self-contained proofs of most of the known results for word equations. To be more specific, the presented (nondeterministic) algorithm runs in O(n log n) space and in time polynomial in log N, where N is the size of the length-minimal solution of the word equation. The presented algorithm can be easily generalised to a generator of all solutions of the given word equation (without increasing the space usage). Furthermore, a further analysis of the algorithm yields a doubly exponential upper bound on the size of the length-minimal solution. The presented algorithm does not use exponential bound on the exponent of periodicity. Conversely, the analysis of the algorithm yields an independent proof of the exponential bound on exponent of periodicity. We believe that the presented algorithm, its idea and analysis are far simpler than all previously applied. Furthermore, thanks to it we can obtain a unified and simple approach to most of known results for word equations. As a small additional result we show that for O(1) variables (with arbitrary many appearances in the equation) word equations can be solved in linear space, i.e. they are context-sensitive.Comment: Submitted to a journal. Since previous version the proofs were simplified, overall presentation improve

    Indexing Highly Repetitive String Collections

    Full text link
    Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed in order to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey we cover the algorithmic developments that have led to these data structures. We describe the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that have been proposed, comparing them both in theoretical and practical aspects. We conclude with the current challenges in this fascinating field

    Complexity, parallel computation and statistical physics

    Full text link
    The intuition that a long history is required for the emergence of complexity in natural systems is formalized using the notion of depth. The depth of a system is defined in terms of the number of parallel computational steps needed to simulate it. Depth provides an objective, irreducible measure of history applicable to systems of the kind studied in statistical physics. It is argued that physical complexity cannot occur in the absence of substantial depth and that depth is a useful proxy for physical complexity. The ideas are illustrated for a variety of systems in statistical physics.Comment: 21 pages, 7 figure
    corecore