Search CORE

10 research outputs found

Compressed Data Structures for Dynamic Sequences

Author: A. Gupta
D. Belazzougui
G. Manzini
G. Navarro
H.-L. Chan
J. Barbay
J. Jansson
L. Arge
M. He
R. Grossi
S. Lee
S. Lee
V. Mäkinen
W.-K. Hon
W.-K. Hon
Publication venue
Publication date: 24/07/2015
Field of study

We consider the problem of storing a dynamic string

S

over an alphabet

\Sigma=\{\,1,\ldots,\sigma\,\}

in compressed form. Our representation supports insertions and deletions of symbols and answers three fundamental queries:

\mathrm{access}(i,S)

returns the

i

-th symbol in

S

\mathrm{rank}_a(i,S)

counts how many times a symbol

a

occurs among the first

i

positions in

S

, and

\mathrm{select}_a(i,S)

finds the position where a symbol

a

occurs for the

i

-th time. We present the first fully-dynamic data structure for arbitrarily large alphabets that achieves optimal query times for all three operations and supports updates with worst-case time guarantees. Ours is also the first fully-dynamic data structure that needs only

nH_k+o(n\log\sigma)

bits, where

H_k

is the

k

-th order entropy and

n

is the string length. Moreover our representation supports extraction of a substring

S[i..i+\ell]

in optimal

O(\log n/\log\log n + \ell/\log_{\sigma}n)

time

arXiv.org e-Print Archive

CiteSeerX

Crossref

Compressed String Dictionary Search with Edit Distance One

Author: Belazzougui Djamal
Venturini Rossano
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 25/03/2015
Field of study

In this paper we present different solutions for the problem of indexing a dictionary of strings in compressed space. Given a pattern (Formula presented.) , the index has to report all the strings in the dictionary having edit distance at most one with (Formula presented.). Our first solution is able to solve queries in (almost optimal) (Formula presented.) time where (Formula presented.) is the number of strings in the dictionary having edit distance at most one with (Formula presented.). The space complexity of this solution is bounded in terms of the (Formula presented.) th order entropy of the indexed dictionary. A second solution further improves this space complexity at the cost of increasing the query time. Finally, we propose randomized solutions (Monte Carlo and Las Vegas) which achieve simultaneously the time complexity of the first solution and the space complexity of the second one

Crossref

Archivio della Ricerca - Università di Pisa

Efficient Fully-Compressed Sequence Representations

Author: Barbay Jeremy
Claude Francisco
Gagie Travis
Navarro Gonzalo
Nekrich Yakov
Publication venue
Publication date: 01/01/2012
Field of study

We present a data structure that stores a sequence

s[1..n]

over alphabet

[1..\sigma]

in n\Ho(s) + o(n)(\Ho(s){+}1) bits, where \Ho(s) is the zero-order entropy of

s

. This structure supports the queries \access, \rank\ and \select, which are fundamental building blocks for many other compressed data structures, in worst-case time \Oh{\lg\lg\sigma} and average time \Oh{\lg \Ho(s)}. The worst-case complexity matches the best previous results, yet these had been achieved with data structures using n\Ho(s)+o(n\lg\sigma) bits. On highly compressible sequences the

o(n\lg\sigma)

bits of the redundancy may be significant compared to the the n\Ho(s) bits that encode the data. Our representation, instead, compresses the redundancy as well. Moreover, our average-case complexity is unprecedented. Our technique is based on partitioning the alphabet into characters of similar frequency. The subsequence corresponding to each group can then be encoded using fast uncompressed representations without harming the overall compression ratios, even in the redundancy. The result also improves upon the best current compressed representations of several other data structures. For example, we achieve

(i)

compressed redundancy, retaining the best time complexities, for the smallest existing full-text self-indexes;

(ii)

compressed permutations

\pi

with times for

\pi()

and \pii() improved to loglogarithmic; and

(iii)

the first compressed representation of dynamic collections of disjoint sets. We also point out various applications to inverted indexes, suffix arrays, binary relations, and data compressors. ..

arXiv.org e-Print Archive

Space-efficient data structures for string searching and retrieval

Author: Valliyil Thankachan Sharma
Publication venue: LSU Digital Commons
Publication date: 01/01/2014
Field of study

Let D = {d_1, d_2, ...} be a collection of string documents of n characters in total, which are drawn from an alphabet set Sigma =[sigma] ={1,2,3,...sigma}. The top-k document retrieval problem is to maintain D as a data structure, such that when ever a query Q=(P, k) comes, we can report (the identifiers of) those k documents that are most relevant to the pattern P (of p characters). The relevance of a document d_r with respect to a pattern P is captured by score(P, d_r), which can be any function of the set of locations where P occurs in d_r. Finding the most relevant documents to the user query is the central task of any web-search engine. In the case of web-data, the documents can be demarcated along word boundaries. All the search engines use inverted index as the back-bone data structure. For each word occurring in the document collection, the inverted index stores the list of documents where it appears. It is often augmented with relevance score and/or positional information. However, when data consists of strings (e.g., in bioinformatics or Asian language texts), there are no word demarcation boundaries and the queries are arbitrary substrings instead of being proper valid words. In this case, string data structures have to be used and central approach is to use suffix tree (or string B-tree) with appropriate augmenting data structures. The work by Hon, Shah and Vitter [FOCS 2009], and Navarro and Nekrich [SODA 2012] resulted in a linear space data structure with optimal O(p+k) query time solution for this problem. This was based on geometric interpretation of the query. We extend this central problem, in two important areas of massive data sets. First, we consider an external memory disk based index, where we give near optimal results. Next, we consider compression aspects of data structure, reducing the storage space. This is central goal of the active research field of succinct data structures. We present several results, which improve upon several previous results, and are currently the best known space-time trade-offs in this area

Louisiana State University

Advanced rank/select data structures: succinctness, bounds and applications.

Author: ORLANDI ALESSIO
Publication venue: 'Pisa University Press'
Publication date: 10/12/2012
Field of study

The thesis explores new theoretical results and applications of rank and select data structures. Given a string, select(c, i) gives the position of the ith occurrence of character c in the string, while rank(c, p) counts the number of instances of character c on the left of position p. Succinct rank/select data structures are space-efficient versions of standard ones, designed to keep data compressed and at the same time answer to queries rapidly. They are at the basis of more involved compressed and succinct data structures which in turn are motivated by the nowadays need to analyze and operate on massive data sets quickly, where space efficiency is crucial. The thesis builds up on the state of the art left by years of study and produces results on multiple fronts. Analyzing binary succinct data structures and their link with predecessor data structures, we integrate data structures for the latter problem in the former. The result is a data structure which outperforms the one of Patrascu 08 in a range of cases which were not studied before, namely when the lower bound for predecessor do not apply and constant-time rank is not feasible. Further, we propose the first lower bound for succinct data structures on generic strings, achieving a linear trade-off between time for rank/select execution and additional space (w.r.t. to the plain data) needed by the data structure. The proposal addresses systematic data structures, namely those that only access the underlying string through ADT calls and do not encode it directly. Also, we propose a matching upper bound that proves the tightness of our lower bound. Finally, we apply rank/select data structures to the substring counting problem, where we seek to preprocess a text and generate a summary data structure which is stored in lieu of the text and answers to substring counting queries with additive error. The results include a theory-proven optimal data structure with generic additive error and a data structure that errs only on infrequent patterns with significative practical space gains

Electronic Thesis and Dissertation Archive - Università di Pisa

Space-Efficient Data Structures in the Word-RAM and Bitprobe Models

Author: Nicholson Patrick
Publication venue: 'University of Waterloo'
Publication date: 06/08/2013
Field of study

This thesis studies data structures in the word-RAM and bitprobe models, with an emphasis on space efficiency. In the word-RAM model of computation the space cost of a data structure is measured in terms of the number of w-bit words stored in memory, and the cost of answering a query is measured in terms of the number of read, write, and arithmetic operations that must be performed. In the bitprobe model, like the word-RAM model, the space cost is measured in terms of the number of bits stored in memory, but the query cost is measured solely in terms of the number of bit accesses, or probes, that are performed. First, we examine the problem of succinctly representing a partially ordered set, or poset, in the word-RAM model with word size Theta(lg n) bits. A succinct representation of a combinatorial object is one that occupies space matching the information theoretic lower bound to within lower order terms. We show how to represent a poset on n vertices using a data structure that occupies n^2/4 + o(n^2) bits, and can answer precedence (i.e., less-than) queries in constant time. Since the transitive closure of a directed acyclic graph is a poset, this implies that we can support reachability queries on an arbitrary directed graph in the same space bound. As far as we are aware, this is the first representation of an arbitrary directed graph that supports reachability queries in constant time, and stores less than n choose 2 bits. We also consider several additional query operations. Second, we examine the problem of supporting range queries on strings of n characters (or, equivalently, arrays of n elements) in the word-RAM model with word size Theta(lg n) bits. We focus on the specific problem of answering range majority queries: i.e., given a range, report the character that is the majority among those in the range, if one exists. We show that these queries can be supported in constant time using a linear space (in words) data structure. We generalize this result in several directions, considering various frequency thresholds, geometric variants of the problem, and dynamism. These results are in stark contrast to recent work on the similar range mode problem, in which the query operation asks for the mode (i.e., most frequent) character in a given range. The current best data structures for the range mode problem take soft-Oh(n^(1/2)) time per query for linear space data structures. Third, we examine the deterministic membership (or dictionary) problem in the bitprobe model. This problem asks us to store a set of n elements drawn from a universe [1,u] such that membership queries can be always answered in t bit probes. We present several new fully explicit results for this problem, in particular for the case when n = 2, answering an open problem posed by Radhakrishnan, Shah, and Shannigrahi [ESA 2010]. We also present a general strategy for the membership problem that can be used to solve many related fundamental problems, such as rank, counting, and emptiness queries. Finally, we conclude with a list of open problems and avenues for future work

University of Waterloo's Institutional Repository

New Lower and Upper Bounds for Representing Sequences

Author: A. Golynski
A. Golynski
F. Claude
G. Manzini
I. Munro
J. Barbay
J.I. Munro
N. Välimäki
P. Ferragina
P. Ferragina
P. Ferragina
R. Grossi
T. Gagie
T. Gagie
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Crossref