804 research outputs found
Compressed Representations of Permutations, and Applications
We explore various techniques to compress a permutation over n
integers, taking advantage of ordered subsequences in , while supporting
its application (i) and the application of its inverse in
small time. Our compression schemes yield several interesting byproducts, in
many cases matching, improving or extending the best existing results on
applications such as the encoding of a permutation in order to support iterated
applications of it, of integer functions, and of inverted lists and
suffix arrays
Succinct Representations of Dynamic Strings
The rank and select operations over a string of length n from an alphabet of
size have been used widely in the design of succinct data structures.
In many applications, the string itself need be maintained dynamically,
allowing characters of the string to be inserted and deleted. Under the word
RAM model with word size , we design a succinct representation
of dynamic strings using bits to support rank,
select, insert and delete in time. When the alphabet size is small, i.e. when \sigma = O(\polylog
(n)), including the case in which the string is a bit vector, these operations
are supported in time. Our data structures are more
efficient than previous results on the same problem, and we have applied them
to improve results on the design and construction of space-efficient text
indexes
Dynamic Data Structures for Document Collections and Graphs
In the dynamic indexing problem, we must maintain a changing collection of
text documents so that we can efficiently support insertions, deletions, and
pattern matching queries. We are especially interested in developing efficient
data structures that store and query the documents in compressed form. All
previous compressed solutions to this problem rely on answering rank and select
queries on a dynamic sequence of symbols. Because of the lower bound in
[Fredman and Saks, 1989], answering rank queries presents a bottleneck in
compressed dynamic indexing. In this paper we show how this lower bound can be
circumvented using our new framework. We demonstrate that the gap between
static and dynamic variants of the indexing problem can be almost closed. Our
method is based on a novel framework for adding dynamism to static compressed
data structures. Our framework also applies more generally to dynamizing other
problems. We show, for example, how our framework can be applied to develop
compressed representations of dynamic graphs and binary relations
Compact Binary Relation Representations with Rich Functionality
Binary relations are an important abstraction arising in many data
representation problems. The data structures proposed so far to represent them
support just a few basic operations required to fit one particular application.
We identify many of those operations arising in applications and generalize
them into a wide set of desirable queries for a binary relation representation.
We also identify reductions among those operations. We then introduce several
novel binary relation representations, some simple and some quite
sophisticated, that not only are space-efficient but also efficiently support a
large subset of the desired queries.Comment: 32 page
Efficient Fully-Compressed Sequence Representations
We present a data structure that stores a sequence over alphabet
in n\Ho(s) + o(n)(\Ho(s){+}1) bits, where \Ho(s) is the
zero-order entropy of . This structure supports the queries \access, \rank\
and \select, which are fundamental building blocks for many other compressed
data structures, in worst-case time \Oh{\lg\lg\sigma} and average time
\Oh{\lg \Ho(s)}. The worst-case complexity matches the best previous results,
yet these had been achieved with data structures using n\Ho(s)+o(n\lg\sigma)
bits. On highly compressible sequences the bits of the
redundancy may be significant compared to the the n\Ho(s) bits that encode
the data. Our representation, instead, compresses the redundancy as well.
Moreover, our average-case complexity is unprecedented. Our technique is based
on partitioning the alphabet into characters of similar frequency. The
subsequence corresponding to each group can then be encoded using fast
uncompressed representations without harming the overall compression ratios,
even in the redundancy. The result also improves upon the best current
compressed representations of several other data structures. For example, we
achieve compressed redundancy, retaining the best time complexities, for
the smallest existing full-text self-indexes; compressed permutations
with times for and \pii() improved to loglogarithmic; and
the first compressed representation of dynamic collections of disjoint
sets. We also point out various applications to inverted indexes, suffix
arrays, binary relations, and data compressors. ..
Succinct Online Dictionary Matching with Improved Worst-Case Guarantees
In the online dictionary matching problem the goal is to preprocess a set of patterns D={P_1,...,P_d} over alphabet Sigma, so that given an online text (one character at a time) we report all of the occurrences of patterns that are a suffix of the current text before the following character arrives. We introduce a succinct Aho-Corasick like data structure for the online dictionary matching problem. Our solution uses a new succinct representation for multi-labeled trees, in which each node has a set of labels from a universe of size lambda. We consider lowest labeled ancestor (LLA) queries on multi-labeled trees, where given a node and a label we return the lowest proper ancestor of the node that has the queried label.
In this paper we introduce a succinct representation of multi-labeled trees for lambda=omega(1) that support LLA queries in O(log(log(lambda))) time. Using this representation of multi-labeled trees, we introduce a succinct data structure for the online dictionary matching problem when sigma=omega(1). In this solution the worst case cost per character is O(log(log(sigma)) + occ) time, where occ is the size of the current output.
Moreover, the amortized cost per character is O(1+occ) time
Applying Wikipedia to Interactive Information Retrieval
There are many opportunities to improve the interactivity of information retrieval systems beyond the ubiquitous search box. One idea is to use knowledge bases—e.g. controlled vocabularies, classification schemes, thesauri and ontologies—to organize, describe and navigate the information space. These resources are popular in libraries and specialist collections, but have proven too expensive and narrow to be applied to everyday webscale search. Wikipedia has the potential to bring structured knowledge into more widespread use. This online, collaboratively generated encyclopaedia is one of the largest and most consulted reference works in existence. It is broader, deeper and more agile than the knowledge bases put forward to assist retrieval in the past. Rendering this resource machine-readable is a challenging task that has captured the interest of many researchers. Many see it as a key step required to break the knowledge acquisition bottleneck that crippled previous efforts. This thesis claims that the roadblock can be sidestepped: Wikipedia can be applied effectively to open-domain information retrieval with minimal natural language processing or information extraction. The key is to focus on gathering and applying human-readable rather than machine-readable knowledge. To demonstrate this claim, the thesis tackles three separate problems: extracting knowledge from Wikipedia; connecting it to textual documents; and applying it to the retrieval process. First, we demonstrate that a large thesaurus-like structure can be obtained directly from Wikipedia, and that accurate measures of semantic relatedness can be efficiently mined from it. Second, we show that Wikipedia provides the necessary features and training data for existing data mining techniques to accurately detect and disambiguate topics when they are mentioned in plain text. Third, we provide two systems and user studies that demonstrate the utility of the Wikipedia-derived knowledge base for interactive information retrieval
Raising Permutations to Powers in Place
Given a permutation of n elements, stored as an array, we address the problem of replacing the permutation by its kth power. We aim to perform this operation quickly using o(n) bits of extra storage. To this end, we first present an algorithm for inverting permutations that uses O(lg^2 n) additional bits and runs in O(n lg n) worst case time. This result is then generalized to the situation in which the permutation is to be replaced by its kth power. An algorithm whose worst case running time is O(n lg n) and uses O(lg^2 n + min{k lg n, n^{3/4 + epsilon}}) additional bits is presented
- …