411 research outputs found
Cache-oblivious index for approximate string matching
This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text T of length n and a positive integer k, we want to construct an index of T such that for any input pattern P, we can find all its k-error matches in T efficiently. This problem is well-studied in the internal-memory setting. Here, we extend some of these recent results to external-memory solutions, which are also cache-oblivious. Our first index occupies O((nlog kn)B) disk pages and finds all k-error matches with O((|P|+occ)B+log knloglog Bn) I/Os, where B denotes the number of words in a disk page. To the best of our knowledge, this index is the first external-memory data structure that does not require Ω (|P|+occ+poly(logn)) I/Os. The second index reduces the space to O((nlogn)B) disk pages, and the I/O complexity is O((|P|+occ)B+log k(k+1)nloglogn) . © 2011 Elsevier B.V. All rights reserved.postprin
08081 Abstracts Collection -- Data Structures
From February 17th to 22nd 2008, the Dagstuhl Seminar 08081 ``Data Structures\u27\u27 was held in the International Conference and Research Center (IBFI),
Schloss Dagstuhl. It brought together 49 researchers from four continents to discuss recent developments concerning data structures in terms of research but also in terms of new technologies that impact how data can be stored, updated,
and retrieved.
During the seminar a fair number of participants presented their current
research. There was discussion of ongoing work, and in addition an open problem
session was held. This paper first describes the seminar topics and goals in general, then gives the minutes of the open problem session, and concludes with
abstracts of the presentations given during the seminar.
Where appropriate and available, links to extended abstracts or full papers are provided
On the Power of False Negative Awareness in Indicator-based Caching Systems
Distributed caching systems such as content distribution networks often
advertise their content via lightweight approximate indicators (e.g., Bloom
filters) to efficiently inform clients where each datum is likely cached. While
false-positive indications are necessary and well understood, most existing
works assume no false-negative indications. Our work illustrates practical
scenarios where false-negatives are unavoidable and ignoring them has a
significant impact on system performance. Specifically, we focus on
false-negatives induced by indicator staleness, which arises whenever the
system advertises the indicator only periodically, rather than immediately
reporting every change in the cache. Such scenarios naturally occur, e.g., in
bandwidth-constraint environments or when latency impedes the ability of each
client to obtain an updated indicator. Our work introduces novel false-negative
aware access policies that continuously estimate the false-negative ratio and
sometimes access caches despite negative indications. We present optimal
policies for homogeneous settings and provide approximation guarantees for our
algorithms in heterogeneous environments. We further perform an extensive
simulation study with multiple real system traces. We show that our
false-negative aware algorithms incur a significantly lower access cost than
existing approaches or match the cost of these approaches while requiring an
order of magnitude fewer resources (e.g., caching capacity or bandwidth)
Approximate Range Counting Revisited
We study range-searching for colored objects, where one has to count (approximately) the number of colors present in a query range. The problems studied mostly involve orthogonal range-searching in two and three dimensions, and the dual setting of rectangle stabbing by points. We present optimal and near-optimal solutions for these problems. Most of the results are obtained via reductions to the approximate uncolored version, and improved data-structures for them. An additional contribution of this work is the introduction of nested shallow cuttings
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable
There has been significant recent interest in parallel graph processing due
to the need to quickly analyze the large graphs available today. Many graph
codes have been designed for distributed memory or external memory. However,
today even the largest publicly-available real-world graph (the Hyperlink Web
graph with over 3.5 billion vertices and 128 billion edges) can fit in the
memory of a single commodity multicore server. Nevertheless, most experimental
work in the literature report results on much smaller graphs, and the ones for
the Hyperlink graph use distributed or external memory. Therefore, it is
natural to ask whether we can efficiently solve a broad class of graph problems
on this graph in memory.
This paper shows that theoretically-efficient parallel graph algorithms can
scale to the largest publicly-available graphs using a single machine with a
terabyte of RAM, processing them in minutes. We give implementations of
theoretically-efficient parallel algorithms for 20 important graph problems. We
also present the optimizations and techniques that we used in our
implementations, which were crucial in enabling us to process these large
graphs quickly. We show that the running times of our implementations
outperform existing state-of-the-art implementations on the largest real-world
graphs. For many of the problems that we consider, this is the first time they
have been solved on graphs at this scale. We have made the implementations
developed in this work publicly-available as the Graph-Based Benchmark Suite
(GBBS).Comment: This is the full version of the paper appearing in the ACM Symposium
on Parallelism in Algorithms and Architectures (SPAA), 201
Efficient Data Structures for Text Processing Applications
This thesis is devoted to designing and analyzing efficient text indexing data structures and associated algorithms for processing text data. The general problem is to preprocess a given text or a collection of texts into a space-efficient index to quickly answer various queries on this data. Basic queries such as counting/reporting a given pattern\u27s occurrences as substrings of the original text are useful in modeling critical bioinformatics applications. This line of research has witnessed many breakthroughs, such as the suffix trees, suffix arrays, FM-index, etc. In this work, we revisit the following problems: 1. The Heaviest Induced Ancestors problem 2. Range Longest Common Prefix problem 3. Range Shortest Unique Substrings problem 4. Non-Overlapping Indexing problem For the first problem, we present two new space-time trade-offs that improve the space, query time, or both of the existing solutions by roughly a logarithmic factor. For the second problem, our solution takes linear space, which improves the previous result by a logarithmic factor. The techniques developed are then extended to obtain an efficient solution for our third problem, which is newly formulated. Finally, we present a new framework that yields efficient solutions for the last problem in both cache-aware and cache-oblivious models
- …