19,934 research outputs found
Optimal Substring-Equality Queries with Applications to Sparse Text Indexing
We consider the problem of encoding a string of length from an integer
alphabet of size so that access and substring equality queries (that
is, determining the equality of any two substrings) can be answered
efficiently. Any uniquely-decodable encoding supporting access must take
bits. We describe a new data
structure matching this lower bound when while supporting
both queries in optimal time. Furthermore, we show that the string can
be overwritten in-place with this structure. The redundancy of
bits and the constant query time break exponentially a lower bound that is
known to hold in the read-only model. Using our new string representation, we
obtain the first in-place subquadratic (indeed, even sublinear in some cases)
algorithms for several string-processing problems in the restore model: the
input string is rewritable and must be restored before the computation
terminates. In particular, we describe the first in-place subquadratic Monte
Carlo solutions to the sparse suffix sorting, sparse LCP array construction,
and suffix selection problems. With the sole exception of suffix selection, our
algorithms are also the first running in sublinear time for small enough sets
of input suffixes. Combining these solutions, we obtain the first
sublinear-time Monte Carlo algorithm for building the sparse suffix tree in
compact space. We also show how to derandomize our algorithms using small
space. This leads to the first Las Vegas in-place algorithm computing the full
LCP array in time and to the first Las Vegas in-place algorithms
solving the sparse suffix sorting and sparse LCP array construction problems in
time. Running times of these Las Vegas
algorithms hold in the worst case with high probability.Comment: Refactored according to TALG's reviews. New w.h.p. bounds and Las
Vegas algorithm
Communication Efficient Checking of Big Data Operations
We propose fast probabilistic algorithms with low (i.e., sublinear in the
input size) communication volume to check the correctness of operations in Big
Data processing frameworks and distributed databases. Our checkers cover many
of the commonly used operations, including sum, average, median, and minimum
aggregation, as well as sorting, union, merge, and zip. An experimental
evaluation of our implementation in Thrill (Bingmann et al., 2016) confirms the
low overhead and high failure detection rate predicted by theoretical analysis
A New Approach to Speeding Up Topic Modeling
Latent Dirichlet allocation (LDA) is a widely-used probabilistic topic
modeling paradigm, and recently finds many applications in computer vision and
computational biology. In this paper, we propose a fast and accurate batch
algorithm, active belief propagation (ABP), for training LDA. Usually batch LDA
algorithms require repeated scanning of the entire corpus and searching the
complete topic space. To process massive corpora having a large number of
topics, the training iteration of batch LDA algorithms is often inefficient and
time-consuming. To accelerate the training speed, ABP actively scans the subset
of corpus and searches the subset of topic space for topic modeling, therefore
saves enormous training time in each iteration. To ensure accuracy, ABP selects
only those documents and topics that contribute to the largest residuals within
the residual belief propagation (RBP) framework. On four real-world corpora,
ABP performs around to times faster than state-of-the-art batch LDA
algorithms with a comparable topic modeling accuracy.Comment: 14 pages, 12 figure
Data-Oblivious Graph Algorithms in Outsourced External Memory
Motivated by privacy preservation for outsourced data, data-oblivious
external memory is a computational framework where a client performs
computations on data stored at a semi-trusted server in a way that does not
reveal her data to the server. This approach facilitates collaboration and
reliability over traditional frameworks, and it provides privacy protection,
even though the server has full access to the data and he can monitor how it is
accessed by the client. The challenge is that even if data is encrypted, the
server can learn information based on the client data access pattern; hence,
access patterns must also be obfuscated. We investigate privacy-preserving
algorithms for outsourced external memory that are based on the use of
data-oblivious algorithms, that is, algorithms where each possible sequence of
data accesses is independent of the data values. We give new efficient
data-oblivious algorithms in the outsourced external memory model for a number
of fundamental graph problems. Our results include new data-oblivious
external-memory methods for constructing minimum spanning trees, performing
various traversals on rooted trees, answering least common ancestor queries on
trees, computing biconnected components, and forming open ear decompositions.
None of our algorithms make use of constant-time random oracles.Comment: 20 page
- …