7,378 research outputs found
Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation
Given a static reference string and a source string , a relative
compression of with respect to is an encoding of as a sequence of
references to substrings of . Relative compression schemes are a classic
model of compression and have recently proved very successful for compressing
highly-repetitive massive data sets such as genomes and web-data. We initiate
the study of relative compression in a dynamic setting where the compressed
source string is subject to edit operations. The goal is to maintain the
compressed representation compactly, while supporting edits and allowing
efficient random access to the (uncompressed) source string. We present new
data structures that achieve optimal time for updates and queries while using
space linear in the size of the optimal relative compression, for nearly all
combinations of parameters. We also present solutions for restricted and
extended sets of updates. To achieve these results, we revisit the dynamic
partial sums problem and the substring concatenation problem. We present new
optimal or near optimal bounds for these problems. Plugging in our new results
we also immediately obtain new bounds for the string indexing for patterns with
wildcards problem and the dynamic text and static pattern matching problem
Dynamic Set Intersection
Consider the problem of maintaining a family of dynamic sets subject to
insertions, deletions, and set-intersection reporting queries: given , report every member of in any order. We show that in the word
RAM model, where is the word size, given a cap on the maximum size of
any set, we can support set intersection queries in
expected time, and updates in expected time. Using this algorithm
we can list all triangles of a graph in
expected time, where and
is the arboricity of . This improves a 30-year old triangle enumeration
algorithm of Chiba and Nishizeki running in time.
We provide an incremental data structure on that supports intersection
{\em witness} queries, where we only need to find {\em one} .
Both queries and insertions take O\paren{\sqrt \frac{N}{w/\log^2 w}} expected
time, where . Finally, we provide time/space tradeoffs for
the fully dynamic set intersection reporting problem. Using words of space,
each update costs expected time, each reporting query
costs expected time where
is the size of the output, and each witness query costs expected time.Comment: Accepted to WADS 201
Handling Network Partitions and Mergers in Structured Overlay Networks
Structured overlay networks form a major class of peer-to-peer systems, which are touted for their abilities to
scale, tolerate failures, and self-manage. Any long-lived
Internet-scale distributed system is destined to face network partitions. Although the problem of network partitions
and mergers is highly related to fault-tolerance and
self-management in large-scale systems, it has hardly been
studied in the context of structured peer-to-peer systems.
These systems have mainly been studied under churn (frequent
joins/failures), which as a side effect solves the problem
of network partitions, as it is similar to massive node
failures. Yet, the crucial aspect of network mergers has been
ignored. In fact, it has been claimed that ring-based structured
overlay networks, which constitute the majority of the
structured overlays, are intrinsically ill-suited for merging
rings. In this paper, we present an algorithm for merging
multiple similar ring-based overlays when the underlying
network merges. We examine the solution in dynamic conditions,
showing how our solution is resilient to churn during
the merger, something widely believed to be difficult or
impossible. We evaluate the algorithm for various scenarios
and show that even when falsely detecting a merger, the
algorithm quickly terminates and does not clutter the network
with many messages. The algorithm is flexible as the
tradeoff between message complexity and time complexity
can be adjusted by a parameter
S-Store: Streaming Meets Transaction Processing
Stream processing addresses the needs of real-time applications. Transaction
processing addresses the coordination and safety of short atomic computations.
Heretofore, these two modes of operation existed in separate, stove-piped
systems. In this work, we attempt to fuse the two computational paradigms in a
single system called S-Store. In this way, S-Store can simultaneously
accommodate OLTP and streaming applications. We present a simple transaction
model for streams that integrates seamlessly with a traditional OLTP system. We
chose to build S-Store as an extension of H-Store, an open-source, in-memory,
distributed OLTP database system. By implementing S-Store in this way, we can
make use of the transaction processing facilities that H-Store already
supports, and we can concentrate on the additional implementation features that
are needed to support streaming. Similar implementations could be done using
other main-memory OLTP platforms. We show that we can actually achieve higher
throughput for streaming workloads in S-Store than an equivalent deployment in
H-Store alone. We also show how this can be achieved within H-Store with the
addition of a modest amount of new functionality. Furthermore, we compare
S-Store to two state-of-the-art streaming systems, Spark Streaming and Storm,
and show how S-Store matches and sometimes exceeds their performance while
providing stronger transactional guarantees
Parallel Graph Connectivity in Log Diameter Rounds
We study graph connectivity problem in MPC model. On an undirected graph with
nodes and edges, round connectivity algorithms have been
known for over 35 years. However, no algorithms with better complexity bounds
were known. In this work, we give fully scalable, faster algorithms for the
connectivity problem, by parameterizing the time complexity as a function of
the diameter of the graph. Our main result is a
time connectivity algorithm for diameter- graphs, using total
memory. If our algorithm can use more memory, it can terminate in fewer rounds,
and there is no lower bound on the memory per processor.
We extend our results to related graph problems such as spanning forest,
finding a DFS sequence, exact/approximate minimum spanning forest, and
bottleneck spanning forest. We also show that achieving similar bounds for
reachability in directed graphs would imply faster boolean matrix
multiplication algorithms.
We introduce several new algorithmic ideas. We describe a general technique
called double exponential speed problem size reduction which roughly means that
if we can use total memory to reduce a problem from size to , for
in one phase, then we can solve the problem in
phases. In order to achieve this fast reduction for graph
connectivity, we use a multistep algorithm. One key step is a carefully
constructed truncated broadcasting scheme where each node broadcasts neighbor
sets to its neighbors in a way that limits the size of the resulting neighbor
sets. Another key step is random leader contraction, where we choose a smaller
set of leaders than many previous works do
Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation
Given a static reference string R and a source string S, a relative compression of S with respect to R is an encoding of S as a sequence of references to substrings of R. Relative compression schemes are a classic model of compression and have recently proved very successful for compressing highly-repetitive massive data sets such as genomes and web-data. We initiate the study of relative compression in a dynamic setting where the compressed source string S is subject to edit operations. The goal is to maintain the compressed representation compactly, while supporting edits and allowing efficient random access to the (uncompressed) source string. We present new data structures that achieve optimal time for updates and queries while using space linear in the size of the optimal relative compression, for nearly all combinations of parameters. We also present solutions for restricted and extended sets of updates. To achieve these results, we revisit the dynamic partial sums problem and the substring concatenation problem. We present new optimal or near optimal bounds for these problems. Plugging in our new results we also immediately obtain new bounds for the string indexing for patterns with wildcards problem and the dynamic text and static pattern matching problem
GPU accelerating distributed succinct de Bruijn graph construction
The research and methods in the field of computational biology have grown in the last decades, thanks to the availability of biological data. One of the applications in computational biology is genome sequencing or sequence alignment, a method to arrange sequences of, for example, DNA or RNA, to determine regions of similarity between these sequences. Sequence alignment applications include public health purposes, such as monitoring antimicrobial resistance.
Demand for fast sequence alignment has led to the usage of data structures, such as the de Bruijn graph, to store a large amount of information efficiently. De Bruijn graphs are currently one of the top data structures used in indexing genome sequences, and different methods to represent them have been explored. One of these methods is the BOSS data structure, a special case of Wheeler graph index, which uses succinct data structures to represent a de Bruijn graph.
As genomes can take a large amount of space, the construction of succinct de Bruijn graphs is slow. This has led to experimental research on using large-scale cluster engines such as Apache Spark and Graphic Processing Units (GPUs) in genome data processing.
This thesis explores the use of Apache Spark and Spark RAPIDS, a GPU computing library for Apache Spark, in the construction of a succinct de Bruijn graph index from genome sequences. The experimental results indicate that Spark RAPIDS can provide up to 8 times speedups to specific operations, but for some other operations has severe limitations that limit its processing power in terms of succinct de Bruijn graph index construction
- …