Search CORE

7,378 research outputs found

Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation

Author: Bille Philip
Cording Patrick Hagge
Gørtz Inge Li
Skjoldjensen Frederik Rye
Vildhøj Hjalte Wedel
Vind Søren
Publication venue
Publication date: 01/01/2016
Field of study

Given a static reference string

R

and a source string

S

, a relative compression of

S

with respect to

R

is an encoding of

S

as a sequence of references to substrings of

R

. Relative compression schemes are a classic model of compression and have recently proved very successful for compressing highly-repetitive massive data sets such as genomes and web-data. We initiate the study of relative compression in a dynamic setting where the compressed source string

S

is subject to edit operations. The goal is to maintain the compressed representation compactly, while supporting edits and allowing efficient random access to the (uncompressed) source string. We present new data structures that achieve optimal time for updates and queries while using space linear in the size of the optimal relative compression, for nearly all combinations of parameters. We also present solutions for restricted and extended sets of updates. To achieve these results, we revisit the dynamic partial sums problem and the substring concatenation problem. We present new optimal or near optimal bounds for these problems. Plugging in our new results we also immediately obtain new bounds for the string indexing for patterns with wildcards problem and the dynamic text and static pattern matching problem

arXiv.org e-Print Archive

Online Research Database In Technology

Dynamic Set Intersection

Author: A Björklund
A Brodnik
A Itai
G Myers
H Cohen
I Baran
ML Fredman
N Chiba
P Bille
R Baeza-Yates
S Albers
TM Chan
TM Chan
TM Chan
TM Chan
WJ Masek
Publication venue
Publication date: 04/05/2015
Field of study

Consider the problem of maintaining a family

F

of dynamic sets subject to insertions, deletions, and set-intersection reporting queries: given

S,S'\in F

, report every member of

S\cap S'

in any order. We show that in the word RAM model, where

w

is the word size, given a cap

d

on the maximum size of any set, we can support set intersection queries in

O(\frac{d}{w/\log^2 w})

expected time, and updates in

O(\log w)

expected time. Using this algorithm we can list all

t

triangles of a graph

G=(V,E)

O(m+\frac{m\alpha}{w/\log^2 w} +t)

expected time, where

m=|E|

and

\alpha

is the arboricity of

G

. This improves a 30-year old triangle enumeration algorithm of Chiba and Nishizeki running in

O(m \alpha)

time. We provide an incremental data structure on

F

that supports intersection {\em witness} queries, where we only need to find {\em one}

e\in S\cap S'

. Both queries and insertions take O\paren{\sqrt \frac{N}{w/\log^2 w}} expected time, where

N=\sum_{S\in F} |S|

. Finally, we provide time/space tradeoffs for the fully dynamic set intersection reporting problem. Using

M

words of space, each update costs

O(\sqrt {M \log N})

expected time, each reporting query costs

O(\frac{N\sqrt{\log N}}{\sqrt M}\sqrt{op+1})

expected time where

op

is the size of the output, and each witness query costs

O(\frac{N\sqrt{\log N}}{\sqrt M} + \log N)

expected time.Comment: Accepted to WADS 201

arXiv.org e-Print Archive

Crossref

Matching and Compression of Strings with Automata and Word Packing

Author: Skjoldjensen Frederik Rye
Publication venue: DTU Compute
Publication date: 01/01/2017
Field of study

Online Research Database In Technology

Handling Network Partitions and Mergers in Structured Overlay Networks

Author: Ghodsi Ali
Haridi Seif
Shafaat Tallat M.
Publication venue
Publication date: 01/01/2007
Field of study

Structured overlay networks form a major class of peer-to-peer systems, which are touted for their abilities to scale, tolerate failures, and self-manage. Any long-lived Internet-scale distributed system is destined to face network partitions. Although the problem of network partitions and mergers is highly related to fault-tolerance and self-management in large-scale systems, it has hardly been studied in the context of structured peer-to-peer systems. These systems have mainly been studied under churn (frequent joins/failures), which as a side effect solves the problem of network partitions, as it is similar to massive node failures. Yet, the crucial aspect of network mergers has been ignored. In fact, it has been claimed that ring-based structured overlay networks, which constitute the majority of the structured overlays, are intrinsically ill-suited for merging rings. In this paper, we present an algorithm for merging multiple similar ring-based overlays when the underlying network merges. We examine the solution in dynamic conditions, showing how our solution is resilient to churn during the merger, something widely believed to be difficult or impossible. We evaluate the algorithm for various scenarios and show that even when falsely detecting a merger, the algorithm quickly terminates and does not clutter the network with many messages. The algorithm is flexible as the tradeoff between message complexity and time complexity can be adjusted by a parameter

CiteSeerX

Crossref

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

S-Store: Streaming Meets Transaction Processing

Author: Aslantas Cansu
Cetintemel Ugur
Du Jiang
Kraska Tim
Madden Samuel
Maier David
Meehan John
Pavlo Andrew
Stonebraker Michael
Tatbul Nesime
Tufte Kristin
Wang Hao
Zdonik Stan
Publication venue
Publication date: 01/01/2015
Field of study

Stream processing addresses the needs of real-time applications. Transaction processing addresses the coordination and safety of short atomic computations. Heretofore, these two modes of operation existed in separate, stove-piped systems. In this work, we attempt to fuse the two computational paradigms in a single system called S-Store. In this way, S-Store can simultaneously accommodate OLTP and streaming applications. We present a simple transaction model for streams that integrates seamlessly with a traditional OLTP system. We chose to build S-Store as an extension of H-Store, an open-source, in-memory, distributed OLTP database system. By implementing S-Store in this way, we can make use of the transaction processing facilities that H-Store already supports, and we can concentrate on the additional implementation features that are needed to support streaming. Similar implementations could be done using other main-memory OLTP platforms. We show that we can actually achieve higher throughput for streaming workloads in S-Store than an equivalent deployment in H-Store alone. We also show how this can be achieved within H-Store with the addition of a modest amount of new functionality. Furthermore, we compare S-Store to two state-of-the-art streaming systems, Spark Streaming and Storm, and show how S-Store matches and sometimes exceeds their performance while providing stronger transactional guarantees

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Crossref

PDXScholar (Portland State University)

Parallel Graph Connectivity in Log Diameter Rounds

Author: Andoni Alexandr
Song Zhao
Stein Clifford
Wang Zhengyu
Zhong Peilin
Publication venue
Publication date: 08/05/2018
Field of study

We study graph connectivity problem in MPC model. On an undirected graph with

n

nodes and

m

edges,

O(\log n)

round connectivity algorithms have been known for over 35 years. However, no algorithms with better complexity bounds were known. In this work, we give fully scalable, faster algorithms for the connectivity problem, by parameterizing the time complexity as a function of the diameter of the graph. Our main result is a

O(\log D \log\log_{m/n} n)

time connectivity algorithm for diameter-

D

graphs, using

\Theta(m)

total memory. If our algorithm can use more memory, it can terminate in fewer rounds, and there is no lower bound on the memory per processor. We extend our results to related graph problems such as spanning forest, finding a DFS sequence, exact/approximate minimum spanning forest, and bottleneck spanning forest. We also show that achieving similar bounds for reachability in directed graphs would imply faster boolean matrix multiplication algorithms. We introduce several new algorithmic ideas. We describe a general technique called double exponential speed problem size reduction which roughly means that if we can use total memory

N

to reduce a problem from size

n

n/k

, for

k=(N/n)^{\Theta(1)}

in one phase, then we can solve the problem in

O(\log\log_{N/n} n)

phases. In order to achieve this fast reduction for graph connectivity, we use a multistep algorithm. One key step is a carefully constructed truncated broadcasting scheme where each node broadcasts neighbor sets to its neighbors in a way that limits the size of the resulting neighbor sets. Another key step is random leader contraction, where we choose a smaller set of leaders than many previous works do

arXiv.org e-Print Archive

Crossref

Dynamic Relative Compression, Dynamic Partial Sums, and Substring Concatenation

Author: Bille Philip
Cording Patrick Hagge
Skjoldjensen Frederik Rye
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 27th International Symposium on Algorithms and Computation (ISAAC 2016)
Publication date: 01/01/2016
Field of study

Given a static reference string R and a source string S, a relative compression of S with respect to R is an encoding of S as a sequence of references to substrings of R. Relative compression schemes are a classic model of compression and have recently proved very successful for compressing highly-repetitive massive data sets such as genomes and web-data. We initiate the study of relative compression in a dynamic setting where the compressed source string S is subject to edit operations. The goal is to maintain the compressed representation compactly, while supporting edits and allowing efficient random access to the (uncompressed) source string. We present new data structures that achieve optimal time for updates and queries while using space linear in the size of the optimal relative compression, for nearly all combinations of parameters. We also present solutions for restricted and extended sets of updates. To achieve these results, we revisit the dynamic partial sums problem and the substring concatenation problem. We present new optimal or near optimal bounds for these problems. Plugging in our new results we also immediately obtain new bounds for the string indexing for patterns with wildcards problem and the dynamic text and static pattern matching problem

Dagstuhl Research Online Publication Server

GPU accelerating distributed succinct de Bruijn graph construction

Author: Laanti Topi
Publication venue: Helsingfors universitet
Publication date: 01/01/2022
Field of study

The research and methods in the field of computational biology have grown in the last decades, thanks to the availability of biological data. One of the applications in computational biology is genome sequencing or sequence alignment, a method to arrange sequences of, for example, DNA or RNA, to determine regions of similarity between these sequences. Sequence alignment applications include public health purposes, such as monitoring antimicrobial resistance. Demand for fast sequence alignment has led to the usage of data structures, such as the de Bruijn graph, to store a large amount of information efficiently. De Bruijn graphs are currently one of the top data structures used in indexing genome sequences, and different methods to represent them have been explored. One of these methods is the BOSS data structure, a special case of Wheeler graph index, which uses succinct data structures to represent a de Bruijn graph. As genomes can take a large amount of space, the construction of succinct de Bruijn graphs is slow. This has led to experimental research on using large-scale cluster engines such as Apache Spark and Graphic Processing Units (GPUs) in genome data processing. This thesis explores the use of Apache Spark and Spark RAPIDS, a GPU computing library for Apache Spark, in the construction of a succinct de Bruijn graph index from genome sequences. The experimental results indicate that Spark RAPIDS can provide up to 8 times speedups to specific operations, but for some other operations has severe limitations that limit its processing power in terms of succinct de Bruijn graph index construction

Helsingin yliopiston digitaalinen arkisto