4,434 research outputs found
On Defining SPARQL with Boolean Tensor Algebra
The Resource Description Framework (RDF) represents information as
subject-predicate-object triples. These triples are commonly interpreted as a
directed labelled graph. We propose an alternative approach, interpreting the
data as a 3-way Boolean tensor. We show how SPARQL queries - the standard
queries for RDF - can be expressed as elementary operations in Boolean algebra,
giving us a complete re-interpretation of RDF and SPARQL. We show how the
Boolean tensor interpretation allows for new optimizations and analyses of the
complexity of SPARQL queries. For example, estimating the size of the results
for different join queries becomes much simpler
Multiple-Tree Push-based Overlay Streaming
Multiple-Tree Overlay Streaming has attracted a great amount of attention
from researchers in the past years. Multiple-tree streaming is a promising
alternative to single-tree streaming in terms of node dynamics and load
balancing, among others, which in turn addresses the perceived video quality by
the streaming user on node dynamics or when heterogeneous nodes join the
network. This article presents a comprehensive survey of the different
aproaches and techniques used in this research area. In this paper we identify
node-disjointness as the property most approaches aim to achieve. We also
present an alternative technique which does not try to achieve this but does
local optimizations aiming global optimizations. Thus, we identify this
property as not being absolute necessary for creating robust and heterogeneous
multi-tree overlays. We identify two main design goals: robustness and support
for heterogeneity, and classify existing approaches into these categories as
their main focus
State-Slice: A New Stream Query Optimization Paradigm for Multi-query and Distributed Processing
Modern stream applications necessitate the handling of large numbers of continuous queries specified over high volume data streams. This dissertation proposes novel solutions to continuous query optimization in three core areas of stream query processing, namely state-slice based multiple continuous query sharing, ring-based multi-way join query distribution and scalable distributed multi-query optimization. The first part of the dissertation proposes efficient optimization strategies that utilize the novel state-slicing concept to achieve maximum memory and computation sharing for stream join queries with window constraints. Extensive analytical and experimental evaluations demonstrate that our proposed strategies is capable to minimize the memory or CPU consumptions for multiple join queries. The second part of this dissertation proposes a novel scheme for the distributed execution of generic multi-way joins with window constraints. The proposed scheme partitions the states into disjoint slices in the time domain, and then distributes the fine-grained states in the cluster, forming a virtual computation ring. New challenges to support this distributed state-slicing processing are answered by numerous new techniques. The extensive experimental evaluations show that the proposed strategies achieve significant performance improvements in terms of response time and memory usages for a wide range of configurations and workloads on a real system. Ring based distributed stream query processing and multi-query sharing both are based on the state-slice concept. The third part of this dissertation combines the first two parts of this dissertation work and proposes a novel distributed multi-query optimization technique
Conclave: secure multi-party computation on big data (extended TR)
Secure Multi-Party Computation (MPC) allows mutually distrusting parties to
run joint computations without revealing private data. Current MPC algorithms
scale poorly with data size, which makes MPC on "big data" prohibitively slow
and inhibits its practical use.
Many relational analytics queries can maintain MPC's end-to-end security
guarantee without using cryptographic MPC techniques for all operations.
Conclave is a query compiler that accelerates such queries by transforming them
into a combination of data-parallel, local cleartext processing and small MPC
steps. When parties trust others with specific subsets of the data, Conclave
applies new hybrid MPC-cleartext protocols to run additional steps outside of
MPC and improve scalability further.
Our Conclave prototype generates code for cleartext processing in Python and
Spark, and for secure MPC using the Sharemind and Obliv-C frameworks. Conclave
scales to data sets between three and six orders of magnitude larger than
state-of-the-art MPC frameworks support on their own. Thanks to its hybrid
protocols, Conclave also substantially outperforms SMCQL, the most similar
existing system.Comment: Extended technical report for EuroSys 2019 pape
SkinnerDB: Regret-Bounded Query Evaluation via Reinforcement Learning
SkinnerDB is designed from the ground up for reliable join ordering. It
maintains no data statistics and uses no cost or cardinality models. Instead,
it uses reinforcement learning to learn optimal join orders on the fly, during
the execution of the current query. To that purpose, we divide the execution of
a query into many small time slices. Different join orders are tried in
different time slices. We merge result tuples generated according to different
join orders until a complete result is obtained. By measuring execution
progress per time slice, we identify promising join orders as execution
proceeds.
Along with SkinnerDB, we introduce a new quality criterion for query
execution strategies. We compare expected execution cost against execution cost
for an optimal join order. SkinnerDB features multiple execution strategies
that are optimized for that criterion. Some of them can be executed on top of
existing database systems. For maximal performance, we introduce a customized
execution engine, facilitating fast join order switching via specialized
multi-way join algorithms and tuple representations.
We experimentally compare SkinnerDB's performance against various baselines,
including MonetDB, Postgres, and adaptive processing methods. We consider
various benchmarks, including the join order benchmark and TPC-H variants with
user-defined functions. Overall, the overheads of reliable join ordering are
negligible compared to the performance impact of the occasional, catastrophic
join order choice
Building Efficient Query Engines in a High-Level Language
Abstraction without regret refers to the vision of using high-level
programming languages for systems development without experiencing a negative
impact on performance. A database system designed according to this vision
offers both increased productivity and high performance, instead of sacrificing
the former for the latter as is the case with existing, monolithic
implementations that are hard to maintain and extend. In this article, we
realize this vision in the domain of analytical query processing. We present
LegoBase, a query engine written in the high-level language Scala. The key
technique to regain efficiency is to apply generative programming: LegoBase
performs source-to-source compilation and optimizes the entire query engine by
converting the high-level Scala code to specialized, low-level C code. We show
how generative programming allows to easily implement a wide spectrum of
optimizations, such as introducing data partitioning or switching from a row to
a column data layout, which are difficult to achieve with existing low-level
query compilers that handle only queries. We demonstrate that sufficiently
powerful abstractions are essential for dealing with the complexity of the
optimization effort, shielding developers from compiler internals and
decoupling individual optimizations from each other. We evaluate our approach
with the TPC-H benchmark and show that: (a) With all optimizations enabled,
LegoBase significantly outperforms a commercial database and an existing query
compiler. (b) Programmers need to provide just a few hundred lines of
high-level code for implementing the optimizations, instead of complicated
low-level code that is required by existing query compilation approaches. (c)
The compilation overhead is low compared to the overall execution time, thus
making our approach usable in practice for compiling query engines
Parallelizing Set Similarity Joins
Eine der größten Herausforderungen in Data Science ist heutzutage, Daten miteinander in Beziehung zu setzen und ähnliche Daten zu finden. Hierzu kann der aus relationalen Datenbanken bekannte Join-Operator eingesetzt werden. Das Konzept der Ähnlichkeit wird häufig durch mengenbasierte Ähnlichkeitsfunktionen gemessen. Um solche Funktionen als Join-Prädikat nutzen zu können, setzt diese Arbeit voraus, dass Records aus Mengen von Tokens bestehen. Die Arbeit fokussiert sich auf den mengenbasierten Ähnlichkeitsjoin, Set Similarity Join (SSJ).
Die Datenmenge, die es heute zu verarbeiten gilt, ist groß und wächst weiter. Der SSJ hingegen ist eine rechenintensive Operation. Um ihn auf großen Daten ausführen zu können, sind neue Ansätze notwendig. Diese Arbeit fokussiert sich auf das Mittel der Parallelisierung. Sie leistet folgende drei Beiträge auf dem Gebiet der SSJs.
Erstens beschreibt und untersucht die Arbeit den aktuellen Stand paralleler SSJ-Ansätze. Diese Arbeit vergleicht zehn Map-Reduce-basierte Ansätze aus der Literatur sowohl analytisch als auch experimentell. Der größte Schwachpunkt aller Ansätze ist überraschenderweise eine geringe Skalierbarkeit aufgrund zu hoher Datenreplikation und/ oder ungleich verteilter Daten. Keiner der Ansätze kann den SSJ auf großen Daten berechnen.
Zweitens macht die Arbeit die verfügbare hohe CPU-Parallelität moderner Rechner für den SSJ nutzbar. Sie stellt einen neuen daten-parallelen multi-threaded SSJ-Ansatz vor. Der vorgestellte Ansatz ermöglicht erhebliche Laufzeit-Beschleunigungen gegenüber der Ausführung auf einem Thread.
Drittens stellt die Arbeit einen neuen hoch skalierbaren verteilten SSJ-Ansatz vor. Mit einer kostenbasierten Heuristik und einem daten-unabhängigen Skalierungsmechanismus vermeidet er Daten-Replikation und wiederholte Berechnungen. Der Ansatz beschleunigt die Join-Ausführung signifikant und ermöglicht die Ausführung auf erheblich größeren Datenmengen als bisher betrachtete parallele Ansätze.One of today's major challenges in data science is to compare and relate data of similar nature. Using the join operation known from relational databases could help solving this problem. Given a collection of records, the join operation finds all pairs of records, which fulfill a user-chosen predicate. Real-world problems could require complex predicates, such as similarity. A common way to measure similarity are set similarity functions. In order to use set similarity functions as predicates, we assume records to be represented by sets of tokens. In this thesis, we focus on the set similarity join (SSJ) operation.
The amount of data to be processed today is typically large and grows continually. On the other hand, the SSJ is a compute-intensive operation. To cope with the increasing size of input data, additional means are needed to develop scalable implementations for SSJ. In this thesis, we focus on parallelization. We make the following three major contributions to SSJ.
First, we elaborate on the state-of-the-art in parallelizing SSJ. We compare ten MapReduce-based approaches from the literature analytically and experimentally. Their main limit is surprisingly a low scalability due to too high and/or skewed data replication. None of the approaches could compute the join on large datasets.
Second, we leverage the abundant CPU parallelism of modern commodity hardware, which has not yet been considered to scale SSJ. We propose a novel data-parallel multi-threaded SSJ. Our approach provides significant speedups compared to single-threaded executions.
Third, we propose a novel highly scalable distributed SSJ approach. With a cost-based heuristic and a data-independent scaling mechanism we avoid data replication and recomputation. A heuristic assigns similar shares of compute costs to each node. Our approach significantly scales up the join execution and processes much larger datasets than all parallel approaches designed and implemented so far
- …