132 research outputs found
A New Framework for Join Product Skew
Different types of data skew can result in load imbalance in the context of
parallel joins under the shared nothing architecture. We study one important
type of skew, join product skew (JPS). A static approach based on frequency
classes is proposed which takes for granted the data distribution of join
attribute values. It comes from the observation that the join selectivity can
be expressed as a sum of products of frequencies of the join attribute values.
As a consequence, an appropriate assignment of join sub-tasks, that takes into
consideration the magnitude of the frequency products can alleviate the join
product skew. Motivated by the aforementioned remark, we propose an algorithm,
called Handling Join Product Skew (HJPS), to handle join product skew
A scalable analysis framework for large-scale RDF data
With the growth of the Semantic Web, the availability of RDF datasets from multiple domains
as Linked Data has taken the corpora of this web to a terabyte-scale, and challenges
modern knowledge storage and discovery techniques. Research and engineering on RDF
data management systems is a very active area with many standalone systems being introduced.
However, as the size of RDF data increases, such single-machine approaches meet
performance bottlenecks, in terms of both data loading and querying, due to the limited
parallelism inherent to symmetric multi-threaded systems and the limited available system
I/O and system memory. Although several approaches for distributed RDF data processing
have been proposed, along with clustered versions of more traditional approaches, their
techniques are limited by the trade-off they exploit between loading complexity and query
efficiency in the presence of big RDF data. This thesis then, introduces a scalable analysis
framework for processing large-scale RDF data, which focuses on various techniques to
reduce inter-machine communication, computation and load-imbalancing so as to achieve
fast data loading and querying on distributed infrastructures.
The first part of this thesis focuses on the study of RDF store implementation and parallel
hashing on big data processing. (1) A system-level investigation of RDF store implementation
has been conducted on the basis of a comparative analysis of runtime characteristics
of a representative set of RDF stores. The detailed time cost and system consumption is
measured for data loading and querying so as to provide insight into different triple store
implementation as well as an understanding of performance differences between different
platforms. (2) A high-level structured parallel hashing approach over distributed memory is
proposed and theoretically analyzed. The detailed performance of hashing implementations
using different lock-free strategies has been characterized through extensive experiments,
thereby allowing system developers to make a more informed choice for the implementation
of their high-performance analytical data processing systems.
The second part of this thesis proposes three main techniques for fast processing of large
RDF data within the proposed framework. (1) A very efficient parallel dictionary encoding
algorithm, to avoid unnecessary disk-space consumption and reduce computational complexity of query execution. The presented implementation has achieved notable speedups
compared to the state-of-art method and also has achieved excellent scalability. (2) Several
novel parallel join algorithms, to efficiently handle skew over large data during query processing.
The approaches have achieved good load balancing and have been demonstrated
to be faster than the state-of-art techniques in both theoretical and experimental comparisons.
(3) A two-tier dynamic indexing approach for processing SPARQL queries has been
devised which keeps loading times low and decreases or in some instances removes intermachine
data movement for subsequent queries that contain the same graph patterns. The
results demonstrate that this design can load data at least an order of magnitude faster than
a clustered store operating in RAM while remaining within an interactive range for query
processing and even outperforms current systems for various queries
Robust and Skew-resistant Parallel Joins in Shared-Nothing Systems
The performance of joins in parallel database management systems is critical for data intensive operations such as querying. Since data skew is common in many applications, poorly engineered join operations result in load imbalance and performance bottlenecks. State-of-the-art methods designed to handle this problem offer significant improvements over naive implementations. However, performance could be further improved by removing the dependency on global skew knowledge and broadcasting. In this paper, we propose PRPQ (partial redistribution & partial query), an efficient and robust join algorithm for processing large-scale joins over distributed systems. We present the detailed implementation and a quantitative evaluation of our method. The experimental results demonstrate that the proposed PRPQ algorithm is indeed robust and scalable under a wide range of skew conditions. Specifically, compared to the state-of-art PRPD method, we achieve 16% - 167% performance improvement and 24% - 54% less network communication under different join workloads
A Self-Optimizing Cloud Computing System for Distributed Storage and Processing of Semantic Web Data
Clouds are dynamic networks of common, off-the-shell computers to build computation farms. The rapid growth of databases in the context of the semantic web requires efficient ways to store and process this data. Using cloud technology for storing and processing Semantic Web data is an obvious way to overcome difficulties in storing and processing the enormously large present and future datasets of the Semantic Web. This paper presents a new approach for storing Semantic Web data, such that operations for the evaluation of Semantic Web queries are more likely to be processed only on local data, instead of using costly distributed operations. An experimental evaluation demonstrates the performance improvements in comparison to a naive distribution of Semantic Web data
Privacy-Preserving Secret Shared Computations using MapReduce
Data outsourcing allows data owners to keep their data at \emph{untrusted}
clouds that do not ensure the privacy of data and/or computations. One useful
framework for fault-tolerant data processing in a distributed fashion is
MapReduce, which was developed for \emph{trusted} private clouds. This paper
presents algorithms for data outsourcing based on Shamir's secret-sharing
scheme and for executing privacy-preserving SQL queries such as count,
selection including range selection, projection, and join while using MapReduce
as an underlying programming model. Our proposed algorithms prevent an
adversary from knowing the database or the query while also preventing
output-size and access-pattern attacks. Interestingly, our algorithms do not
involve the database owner, which only creates and distributes secret-shares
once, in answering any query, and hence, the database owner also cannot learn
the query. Logically and experimentally, we evaluate the efficiency of the
algorithms on the following parameters: (\textit{i}) the number of
communication rounds (between a user and a server), (\textit{ii}) the total
amount of bit flow (between a user and a server), and (\textit{iii}) the
computational load at the user and the server.\BComment: IEEE Transactions on Dependable and Secure Computing, Accepted 01
Aug. 201
Design and Evaluation of Small-Large Outer Joins in Cloud Computing Environments
Large-scale analytics is a key application area for data processing and parallel computing research. One of the most common (and challenging) operations in this domain is the join. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially in the extremely popular cloud computing environments. A common type of outer join is the small-large outer join, where one relation is relatively small and the other is large. Conventional implementations on this condition, such as one based on hash redistribution, often incur significant network communication, while the duplication-based approaches are complex and inefficient. In this work, we present a new method called DDR (duplication and direct redistribution), which aims to enable efficient small-large outer joins in cloud computing environments while being easy to implement using existing predicates in data processing frameworks. We present the detailed implementation of our approach and evaluate its performance through extensive experiments over the widely used MapReduce and Spark platforms. We show that the proposed method is scalable and can achieve significant performance improvements over the conventional approaches. Compared to the state-of-art method, the DDR algorithm is shown to be easier to implement and can achieve very similar or better performance under different outer join workloads, and thus, can be considered as a new option for current data analysis applications. Moreover, our detailed experimental results also have provided insights of current small-large outer join implementations, thereby allowing system developers to make a more informed choice for their data analysis applications
Materialisierte views in verteilten key-value stores
Distributed key-value stores have become the solution of choice for warehousing large volumes of data. However, their architecture is not suitable for real-time analytics. To achieve the required velocity, materialized views can be used to provide summarized data for fast access. The main challenge then, is the incremental, consistent maintenance of views at large scale. Thus, we introduce our View Maintenance System (VMS) to maintain SQL queries in a data-intensive real-time scenario.Verteilte key-value stores sind ein Typ moderner Datenbanken um große Mengen an Daten zu verarbeiten. Trotzdem erlaubt ihre Architektur keine analytischen Abfragen in Echtzeit. Materialisierte Views können diesen Nachteil ausgleichen, indem sie schnellen Zuriff auf Ergebnisse ermöglichen. Die Herausforderung ist dann, das inkrementelle und konsistente Aktualisieren der Views. Daher präsentieren wir unser View Maintenance System (VMS), das datenintensive SQL Abfragen in Echtzeit berechnet
- …