468 research outputs found
A scalable analysis framework for large-scale RDF data
With the growth of the Semantic Web, the availability of RDF datasets from multiple domains
as Linked Data has taken the corpora of this web to a terabyte-scale, and challenges
modern knowledge storage and discovery techniques. Research and engineering on RDF
data management systems is a very active area with many standalone systems being introduced.
However, as the size of RDF data increases, such single-machine approaches meet
performance bottlenecks, in terms of both data loading and querying, due to the limited
parallelism inherent to symmetric multi-threaded systems and the limited available system
I/O and system memory. Although several approaches for distributed RDF data processing
have been proposed, along with clustered versions of more traditional approaches, their
techniques are limited by the trade-off they exploit between loading complexity and query
efficiency in the presence of big RDF data. This thesis then, introduces a scalable analysis
framework for processing large-scale RDF data, which focuses on various techniques to
reduce inter-machine communication, computation and load-imbalancing so as to achieve
fast data loading and querying on distributed infrastructures.
The first part of this thesis focuses on the study of RDF store implementation and parallel
hashing on big data processing. (1) A system-level investigation of RDF store implementation
has been conducted on the basis of a comparative analysis of runtime characteristics
of a representative set of RDF stores. The detailed time cost and system consumption is
measured for data loading and querying so as to provide insight into different triple store
implementation as well as an understanding of performance differences between different
platforms. (2) A high-level structured parallel hashing approach over distributed memory is
proposed and theoretically analyzed. The detailed performance of hashing implementations
using different lock-free strategies has been characterized through extensive experiments,
thereby allowing system developers to make a more informed choice for the implementation
of their high-performance analytical data processing systems.
The second part of this thesis proposes three main techniques for fast processing of large
RDF data within the proposed framework. (1) A very efficient parallel dictionary encoding
algorithm, to avoid unnecessary disk-space consumption and reduce computational complexity of query execution. The presented implementation has achieved notable speedups
compared to the state-of-art method and also has achieved excellent scalability. (2) Several
novel parallel join algorithms, to efficiently handle skew over large data during query processing.
The approaches have achieved good load balancing and have been demonstrated
to be faster than the state-of-art techniques in both theoretical and experimental comparisons.
(3) A two-tier dynamic indexing approach for processing SPARQL queries has been
devised which keeps loading times low and decreases or in some instances removes intermachine
data movement for subsequent queries that contain the same graph patterns. The
results demonstrate that this design can load data at least an order of magnitude faster than
a clustered store operating in RAM while remaining within an interactive range for query
processing and even outperforms current systems for various queries
Design and Evaluation of Small-Large Outer Joins in Cloud Computing Environments
Large-scale analytics is a key application area for data processing and parallel computing research. One of the most common (and challenging) operations in this domain is the join. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially in the extremely popular cloud computing environments. A common type of outer join is the small-large outer join, where one relation is relatively small and the other is large. Conventional implementations on this condition, such as one based on hash redistribution, often incur significant network communication, while the duplication-based approaches are complex and inefficient. In this work, we present a new method called DDR (duplication and direct redistribution), which aims to enable efficient small-large outer joins in cloud computing environments while being easy to implement using existing predicates in data processing frameworks. We present the detailed implementation of our approach and evaluate its performance through extensive experiments over the widely used MapReduce and Spark platforms. We show that the proposed method is scalable and can achieve significant performance improvements over the conventional approaches. Compared to the state-of-art method, the DDR algorithm is shown to be easier to implement and can achieve very similar or better performance under different outer join workloads, and thus, can be considered as a new option for current data analysis applications. Moreover, our detailed experimental results also have provided insights of current small-large outer join implementations, thereby allowing system developers to make a more informed choice for their data analysis applications
Load-balancing distributed outer joins through operator decomposition
High-performance data analytics largely relies on being able to efficiently execute various distributed data operators such as distributed joins. So far, large amounts of join methods have been proposed and evaluated in parallel and distributed environments.
However, most of them focus on inner joins, and there is little published work providing the detailed implementations and analysis
of outer joins. In this work, we present POPI (Partial Outer join & Partial Inner join), a novel method to load-balance large parallel
outer joins by decomposing them into two operations: a large outer join over data that does not present significant skew in the input
and an inner join over data presenting significant skew. We present the detailed implementation of our approach and show that
POPI is implementable over a variety of architectures and underlying join implementations. Moreover, our experimental evaluation
over a distributed memory platform also demonstrates that the proposed method is able to improve outer join performance under
varying data skew and present excellent load-balancing properties, compared to current approaches
Scalable Querying of Nested Data
While large-scale distributed data processing platforms have become an attractive target for query processing, these systems are problematic for applications that deal with nested collections. Programmers are forced either to perform non-trivial translations of collection programs or to employ automated flattening procedures, both of which lead to performance problems. These challenges only worsen for nested collections with skewed cardinalities, where both handcrafted rewriting and automated flattening are unable to enforce load balancing across partitions.
In this work, we propose a framework that translates a program manipulating nested collections into a set of semantically equivalent shredded queries that can be efficiently evaluated. The framework employs a combination of query compilation techniques, an efficient data representation for nested collections, and automated skew-handling. We provide an extensive experimental evaluation, demonstrating significant improvements provided by the framework in diverse scenarios for nested collection programs
Cloud-Scale Entity Resolution: Current State and Open Challenges
Entity resolution (ER) is a process to identify records in information systems, which refer to the same real-world entity. Because in the two recent decades the data volume has grown so large, parallel techniques are called upon to satisfy the ER requirements of high performance and scalability. The development of parallel ER has reached a relatively prosperous stage, and has found its way into several applications. In this work, we first comprehensively survey the state of the art of parallel ER approaches. From the comprehensive overview, we then extract the classification criteria of parallel ER, classify and compare these approaches based on these criteria. Finally, we identify open research questions and challenges and discuss potential solutions and further research potentials in this field
Scaling and Load-Balancing Equi-Joins
The task of joining two tables is fundamental for querying databases. In this
paper, we focus on the equi-join problem, where a pair of records from the two
joined tables are part of the join results if equality holds between their
values in the join column(s). While this is a tractable problem when the number
of records in the joined tables is relatively small, it becomes very
challenging as the table sizes increase, especially if hot keys (join column
values with a large number of records) exist in both joined tables.
This paper, an extended version of [metwally-SIGMOD-2022], proposes
Adaptive-Multistage-Join (AM-Join) for scalable and fast equi-joins in
distributed shared-nothing architectures. AM-Join utilizes (a) Tree-Join, a
proposed novel algorithm that scales well when the joined tables share hot
keys, and (b) Broadcast-Join, the known fastest when joining keys that are hot
in only one table.
Unlike the state-of-the-art algorithms, AM-Join (a) holistically solves the
join-skew problem by achieving load balancing throughout the join execution,
and (b) supports all outer-join variants without record deduplication or custom
table partitioning. For the fastest AM-Join outer-join performance, we propose
the Index-Broadcast-Join (IB-Join) family of algorithms for Small-Large joins,
where one table fits in memory and the other can be up to orders of magnitude
larger. The outer-join variants of IB-Join improves on the state-of-the-art
Small-Large outer-join algorithms.
The proposed algorithms can be adopted in any shared-nothing architecture. We
implemented a MapReduce version using Spark. Our evaluation shows the proposed
algorithms execute significantly faster and scale to more skewed and
orders-of-magnitude bigger tables when compared to the state-of-the-art
algorithms
Scalable discovery of hybrid process models in a cloud computing environment
Process descriptions are used to create products and deliver services. To lead better processes and services, the first step
is to learn a process model. Process discovery is such a technique which can automatically extract process models from event logs.
Although various discovery techniques have been proposed, they focus on either constructing formal models which are very powerful
but complex, or creating informal models which are intuitive but lack semantics. In this work, we introduce a novel method that returns
hybrid process models to bridge this gap. Moreover, to cope with today’s big event logs, we propose an efficient method, called f-HMD,
aims at scalable hybrid model discovery in a cloud computing environment. We present the detailed implementation of our approach
over the Spark framework, and our experimental results demonstrate that the proposed method is efficient and scalabl
10381 Summary and Abstracts Collection -- Robust Query Processing
Dagstuhl seminar 10381 on robust query processing (held 19.09.10 -
24.09.10) brought together a diverse set of researchers and practitioners
with a broad range of expertise for the purpose of fostering discussion
and collaboration regarding causes, opportunities, and solutions for
achieving robust query processing.
The seminar strove to build a unified view across
the loosely-coupled system components responsible for
the various stages of database query processing.
Participants were chosen for their experience with database
query processing and, where possible, their prior work in academic
research or in product development towards robustness in database query
processing.
In order to pave the way to motivate, measure, and protect future advances
in robust query processing, seminar 10381 focused on developing tests
for measuring the robustness of query processing.
In these proceedings, we first review the seminar topics, goals,
and results, then present abstracts or notes of some of the seminar break-out
sessions.
We also include, as an appendix,
the robust query processing reading list that
was collected and distributed to participants before the seminar began,
as well as summaries of a few of those papers that were
contributed by some participants
Improved Rsa Algorithm for Data Security against DDoS Attack in a Cloud-based Intrusion Detection System
Today, more and more industries are using cloud computing for some integration operations, but ensuring the security of user data and system resources remains a challenge. This article proposes a method to identify and mitigate unwanted packets and traffic, especially duplicate packets, in cloud computing environments. The method includes creating an Intrusion Search and Detection (IF-AD) system to securely maintain user information and allocate secondary memory. To detect unwanted traffic, this method compares the size of the downloaded file with the original file, identifying any differences as potential DDoS. RSA encryption mechanism is used for subsequent file transfers for added security. The proposed approach aims to enhance the security posture of cloud-based systems by detecting and preventing unauthorized access and file modification
- …