24,002 research outputs found
Parallelizing Windowed Stream Joins in a Shared-Nothing Cluster
The availability of large number of processing nodes in a parallel and
distributed computing environment enables sophisticated real time processing
over high speed data streams, as required by many emerging applications.
Sliding window stream joins are among the most important operators in a stream
processing system. In this paper, we consider the issue of parallelizing a
sliding window stream join operator over a shared nothing cluster. We propose a
framework, based on fixed or predefined communication pattern, to distribute
the join processing loads over the shared-nothing cluster. We consider various
overheads while scaling over a large number of nodes, and propose solution
methodologies to cope with the issues. We implement the algorithm over a
cluster using a message passing system, and present the experimental results
showing the effectiveness of the join processing algorithm.Comment: 11 page
On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark
Querying very large RDF data sets in an efficient manner requires a
sophisticated distribution strategy. Several innovative solutions have recently
been proposed for optimizing data distribution with predefined query workloads.
This paper presents an in-depth analysis and experimental comparison of five
representative and complementary distribution approaches. For achieving fair
experimental results, we are using Apache Spark as a common parallel computing
framework by rewriting the concerned algorithms using the Spark API. Spark
provides guarantees in terms of fault tolerance, high availability and
scalability which are essential in such systems. Our different implementations
aim to highlight the fundamental implementation-independent characteristics of
each approach in terms of data preparation, load balancing, data replication
and to some extent to query answering cost and performance. The presented
measures are obtained by testing each system on one synthetic and one
real-world data set over query workloads with differing characteristics and
different partitioning constraints.Comment: 16 pages, 3 figure
- …