475 research outputs found
GraphX: Unifying Data-Parallel and Graph-Parallel Analytics
From social networks to language modeling, the growing scale and importance
of graph data has driven the development of numerous new graph-parallel systems
(e.g., Pregel, GraphLab). By restricting the computation that can be expressed
and introducing new techniques to partition and distribute the graph, these
systems can efficiently execute iterative graph algorithms orders of magnitude
faster than more general data-parallel systems. However, the same restrictions
that enable the performance gains also make it difficult to express many of the
important stages in a typical graph-analytics pipeline: constructing the graph,
modifying its structure, or expressing computation that spans multiple graphs.
As a consequence, existing graph analytics pipelines compose graph-parallel and
data-parallel systems using external storage systems, leading to extensive data
movement and complicated programming model.
To address these challenges we introduce GraphX, a distributed graph
computation framework that unifies graph-parallel and data-parallel
computation. GraphX provides a small, core set of graph-parallel operators
expressive enough to implement the Pregel and PowerGraph abstractions, yet
simple enough to be cast in relational algebra. GraphX uses a collection of
query optimization techniques such as automatic join rewrites to efficiently
implement these graph-parallel operators. We evaluate GraphX on real-world
graphs and workloads and demonstrate that GraphX achieves comparable
performance as specialized graph computation systems, while outperforming them
in end-to-end graph pipelines. Moreover, GraphX achieves a balance between
expressiveness, performance, and ease of use
Scaling and Load-Balancing Equi-Joins
The task of joining two tables is fundamental for querying databases. In this
paper, we focus on the equi-join problem, where a pair of records from the two
joined tables are part of the join results if equality holds between their
values in the join column(s). While this is a tractable problem when the number
of records in the joined tables is relatively small, it becomes very
challenging as the table sizes increase, especially if hot keys (join column
values with a large number of records) exist in both joined tables.
This paper, an extended version of [metwally-SIGMOD-2022], proposes
Adaptive-Multistage-Join (AM-Join) for scalable and fast equi-joins in
distributed shared-nothing architectures. AM-Join utilizes (a) Tree-Join, a
proposed novel algorithm that scales well when the joined tables share hot
keys, and (b) Broadcast-Join, the known fastest when joining keys that are hot
in only one table.
Unlike the state-of-the-art algorithms, AM-Join (a) holistically solves the
join-skew problem by achieving load balancing throughout the join execution,
and (b) supports all outer-join variants without record deduplication or custom
table partitioning. For the fastest AM-Join outer-join performance, we propose
the Index-Broadcast-Join (IB-Join) family of algorithms for Small-Large joins,
where one table fits in memory and the other can be up to orders of magnitude
larger. The outer-join variants of IB-Join improves on the state-of-the-art
Small-Large outer-join algorithms.
The proposed algorithms can be adopted in any shared-nothing architecture. We
implemented a MapReduce version using Spark. Our evaluation shows the proposed
algorithms execute significantly faster and scale to more skewed and
orders-of-magnitude bigger tables when compared to the state-of-the-art
algorithms
Pregelix: Big(ger) Graph Analytics on A Dataflow Engine
There is a growing need for distributed graph processing systems that are
capable of gracefully scaling to very large graph datasets. Unfortunately, this
challenge has not been easily met due to the intense memory pressure imposed by
process-centric, message passing designs that many graph processing systems
follow. Pregelix is a new open source distributed graph processing system that
is based on an iterative dataflow design that is better tuned to handle both
in-memory and out-of-core workloads. As such, Pregelix offers improved
performance characteristics and scaling properties over current open source
systems (e.g., we have seen up to 15x speedup compared to Apache Giraph and up
to 35x speedup compared to distributed GraphLab), and makes more effective use
of available machine resources to support Big(ger) Graph Analytics
- …