3,026 research outputs found
Scalable Analytics over Distributed Time-series Graphs using GoFFish
Graphs are a key form of Big Data, and performing scalable analytics over
them is invaluable to many domains. As our ability to collect data grows, there
is an emerging class of inter-connected data which accumulates or varies over
time, and on which novel analytics - both over the network structure and across
the time-variant attribute values - is necessary. We introduce the notion of
time-series graph analytics and propose Gopher, a scalable programming
abstraction to develop algorithms and analytics on such datasets. Our
abstraction leverages a sub-graph centric programming model and extends it to
the temporal dimension using an iterative BSP (Bulk Synchronous Parallel)
approach. Gopher is co-designed with GoFS, a distributed storage specialized
for time-series graphs, as part of the GoFFish distributed analytics platform.
We examine storage optimizations for GoFS, design patterns in Gopher to
leverage the distributed data layout, and evaluate the GoFFish platform using
time-series graph data and applications on a commodity cluster
Polystore++: Accelerated Polystore System for Heterogeneous Workloads
Modern real-time business analytic consist of heterogeneous workloads (e.g,
database queries, graph processing, and machine learning). These analytic
applications need programming environments that can capture all aspects of the
constituent workloads (including data models they work on and movement of data
across processing engines). Polystore systems suit such applications; however,
these systems currently execute on CPUs and the slowdown of Moore's Law means
they cannot meet the performance and efficiency requirements of modern
workloads. We envision Polystore++, an architecture to accelerate existing
polystore systems using hardware accelerators (e.g, FPGAs, CGRAs, and GPUs).
Polystore++ systems can achieve high performance at low power by identifying
and offloading components of a polystore system that are amenable to
acceleration using specialized hardware. Building a Polystore++ system is
challenging and introduces new research problems motivated by the use of
hardware accelerators (e.g, optimizing and mapping query plans across
heterogeneous computing units and exploiting hardware pipelining and
parallelism to improve performance). In this paper, we discuss these challenges
in detail and list possible approaches to address these problems.Comment: 11 pages, Accepted in ICDCS 201
A Microbenchmark Characterization of the Emu Chick
The Emu Chick is a prototype system designed around the concept of migratory
memory-side processing. Rather than transferring large amounts of data across
power-hungry, high-latency interconnects, the Emu Chick moves lightweight
thread contexts to near-memory cores before the beginning of each memory read.
The current prototype hardware uses FPGAs to implement cache-less "Gossamer
cores for doing computational work and a stationary core to run basic operating
system functions and migrate threads between nodes. In this multi-node
characterization of the Emu Chick, we extend an earlier single-node
investigation (Hein, et al. AsHES 2018) of the the memory bandwidth
characteristics of the system through benchmarks like STREAM, pointer chasing,
and sparse matrix-vector multiplication. We compare the Emu Chick hardware to
architectural simulation and an Intel Xeon-based platform. Our results
demonstrate that for many basic operations the Emu Chick can use available
memory bandwidth more efficiently than a more traditional, cache-based
architecture although bandwidth usage suffers for computationally intensive
workloads like SpMV. Moreover, the Emu Chick provides stable, predictable
performance with up to 65% of the peak bandwidth utilization on a random-access
pointer chasing benchmark with weak locality
MESH: A Flexible Distributed Hypergraph Processing System
With the rapid growth of large online social networks, the ability to analyze
large-scale social structure and behavior has become critically important, and
this has led to the development of several scalable graph processing systems.
In reality, however, social interaction takes place not only between pairs of
individuals as in the graph model, but rather in the context of multi-user
groups. Research has shown that such group dynamics can be better modeled
through a more general hypergraph model, resulting in the need to build
scalable hypergraph processing systems. In this paper, we present MESH, a
flexible distributed framework for scalable hypergraph processing. MESH
provides an easy-to-use and expressive application programming interface that
naturally extends the think like a vertex model common to many popular graph
processing systems. Our framework provides a flexible implementation based on
an underlying graph processing system, and enables different design choices for
the key implementation issues of partitioning a hypergraph representation. We
implement MESH on top of the popular GraphX graph processing framework in
Apache Spark. Using a variety of real datasets and experiments conducted on a
local 8-node cluster as well as a 65-node Amazon AWS testbed, we demonstrate
that MESH provides flexibility based on data and application characteristics,
as well as scalability with cluster size. We further show that it is
competitive in performance to HyperX, another hypergraph processing system
based on Spark, while providing a much simpler implementation (requiring about
5X fewer lines of code), thus showing that simplicity and flexibility need not
come at the cost of performance.Comment: 14 pages, 15 figures, 2019 IEEE International Conference on Cloud
Engineering (IC2E
Scaling-Up Reasoning and Advanced Analytics on BigData
BigDatalog is an extension of Datalog that achieves performance and
scalability on both Apache Spark and multicore systems to the point that its
graph analytics outperform those written in GraphX. Looking back, we see how
this realizes the ambitious goal pursued by deductive database researchers
beginning forty years ago: this is the goal of combining the rigor and power of
logic in expressing queries and reasoning with the performance and scalability
by which relational databases managed Big Data. This goal led to Datalog which
is based on Horn Clauses like Prolog but employs implementation techniques,
such as Semi-naive Fixpoint and Magic Sets, that extend the bottom-up
computation model of relational systems, and thus obtain the performance and
scalability that relational systems had achieved, as far back as the 80s, using
data-parallelization on shared-nothing architectures. But this goal proved
difficult to achieve because of major issues at (i) the language level and (ii)
at the system level. The paper describes how (i) was addressed by simple rules
under which the fixpoint semantics extends to programs using count, sum and
extrema in recursion, and (ii) was tamed by parallel compilation techniques
that achieve scalability on multicore systems and Apache Spark. This paper is
under consideration for acceptance in Theory and Practice of Logic Programming
(TPLP).Comment: Under consideration in Theory and Practice of Logic Programming
(TPLP
Accelerating PageRank using Partition-Centric Processing
PageRank is a fundamental link analysis algorithm that also functions as a
key representative of the performance of Sparse Matrix-Vector (SpMV)
multiplication. The traditional PageRank implementation generates fine
granularity random memory accesses resulting in large amount of wasteful DRAM
traffic and poor bandwidth utilization. In this paper, we present a novel
Partition-Centric Processing Methodology (PCPM) to compute PageRank, that
drastically reduces the amount of DRAM communication while achieving high
sustained memory bandwidth. PCPM uses a Partition-centric abstraction coupled
with the Gather-Apply-Scatter (GAS) programming model. By carefully examining
how a PCPM based implementation impacts communication characteristics of the
algorithm, we propose several system optimizations that improve the execution
time substantially. More specifically, we develop (1) a new data layout that
significantly reduces communication and random DRAM accesses, and (2) branch
avoidance mechanisms to get rid of unpredictable data-dependent branches.
We perform detailed analytical and experimental evaluation of our approach
using 6 large graphs and demonstrate an average 2.7x speedup in execution time
and 1.7x reduction in communication volume, compared to the state-of-the-art.
We also show that unlike other GAS based implementations, PCPM is able to
further reduce main memory traffic by taking advantage of intelligent node
labeling that enhances locality. Although we use PageRank as the target
application in this paper, our approach can be applied to generic SpMV
computation.Comment: Added acknowledgments. In proceedings of USENIX ATC 201
A Collaborative Untethered Virtual Reality Environment for Interactive Social Network Visualization
The increasing prevalence of Virtual Reality technologies as a platform for
gaming and video playback warrants research into how to best apply the current
state of the art to challenges in data visualization. Many current VR systems
are noncollaborative, while data analysis and visualization is often a
multi-person process. Our goal in this paper is to address the technical and
user experience challenges that arise when creating VR environments for
collaborative data visualization. We focus on the integration of multiple
tracking systems and the new interaction paradigms that this integration can
enable, along with visual design considerations that apply specifically to
collaborative network visualization in virtual reality. We demonstrate a system
for collaborative interaction with large 3D layouts of Twitter friend/follow
networks. The system is built by combining a 'Holojam' architecture (multiple
GearVR Headsets within an OptiTrack motion capture stage) and Perception Neuron
motion suits, to offer an untethered, full-room multi-person visualization
experience
Empowering In-Memory Relational Database Engines with Native Graph Processing
The plethora of graphs and relational data give rise to many interesting
graph-relational queries in various domains, e.g., finding related proteins
satisfying relational predicates in a biological network. The maturity of
RDBMSs motivated academia and industry to invest efforts in leveraging RDBMSs
for graph processing, where efficiency is proven for vital graph queries.
However, none of these efforts process graphs natively inside the RDBMS, which
is particularly challenging due to the impedance mismatch between the
relational and the graph models. In this paper, we propose to treat graphs as
first-class citizens inside the relational engine so that operations on graphs
are executed natively inside the RDBMS. We realize our approach inside VoltDB,
an open-source in-memory relational database, and name this realization
GRFusion. The SQL and the query engine of GRFusion are empowered to
declaratively define graphs and execute cross-data-model query plans formed by
graph and relational operators, resulting in up to four orders-of-magnitude in
query-time speedup w.r.t. state-of-the-art approaches
A New Frontier for Pull-Based Graph Processing
The trade-off between pull-based and push-based graph processing engines is
well-understood. On one hand, pull-based engines can achieve higher throughput
because their workloads are read-dominant, rather than write-dominant, and can
proceed without synchronization between threads. On the other hand, push-based
engines are much better able to take advantage of the frontier optimization,
which leverages the fact that often only a small subset of the graph needs to
be accessed to complete an iteration of a graph processing application. Hybrid
engines attempt to overcome this trade-off by dynamically switching between
push and pull, but there are two key disadvantages with this approach. First,
applications must be implemented twice (once for push and once for pull), and
second, processing throughput is reduced for iterations that run with push.
We propose a radically different solution: rebuild the frontier optimization
entirely such that it is well-suited for a pull-based engine. In doing so, we
remove the only advantage that a push-based engine had over a pull-based
engine, making it possible to eliminate the push-based engine entirely. We
introduce Wedge, a pull-only graph processing framework that transforms the
traditional source-oriented vertex-based frontier into a pull-friendly format
called the Wedge Frontier. The transformation itself is expensive even when
parallelized, so we introduce two key optimizations to make it practical.
First, we perform the transformation only when the resulting Wedge Frontier is
sufficiently sparse. Second, we coarsen the granularity of the representation
of elements in the Wedge Frontier. These optimizations respectively improve
Wedge's performance by up to 5x and 2x, enabling it to outperform Grazelle,
Ligra, and GraphMat respectively by up to 2.8x, 4.9x, and 185.5x
Asynchronous Complex Analytics in a Distributed Dataflow Architecture
Scalable distributed dataflow systems have recently experienced widespread
adoption, with commodity dataflow engines such as Hadoop and Spark, and even
commodity SQL engines routinely supporting increasingly sophisticated analytics
tasks (e.g., support vector machines, logistic regression, collaborative
filtering). However, these systems' synchronous (often Bulk Synchronous
Parallel) dataflow execution model is at odds with an increasingly important
trend in the machine learning community: the use of asynchrony via shared,
mutable state (i.e., data races) in convex programming tasks, which has---in a
single-node context---delivered noteworthy empirical performance gains and
inspired new research into asynchronous algorithms. In this work, we attempt to
bridge this gap by evaluating the use of lightweight, asynchronous state
transfer within a commodity dataflow engine. Specifically, we investigate the
use of asynchronous sideways information passing (ASIP) that presents
single-stage parallel iterators with a Volcano-like intra-operator iterator
that can be used for asynchronous information passing. We port two synchronous
convex programming algorithms, stochastic gradient descent and the alternating
direction method of multipliers (ADMM), to use ASIPs. We evaluate an
implementation of ASIPs within on Apache Spark that exhibits considerable
speedups as well as a rich set of performance trade-offs in the use of these
asynchronous algorithms
- …