46,218 research outputs found
GraphMat: High performance graph analytics made productive
Given the growing importance of large-scale graph analytics, there is a need
to improve the performance of graph analysis frameworks without compromising on
productivity. GraphMat is our solution to bridge this gap between a
user-friendly graph analytics framework and native, hand-optimized code.
GraphMat functions by taking vertex programs and mapping them to high
performance sparse matrix operations in the backend. We get the productivity
benefits of a vertex programming framework without sacrificing performance.
GraphMat is in C++, and we have been able to write a diverse set of graph
algorithms in this framework with the same effort compared to other vertex
programming frameworks. GraphMat performs 1.2-7X faster than high performance
frameworks such as GraphLab, CombBLAS and Galois. It achieves better multicore
scalability (13-15X on 24 cores) than other frameworks and is 1.2X off native,
hand-optimized code on a variety of different graph algorithms. Since GraphMat
performance depends mainly on a few scalable and well-understood sparse matrix
operations, GraphMatcan naturally benefit from the trend of increasing
parallelism on future hardware
ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems
We present ExaGeoStat, a high performance framework for geospatial statistics
in climate and environment modeling. In contrast to simulation based on partial
differential equations derived from first-principles modeling, ExaGeoStat
employs a statistical model based on the evaluation of the Gaussian
log-likelihood function, which operates on a large dense covariance matrix.
Generated by the parametrizable Matern covariance function, the resulting
matrix is symmetric and positive definite. The computational tasks involved
during the evaluation of the Gaussian log-likelihood function become daunting
as the number n of geographical locations grows, as O(n2) storage and O(n3)
operations are required. While many approximation methods have been devised
from the side of statistical modeling to ameliorate these polynomial
complexities, we are interested here in the complementary approach of
evaluating the exact algebraic result by exploiting advances in solution
algorithms and many-core computer architectures. Using state-of-the-art high
performance dense linear algebra libraries associated with various leading edge
parallel architectures (Intel KNLs, NVIDIA GPUs, and distributed-memory
systems), ExaGeoStat raises the game for statistical applications from climate
and environmental science. ExaGeoStat provides a reference evaluation of
statistical parameters, with which to assess the validity of the various
approaches based on approximation. The framework takes a first step in the
merger of large-scale data analytics and extreme computing for geospatial
statistical applications, to be followed by additional complexity reducing
improvements from the solver side that can be implemented under the same
interface. Thus, a single uncompromised statistical model can ultimately be
executed in a wide variety of emerging exascale environments.Comment: 14 pages, 7 figure
GRE: A Graph Runtime Engine for Large-Scale Distributed Graph-Parallel Applications
Large-scale distributed graph-parallel computing is challenging. On one hand,
due to the irregular computation pattern and lack of locality, it is hard to
express parallelism efficiently. On the other hand, due to the scale-free
nature, real-world graphs are hard to partition in balance with low cut. To
address these challenges, several graph-parallel frameworks including Pregel
and GraphLab (PowerGraph) have been developed recently. In this paper, we
present an alternative framework, Graph Runtime Engine (GRE). While retaining
the vertex-centric programming model, GRE proposes two new abstractions: 1) a
Scatter-Combine computation model based on active message to exploit massive
fined-grained edge-level parallelism, and 2) a Agent-Graph data model based on
vertex factorization to partition and represent directed graphs. GRE is
implemented on commercial off-the-shelf multi-core cluster. We experimentally
evaluate GRE with three benchmark programs (PageRank, Single Source Shortest
Path and Connected Components) on real-world and synthetic graphs of millions
billion of vertices. Compared to PowerGraph, GRE shows 2.5~17 times better
performance on 8~16 machines (192 cores). Specifically, the PageRank in GRE is
the fastest when comparing to counterparts of other frameworks (PowerGraph,
Spark,Twister) reported in public literatures. Besides, GRE significantly
optimizes memory usage so that it can process a large graph of 1 billion
vertices and 17 billion edges on our cluster with totally 768GB memory, while
PowerGraph can only process less than half of this graph scale.Comment: 12 pages, also submitted to PVLD
Regional Consistency: Programmability and Performance for Non-Cache-Coherent Systems
Parallel programmers face the often irreconcilable goals of programmability
and performance. HPC systems use distributed memory for scalability, thereby
sacrificing the programmability advantages of shared memory programming models.
Furthermore, the rapid adoption of heterogeneous architectures, often with
non-cache-coherent memory systems, has further increased the challenge of
supporting shared memory programming models. Our primary objective is to define
a memory consistency model that presents the familiar thread-based shared
memory programming model, but allows good application performance on
non-cache-coherent systems, including distributed memory clusters and
accelerator-based systems. We propose regional consistency (RegC), a new
consistency model that achieves this objective. Results on up to 256 processors
for representative benchmarks demonstrate the potential of RegC in the context
of our prototype distributed shared memory system.Comment: 8 pages, 7 figures, 1 table; as submitted to CCGRID 201
Asynchronous Complex Analytics in a Distributed Dataflow Architecture
Scalable distributed dataflow systems have recently experienced widespread
adoption, with commodity dataflow engines such as Hadoop and Spark, and even
commodity SQL engines routinely supporting increasingly sophisticated analytics
tasks (e.g., support vector machines, logistic regression, collaborative
filtering). However, these systems' synchronous (often Bulk Synchronous
Parallel) dataflow execution model is at odds with an increasingly important
trend in the machine learning community: the use of asynchrony via shared,
mutable state (i.e., data races) in convex programming tasks, which has---in a
single-node context---delivered noteworthy empirical performance gains and
inspired new research into asynchronous algorithms. In this work, we attempt to
bridge this gap by evaluating the use of lightweight, asynchronous state
transfer within a commodity dataflow engine. Specifically, we investigate the
use of asynchronous sideways information passing (ASIP) that presents
single-stage parallel iterators with a Volcano-like intra-operator iterator
that can be used for asynchronous information passing. We port two synchronous
convex programming algorithms, stochastic gradient descent and the alternating
direction method of multipliers (ADMM), to use ASIPs. We evaluate an
implementation of ASIPs within on Apache Spark that exhibits considerable
speedups as well as a rich set of performance trade-offs in the use of these
asynchronous algorithms
An Empirical Comparison of Big Graph Frameworks in the Context of Network Analysis
Complex networks are relational data sets commonly represented as graphs. The
analysis of their intricate structure is relevant to many areas of science and
commerce, and data sets may reach sizes that require distributed storage and
processing. We describe and compare programming models for distributed
computing with a focus on graph algorithms for large-scale complex network
analysis. Four frameworks - GraphLab, Apache Giraph, Giraph++ and Apache Flink
- are used to implement algorithms for the representative problems Connected
Components, Community Detection, PageRank and Clustering Coefficients. The
implementations are executed on a computer cluster to evaluate the frameworks'
suitability in practice and to compare their performance to that of the
single-machine, shared-memory parallel network analysis package NetworKit. Out
of the distributed frameworks, GraphLab and Apache Giraph generally show the
best performance. In our experiments a cluster of eight computers running
Apache Giraph enables the analysis of a network with about 2 billion edges,
which is too large for a single machine of the same type. However, for networks
that fit into memory of one machine, the performance of the shared-memory
parallel implementation is far better than the distributed ones. The study
provides experimental evidence for selecting the appropriate framework
depending on the task and data volume
Design space exploration in the microthreaded many-core architecture
Design space exploration is commonly performed in embedded system, where the
architecture is a complicated piece of engineering. With the current trend of
many-core systems, design space exploration in general-purpose computers can no
longer be avoided. Microgrid is a complicated architecture, and therefor we
need to perform design space exploration. Generally, simulators are used for
the design space exploration of an architecture. Different simulators with
different levels of complexity, simulation time and accuracy are used.
Simulators with little complexity, low simulation time and reasonable accuracy
are desirable for the design space exploration of an architecture. These
simulators are referred as high-level simulators and are commonly used in the
design of embedded systems. However, the use of high-level simulation for
design space exploration in general-purpose computers is a relatively new area
of research.Comment: 12 pages, 1 figur
STEP : A Distributed Multi-threading Framework Towards Efficient Data Analytics
Various general-purpose distributed systems have been proposed to cope with
high-diversity applications in the pipeline of Big Data analytics. Most of them
provide simple yet effective primitives to simplify distributed programming.
While the rigid primitives offer great ease of use to savvy programmers, they
probably compromise efficiency in performance and flexibility in data
representation and programming specifications, which are critical properties in
real systems. In this paper, we discuss the limitations of coarse-grained
primitives and aim to provide an alternative for users to have flexible control
over distributed programs and operate globally shared data more efficiently. We
develop STEP, a novel distributed framework based on in-memory key-value store.
The key idea of STEP is to adapt multi-threading in a single machine to a
distributed environment. STEP enables users to take fine-grained control over
distributed threads and apply task-specific optimizations in a flexible manner.
The underlying key-value store serves as distributed shared memory to keep
globally shared data. To ensure ease-of-use, STEP offers plentiful effective
interfaces in terms of distributed shared data manipulation, cluster
management, distributed thread management and synchronization. We conduct
extensive experimental studies to evaluate the performance of STEP using real
data sets. The results show that STEP outperforms the state-of-the-art
general-purpose distributed systems as well as a specialized ML platform in
many real applications
BigSR: an empirical study of real-time expressive RDF stream reasoning on modern Big Data platforms
The trade-off between language expressiveness and system scalability (E&S) is
a well-known problem in RDF stream reasoning. Higher expressiveness supports
more complex reasoning logic, however, it may also hinder system scalability.
Current research mainly focuses on logical frameworks suitable for stream
reasoning as well as the implementation and the evaluation of prototype
systems. These systems are normally developed in a centralized setting which
suffer from inherent limited scalability, while an in-depth study of applying
distributed solutions to cover E&S is still missing. In this paper, we aim to
explore the feasibility of applying modern distributed computing frameworks to
meet E&S all together. To do so, we first propose BigSR, a technical
demonstrator that supports a positive fragment of the LARS framework. For the
sake of generality and to cover a wide variety of use cases, BigSR relies on
the two main execution models adopted by major distributed execution
frameworks: Bulk Synchronous Processing (BSP) and Record-at-A-Time (RAT).
Accordingly, we implement BigSR on top of Apache Spark Streaming (BSP model)
and Apache Flink (RAT model). In order to conclude on the impacts of BSP and
RAT on E&S, we analyze the ability of the two models to support distributed
stream reasoning and identify several types of use cases characterized by their
levels of support. This classification allows for quantifying the E&S trade-off
by assessing the scalability of each type of use case \wrt its level of
expressiveness. Then, we conduct a series of experiments with 15 queries from 4
different datasets. Our experiments show that BigSR over both BSP and RAT
generally scales up to high throughput beyond million-triples per second (with
or without recursion), and RAT attains sub-millisecond delay for stateless
query operators.Comment: 16 pages, 8 figure
DRONE: a Distributed Subgraph-Centric Framework for Processing Large Scale Power-law Graphs
Nowadays, in the big data era, social networks, graph databases, knowledge
graphs, electronic commerce etc. demand efficient and scalable capability to
process an ever increasing volume of graph-structured data. To meet the
challenge, two mainstream distributed programming models, vertex-centric (VC)
and subgraph-centric (SC) were proposed. Compared to the VC model, the SC model
converges faster with less communication overhead on well-partitioned graphs,
and is easy to program due to the "think like a graph" philosophy. The edge-cut
method is considered as a natural choice of subgraph-centric model for graph
partitioning, and has been adopted by Giraph++, Blogel and GRAPE. However, the
edge-cut method causes significant performance bottleneck for processing large
scale power-law graphs. Thus, the SC model is less competitive in practice. In
this paper, we present an innovative distributed graph computing framework,
DRONE (Distributed gRaph cOmputiNg Engine). It combines the subgraph-centric
model and the vertex-cut graph partitioning strategy. Experiments show that
DRONE outperforms the state-of-art distributed graph computing engines on
real-world graphs and synthetic power-law graphs. DRONE is capable of scaling
up to process one-trillion-edge synthetic power-law graphs, which is orders of
magnitude larger than previously reported by existing SC-based frameworks.Comment: 13 pages, 9 figure
- …