Search CORE

46,218 research outputs found

GraphMat: High performance graph analytics made productive

Author: Das Dipankar
Dubey Pradeep
Dulloor Subramanya R
Patwary Md Mostofa Ali
Satish Nadathur Rajagopalan
Sundaram Narayanan
Vadlamudi Satya Gautam
Publication venue
Publication date: 24/03/2015
Field of study

Given the growing importance of large-scale graph analytics, there is a need to improve the performance of graph analysis frameworks without compromising on productivity. GraphMat is our solution to bridge this gap between a user-friendly graph analytics framework and native, hand-optimized code. GraphMat functions by taking vertex programs and mapping them to high performance sparse matrix operations in the backend. We get the productivity benefits of a vertex programming framework without sacrificing performance. GraphMat is in C++, and we have been able to write a diverse set of graph algorithms in this framework with the same effort compared to other vertex programming frameworks. GraphMat performs 1.2-7X faster than high performance frameworks such as GraphLab, CombBLAS and Galois. It achieves better multicore scalability (13-15X on 24 cores) than other frameworks and is 1.2X off native, hand-optimized code on a variety of different graph algorithms. Since GraphMat performance depends mainly on a few scalable and well-understood sparse matrix operations, GraphMatcan naturally benefit from the trend of increasing parallelism on future hardware

arXiv.org e-Print Archive

ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems

Author: Abdulah Sameh
Genton Marc G.
Keyes David E.
Ltaief Hatem
Sun Ying
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/06/2018
Field of study

We present ExaGeoStat, a high performance framework for geospatial statistics in climate and environment modeling. In contrast to simulation based on partial differential equations derived from first-principles modeling, ExaGeoStat employs a statistical model based on the evaluation of the Gaussian log-likelihood function, which operates on a large dense covariance matrix. Generated by the parametrizable Matern covariance function, the resulting matrix is symmetric and positive definite. The computational tasks involved during the evaluation of the Gaussian log-likelihood function become daunting as the number n of geographical locations grows, as O(n2) storage and O(n3) operations are required. While many approximation methods have been devised from the side of statistical modeling to ameliorate these polynomial complexities, we are interested here in the complementary approach of evaluating the exact algebraic result by exploiting advances in solution algorithms and many-core computer architectures. Using state-of-the-art high performance dense linear algebra libraries associated with various leading edge parallel architectures (Intel KNLs, NVIDIA GPUs, and distributed-memory systems), ExaGeoStat raises the game for statistical applications from climate and environmental science. ExaGeoStat provides a reference evaluation of statistical parameters, with which to assess the validity of the various approaches based on approximation. The framework takes a first step in the merger of large-scale data analytics and extreme computing for geospatial statistical applications, to be followed by additional complexity reducing improvements from the solver side that can be implemented under the same interface. Thus, a single uncompromised statistical model can ultimately be executed in a wide variety of emerging exascale environments.Comment: 14 pages, 7 figure

arXiv.org e-Print Archive

GRE: A Graph Runtime Engine for Large-Scale Distributed Graph-Parallel Applications

Author: Sun Ninghui
Tan Guangming
Yan Jie
Publication venue
Publication date: 21/10/2013
Field of study

Large-scale distributed graph-parallel computing is challenging. On one hand, due to the irregular computation pattern and lack of locality, it is hard to express parallelism efficiently. On the other hand, due to the scale-free nature, real-world graphs are hard to partition in balance with low cut. To address these challenges, several graph-parallel frameworks including Pregel and GraphLab (PowerGraph) have been developed recently. In this paper, we present an alternative framework, Graph Runtime Engine (GRE). While retaining the vertex-centric programming model, GRE proposes two new abstractions: 1) a Scatter-Combine computation model based on active message to exploit massive fined-grained edge-level parallelism, and 2) a Agent-Graph data model based on vertex factorization to partition and represent directed graphs. GRE is implemented on commercial off-the-shelf multi-core cluster. We experimentally evaluate GRE with three benchmark programs (PageRank, Single Source Shortest Path and Connected Components) on real-world and synthetic graphs of millions billion of vertices. Compared to PowerGraph, GRE shows 2.5~17 times better performance on 8~16 machines (192 cores). Specifically, the PageRank in GRE is the fastest when comparing to counterparts of other frameworks (PowerGraph, Spark,Twister) reported in public literatures. Besides, GRE significantly optimizes memory usage so that it can process a large graph of 1 billion vertices and 17 billion edges on our cluster with totally 768GB memory, while PowerGraph can only process less than half of this graph scale.Comment: 12 pages, also submitted to PVLD

arXiv.org e-Print Archive

Regional Consistency: Programmability and Performance for Non-Cache-Coherent Systems

Author: Ramesh Bharath
Ribbens Calvin J.
Varadarajan Srinidhi
Publication venue
Publication date: 18/01/2013
Field of study

Parallel programmers face the often irreconcilable goals of programmability and performance. HPC systems use distributed memory for scalability, thereby sacrificing the programmability advantages of shared memory programming models. Furthermore, the rapid adoption of heterogeneous architectures, often with non-cache-coherent memory systems, has further increased the challenge of supporting shared memory programming models. Our primary objective is to define a memory consistency model that presents the familiar thread-based shared memory programming model, but allows good application performance on non-cache-coherent systems, including distributed memory clusters and accelerator-based systems. We propose regional consistency (RegC), a new consistency model that achieves this objective. Results on up to 256 processors for representative benchmarks demonstrate the potential of RegC in the context of our prototype distributed shared memory system.Comment: 8 pages, 7 figures, 1 table; as submitted to CCGRID 201

arXiv.org e-Print Archive

Asynchronous Complex Analytics in a Distributed Dataflow Architecture

Author: Bailis Peter
Franklin Michael J.
Ghodsi Ali
Gonzalez Joseph E.
Hellerstein Joseph M.
Jordan Michael I.
Stoica Ion
Publication venue
Publication date: 23/10/2015
Field of study

Scalable distributed dataflow systems have recently experienced widespread adoption, with commodity dataflow engines such as Hadoop and Spark, and even commodity SQL engines routinely supporting increasingly sophisticated analytics tasks (e.g., support vector machines, logistic regression, collaborative filtering). However, these systems' synchronous (often Bulk Synchronous Parallel) dataflow execution model is at odds with an increasingly important trend in the machine learning community: the use of asynchrony via shared, mutable state (i.e., data races) in convex programming tasks, which has---in a single-node context---delivered noteworthy empirical performance gains and inspired new research into asynchronous algorithms. In this work, we attempt to bridge this gap by evaluating the use of lightweight, asynchronous state transfer within a commodity dataflow engine. Specifically, we investigate the use of asynchronous sideways information passing (ASIP) that presents single-stage parallel iterators with a Volcano-like intra-operator iterator that can be used for asynchronous information passing. We port two synchronous convex programming algorithms, stochastic gradient descent and the alternating direction method of multipliers (ADMM), to use ASIPs. We evaluate an implementation of ASIPs within on Apache Spark that exhibits considerable speedups as well as a rich set of performance trade-offs in the use of these asynchronous algorithms

arXiv.org e-Print Archive

An Empirical Comparison of Big Graph Frameworks in the Context of Network Analysis

Author: Koch Jannis
Meyerhenke Henning
Staudt Christian L.
Vogel Maximilian
Publication venue
Publication date: 03/01/2016
Field of study

Complex networks are relational data sets commonly represented as graphs. The analysis of their intricate structure is relevant to many areas of science and commerce, and data sets may reach sizes that require distributed storage and processing. We describe and compare programming models for distributed computing with a focus on graph algorithms for large-scale complex network analysis. Four frameworks - GraphLab, Apache Giraph, Giraph++ and Apache Flink - are used to implement algorithms for the representative problems Connected Components, Community Detection, PageRank and Clustering Coefficients. The implementations are executed on a computer cluster to evaluate the frameworks' suitability in practice and to compare their performance to that of the single-machine, shared-memory parallel network analysis package NetworKit. Out of the distributed frameworks, GraphLab and Apache Giraph generally show the best performance. In our experiments a cluster of eight computers running Apache Giraph enables the analysis of a network with about 2 billion edges, which is too large for a single machine of the same type. However, for networks that fit into memory of one machine, the performance of the shared-memory parallel implementation is far better than the distributed ones. The study provides experimental evidence for selecting the appropriate framework depending on the task and data volume

arXiv.org e-Print Archive

Design space exploration in the microthreaded many-core architecture

Author: Uddin Irfan
Publication venue
Publication date: 21/09/2013
Field of study

Design space exploration is commonly performed in embedded system, where the architecture is a complicated piece of engineering. With the current trend of many-core systems, design space exploration in general-purpose computers can no longer be avoided. Microgrid is a complicated architecture, and therefor we need to perform design space exploration. Generally, simulators are used for the design space exploration of an architecture. Different simulators with different levels of complexity, simulation time and accuracy are used. Simulators with little complexity, low simulation time and reasonable accuracy are desirable for the design space exploration of an architecture. These simulators are referred as high-level simulators and are commonly used in the design of embedded systems. However, the use of high-level simulation for design space exploration in general-purpose computers is a relatively new area of research.Comment: 12 pages, 1 figur

arXiv.org e-Print Archive

STEP : A Distributed Multi-threading Framework Towards Efficient Data Analytics

Author: Huang Linpeng
Mei Yijie
Shen Yanyan
Zhu Yanmin
Publication venue
Publication date: 12/12/2018
Field of study

Various general-purpose distributed systems have been proposed to cope with high-diversity applications in the pipeline of Big Data analytics. Most of them provide simple yet effective primitives to simplify distributed programming. While the rigid primitives offer great ease of use to savvy programmers, they probably compromise efficiency in performance and flexibility in data representation and programming specifications, which are critical properties in real systems. In this paper, we discuss the limitations of coarse-grained primitives and aim to provide an alternative for users to have flexible control over distributed programs and operate globally shared data more efficiently. We develop STEP, a novel distributed framework based on in-memory key-value store. The key idea of STEP is to adapt multi-threading in a single machine to a distributed environment. STEP enables users to take fine-grained control over distributed threads and apply task-specific optimizations in a flexible manner. The underlying key-value store serves as distributed shared memory to keep globally shared data. To ensure ease-of-use, STEP offers plentiful effective interfaces in terms of distributed shared data manipulation, cluster management, distributed thread management and synchronization. We conduct extensive experimental studies to evaluate the performance of STEP using real data sets. The results show that STEP outperforms the state-of-the-art general-purpose distributed systems as well as a specialized ML platform in many real applications

arXiv.org e-Print Archive

BigSR: an empirical study of real-time expressive RDF stream reasoning on modern Big Data platforms

Author: Curé Olivier
Naacke Hubert
Ren Xiangnan
Xiao Guohui
Publication venue
Publication date: 12/04/2018
Field of study

The trade-off between language expressiveness and system scalability (E&S) is a well-known problem in RDF stream reasoning. Higher expressiveness supports more complex reasoning logic, however, it may also hinder system scalability. Current research mainly focuses on logical frameworks suitable for stream reasoning as well as the implementation and the evaluation of prototype systems. These systems are normally developed in a centralized setting which suffer from inherent limited scalability, while an in-depth study of applying distributed solutions to cover E&S is still missing. In this paper, we aim to explore the feasibility of applying modern distributed computing frameworks to meet E&S all together. To do so, we first propose BigSR, a technical demonstrator that supports a positive fragment of the LARS framework. For the sake of generality and to cover a wide variety of use cases, BigSR relies on the two main execution models adopted by major distributed execution frameworks: Bulk Synchronous Processing (BSP) and Record-at-A-Time (RAT). Accordingly, we implement BigSR on top of Apache Spark Streaming (BSP model) and Apache Flink (RAT model). In order to conclude on the impacts of BSP and RAT on E&S, we analyze the ability of the two models to support distributed stream reasoning and identify several types of use cases characterized by their levels of support. This classification allows for quantifying the E&S trade-off by assessing the scalability of each type of use case \wrt its level of expressiveness. Then, we conduct a series of experiments with 15 queries from 4 different datasets. Our experiments show that BigSR over both BSP and RAT generally scales up to high throughput beyond million-triples per second (with or without recursion), and RAT attains sub-millisecond delay for stateless query operators.Comment: 16 pages, 8 figure

arXiv.org e-Print Archive

DRONE: a Distributed Subgraph-Centric Framework for Processing Large Scale Power-law Graphs

Author: Wen Xiaole
You Haihang
Zhang Shuai
Publication venue
Publication date: 09/01/2019
Field of study

Nowadays, in the big data era, social networks, graph databases, knowledge graphs, electronic commerce etc. demand efficient and scalable capability to process an ever increasing volume of graph-structured data. To meet the challenge, two mainstream distributed programming models, vertex-centric (VC) and subgraph-centric (SC) were proposed. Compared to the VC model, the SC model converges faster with less communication overhead on well-partitioned graphs, and is easy to program due to the "think like a graph" philosophy. The edge-cut method is considered as a natural choice of subgraph-centric model for graph partitioning, and has been adopted by Giraph++, Blogel and GRAPE. However, the edge-cut method causes significant performance bottleneck for processing large scale power-law graphs. Thus, the SC model is less competitive in practice. In this paper, we present an innovative distributed graph computing framework, DRONE (Distributed gRaph cOmputiNg Engine). It combines the subgraph-centric model and the vertex-cut graph partitioning strategy. Experiments show that DRONE outperforms the state-of-art distributed graph computing engines on real-world graphs and synthetic power-law graphs. DRONE is capable of scaling up to process one-trillion-edge synthetic power-law graphs, which is orders of magnitude larger than previously reported by existing SC-based frameworks.Comment: 13 pages, 9 figure

arXiv.org e-Print Archive