42 research outputs found
Sparse Matrix Multiplication on a Field-Programmable Gate Array
To extract data from highly sophisticated sensor networks, algorithms derived from graph theory are often applied to raw sensor data. Embedded digital systems are used to apply these algorithms. A common computation performed in these algorithms is finding the product of two sparsely populated matrices. When processing a sparse matrix, certain optimizations can be made by taking advantage of the large percentage of zero entries. This project proposes an optimized algorithm for performing sparse matrix multiplications in an embedded system and investigates how a parallel architecture constructed of multiple processors on a single Field-Programmable Gate Array (FPGA) can be used to speed up computations
Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable
There has been significant recent interest in parallel graph processing due
to the need to quickly analyze the large graphs available today. Many graph
codes have been designed for distributed memory or external memory. However,
today even the largest publicly-available real-world graph (the Hyperlink Web
graph with over 3.5 billion vertices and 128 billion edges) can fit in the
memory of a single commodity multicore server. Nevertheless, most experimental
work in the literature report results on much smaller graphs, and the ones for
the Hyperlink graph use distributed or external memory. Therefore, it is
natural to ask whether we can efficiently solve a broad class of graph problems
on this graph in memory.
This paper shows that theoretically-efficient parallel graph algorithms can
scale to the largest publicly-available graphs using a single machine with a
terabyte of RAM, processing them in minutes. We give implementations of
theoretically-efficient parallel algorithms for 20 important graph problems. We
also present the optimizations and techniques that we used in our
implementations, which were crucial in enabling us to process these large
graphs quickly. We show that the running times of our implementations
outperform existing state-of-the-art implementations on the largest real-world
graphs. For many of the problems that we consider, this is the first time they
have been solved on graphs at this scale. We have made the implementations
developed in this work publicly-available as the Graph-Based Benchmark Suite
(GBBS).Comment: This is the full version of the paper appearing in the ACM Symposium
on Parallelism in Algorithms and Architectures (SPAA), 201
Graph analytics on modern massively parallel systems
Graphs provide a very flexible abstraction for understanding and modeling complex systems in many fields such as physics, biology, neuroscience, engineering, and social science. Only in the last two decades, with the advent of Big Data era, supercomputers equipped by accelerators –i.e., Graphics Processing Unit (GPUs)–, advanced networking, and highly parallel file systems have been used to analyze graph properties such as reachability, diameter, connected components, centrality, and clustering coefficient. Today graphs of interest may be composed by millions, sometimes billions, of nodes and edges and exhibit a highly irregular structure. As a consequence, the design of efficient and scalable graph algorithms is an extraordinary challenge due to irregular communication and memory access patterns, high synchronization
costs, and lack of data locality. In the present dissertation, we start off with a brief and gentle introduction for the reader to graph analytics and massively parallel systems. In particular, we present the intersection between graph analytics and parallel architectures in the current state-of-the-art and discuss the challenges encountered when solving such problems on large-scale graphs on these architectures (Chapter 1). In Chapter 2, some preliminary definitions and graph-theoretical notions are provided together with a description of the synthetic graphs used in the literature to model real-world networks. In Chapters 3-5, we present and tackle three different relevant problems in graph analysis: reachability (Chapter 3), Betweenness Centrality (Chapter 4), and clustering coefficient (Chapter 5). In detail, Chapter 3 tackles reachability problems by providing two scalable algorithms and implementations which efficiently solve st-connectivity problems on very large-scale graphs Chapter 4 considers the problem of identifying most relevant nodes in a network which plays a crucial role in several applications, including transportation and communication networks, social network analysis, and biological networks. In particular, we focus on a well-known centrality metrics,
namely Betweenness Centrality (BC), and present two different distributed algorithms for the BC computation on unweighted and weighted graphs. For unweighted graphs, we present a new communication-efficient algorithm based on the combination of bi-dimensional (2D) decomposition and multi-level parallelism. Furthermore, new algorithms which exploit the underlying graph topology to reduce the time and space usage of betweenness centrality computations are described as well. Concerning weighted graphs, we provide a scalable algorithm based on an algebraic formulation of the problem. Finally, thorough comprehensive experimental results on synthetic and real-
world large-scale graphs, we show that the proposed techniques are effective in practice and achieve significant speedups against state-of-the-art solutions. Chapter 5 considers clustering coefficients problem. Similarly to Betweenness Centrality, it is a fundamental tool in network analysis, as it specifically measures how nodes tend to cluster together in a network. In the chapter, we first extend caching techniques to Remote Memory Access (RMA) operations on distributed-memory system. The caching layer is mainly designed to avoid inter-node communications in order to achieve similar benefits for irregular applications as communication-avoiding algorithms. We also show how cached RMA is able to improve the performance of a new distributed asynchronous algorithm for the computation of local clustering coefficients. Finally, Chapter 6 contains a brief summary of the key contributions described in the dissertation and presents potential future directions of the
work
Book of Abstracts of the Sixth SIAM Workshop on Combinatorial Scientific Computing
Book of Abstracts of CSC14 edited by Bora UçarInternational audienceThe Sixth SIAM Workshop on Combinatorial Scientific Computing, CSC14, was organized at the Ecole Normale Supérieure de Lyon, France on 21st to 23rd July, 2014. This two and a half day event marked the sixth in a series that started ten years ago in San Francisco, USA. The CSC14 Workshop's focus was on combinatorial mathematics and algorithms in high performance computing, broadly interpreted. The workshop featured three invited talks, 27 contributed talks and eight poster presentations. All three invited talks were focused on two interesting fields of research specifically: randomized algorithms for numerical linear algebra and network analysis. The contributed talks and the posters targeted modeling, analysis, bisection, clustering, and partitioning of graphs, applied in the context of networks, sparse matrix factorizations, iterative solvers, fast multi-pole methods, automatic differentiation, high-performance computing, and linear programming. The workshop was held at the premises of the LIP laboratory of ENS Lyon and was generously supported by the LABEX MILYON (ANR-10-LABX-0070, Université de Lyon, within the program ''Investissements d'Avenir'' ANR-11-IDEX-0007 operated by the French National Research Agency), and by SIAM
Efficient enumeration of small graphlets and orbits
As the world is flooded with data, the demand for mining data for useful purposes is increasing. An effective techniques is to model the data as networks (graphs) and then apply graph mining techniques for analysis. As on date, the algorithms available to count graphlets and orbits for various types of graphs and their generalizations are limited. The thesis aims to fill the gap by presenting a simple and efficient algorithm for 3-node graphlet and orbit counting that is generic enough to work for both undirected and directed graphs. Our algorithm is compared with the state-of-art algorithms and we show that in most cases our algorithm performs better. We demonstrate our algorithm in three case studies related to (i) enzyme and metabolite correlation network in corn, (ii) watershed governance networks, and (iii) patterns exhibited by co-expression networks of healthy and cancerous stomach cells
GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra
We propose GraphMineSuite (GMS): the first benchmarking suite for graph
mining that facilitates evaluating and constructing high-performance graph
mining algorithms. First, GMS comes with a benchmark specification based on
extensive literature review, prescribing representative problems, algorithms,
and datasets. Second, GMS offers a carefully designed software platform for
seamless testing of different fine-grained elements of graph mining algorithms,
such as graph representations or algorithm subroutines. The platform includes
parallel implementations of more than 40 considered baselines, and it
facilitates developing complex and fast mining algorithms. High modularity is
possible by harnessing set algebra operations such as set intersection and
difference, which enables breaking complex graph mining algorithms into simple
building blocks that can be separately experimented with. GMS is supported with
a broad concurrency analysis for portability in performance insights, and a
novel performance metric to assess the throughput of graph mining algorithms,
enabling more insightful evaluation. As use cases, we harness GMS to rapidly
redesign and accelerate state-of-the-art baselines of core graph mining
problems: degeneracy reordering (by up to >2x), maximal clique listing (by up
to >9x), k-clique listing (by 1.1x), and subgraph isomorphism (by up to 2.5x),
also obtaining better theoretical performance bounds
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
Graph Algorithms on GPUs
This chapter introduces the topic of graph algorithms on GPUs. It starts by presenting and comparing the main important data structures and techniques applied for representing and analysing graphs on GPUs at the state of the art.It then presents the theory and an updated review of the most efficient implementations of graph algorithms for GPUs. In particular, the chapter focuses on graph traversal algorithms (breadth-first search), single-source shortest path(Djikstra, Bellman-Ford, delta stepping, hybrids), and all-pair shortest path (Floyd-Warshall). By the end of the chapter, load balancing and memory access techniques are discussed through an overview of their main issues and management techniques