47 research outputs found
GMT: Enabling easy development and efficient execution of irregular applications on commodity clusters
In this poster we introduce GMT (Global Memory and Threading library), a custom runtime library that enables efficient execution of irregular applications on commodity clusters. GMT only requires a cluster with x86 nodes supporting MPI. GMT integrates the Partititioned Global Address Space (PGAS) locality-aware global data model with a fork/join control model common in single node multithreaded environments. GMT supports lightweight software multithreading to tolerate latencies for accessing data on remote nodes, and is built around data aggregation to maximize network bandwidth utilization.Peer ReviewedPostprint (author's final draft
High level synthesis of RDF queries for graph analytics
In this paper we present a set of techniques that enable the synthesis of efficient custom accelerators for memory intensive, irregular applications. To address the challenges of irregular applications (large memory footprint, unpredictable fine-grained data accesses, and high synchronization intensity), and exploit their opportunities (thread level parallelism, memory level parallelism), we propose a novel accelerator design that employs an adaptive and Distributed Controller (DC) architecture, and a Memory Interface Controller (MIC) that supports concurrent and atomic memory operations on a multi-ported/multi-banked shared memory. Among the multitude of algorithms that may benefit from our solution, we focus on the acceleration of graph analytics applications and, in particular, on the synthesis of SPARQL queries on Resource Description Framework (RDF) databases. We achieve this objective by incorporating the synthesis techniques into Bambu, an Open Source high-level synthesis tools, and interfacing it with GEMS, the Graph database Engine for Multithreaded Systems. The GEMS' front-end generates optimized C implementations of the input queries, modeled as graph pattern matching algorithms, which are then automatically synthesized by Bambu. We validate our approach by synthesizing several SPARQL queries from the Lehigh University Benchmark (LUBM)
Fused Breadth-First Probabilistic Traversals on Distributed GPU Systems
Probabilistic breadth-first traversals (BPTs) are used in many network
science and graph machine learning applications. In this paper, we are
motivated by the application of BPTs in stochastic diffusion-based graph
problems such as influence maximization. These applications heavily rely on
BPTs to implement a Monte-Carlo sampling step for their approximations. Given
the large sampling complexity, stochasticity of the diffusion process, and the
inherent irregularity in real-world graph topologies, efficiently parallelizing
these BPTs remains significantly challenging.
In this paper, we present a new algorithm to fuse massive number of
concurrently executing BPTs with random starts on the input graph. Our
algorithm is designed to fuse BPTs by combining separate traversals into a
unified frontier on distributed multi-GPU systems. To show the general
applicability of the fused BPT technique, we have incorporated it into two
state-of-the-art influence maximization parallel implementations (gIM and
Ripples). Our experiments on up to 4K nodes of the OLCF Frontier supercomputer
( GPUs and K CPU cores) show strong scaling behavior, and that
fused BPTs can improve the performance of these implementations up to
34 (for gIM) and ~360 (for Ripples).Comment: 12 pages, 11 figure
The Future is Big Graphs! A Community View on Graph Processing Systems
Graphs are by nature unifying abstractions that can leverage
interconnectedness to represent, explore, predict, and explain real- and
digital-world phenomena. Although real users and consumers of graph instances
and graph workloads understand these abstractions, future problems will require
new abstractions and systems. What needs to happen in the next decade for big
graph processing to continue to succeed?Comment: 12 pages, 3 figures, collaboration between the large-scale systems
and data management communities, work started at the Dagstuhl Seminar 19491
on Big Graph Processing Systems, to be published in the Communications of the
AC
Hardware Acceleration of Complex Machine Learning Models through Modern High-Level Synthesis
Machine learning algorithms continue to receive significant attention from industry and research. As the models increase in complexity and accuracy, their computational and memory demands also grow, pushing for more powerful, heterogeneous architectures; custom FPGA/ASIC accelerators are often the best solution to efficiently process large amounts of data close to the sensors in large-scale scientific experiments. Previous works exploited high-level synthesis to help design dedicated compute units for machine learning inference, proposing frameworks that translate high-level models into annotated C/C++. Our proposal, instead, integrates HLS in a compiler-based tool flow with multiple levels of abstraction, enabling analysis, optimization and design space exploration along the whole process. Such an approach will also allow to explore models beyond multi-layer perceptrons and convolutional neural networks (which are often the main target of "classic" HLS frameworks), for example to address the different challenges posed by sparse and graph-based neural networks
Exploring efficient hardware support for applications with irregular memory patterns on multinode manycore architectures
With computing systems becoming ubiquitous, numerous data sets of extremely large size are becoming available for analysis. Often the data collected have complex, graph based structures, which makes them difficult to process with traditional tools. Moreover, the irregularities in the data sets, and in the analysis algorithms, hamper the scaling of performance in large distributed highperformance systems, optimized for locality exploitation and regular data structures. In this paper we present an approach to system design that enable efficient execution of applications with irregular memory patterns on a distribute, many-core architecture, based on off-the-shelf cores. We introduce a set of hardware and software components, which provide a distributed global address space, fine-grained synchronization and transparently hide the latencies of remote accesses with multithreading. An FPGA prototype has been implemented to explore the design with a set of typical irregular kernels. We finally present an analytical model that highlights the benefits of the approach and help identifying the bottlenecks in the prototypes. The experimental evaluation on graph based applications demonstrates the scalability of the architecture for different configurations of the whole system