85,899 research outputs found
Gunrock: A High-Performance Graph Processing Library on the GPU
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs have been two
significant challenges for developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We evaluate Gunrock on five key graph
primitives and show that Gunrock has on average at least an order of magnitude
speedup over Boost and PowerGraph, comparable performance to the fastest GPU
hardwired primitives, and better performance than any other GPU high-level
graph library.Comment: 14 pages, accepted by PPoPP'16 (removed the text repetition in the
previous version v5
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
OSCP: Optimization Service Connectivity Protocol
Optimization software e.g. solvers and modelling systems, expose software and vendor specific interfacing mechanisms to client applications e.g. decision support systems, thus introducing close coupling. The ‘Optimization Service Connectivity Protocol’ (OSCP) is an abstraction of the interfaces to optimization software, which is aimed primarily at simplifying the process of integrating optimization systems into software solutions by providing an abstracted, uniform and easy to use interface to such systems, regardless of system or vendor specific requirements. This paper presents a high-level overview of OSCP including descriptions of its main interfaces, and illustrates its use via examples
Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping
The multi-pumping resource sharing technique can overcome the limitations
commonly found in single-clocked FPGA designs by allowing hardware components
to operate at a higher clock frequency than the surrounding system. However,
this optimization cannot be expressed in high levels of abstraction, such as
HLS, requiring the use of hand-optimized RTL. In this paper we show how to
leverage multiple clock domains for computational subdomains on reconfigurable
devices through data movement analysis on high-level programs. We offer a novel
view on multi-pumping as a compiler optimization - a superclass of traditional
vectorization. As multiple data elements are fed and consumed, the computations
are packed temporally rather than spatially. The optimization is applied
automatically using an intermediate representation that maps high-level code to
HLS. Internally, the optimization injects modules into the generated designs,
incorporating RTL for fine-grained control over the clock domains. We obtain a
reduction of resource consumption by up to 50% on critical components and 23%
on average. For scalable designs, this can enable further parallelism,
increasing overall performance
Block level voltage
Over the past years, state-of-art power optimization methods move towards higher abstraction levels that result in more efficient power savings. Among existing power optimization approaches, dynamic power management (DPM) is considered to be one of the most effective strategies. Depending on abstraction levels, DPM can be implemented in different formats but here we focus on scheduling that is more suitable for real-time system design use. This differs from the concurrent scheduling approaches that start from either the HLS (High-Level Synthesis) or RTS (Real-Time System) point of view, we propose a synergy solution of both approaches, namely block-level voltage/frequency scheduling (BLVFS). The presented block-level voltage/ frequency scheduling approach shows a generic solution for low power SoC (System on Chip) system design while the approaches which belong to the HLS and RTS categories have a strong dependency on the system functionalities. Consider a SoC as a combination of heterogeneous functional blocks, our approach provides efficient power savings by dynamically scheduling the scaling of voltage and frequency at the same time. Simulation results indicate that by using heuristic based strategies significant power savings can be achieved
Building Efficient Query Engines in a High-Level Language
Abstraction without regret refers to the vision of using high-level
programming languages for systems development without experiencing a negative
impact on performance. A database system designed according to this vision
offers both increased productivity and high performance, instead of sacrificing
the former for the latter as is the case with existing, monolithic
implementations that are hard to maintain and extend. In this article, we
realize this vision in the domain of analytical query processing. We present
LegoBase, a query engine written in the high-level language Scala. The key
technique to regain efficiency is to apply generative programming: LegoBase
performs source-to-source compilation and optimizes the entire query engine by
converting the high-level Scala code to specialized, low-level C code. We show
how generative programming allows to easily implement a wide spectrum of
optimizations, such as introducing data partitioning or switching from a row to
a column data layout, which are difficult to achieve with existing low-level
query compilers that handle only queries. We demonstrate that sufficiently
powerful abstractions are essential for dealing with the complexity of the
optimization effort, shielding developers from compiler internals and
decoupling individual optimizations from each other. We evaluate our approach
with the TPC-H benchmark and show that: (a) With all optimizations enabled,
LegoBase significantly outperforms a commercial database and an existing query
compiler. (b) Programmers need to provide just a few hundred lines of
high-level code for implementing the optimizations, instead of complicated
low-level code that is required by existing query compilation approaches. (c)
The compilation overhead is low compared to the overall execution time, thus
making our approach usable in practice for compiling query engines
- …