3,764 research outputs found
Gunrock: A High-Performance Graph Processing Library on the GPU
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs have been two
significant challenges for developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We evaluate Gunrock on five key graph
primitives and show that Gunrock has on average at least an order of magnitude
speedup over Boost and PowerGraph, comparable performance to the fastest GPU
hardwired primitives, and better performance than any other GPU high-level
graph library.Comment: 14 pages, accepted by PPoPP'16 (removed the text repetition in the
previous version v5
Gunrock: GPU Graph Analytics
For large-scale graph analytics on the GPU, the irregularity of data access
and control flow, and the complexity of programming GPUs, have presented two
significant challenges to developing a programmable high-performance graph
library. "Gunrock", our graph-processing system designed specifically for the
GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on
operations on a vertex or edge frontier. Gunrock achieves a balance between
performance and expressiveness by coupling high performance GPU computing
primitives and optimization strategies with a high-level programming model that
allows programmers to quickly develop new graph primitives with small code size
and minimal GPU programming knowledge. We characterize the performance of
various optimization strategies and evaluate Gunrock's overall performance on
different GPU architectures on a wide range of graph primitives that span from
traversal-based algorithms and ranking algorithms, to triangle counting and
bipartite-graph-based algorithms. The results show that on a single GPU,
Gunrock has on average at least an order of magnitude speedup over Boost and
PowerGraph, comparable performance to the fastest GPU hardwired primitives and
CPU shared-memory graph libraries such as Ligra and Galois, and better
performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing
(TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance
Graph Processing Library on the GPU
A fine-grain time-sharing Time Warp system
Although Parallel Discrete Event Simulation (PDES) platforms relying on the Time Warp (optimistic) synchronization
protocol already allow for exploiting parallelism, several techniques have been proposed to
further favor performance. Among them we can mention optimized approaches for state restore, as well as
techniques for load balancing or (dynamically) controlling the speculation degree, the latter being specifically
targeted at reducing the incidence of causality errors leading to waste of computation. However, in
state of the art Time Warp systems, eventsâ processing is not preemptable, which may prevent the possibility
to promptly react to the injection of higher priority (say lower timestamp) events. Delaying the processing
of these events may, in turn, give rise to higher incidence of incorrect speculation. In this article we present
the design and realization of a fine-grain time-sharing Time Warp system, to be run on multi-core Linux
machines, which makes systematic use of event preemption in order to dynamically reassign the CPU to
higher priority events/tasks. Our proposal is based on a truly dual mode execution, application vs platform,
which includes a timer-interrupt based support for bringing control back to platform mode for possible CPU
reassignment according to very fine grain periods. The latter facility is offered by an ad-hoc timer-interrupt
management module for Linux, which we release, together with the overall time-sharing support, within the
open source ROOT-Sim platform. An experimental assessment based on the classical PHOLD benchmark and
two real world models is presented, which shows how our proposal effectively leads to the reduction of the
incidence of causality errors, as compared to traditional Time Warp, especially when running with higher
degrees of parallelism
Analysis, Tracing, Characterization and Performance Modeling of Select ASCI Applications for BlueGene/L Using Parallel Discrete Event Simulation
Caltech's Jet Propulsion Laboratory (JPL) and Center for Advanced Computer Architecture (CACR) are conducting application and simulation analyses of Blue Gene/L[1] in order to establish a range of effectiveness of the architecture in performing important classes of computations and to determine the design sensitivity of the global interconnect network in support of real world ASCI application execution
The Simulation Model Partitioning Problem: an Adaptive Solution Based on Self-Clustering (Extended Version)
This paper is about partitioning in parallel and distributed simulation. That
means decomposing the simulation model into a numberof components and to
properly allocate them on the execution units. An adaptive solution based on
self-clustering, that considers both communication reduction and computational
load-balancing, is proposed. The implementation of the proposed mechanism is
tested using a simulation model that is challenging both in terms of structure
and dynamicity. Various configurations of the simulation model and the
execution environment have been considered. The obtained performance results
are analyzed using a reference cost model. The results demonstrate that the
proposed approach is promising and that it can reduce the simulation execution
time in both parallel and distributed architectures
Aerodynamics, Stability and Control of the 1903 Wright Flyer
The Los Angeles Chapter of the American Institute of Aero and
Astronautics is building two replicas of the 1903 Wright Flyer airplane;
one to wind-tunnel test and display, and a modified one to fly. As part
of this project the aerodynamic characteristics of the Flyer are being
analyzed by modern wind-tunnel and analytical techniques. Tnis paper
describes the Wright Flyer Project, and compares key results from
small-scale wind-tunnel tests and from vortex-lattice computations for
this multi-biplane canard configuration. Analyses of the stability and
control properties are summarized and their implications for closed-loop
control by a pilot are derived using quasilinear pilot-vehicle analysis
and illustrated by simulation time histories.
It is concluded that, although the Wrights were very knowledgeable
and ingenious with respect to aircraft controls and their interactions
(e.g., the good effects of their wing-warp-to-rudder linkage are
validated), they were largely ignorant of dynamic stability
considerations. The paper shows that the 1903 Flyer was readily
controllable about all axes but was intrinsically unstable in pitch and
roll, and it could barely be stabilized by a skilled pilot
Forced Oscillation Source Location via Multivariate Time Series Classification
Precisely locating low-frequency oscillation sources is the prerequisite of
suppressing sustained oscillation, which is an essential guarantee for the
secure and stable operation of power grids. Using synchrophasor measurements, a
machine learning method is proposed to locate the source of forced oscillation
in power systems. Rotor angle and active power of each power plant are utilized
to construct multivariate time series (MTS). Applying Mahalanobis distance
metric and dynamic time warping, the distance between MTS with different phases
or lengths can be appropriately measured. The obtained distance metric,
representing characteristics during the transient phase of forced oscillation
under different disturbance sources, is used for offline classifier training
and online matching to locate the disturbance source. Simulation results using
the four-machine two-area system and IEEE 39-bus system indicate that the
proposed location method can identify the power system forced oscillation
source online with high accuracy.Comment: 5 pages, 3 figures. Accepted by 2018 IEEE/PES Transmission and
Distribution Conferenc
Spatio-temporal SIMT and scalarization for improving GPU efficiency
Temporal SIMT (TSIMT) has been suggested as an alternative to conventional (spatial) SIMT for improving GPU performance on branch-intensive code. Although TSIMT has been briefly mentioned before, it was not evaluated. We present a complete design and evaluation of TSIMT GPUs, along with the inclusion of scalarization and a combination of temporal and spatial SIMT, named Spatiotemporal SIMT (STSIMT). Simulations show that TSIMT alone results in a performance reduction, but a combination of scalarization and STSIMT yields a mean performance enhancement of 19.6% and improves the energy-delay product by 26.2% compared to SIMT.EC/FP7/288653/EU/Low-Power Parallel Computing on GPUs/LPGP
- âŠ