5,666 research outputs found
Load-Varying LINPACK: A Benchmark for Evaluating Energy Efficiency in High-End Computing
For decades, performance has driven the high-end computing (HEC) community. However, as highlighted in recent exascale studies that chart a path from petascale to exascale computing, power consumption is fast becoming the major design constraint in HEC. Consequently, the HEC community needs to address this issue in future petascale and exascale computing systems.
Current scientific benchmarks, such as LINPACK and SPEChpc, only evaluate HEC systems when running at full throttle, i.e., 100% workload, resulting in a focus on performance and ignoring the issues of power and energy consumption. In contrast, efforts like SPECpower evaluate the energy efficiency of a compute server at varying workloads. This is analogous to evaluating the energy efficiency (i.e., fuel efficiency) of an automobile at varying speeds (e.g., miles per gallon highway versus city). SPECpower, however, only evaluates the energy efficiency of a single compute server rather than a HEC system; furthermore, it is based on SPEC's Java Business Benchmarks (SPECjbb) rather than a scientific benchmark. Given the absence of a load-varying scientific benchmark to evaluate the energy efficiency of HEC systems at different workloads, we propose the load-varying LINPACK (LV-LINPACK) benchmark. In this paper, we identify application parameters that affect performance and provide a methodology to vary the workload of LINPACK, thus enabling a more rigorous study of energy efficiency in supercomputers, or more generally, HEC
Towards Energy-Proportional Computing for Enterprise-Class Server Workloads
Massive data centers housing thousands of computing nodes
have become commonplace in enterprise computing, and the
power consumption of such data centers is growing at an
unprecedented rate. Adding to the problem is the inability
of the servers to exhibit energy proportionality, i.e., provide
energy-ecient execution under all levels of utilization,
which diminishes the overall energy eciency of the data
center. It is imperative that we realize eective strategies
to control the power consumption of the server and improve
the energy eciency of data centers. With the advent of
Intel Sandy Bridge processors, we have the ability to specify
a limit on power consumption during runtime, which creates
opportunities to design new power-management techniques
for enterprise workloads and make the systems that they run
on more energy-proportional.
In this paper, we investigate whether it is possible to achieve
energy proportionality for an enterprise-class server workload,
namely SPECpower ssj2008 benchmark, by using Intel's
Running Average Power Limit (RAPL) interfaces. First,
we analyze the power consumption and characterize the instantaneous
power prole of the SPECpower benchmark at
a subsystem-level using the on-chip energy meters exposed
via the RAPL interfaces. We then analyze the impact of
RAPL power limiting on the performance, per-transaction
response time, power consumption, and energy eciency of
the benchmark under dierent load levels. Our observations
and results shed light on the ecacy of the RAPL interfaces
and provide guidance for designing power-management techniques
for enterprise-class workloads
CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-core Architectures
The use of graphics processing units (GPUs) in
high-performance parallel computing continues to become more
prevalent, often as part of a heterogeneous system. For years,
CUDA has been the de facto programming environment for
nearly all general-purpose GPU (GPGPU) applications. In spite
of this, the framework is available only on NVIDIA GPUs,
traditionally requiring reimplementation in other frameworks
in order to utilize additional multi- or many-core devices.
On the other hand, OpenCL provides an open and vendorneutral
programming environment and runtime system. With
implementations available for CPUs, GPUs, and other types of
accelerators, OpenCL therefore holds the promise of a “write
once, run anywhere” ecosystem for heterogeneous computing.
Given the many similarities between CUDA and OpenCL,
manually porting a CUDA application to OpenCL is typically
straightforward, albeit tedious and error-prone. In response
to this issue, we created CU2CL, an automated CUDA-to-
OpenCL source-to-source translator that possesses a novel design
and clever reuse of the Clang compiler framework. Currently,
the CU2CL translator covers the primary constructs found in
CUDA runtime API, and we have successfully translated many
applications from the CUDA SDK and Rodinia benchmark suite.
The performance of our automatically translated applications via
CU2CL is on par with their manually ported countparts
The Green500 List: Escapades to Exascale
Energy efficiency is now a top priority. The first
four years of the Green500 have seen the importance of en-
ergy efficiency in supercomputing grow from an afterthought
to the forefront of innovation as we near a point where sys-
tems will be forced to stop drawing more power. Even so,
the landscape of efficiency in supercomputing continues to
shift, with new trends emerging, and unexpected shifts in
previous predictions.
This paper offers an in-depth analysis of the new and
shifting trends in the Green500. In addition, the analysis of-
fers early indications of the track we are taking toward exas-
cale, and what an exascale machine in 2018 is likely to look
like. Lastly, we discuss the new efforts and collaborations
toward designing and establishing better metrics, method-
ologies and workloads for the measurement and analysis of
energy-efficient supercomputing
CampProf: A Visual Performance Analysis Tool for Memory Bound GPU Kernels
Current GPU tools and performance models provide some common architectural insights that guide the programmers to write optimal code. We challenge these performance models, by modeling and analyzing a lesser known, but very severe performance pitfall, called 'Partition Camping', in NVIDIA GPUs. Partition Camping is caused by memory accesses that are skewed towards a subset of the available memory partitions, which may degrade the performance of memory-bound CUDA kernels by up to seven-times. No existing tool can detect the partition camping effect in CUDA kernels.
We complement the existing tools by developing 'CampProf', a spreadsheet based, visual analysis tool, that detects the degree to which any memory-bound kernel suffers from partition camping. In addition, CampProf also predicts the kernel's performance at all execution configurations, if its performance parameters are known at any one of them. To demonstrate the utility of CampProf, we analyze three different applications using our tool, and demonstrate how it can be used to discover partition camping. We also demonstrate how CampProf can be used to monitor the performance improvements in the kernels, as the partition camping effect is being removed.
The performance model that drives CampProf was developed by applying multiple linear regression techniques over a set of specific micro-benchmarks that simulated the partition camping behavior. Our results show that the geometric mean of errors in our prediction model is within 12% of the actual execution times. In summary, CampProf is a new, accurate, and easy-to-use tool that can be used in conjunction with the existing tools to analyze and improve the overall performance of memory-bound CUDA kernels
Architecture-Aware Optimization on a 1600-core Graphics Processor
The graphics processing unit (GPU) continues to
make significant strides as an accelerator in commodity cluster
computing for high-performance computing (HPC). For example,
three of the top five fastest supercomputers in the world, as
ranked by the TOP500, employ GPUs as accelerators. Despite this
increasing interest in GPUs, however, optimizing the performance
of a GPU-accelerated compute node requires deep technical
knowledge of the underlying architecture. Although significant
literature exists on how to optimize GPU performance on the
more mature NVIDIA CUDA architecture, the converse is true
for OpenCL on the AMD GPU.
Consequently, we present and evaluate architecture-aware optimizations
for the AMD GPU. The most prominent optimizations
include (i) explicit use of registers, (ii) use of vector types, (iii)
removal of branches, and (iv) use of image memory for global data.
We demonstrate the efficacy of our AMD GPU optimizations by
applying each optimization in isolation as well as in concert to
a large-scale, molecular modeling application called GEM. Via
these AMD-specific GPU optimizations, the AMD Radeon HD
5870 GPU delivers 65% better performance than with the wellknown
NVIDIA-specific optimizations
- …