19,526 research outputs found
ScALPEL: A Scalable Adaptive Lightweight Performance Evaluation Library for application performance monitoring
As supercomputers continue to grow in scale and capabilities, it is becoming
increasingly difficult to isolate processor and system level causes of
performance degradation. Over the last several years, a significant number of
performance analysis and monitoring tools have been built/proposed. However,
these tools suffer from several important shortcomings, particularly in
distributed environments. In this paper we present ScALPEL, a Scalable Adaptive
Lightweight Performance Evaluation Library for application performance
monitoring at the functional level. Our approach provides several distinct
advantages. First, ScALPEL is portable across a wide variety of architectures,
and its ability to selectively monitor functions presents low run-time
overhead, enabling its use for large-scale production applications. Second, it
is run-time configurable, enabling both dynamic selection of functions to
profile as well as events of interest on a per function basis. Third, our
approach is transparent in that it requires no source code modifications.
Finally, ScALPEL is implemented as a pluggable unit by reusing existing
performance monitoring frameworks such as Perfmon and PAPI and extending them
to support both sequential and MPI applications.Comment: 10 pages, 4 figures, 2 table
MERIC and RADAR generator: tools for energy evaluation and runtime tuning of HPC applications
This paper introduces two tools for manual energy evaluation and runtime tuning developed at IT4Innovations in the READEX project. The MERIC library can be used for manual instrumentation and analysis of any application from the energy and time consumption point of view. Besides tracing, MERIC can also change environment and hardware parameters during the application runtime, which leads to energy savings.
MERIC stores large amounts of data, which are difficult to read by a human. The RADAR generator analyses the MERIC output files to find the best settings of evaluated parameters for each instrumented region. It generates a Open image in new window report and a MERIC configuration file for application production runs
A Fast Causal Profiler for Task Parallel Programs
This paper proposes TASKPROF, a profiler that identifies parallelism
bottlenecks in task parallel programs. It leverages the structure of a task
parallel execution to perform fine-grained attribution of work to various parts
of the program. TASKPROF's use of hardware performance counters to perform
fine-grained measurements minimizes perturbation. TASKPROF's profile execution
runs in parallel using multi-cores. TASKPROF's causal profile enables users to
estimate improvements in parallelism when a region of code is optimized even
when concrete optimizations are not yet known. We have used TASKPROF to isolate
parallelism bottlenecks in twenty three applications that use the Intel
Threading Building Blocks library. We have designed parallelization techniques
in five applications to in- crease parallelism by an order of magnitude using
TASKPROF. Our user study indicates that developers are able to isolate
performance bottlenecks with ease using TASKPROF.Comment: 11 page
LIKWID: Lightweight Performance Tools
Exploiting the performance of today's microprocessors requires intimate
knowledge of the microarchitecture as well as an awareness of the ever-growing
complexity in thread and cache topology. LIKWID is a set of command line
utilities that addresses four key problems: Probing the thread and cache
topology of a shared-memory node, enforcing thread-core affinity on a program,
measuring performance counter metrics, and microbenchmarking for reliable upper
performance bounds. Moreover, it includes a mpirun wrapper allowing for
portable thread-core affinity in MPI and hybrid MPI/threaded applications. To
demonstrate the capabilities of the tool set we show the influence of thread
affinity on performance using the well-known OpenMP STREAM triad benchmark, use
hardware counter tools to study the performance of a stencil code, and finally
show how to detect bandwidth problems on ccNUMA-based compute nodes.Comment: 12 page
Performance and Power Analysis of HPC Workloads on Heterogenous Multi-Node Clusters
Performance analysis tools allow application developers to identify and characterize the inefficiencies that cause performance degradation in their codes, allowing for application optimizations. Due to the increasing interest in the High Performance Computing (HPC) community towards energy-efficiency issues, it is of paramount importance to be able to correlate performance and power figures within the same profiling and analysis tools. For this reason, we present a performance and energy-efficiency study aimed at demonstrating how a single tool can be used to collect most of the relevant metrics. In particular, we show how the same analysis techniques can be applicable on different architectures, analyzing the same HPC application on a high-end and a low-power cluster. The former cluster embeds Intel Haswell CPUs and NVIDIA K80 GPUs, while the latter is made up of NVIDIA Jetson TX1 boards, each hosting an Arm Cortex-A57 CPU and an NVIDIA Tegra X1 Maxwell GPU.The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc projects [17], grant agreements n. 288777, 610402 and 671697. E.C. was partially founded by “Contributo 5 per mille assegnato all’Università degli Studi di Ferrara-dichiarazione dei redditi dell’anno 2014”. We thank the University of Ferrara and INFN Ferrara for the access to the COKA Cluster. We warmly thank the BSC tools group, supporting us for the smooth integration and test of our setup within Extrae and Paraver.Peer ReviewedPostprint (published version
Mira: A Framework for Static Performance Analysis
The performance model of an application can pro- vide understanding about its
runtime behavior on particular hardware. Such information can be analyzed by
developers for performance tuning. However, model building and analyzing is
frequently ignored during software development until perfor- mance problems
arise because they require significant expertise and can involve many
time-consuming application runs. In this paper, we propose a fast, accurate,
flexible and user-friendly tool, Mira, for generating performance models by
applying static program analysis, targeting scientific applications running on
supercomputers. We parse both the source code and binary to estimate
performance attributes with better accuracy than considering just source or
just binary code. Because our analysis is static, the target program does not
need to be executed on the target architecture, which enables users to perform
analysis on available machines instead of conducting expensive exper- iments on
potentially expensive resources. Moreover, statically generated models enable
performance prediction on non-existent or unavailable architectures. In
addition to flexibility, because model generation time is significantly reduced
compared to dynamic analysis approaches, our method is suitable for rapid
application performance analysis and improvement. We present several scientific
application validation results to demonstrate the current capabilities of our
approach on small benchmarks and a mini application
- …