234,326 research outputs found
Tuning Parallel Applications in Parallel
Auto-tuning has recently received significant attention from the High Performance Computing
community. Most auto-tuning approaches are specialized to work either on specific domains such as
dense linear algebra and stencil computations, or only at certain stages of program execution such
as compile time and runtime. Real scientific applications, however, demand a cohesive environment that can efficiently provide auto-tuning solutions at all stages of application development and deployment. Towards that end, we describe a unified end-to-end approach to auto-tuning scientific applications. Our system, Active Harmony, takes a search-based collaborative approach to auto-tuning. Application programmers, library writers and compilers collaborate to describe and export a set of performance related tunable parameters to the Active Harmony system. These parameters define a tuning search-space. The auto-tuner monitors the program performance and suggests adaptation decisions. The decisions are made by a central controller using a parallel search algorithm. The algorithm leverages parallel architectures to search across a set of optimization parameter values. Different nodes of a parallel system evaluate different
configurations at each timestep.
Active Harmony supports runtime adaptive code-generation and tuning for parameters that require new code (e.g. unroll factors). Effectively, we merge traditional feedback directed optimization and
just-in-time compilation. This feature also enables application developers to write applications once and have the auto-tuner adjust the application behavior automatically when run on new systems. We evaluated our system on multiple large-scale parallel applications and showed that our system can improve the execution time by up to 46% compared to the original version of the program.
Finally, we believe that the success of any auto-tuning research depends on how effectively
application developers, domain-experts and auto-tuners communicate and work together. To that end, we have developed and released a simple and extensible language that standardizes the parameter space representation. Using this language, developers and researchers can collaborate to export tunable parameters to the tuning frameworks. Relationships (e.g. ordering, dependencies,
constraints, ranking) between tunable parameters and search-hints can also be expressed
Data Partitioning and Load Balancing in Parallel Disk Systems
Parallel disk systems provide opportunities for exploiting I/O parallelism in two possible ways, namely via inter-request and intra-request parallelism. In this paper we discuss the main issues in performance tuning of such systems, namely striping and load balancing, and show their relationship to response time and throughput. We outline the main components of an intelligent, self-reliant file system that aims to optimize striping by taking into account the requirements of the applications, and performs load balancing by judicious file allocation and dynamic redistributions of the data when access patterns change. Our system uses simple but effective heuristics that incur only little overhead. We present performance experiments based on synthetic workloads and real-life traces. Keywords: parallel disk systems, performance tuning, file striping, data allocation, load balancing, disk cooling. 1 Introduction: Tuning Issues in Parallel Disk Systems Parallel disk systems are of great imp..
A methodology for transparent knowledge specification in a dynamic tuning environment
The increasing use of parallel/distributed applications demands a continuous support to take significant advantages from parallel power. This includes the evolution of performance analysis and tuning tools which automatically allows for obtaining a better behavior of the applications. Different approaches and tools have been proposed and they are continuously evolving to cover the requirements and expectations of users. One such tool is MATE (Monitoring Analysis and Tuning Environment), which provides automatic and dynamic tuning for parallel/distributed applications. The knowledge used by MATE to analyze and take decisions is based on performance models which include a set of performance parameters and a set of mathematical expressions modeling the solution of the performance problem. These elements are used by the tuning environment to conduct the monitoring and analysis steps, respectively. The tuning phase depends on the results of the performance analysis. This paper presents a methodology to specify performance models. Each performance model specification can be automatically and transparently translated into a piece of software code encapsulating the knowledge to be straightforwardly included in MATE. Applying this methodology, the user does not have to be involved in the implementation details of MATE, which makes the usage of the tool more transparent.Fil: Caymes Scutari, Paola Guadalupe. Universidad Tecnológica Nacional. Facultad Regional de Mendoza; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza; ArgentinaFil: Morajko, A.. Universitat Autònoma de Barcelona; EspañaFil: Margalef, T.. Universitat Autònoma de Barcelona; EspañaFil: Luque, E.. Universitat Autònoma de Barcelona; Españ
Instrumentation, performance visualization, and debugging tools for multiprocessors
The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessor architectures. However, without effective means to monitor (and visualize) program execution, debugging, and tuning parallel programs becomes intractably difficult as program complexity increases with the number of processors. Research on performance evaluation tools for multiprocessors is being carried out at ARC. Besides investigating new techniques for instrumenting, monitoring, and presenting the state of parallel program execution in a coherent and user-friendly manner, prototypes of software tools are being incorporated into the run-time environments of various hardware testbeds to evaluate their impact on user productivity. Our current tool set, the Ames Instrumentation Systems (AIMS), incorporates features from various software systems developed in academia and industry. The execution of FORTRAN programs on the Intel iPSC/860 can be automatically instrumented and monitored. Performance data collected in this manner can be displayed graphically on workstations supporting X-Windows. We have successfully compared various parallel algorithms for computational fluid dynamics (CFD) applications in collaboration with scientists from the Numerical Aerodynamic Simulation Systems Division. By performing these comparisons, we show that performance monitors and debuggers such as AIMS are practical and can illuminate the complex dynamics that occur within parallel programs
Automatic Detection of Performance Anomalies in Task-Parallel Programs
To efficiently exploit the resources of new many-core architectures,
integrating dozens or even hundreds of cores per chip, parallel programming
models have evolved to expose massive amounts of parallelism, often in the form
of fine-grained tasks. Task-parallel languages, such as OpenStream, X10,
Habanero Java and C or StarSs, simplify the development of applications for new
architectures, but tuning task-parallel applications remains a major challenge.
Performance bottlenecks can occur at any level of the implementation, from the
algorithmic level (e.g., lack of parallelism or over-synchronization), to
interactions with the operating and runtime systems (e.g., data placement on
NUMA architectures), to inefficient use of the hardware (e.g., frequent cache
misses or misaligned memory accesses); detecting such issues and determining
the exact cause is a difficult task.
In previous work, we developed Aftermath, an interactive tool for trace-based
performance analysis and debugging of task-parallel programs and run-time
systems. In contrast to other trace-based analysis tools, such as Paraver or
Vampir, Aftermath offers native support for tasks, i.e., visualization,
statistics and analysis tools adapted for performance debugging at task
granularity. However, the tool currently does not provide support for the
automatic detection of performance bottlenecks and it is up to the user to
investigate the relevant aspects of program execution by focusing the
inspection on specific slices of a trace file. In this paper, we present
ongoing work on two extensions that guide the user through this process.Comment: Presented at 1st Workshop on Resource Awareness and Adaptivity in
Multi-Core Computing (Racing 2014) (arXiv:1405.2281
Performance Characterization of Multi-threaded Graph Processing Applications on Intel Many-Integrated-Core Architecture
Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of
terascale integration. Among emerging killer applications, parallel graph
processing has been a critical technique to analyze connected data. In this
paper, we empirically evaluate various computing platforms including an Intel
Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor
codenamed Knights Landing (KNL) in the domain of parallel graph processing. We
show that the KNL gains encouraging performance when processing graphs, so that
it can become a promising solution to accelerating multi-threaded graph
applications. We further characterize the impact of KNL architectural
enhancements on the performance of a state-of-the art graph framework.We have
four key observations: 1 Different graph applications require distinctive
numbers of threads to reach the peak performance. For the same application,
various datasets need even different numbers of threads to achieve the best
performance. 2 Only a few graph applications benefit from the high bandwidth
MCDRAM, while others favor the low latency DDR4 DRAM. 3 Vector processing units
executing AVX512 SIMD instructions on KNLs are underutilized when running the
state-of-the-art graph framework. 4 The sub-NUMA cache clustering mode offering
the lowest local memory access latency hurts the performance of graph
benchmarks that are lack of NUMA awareness. At last, We suggest future works
including system auto-tuning tools and graph framework optimizations to fully
exploit the potential of KNL for parallel graph processing.Comment: published as L. Jiang, L. Chen and J. Qiu, "Performance
Characterization of Multi-threaded Graph Processing Applications on
Many-Integrated-Core Architecture," 2018 IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS), Belfast, United
Kingdom, 2018, pp. 199-20
- …