469 research outputs found
Experiences with OpenMP in tmLQCD
An overview is given of the lessons learned from the introduction of
multi-threading using OpenMP in tmLQCD. In particular, programming style,
performance measurements, cache misses, scaling, thread distribution for hybrid
codes, race conditions, the overlapping of communication and computation and
the measurement and reduction of certain overheads are discussed. Performance
measurements and sampling profiles are given for different implementations of
the hopping matrix computational kernel.Comment: presented at the 31st International Symposium on Lattice Field Theory
(Lattice 2013), 29 July - 3 August 2013, Mainz, German
Bio-inspired call-stack reconstruction for performance analysis
The correlation of performance bottlenecks and their associated source code has become a cornerstone of performance analysis. It allows understanding why the efficiency of an application falls behind the computer's peak performance and enabling optimizations on the code ultimately. To this end, performance analysis tools collect the processor call-stack and then combine this information with measurements to allow the analyst comprehend the application behavior. Some tools modify the call-stack during run-time to diminish the collection expense but at the cost of resulting in non-portable solutions. In this paper, we present a novel portable approach to associate performance issues with their source code counterpart. To address it, we capture a reduced segment of the call-stack (up to three levels) and then process the segments using an algorithm inspired by multi-sequence alignment techniques. The results of our approach are easily mapped to detailed performance views, enabling the analyst to unveil the application behavior and its corresponding region of code. To demonstrate the usefulness of our approach, we have applied the algorithm to several first-time seen in-production applications to describe them finely, and optimize them by using tiny modifications based on the analyses.We thankfully acknowledge Mathis Bode for giving us access to the Arts CF binaries, and Miguel Castrillo and Kim Serradell for their valuable insight regarding Nemo. We would like to thank Forschungszentrum JĂĽlich for the computation time on their Blue Gene/Q system. This research has been partially funded by the CICYT under contracts No. TIN2012-34557 and TIN2015-65316-P.Peer ReviewedPostprint (author's final draft
Recommended from our members
Automation of Determination of Optimal Intra-Compute Node Parallelism
Maximizing the productivity of modern multicore and manycore chips requires optimizing parallelism at the compute node level. This is, however, a complex multi-step process. It is an iterative method requiring determining optimal degrees of parallel scalability and optimizing memory access behavior. Further, there are multiple cases to be considered, programs which use only MPI or OpenMP and hybrid (MPI +OpenMP) programs. This paper presents a set of three coordinated workflows for determining the optimal parallelism at the program level for MPI programs and at the loop level for hybrid (MPI+OpenMP) cases. The paper also details mostly automated implementations of these workflows using the PerfExpert infrastructure. Finally the paper presents case studies demonstrating both the applicability and the effectiveness of optimizing parallelism at the compute node level. The results shown in the paper will provide valuable information to further advance in the full automation of the workflows. The software implementing the parallelism scalability optimization is open source and available for download.Texas Advanced Computing Center (TACC)Computer Science
ScalAna: Automating Scaling Loss Detection with Graph Analysis
Scaling a parallel program to modern supercomputers is challenging due to
inter-process communication, Amdahl's law, and resource contention. Performance
analysis tools for finding such scaling bottlenecks either base on profiling or
tracing. Profiling incurs low overheads but does not capture detailed
dependencies needed for root-cause analysis. Tracing collects all information
at prohibitive overheads. In this work, we design ScalAna that uses static
analysis techniques to achieve the best of both worlds - it enables the
analyzability of traces at a cost similar to profiling. ScalAna first leverages
static compiler techniques to build a Program Structure Graph, which records
the main computation and communication patterns as well as the program's
control structures. At runtime, we adopt lightweight techniques to collect
performance data according to the graph structure and generate a Program
Performance Graph. With this graph, we propose a novel approach, called
backtracking root cause detection, which can automatically and efficiently
detect the root cause of scaling loss. We evaluate ScalAna with real
applications. Results show that our approach can effectively locate the root
cause of scaling loss for real applications and incurs 1.73% overhead on
average for up to 2,048 processes. We achieve up to 11.11% performance
improvement by fixing the root causes detected by ScalAna on 2,048 processes.Comment: conferenc
Implementation and scaling of the fully coupled Terrestrial Systems Modeling Platform (TerrSysMP) in a massively parallel supercomputing environment – a case study on JUQUEEN (IBM Blue Gene/Q)
Continental-scale hyper-resolution simulations constitute a grand challenge in characterizing non-linear feedbacks of states and fluxes of the coupled water, energy, and biogeochemical cycles of terrestrial systems. Tackling this challenge requires advanced coupling and supercomputing technologies for earth system models that are discussed in this study, utilizing the example of the implementation of the newly developed Terrestrial Systems Modeling Platform (TerrSysMP) on JUQUEEN (IBM Blue Gene/Q) of the JĂĽlich Supercomputing Centre, Germany. The applied coupling strategies rely on the Multiple Program Multiple Data (MPMD) paradigm and require memory and load balancing considerations in the exchange of the coupling fields between different component models and allocation of computational resources, respectively. These considerations can be reached with advanced profiling and tracing tools leading to the efficient use of massively parallel computing environments, which is then mainly determined by the parallel performance of individual component models. However, the problem of model I/O and initialization in the peta-scale range requires major attention, because this constitutes a true big data challenge in the perspective of future exa-scale capabilities, which is unsolved
Using hardware performance counters for fault localization
In this work, we leverage hardware performance counters-collected data as abstraction mechanisms for program executions and use these abstractions to identify likely causes of failures. Our approach can be summarized as follows: Hardware counters-based data is collected from both successful and failed executions, the data collected from the successful executions is used to create normal behavior models of programs, and deviations from these models observed in failed executions are scored and reported as likely causes of failures. The results of our experiments conducted on three open source projects suggest that the proposed approach can effectively prioritize the space of likely causes of failures, which can in turn improve the turn around time for defect fixes
- …