32 research outputs found

    A Performance Comparison Using HPC Benchmarks: Windows HPC Server 2008 and Red Hat Enterprise Linux 5

    Get PDF
    This document was developed with support from the National Science Foundation (NSF) under Grant No. 0910812 to Indiana University for ”FutureGrid: An Experimental, High-Performance Grid Test-bed.” Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.A collection of performance benchmarks have been run on an IBM System X iDataPlex cluster using two different operating systems. Windows HPC Server 2008 (WinHPC) and Red Hat Enterprise Linux v5.4 (RHEL5) are compared using SPEC MPI2007 v1.1, the High Performance Computing Challenge (HPCC) and National Science Foundation (NSF) acceptance test benchmark suites. Overall, we find the performance of WinHPC and RHEL5 to be equivalent but significant performance differences exist when analyzing specific applications. We focus on presenting the results from the application benchmarks and include the results of the HPCC microbenchmark for completeness

    Report about the collaboration between UITS/Research Technologies at Indiana University and the Center for Information Services and High Performance Computing at Technische Universität Dresden, Germany (2011-2012)

    Get PDF
    This report lists the activities and outcomes for July 2011-June 2012 of the collaboration between Research Technologies, a division of University Information Technology Services at Indiana University (IU), and the Center for Information Services and High Performance Computing (ZIH) at Technische Universität Dresden.This material is based upon work supported in part by the National Science Foundation under Grant No. 0910812 to Indiana University for "FutureGrid: An Experimental, High-Performance Grid Test-bed." Partners in the FutureGrid project include San Diego Supercomputer Center at UC San Diego, University of Chicago, University of Florida, University of Southern California, University of Tennessee at Knoxville, University of Texas at Austin, Purdue University, University of Virginia, and T-U Dresden. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF

    Main memory in HPC: do we need more, or could we live with less?

    Get PDF
    An important aspect of High-Performance Computing (HPC) system design is the choice of main memory capacity. This choice becomes increasingly important now that 3D-stacked memories are entering the market. Compared with conventional Dual In-line Memory Modules (DIMMs), 3D memory chiplets provide better performance and energy efficiency but lower memory capacities. Therefore, the adoption of 3D-stacked memories in the HPC domain depends on whether we can find use cases that require much less memory than is available now. This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High-Performance Conjugate Gradients (HPCG) benchmark could be an important success story for 3D-stacked memories in HPC, but High-Performance Linpack (HPL) is likely to be constrained by 3D memory capacity. The study also emphasizes that the analysis of memory footprints of production HPC applications is complex and that it requires an understanding of application scalability and target category, i.e., whether the users target capability or capacity computing. The results show that most of the HPC applications under study have per-core memory footprints in the range of hundreds of megabytes, but we also detect applications and use cases that require gigabytes per core. Overall, the study identifies the HPC applications and use cases with memory footprints that could be provided by 3D-stacked memory chiplets, making a first step toward adoption of this novel technology in the HPC domain.This work was supported by the Collaboration Agreement between Samsung Electronics Co., Ltd. and BSC, Spanish Government through Severo Ochoa programme (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project and by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). This work has also received funding from the European Union’s Horizon 2020 research and innovation programme under ExaNoDe project (grant agreement No 671578). Darko Zivanovic holds the Severo Ochoa grant (SVP-2014-068501) of the Ministry of Economy and Competitiveness of Spain. The authors thank Harald Servat from BSC and Vladimir Marjanovi´c from High Performance Computing Center Stuttgart for their technical support.Postprint (published version

    Capturing the impact of external interference on HPC application performance

    Get PDF
    HPC applications are large software packages with high computation and storage requirements. To meet these requirements, the architectures of supercomputers are continuously evolving and their capabilities are continuously increasing. Present-day supercomputers have achieved petaflops of computational power by utilizing thousands to millions of compute cores, connected through specialized communication networks, and are equipped with petabytes of storage using a centralized I/O subsystem. While fulfilling the high resource demands of HPC applications, such a design also entails its own challenges. Applications running on these systems own the computation resources exclusively, but share the communication interconnect and the I/O subsystem with other concurrently running applications. Simultaneous access to these shared resources causes contention and inter-application interference, leading to degraded application performance. Inter-application interference is one of the sources of run-to-run variation. While other sources of variation, such as operating system jitter, have been investigated before, this doctoral thesis specifically focuses on inter-application interference and studies it from the perspective of an application. Variation in execution time not only causes uncertainty and affects user expectations (especially during performance analysis), but also causes suboptimal usage of HPC resources. Therefore, this thesis aims to evaluate inter-application interference, establish trends among applications under contention, and approximate the impact of external influences on the runtime of an application. To this end, this thesis first presents a method to correlate the performance of applications running side-by-side. The method divides the runtime of a system into globally synchronized, fine-grained time slices for which application performance data is recorded separately. The evaluation of the method demonstrates that correlating application performance data can identify inter-application interference. The thesis further uses the method to study I/O interference and shows that file access patterns are a significant factor in determining the interference potential of an application. This thesis also presents a technique to estimate the impact of external influences on an application run. The technique introduces the concept of intrinsic performance characteristics to cluster similar application execution segments. Anomalies in the cluster are the result of external interference. An evaluation with several benchmarks shows high accuracy in estimating the impact of interference from a single application run. The contributions of this thesis will help establish interference trends and devise interference mitigation techniques. Similarly, estimating the impact of external interference will restore user expectations and help performance analysts separate application performance from external influence

    Characterizing and Optimizing the Performance of the MAESTRO 49-core Processor

    Get PDF
    As space-based imagery-intelligence systems become increasingly complex, processing units are needed that can process the extra data these systems seek to collect. However, the space environment presents a number of threats, such as ambient or malicious radiation, that can damage and otherwise interfere with electronic systems. There is a need, then, for processors that can tolerate radiation-induced faults, and that also have sufficient computational power to handle the large flow of data they encounter. This research investigates one potential solution: a multi-core processor that is radiation-hardened and designed to provide highly parallelized MIMD execution of applicable workloads. A variety of benchmarking programs are used to explore the capabilities of this processor. Additionally, the source code is modified in an attempt to enhance the processor speed and efficiency; the consequent improvements in performance are documented

    Predicting Execution Readiness of MPI Binaries with FEAM, a Framework for Efficient Application Migration

    Get PDF
    Abstract-As computational science becomes increasingly relevant for performing research, shared computing resources made accessible by cyberinfrastructures emerge as especially valuable for the majority of scientists who have not traditionally been the dominant users of such resources. However, in order to provide these newer computational scientists the opportunities to do great research, the ease-of-use of shared computing resources needs to be increased. In this paper, we present techniques that aim to make the migration to (and between) shared computing resources more efficient. Specifically, we focus on determining whether a computing site is a good fit for running an MPI binary. We present our methods and a Linux-based implementation called FEAM (a Framework for Efficient Application Migration). FEAM predicts execution readiness, resolves missing shared libraries, and composes site-specific configurations. We show that FEAM is more than 90% accurate at predicting execution readiness of MPI application binaries from the NAS Parallel and SPEC MPI2007 benchmark suites. In our evaluation, only half of the migrated binaries execute successfully at sites only configured with a matching MPI implementation. We show that by automatically resolving shared libraries requirements, FEAM is able to increase the number of successful executions by a third

    SPEChpc 2021 Benchmarks on Ice Lake and Sapphire Rapids Infiniband Clusters: A Performance and Energy Case Study

    Full text link
    In this work, fundamental performance, power, and energy characteristics of the full SPEChpc 2021 benchmark suite are assessed on two different clusters based on Intel Ice Lake and Sapphire Rapids CPUs using the MPI-only codes' variants. We use memory bandwidth, data volume, and scalability metrics in order to categorize the benchmarks and pinpoint relevant performance and scalability bottlenecks on the node and cluster levels. Common patterns such as memory bandwidth limitation, dominating communication and synchronization overhead, MPI serialization, superlinear scaling, and alignment issues could be identified, in isolation or in combination, showing that SPEChpc 2021 is representative of many HPC workloads. Power dissipation and energy measurements indicate that the modern Intel server CPUs have such a high idle power level that race-to-idle is the paramount strategy for energy to solution and energy-delay product minimization. On the chip level, only memory-bound code shows a clear advantage of Sapphire Rapids compared to Ice Lake in terms of energy to solution.Comment: 9 pages, 6 figures; corrected links to system doc

    Predicting Execution Readiness of MPI Binaries with FEAM, a Framework for Efficient Application Migration

    Get PDF
    Abstract-Today's scientific computing infrastructures provide scientists with easy access to a wide variety of computing resources. However, migrating applications to new computing sites can be tedious and time consuming. When optimal performance is not a concern, scientists can benefit by moving binaries instead of source code. Our work aims to make migration of MPI application binaries more efficient by automation. We present general methods that assess if binaries are a good match for execution at computing sites. We also present methods for increasing execution readiness by resolving missing shared libraries. Our work aims to free scientists from extensive manual preparation at new sites. To evaluate the effectiveness of our methods, we present an automated Linux-based implementation called FEAM, a Framework for Efficient Application Migration. We show that FEAM is more than 90% accurate at predicting execution readiness of MPI application binaries from the NAS Parallel and SPEC MPI2007 benchmark suites. In our evaluation, only half of the migrated binaries execute successfully at sites configured with a matching MPI implementation. We show that by automatically resolving shared libraries requirements, FEAM is able to increase the number of successful executions by a third

    Runtime MPI Correctness Checking with a Scalable Tools Infrastructure

    Get PDF
    Increasing computational demand of simulations motivates the use of parallel computing systems. At the same time, this parallelism poses challenges to application developers. The Message Passing Interface (MPI) is a de-facto standard for distributed memory programming in high performance computing. However, its use also enables complex parallel programing errors such as races, communication errors, and deadlocks. Automatic tools can assist application developers in the detection and removal of such errors. This thesis considers tools that detect such errors during an application run and advances them towards a combination of both precise checks (neither false positives nor false negatives) and scalability. This includes novel hierarchical checks that provide scalability, as well as a formal basis for a distributed deadlock detection approach. At the same time, the development of parallel runtime tools is challenging and time consuming, especially if scalability and portability are key design goals. Current tool development projects often create similar tool components, while component reuse remains low. To provide a perspective towards more efficient tool development, which simplifies scalable implementations, component reuse, and tool integration, this thesis proposes an abstraction for a parallel tools infrastructure along with a prototype implementation. This abstraction overcomes the use of multiple interfaces for different types of tool functionality, which limit flexible component reuse. Thus, this thesis advances runtime error detection tools and uses their redesign and their increased scalability requirements to apply and evaluate a novel tool infrastructure abstraction. The new abstraction ultimately allows developers to focus on their tool functionality, rather than on developing or integrating common tool components. The use of such an abstraction in wide ranges of parallel runtime tool development projects could greatly increase component reuse. Thus, decreasing tool development time and cost. An application study with up to 16,384 application processes demonstrates the applicability of both the proposed runtime correctness concepts and of the proposed tools infrastructure

    Ant: A Framework for Increasing the Efficiency of Sequential Debugging Techniques with Parallel Programs

    Get PDF
    Bugs in sequential programs cost the software industry billions of dollars in lost productivity each year. Even if simple parallel programming models are created, they will not reduce the level of sequential bugs in programs below that of sequential programs. It can be argued that the complexity of current parallel programming models may increase the number of sequential bugs in parallel programs because they distract the programmer from the core logic of the program. Tools exist that identify statements related to sequential bugs and allow those bugs to be more quickly located and fixed. Their use in parallel programs will continue to be useful. Many of these debugging tools require runtime monitoring of program points of interest in a program and the overhead of this monitoring is usually very high. We propose Ant, a framework that increases the efficiency of sequential debugging techniques when used with parallel programs. The Ant framework takes two different strategies depending on whether the program to be debugged is a distributed memory program or shared memory program. For MPI programs, the Ant compiler analyzes the program and identifies two different types of code regions: those that all processes execute and regions that only part of the processes execute. For shared memory Pthreads programs, Ant uses a combination of static and dynamic analyses to determine similar parts of the program executing in parallel and the number of threads executing those parts of the program. The programs are instrumented with calls to Ant runtime libraries and debugging libraries based on the Ant compiler\u27s static analysis results. Relative to a naive port of a debugging tool (C-DIDUCE, in our cases), Ant\u27s technique, by exploiting the application\u27s parallelism, reduces the monitoring overhead by up to 15.85 times (and on average 9.23 times) for MPI programs executing with 32 processes and up to 18.14 times (and on average 8.73 times) for Pthreads programs executing with 8 threads, while maintaining high accuracy
    corecore