142,748 research outputs found

    WMTrace : a lightweight memory allocation tracker and analysis framework

    Get PDF
    The diverging gap between processor and memory performance has been a well discussed aspect of computer architecture literature for some years. The use of multi-core processor designs has, however, brought new problems to the design of memory architectures - increased core density without matched improvement in memory capacity is reduc- ing the available memory per parallel process. Multiple cores accessing memory simultaneously degrades performance as a result of resource con- tention for memory channels and physical DIMMs. These issues combine to ensure that memory remains an on-going challenge in the design of parallel algorithms which scale. In this paper we present WMTrace, a lightweight tool to trace and analyse memory allocation events in parallel applications. This tool is able to dynamically link to pre-existing application binaries requiring no source code modification or recompilation. A post-execution analysis stage enables in-depth analysis of traces to be performed allowing memory allocations to be analysed by time, size or function. The second half of this paper features a case study in which we apply WMTrace to five parallel scientific applications and benchmarks, demonstrating its effectiveness at recording high-water mark memory consumption as well as memory use per-function over time. An in-depth analysis is provided for an unstructured mesh benchmark which reveals significant memory allocation imbalance across its participating processes

    High Performance Computing Framework for Tera-Scale Database Search of Mass Spectrometry Data

    Get PDF
    Database peptide search algorithms deduce peptides from mass spectrometry data. There has been substantial effort in improving their computational efficiency to achieve larger and more complex systems biology studies. However, modern serial and high-performance computing (HPC) algorithms exhibit suboptimal performance mainly due to their ineffective parallel designs (low resource utilization) and high overhead costs. We present an HPC framework, called HiCOPS, for efficient acceleration of the database peptide search algorithms on distributed-memory supercomputers. HiCOPS provides, on average, more than tenfold improvement in speed and superior parallel performance over several existing HPC database search software. We also formulate a mathematical model for performance analysis and optimization, and report near-optimal results for several key metrics including strong-scale efficiency, hardware utilization, load-balance, inter-process communication and I/O overheads. The core parallel design, techniques and optimizations presented in HiCOPS are search-algorithm-independent and can be extended to efficiently accelerate the existing and future algorithms and software

    Performance analysis of a scalable hardware FPGA Skein implementation

    Get PDF
    Hashing functions are a key cryptographic primitive used in many everyday applications, such as authentication, ensuring data integrity, as well as digital signatures. The current hashing standard is defined by the National Institute of Standards and Technology (NIST) as the Secure Hash Standard (SHS), and includes SHA-1, SHA-224, SHA-256, SHA-384 and SHA-512 . SHS\u27s level of security is waning as technology and analysis techniques continue to develop over time. As a result, after the 2005 Cryptographic Hash Workshop, NIST called for the creation of a new cryptographic hash algorithm to replace SHS. The new candidate algorithms were submitted on October 31st, 2008, and of them fourteen have advanced to round two of the competition. The competition is expected to produce a final replacement for the SHS standard by 2012. Multi-core processors, and parallel programming are the dominant force in computing, and some of the new hashing algorithms are attempting to take advantage of these resources by offering parallel tree-hashing variants to the algorithms. Tree-hashing allows multiple parts of the data on the same level of a tree to be operated on simultaneously, resulting in the potential to reduce the execution time complexity for hashing from O(n) to O(log n). Designs for tree-hashing require that the scalability and parallelism of the algorithms be researched on all platforms, including multi-core processors (CPUs), graphics processors (GPUs), as well as custom hardware (ASICs and FPGAs). Skein, the hashing function that this work has focused on, offers a tree-hashing mode with different options for the maximum tree height, and leaf node size, as well as the node fan-out. This research focuses on creating and analyzing the performance of scalable hardware designs for Skein\u27s tree hashing mode. Different ideas and approaches on how to modify sequential hashing cores, and create scalable control logic in order to provide for high-speed and low-area parallel hashing hardware are presented and analyzed. Equations were created to help understand the expected performance and potential bottlenecks of Skein in FPGAs. The equations are intended to assist the decision making process during the design phase, as well as potentially provide insight into design considerations for other tree hashing schemes in FPGAs. The results are also compared to current sequential designs of Skein, providing a complete analysis of the performance of Skein in an FPGA

    Making and breaking power laws in evolutionary algorithm population dynamics

    Get PDF
    Deepening our understanding of the characteristics and behaviors of population-based search algorithms remains an important ongoing challenge in Evolutionary Computation. To date however, most studies of Evolutionary Algorithms have only been able to take place within tightly restricted experimental conditions. For instance, many analytical methods can only be applied to canonical algorithmic forms or can only evaluate evolution over simple test functions. Analysis of EA behavior under more complex conditions is needed to broaden our understanding of this population-based search process. This paper presents an approach to analyzing EA behavior that can be applied to a diverse range of algorithm designs and environmental conditions. The approach is based on evaluating an individual’s impact on population dynamics using metrics derived from genealogical graphs.\ud From experiments conducted over a broad range of conditions, some important conclusions are drawn in this study. First, it is determined that very few individuals in an EA population have a significant influence on future population dynamics with the impact size fitting a power law distribution. The power law distribution indicates there is a non-negligible probability that single individuals will dominate the entire population, irrespective of population size. Two EA design features are however found to cause strong changes to this aspect of EA behavior: i) the population topology and ii) the introduction of completely new individuals. If the EA population topology has a long path length or if new (i.e. historically uncoupled) individuals are continually inserted into the population, then power law deviations are observed for large impact sizes. It is concluded that such EA designs can not be dominated by a small number of individuals and hence should theoretically be capable of exhibiting higher degrees of parallel search behavior

    Ergonomic Chair Design by Fusing Qualitative and Quantitative Criteria using Interactive Genetic Algorithms

    Get PDF
    This paper emphasizes the necessity of formally bringing qualitative and quantitative criteria of ergonomic design together, and provides a novel complementary design framework with this aim. Within this framework, different design criteria are viewed as optimization objectives; and design solutions are iteratively improved through the cooperative efforts of computer and user. The framework is rooted in multi-objective optimization, genetic algorithms and interactive user evaluation. Three different algorithms based on the framework are developed, and tested with an ergonomic chair design problem. The parallel and multi-objective approaches show promising results in fitness convergence, design diversity and user satisfaction metrics

    Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL?

    Full text link
    Dense Multi-GPU systems have recently gained a lot of attention in the HPC arena. Traditionally, MPI runtimes have been primarily designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important to address efficient communication schemes for such dense Multi-GPU nodes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft CNTK pose additional design constraints due to very large message communication of GPU buffers during the training phase. In this context, special-purpose libraries like NVIDIA NCCL have been proposed for GPU-based collective communication on dense GPU systems. In this paper, we propose a pipelined chain (ring) design for the MPI_Bcast collective operation along with an enhanced collective tuning framework in MVAPICH2-GDR that enables efficient intra-/inter-node multi-GPU communication. We present an in-depth performance landscape for the proposed MPI_Bcast schemes along with a comparative analysis of NVIDIA NCCL Broadcast and NCCL-based MPI_Bcast. The proposed designs for MVAPICH2-GDR enable up to 14X and 16.6X improvement, compared to NCCL-based solutions, for intra- and inter-node broadcast latency, respectively. In addition, the proposed designs provide up to 7% improvement over NCCL-based solutions for data parallel training of the VGG network on 128 GPUs using Microsoft CNTK.Comment: 8 pages, 3 figure
    • 

    corecore