20 research outputs found

    Automatic mapping of parallel applications on multicore architectures using the Servet benchmark suite

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Computers & Electrical Engineering. The final authenticated version is available online at: https://doi.org/10.1016/j.compeleceng.2011.12.007[Abstract] Servet is a suite of benchmarks focused on detecting a set of parameters with high influence on the overall performance of multicore systems. These parameters can be used for autotuning codes to increase their performance on multicore clusters. Although Servet has been proved to detect accurately cache hierarchies, bandwidths and bottlenecks in memory accesses, as well as the communication overhead among cores, up to now the impact of the use of this information on application performance optimization has not been assessed. This paper presents a novel algorithm that automatically uses Servet for mapping parallel applications on multicore systems and analyzes its impact on three testbeds using three different parallel programming models: message-passing, shared memory and partitioned global address space (PGAS). Our results show that a suitable mapping policy based on the data provided by this tool can significantly improve the performance of parallel applications without source code modification.Ministerio de Ciencia e Innovación; TIN2010-16735Ministerio de Educación; FPU; AP2008-01578Xunta de Galicia; INCITE08PXIB105161P

    The Servet 3.0 benchmark suite: characterization of network performance degradation

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Computers & Electrical Engineering. The final authenticated version is available online at: https://doi.org/10.1016/j.compeleceng.2013.08.012.[Abstract] Servet is a suite of benchmarks focused on extracting a set of parameters with high influence on the overall performance of multicore clusters. These parameters can be used to optimize the performance of parallel applications by adapting part of their behavior to the characteristics of the machine. Up to now the tool considered network bandwidth as constant and independent of the communication pattern. Nevertheless, the inter-node communication bandwidth decreases on modern large supercomputers depending on the number of cores per node that simultaneously access the network and on the distance between the communicating nodes. This paper describes two new benchmarks that improve Servet by characterizing the network performance degradation depending on these factors. This work also shows the experimental results of these benchmarks on a Cray XE6 supercomputer and some examples of how real parallel codes can be optimized by using the information about network degradation.Ministerio de Ciencia e Innovación; TIN2010-16735Ministerio de Educación; AP2008-01578Ministerio de Educación; AP2010-4348European Commision; HPC-Europa2 Programme; 22839

    Servet: A Benchmark Suite for Autotuning on Multicore Clusters

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). Proceedings. The final authenticated version is available online at: http://dx.doi.org/10.1109/IPDPS.2010.5470358.[Abstract] MapReduce is a powerful tool for processing large data sets used by many applications running in distributed environments. However, despite the increasing number of computationally intensive problems that require low-latency communications, the adoption of MapReduce in High Performance Computing (HPC) is still emerging. Here languages based on the Partitioned Global Address Space (PGAS) programming model have shown to be a good choice for implementing parallel applications, in order to take advantage of the increasing number of cores per node and the programmability benefits achieved by their global memory view, such as the transparent access to remote data. This paper presents the first PGAS-based MapReduce implementation that uses the Unified Parallel C (UPC) language, which (1) obtains programmability benefits in parallel programming, (2) offers advanced configuration options to define a customized load distribution for different codes, and (3) overcomes performance penalties and bottlenecks that have traditionally prevented the deployment of MapReduce applications in HPC. The performance evaluation of representative applications on shared and distributed memory environments assesses the scalability of the presented MapReduce framework, confirming its suitability.Xunta de Galicia; INCITE08PXIB105161PRMinisterio de Ciencia e Innovación; TIN2007-67537-C03-02Ministerio de Educación; FPU; AP2008-0157

    UPCBLAS: a library for parallel matrix computations in Unified Parallel C

    Get PDF
    This is the peer reviewed version of the following article: González‐Domínguez, J. , Martín, M. J., Taboada, G. L., Touriño, J. , Doallo, R. , Mallón, D. A. and Wibecan, B. (2012), UPCBLAS: a library for parallel matrix computations in Unified Parallel C. Concurrency Computat.: Pract. Exper., 24: 1645-1667. doi:10.1002/cpe.1914, which has been published in final form at https://doi.org/10.1002/cpe.1914. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions.[Abstract] The popularity of Partitioned Global Address Space (PGAS) languages has increased during the last years thanks to their high programmability and performance through an efficient exploitation of data locality, especially on hierarchical architectures such as multicore clusters. This paper describes UPCBLAS, a parallel numerical library for dense matrix computations using the PGAS Unified Parallel C language. The routines developed in UPCBLAS are built on top of sequential basic linear algebra subprograms functions and exploit the particularities of the PGAS paradigm, taking into account data locality in order to achieve a good performance. Furthermore, the routines implement other optimization techniques, several of them by automatically taking into account the hardware characteristics of the underlying systems on which they are executed. The library has been experimentally evaluated on a multicore supercomputer and compared with a message‐passing‐based parallel numerical library, demonstrating good scalability and efficiency.Ministerio de Ciencia e Innovación; TIN2010-16735Ministerio de Educación; AP2008-0157

    UPCBLAS : a numerical library for unified parallel C with architecture-aware optimizations

    Get PDF
    [Abstract] The popularity of Partitioned Global Address Space (PGAS) languages has increased during the last years thanks to their high programmability and performance through an efficient exploitation of data locality, especially on hierarchical architectures like multicore clusters. This PhD Thesis describes UPCBLAS, a parallel library for numerical computation using the PGAS Unified Parallel C (UPC) language. The routines are built on top of sequential BLAS and SparseBLAS functions and exploit the particularities of the PGAS paradigm, taking into account data locality in order to achieve a good performance. However, the growing complexity in computer system hierarchies due to the increase in the number of cores per processor, levels of cache (some of them shared) and the number of processors per node, as well as the high-speed interconnects, demands the use of new optimization techniques and libraries that take advantage of their features. For this reason, this Thesis also presents Servet, a suite of benchmarks focused on detecting a set of parameters with high in uence on the overall performance of multicore systems. UPCBLAS routines use the hardware parameters provided by Servet to implement optimization techniques that improve their performance. The performance of the library has been experimentally evaluated on several multicore supercomputers and compared to message-passing-based parallel numerical libraries, demonstrating good scalability and efficiency. UPCBLAS has also been used to develop more complex numerical codes in order to demonstrate that it is a good alternative to MPI-based libraries for increasing the productivity of numerical application developers

    Efficient Characterization of Hidden Processor Memory Hierarchies

    Full text link
    A processor's memory hierarchy has a major impact on the performance of running code. However, computing platforms, where the actual hardware characteristics are hidden from both the end user and the tools that mediate execution, such as a compiler, a JIT and a runtime system, are used more and more, for example, performing large scale computation in cloud and cluster. Even worse, in such environments, a single computation may use a collection of processors with dissimilar characteristics. Ignorance of the performance-critical parameters of the underlying system makes it difficult to improve performance by optimizing the code or adjusting runtime-system behaviors; it also makes application performance harder to understand. To address this problem, we have developed a suite of portable tools that can efficiently derive many of the parameters of processor memory hierarchies, such as levels, effective capacity and latency of caches and TLBs, in a matter of seconds. The tools use a series of carefully considered experiments to produce and analyze cache response curves automatically. The tools are inexpensive enough to be used in a variety of contexts that may include install time, compile time or runtime adaption, or performance understanding tools.Comment: 14 pages, International Conference on Computational Science 201

    Systematic energy characterization of CMP/SMT processor systems via automated micro-benchmarks

    Get PDF
    Microprocessor-based systems today are composed of multi-core, multi-threaded processors with complex cache hierarchies and gigabytes of main memory. Accurate characterization of such a system, through predictive pre-silicon modeling and/or diagnostic postsilicon measurement based analysis are increasingly cumbersome and error prone. This is especially true of energy-related characterization studies. In this paper, we take the position that automated micro-benchmarks generated with particular objectives in mind hold the key to obtaining accurate energy-related characterization. As such, we first present a flexible micro-benchmark generation framework (MicroProbe) that is used to probe complex multi-core/multi-threaded systems with a variety and range of energy-related queries in mind. We then present experimental results centered around an IBM POWER7 CMP/SMT system to demonstrate how the systematically generated micro-benchmarks can be used to answer three specific queries: (a) How to project application-specific (and if needed, phase-specific) power consumption with component-wise breakdowns? (b) How to measure energy-per-instruction (EPI) values for the target machine? (c) How to bound the worst-case (maximum) power consumption in order to determine safe, but practical (i.e. affordable) packaging or cooling solutions? The solution approaches to the above problems are all new. Hardware measurement based analysis shows superior power projection accuracy (with error margins of less than 2.3% across SPEC CPU2006) as well as max-power stressing capability (with 10.7% increase in processor power over the very worst-case power seen during the execution of SPEC CPU2006 applications).Peer ReviewedPostprint (author’s final draft

    On the Overhead of Topology Discovery for Locality-aware Scheduling in HPC

    Get PDF
    International audienceThe increasing complexity of parallel computing platforms requires a deep knowledge of the hardware and of the application needs. Locality a key criteria for performance optimization. It involves software tools to expose information about the hardware topology to high performance runtime libraries. We show that the overhead of gathering such information from the operating system is significant on large computing nodes that run Linux. This overhead also increases more than linearly with the number of processes that perform it simultaneously. We then study the actual needs of the HPC software ecosystem in terms of topology information. We propose some ways to avoid multiple expensive topology discovery and to share topology information between components such as the resource manager or the runtime libraries

    Towards the Structural Modeling of the Topology of next-generation heterogeneous cluster Nodes with hwloc

    Get PDF
    Parallel computing platforms are increasingly complex, with multiple cores, shared caches, and NUMA memory interconnects, as well as asymmetric I/O access. Upcoming architectures will add a heterogeneous memory subsystem with non-volatile and/or high-bandwidth memory banks. Parallel applications developers have to take locality into account before they can expect good efficiency on these platforms. Thus there is a strong need for a portable tool gathering and exposing this information. The Hardware Locality project (hwloc) offers a tree representation of the hardware based on the inclusion of CPU resources and localities of memory and I/O devices. It is already widely used for affinity-based task placement in high performance computing. We present how hwloc represents parallel computing nodes, from the hierarchy of computing and memory resources to I/O device locality. It builds a structural model of the hardware to help application find the best resources fitting their needs. hwloc also annotates objects to ease identification of resources from different programming points of view. We finally describe how it helps process managers and batch schedulers to deal with the topology of multiple cluster nodes, by offering different compression techniques for better management of thousands of nodes
    corecore