219 research outputs found

    RePP-C: runtime estimation of performance-power with workload consolidation in CMPs

    Get PDF
    Configuration of hardware knobs in multicore environments for meeting performance-power demands constitutes a desirable feature in modern data centers. At the same time, high energy efficiency (performance per watt) requires optimal thread-to-core assignment. In this paper, we present the runtime estimator (RePP-C) for performance-power, characterized by processor frequency states (P-states), a wide range of sleep intervals (Cl-states) and workload consolidation. We also present a schema for frequency and contention-aware thread-to-core assignment (FACTS) which considers various thread demands. The proposed solution (RePP-C) selects a given hardware configuration for each active core to ensure that the performance-power demands are satisfied while using the scheduling schema (FACTS) for mapping threads-to-cores. Our results show that FACTS improves over other state-of-the-art schedulers like Distributed Intensity Online (DIO) and native Linux scheduler by 8.25% and 37.56% in performance, with simultaneous improvement in energy efficiency by 6.2% and 14.17%, respectively. Moreover, we prove the usability of RePP-C by predicting performance and power for 7 different types of workloads and 10 different QoS targets. The results show an average error of 7.55% and 8.96% (with 95% confidence interval) when predicting energy and performance respectively.This work has been partially supported by the European Union FP7 program through the Mont-Blanc-2 project (FP7-ICT-610402), by the Ministerio de Economia y Competitividad under contract Computacion de Altas Prestaciones VII (TIN2015-65316-P), and the Departament d’Innovacio, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programacio i Entorns d’Execucio Paral.lels (2014-SGR-1051).Peer ReviewedPostprint (author's final draft

    Parallel programming issues and what the compiler can do to help

    Get PDF
    Twenty-first century parallel programming models are becoming real complex due to the diversity of architectures they need to target (Multi- and Many-cores, GPUs, FPGAs, etc.). What if we could use one programming model to rule them all, one programming model to find them, one programming model to bring them all and in the darkness bind them, in the land of MareNostrum where the Applications lie. OmpSs programming model is an attempt to do so, by means of compiler directives. Compilers are essential tools to exploit applications and the architectures the run on. In this sense, compiler analysis and optimization techniques have been widely studied, in order to produce better performing and less consuming codes. In this paper we present two uses of several analyses we have implemented in the Mercurium[3] source-to-source compiler: a) the first use is to help users with correctness hints regarding the usage of the OpenMP and OmpSs tasks; b) the second use is to be able to execute OpenMP in embedded systems, with very little memory, thanks to calculating the Task Dependency Graph of the application at compile time. We also present the next steps of our work: a) extending range analysis for analyzing OpenMP and OmpSs recursive applications, and b) modeling applications using OmpSs and future OpenMP4.1 tasks priorities feature

    Parallel programming issues and what the compiler can do to help

    Get PDF
    Twenty-first century parallel programming models are becoming real complex due to the diversity of architectures they need to target (Multi- and Many-cores, GPUs, FPGAs, etc.). What if we could use one programming model to rule them all, one programming model to find them, one programming model to bring them all and in the darkness bind them, in the land of MareNostrum where the Applications lie. OmpSs programming model is an attempt to do so, by means of compiler directives. Compilers are essential tools to exploit applications and the architectures the run on. In this sense, compiler analysis and optimization techniques have been widely studied, in order to produce better performing and less consuming codes. In this paper we present two uses of several analyses we have implemented in the Mercurium[3] source-to-source compiler: a) the first use is to help users with correctness hints regarding the usage of the OpenMP and OmpSs tasks; b) the second use is to be able to execute OpenMP in embedded systems, with very little memory, thanks to calculating the Task Dependency Graph of the application at compile time. We also present the next steps of our work: a) extending range analysis for analyzing OpenMP and OmpSs recursive applications, and b) modeling applications using OmpSs and future OpenMP4.1 tasks priorities feature

    REPP-H: runtime estimation of power and performance on heterogeneous data centers

    Get PDF
    Modern data centers increasingly demand improved performance with minimal power consumption. Managing the power and performance requirements of the applications is challenging because these data centers, incidentally or intentionally, have to deal with server architecture heterogeneity [19], [22]. One critical challenge that data centers have to face is how to manage system power and performance given the different application behavior across multiple different architectures.This work has been supported by the EU FP7 program (Mont-Blanc 2, ICT-610402), by the Ministerio de Economia (CAP-VII, TIN2015-65316-P), and the Generalitat de Catalunya (MPEXPAR, 2014-SGR-1051). The material herein is based in part upon work supported by the US NSF, grant numbers ACI-1535232 and CNS-1305220.Peer ReviewedPostprint (author's final draft

    Runtime estimation of performance–power in CMPs under QoS constraints

    Get PDF
    One of the main challenges in data center systems is operating under certain Quality of Service (QoS) while minimizing power consumption. Increasingly, data centers are exploring and adopting heterogeneous server architectures with different power and performance trade-offs. This not only requires careful understanding of the application behavior across multiple architectures at runtime so as to enable meeting power and performance requirements but also an understanding of individual and aggregated behaviour of application and server level performance and power metrics

    Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs

    Get PDF
    Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector executor transformation results in excessive instrumentation that hinders performance.; This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) [S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003.], the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.; A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 x their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 x for applications with irregular accesses.Peer ReviewedPostprint (author's final draft

    Energy optimizing methodologies on heterogeneous data centers

    Get PDF
    In 2013, U.S. data centers accounted for 2.2% of the country’s total electricity consumption, a figure that is projected to increase rapidly over the next decade. Many important work-loads are interactive, and they demand strict levels of quality-of-service (QoS) to meet user expectations, making it challenging to reduce power consumption due to increasing performance demands

    Analyzing the performance of hierarchical collective algorithms on ARM-based multicore clusters

    Get PDF
    MPI is the de facto communication standard library for parallel applications in distributed memory architectures. Collective operations performance is critical in HPC applications as they can become the bottleneck of their executions. The advent of larger node sizes on multicore clusters has motivated the exploration of hierarchical collective algorithms aware of the process placement in the cluster and the memory hierarchy. This work analyses and compares several hierarchical collective algorithms from the literature that do not form part of the current MPI standard. We implement the algorithms on top of OpenMPI using the shared-memory facility provided by MPI-3 at the intra-node level and evaluate them on ARM-based multicore clusters. From our results, we evidence aspects of the algorithms that impact the performance and applicability of the different algorithms. Finally, we propose a model that helps us to analyze the scalability of the algorithms.This work has been supported by the Spanish Ministry of Education (PID2019-107255GB-C22) and the Generalitat de Catalunya (2017-SGR-1414).Peer ReviewedPostprint (author's final draft

    HBM, present and future of HPC based on FPGAs

    Get PDF
    In the past decades, advances in the speed of commodity CPUs have far out-paced advances in memory latency. Mainmemory access is therefore increasingly a performance bottleneck for many computer applications, including HPC embedded systems. This translates into unprecedented memory performance requirements in critical systems that the commonly used DRAM memories struggle to provide. High-Bandwidth Memory (HBM) can satisfy these requirements offering high bandwidth, low power and highintegration capacity features. However, it remains unclear whether the predictability and isolation properties of HBM are compatible with the requirements of critical embedded systems. In our research, a deep characterization of the HBM for its use in MEEP applications is performed
    • …
    corecore