219 research outputs found
RePP-C: runtime estimation of performance-power with workload consolidation in CMPs
Configuration of hardware knobs in multicore environments for meeting performance-power demands constitutes a desirable feature in modern data centers. At the same time, high energy efficiency (performance per watt) requires optimal thread-to-core assignment. In this paper, we present the runtime estimator (RePP-C) for performance-power, characterized by processor frequency states (P-states), a wide range of sleep intervals (Cl-states) and workload consolidation. We also present a schema for frequency and contention-aware thread-to-core assignment (FACTS) which considers various thread demands. The proposed solution (RePP-C) selects a given hardware configuration for each active core to ensure that the performance-power demands are satisfied while using the scheduling schema (FACTS) for mapping threads-to-cores. Our results show that FACTS improves over other state-of-the-art schedulers like Distributed Intensity Online (DIO) and native Linux scheduler by 8.25% and 37.56% in performance, with simultaneous improvement in energy efficiency by 6.2% and 14.17%, respectively. Moreover, we prove the usability of RePP-C by predicting performance and power for 7 different types of workloads and 10 different QoS targets. The results show an average error of 7.55% and 8.96% (with 95% confidence interval) when predicting energy and performance respectively.This work has been partially supported by the European Union FP7 program through the Mont-Blanc-2 project (FP7-ICT-610402), by the Ministerio de Economia y Competitividad
under contract Computacion de Altas Prestaciones VII (TIN2015-65316-P), and the Departament d’Innovacio, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programacio i Entorns d’Execucio Paral.lels (2014-SGR-1051).Peer ReviewedPostprint (author's final draft
Parallel programming issues and what the compiler can do to help
Twenty-first century parallel programming models are
becoming real complex due to the diversity of architectures they need
to target (Multi- and Many-cores, GPUs, FPGAs, etc.). What if we
could use one programming model to rule them all, one programming
model to find them, one programming model to bring them all and in
the darkness bind them, in the land of MareNostrum where the
Applications lie. OmpSs programming model is an attempt to do so,
by means of compiler directives.
Compilers are essential tools to exploit applications and the
architectures the run on. In this sense, compiler analysis and
optimization techniques have been widely studied, in order to
produce better performing and less consuming codes.
In this paper we present two uses of several analyses we have
implemented in the Mercurium[3] source-to-source compiler: a) the
first use is to help users with correctness hints regarding the usage of
the OpenMP and OmpSs tasks; b) the second use is to be able to
execute OpenMP in embedded systems, with very little memory,
thanks to calculating the Task Dependency Graph of the application
at compile time. We also present the next steps of our work: a)
extending range analysis for analyzing OpenMP and OmpSs
recursive applications, and b) modeling applications using OmpSs
and future OpenMP4.1 tasks priorities feature
Parallel programming issues and what the compiler can do to help
Twenty-first century parallel programming models are
becoming real complex due to the diversity of architectures they need
to target (Multi- and Many-cores, GPUs, FPGAs, etc.). What if we
could use one programming model to rule them all, one programming
model to find them, one programming model to bring them all and in
the darkness bind them, in the land of MareNostrum where the
Applications lie. OmpSs programming model is an attempt to do so,
by means of compiler directives.
Compilers are essential tools to exploit applications and the
architectures the run on. In this sense, compiler analysis and
optimization techniques have been widely studied, in order to
produce better performing and less consuming codes.
In this paper we present two uses of several analyses we have
implemented in the Mercurium[3] source-to-source compiler: a) the
first use is to help users with correctness hints regarding the usage of
the OpenMP and OmpSs tasks; b) the second use is to be able to
execute OpenMP in embedded systems, with very little memory,
thanks to calculating the Task Dependency Graph of the application
at compile time. We also present the next steps of our work: a)
extending range analysis for analyzing OpenMP and OmpSs
recursive applications, and b) modeling applications using OmpSs
and future OpenMP4.1 tasks priorities feature
REPP-H: runtime estimation of power and performance on heterogeneous data centers
Modern data centers increasingly demand improved performance with minimal power consumption. Managing the power and performance requirements of the applications is challenging because these data centers, incidentally or intentionally, have to deal with server architecture heterogeneity [19], [22]. One critical challenge that data centers have to face is how to manage system power and performance given the different application behavior across multiple different architectures.This work has been supported by the EU FP7 program (Mont-Blanc 2, ICT-610402), by the
Ministerio de Economia (CAP-VII, TIN2015-65316-P), and the Generalitat de Catalunya (MPEXPAR, 2014-SGR-1051).
The material herein is based in part upon work supported by the US NSF, grant numbers ACI-1535232 and CNS-1305220.Peer ReviewedPostprint (author's final draft
Dynamic Scheduling of Parallel Applications on Shared-Memory Multiprocessors
Postprint (published version
Runtime estimation of performance–power in CMPs under QoS constraints
One of the main challenges in data center systems is operating under certain Quality of Service (QoS) while minimizing power consumption. Increasingly, data centers are exploring and adopting heterogeneous server architectures with different power and performance trade-offs. This not only requires careful understanding of the application behavior across multiple architectures at runtime so as to enable meeting power and performance requirements but also an understanding of individual and aggregated behaviour of application and server level performance and power metrics
Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs
Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector executor transformation results in excessive instrumentation that hinders performance.; This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) [S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003.], the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.; A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 x their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 x for applications with irregular accesses.Peer ReviewedPostprint (author's final draft
Energy optimizing methodologies on heterogeneous data centers
In 2013, U.S. data centers accounted for 2.2% of the country’s total electricity consumption, a figure that is projected to increase rapidly over the next decade. Many important work-loads are interactive, and they demand strict levels of quality-of-service (QoS) to meet user expectations, making it challenging to reduce power consumption due to increasing performance demands
Analyzing the performance of hierarchical collective algorithms on ARM-based multicore clusters
MPI is the de facto communication standard library for parallel applications in distributed memory architectures. Collective operations performance is critical in HPC applications as they can become the bottleneck of their executions. The advent of larger node sizes on multicore clusters has motivated the exploration of hierarchical collective algorithms aware of the process placement in the cluster and the memory hierarchy. This work analyses and compares several hierarchical collective algorithms from the literature that do not form part of the current MPI standard. We implement the algorithms on top of OpenMPI using the shared-memory facility provided by MPI-3 at the intra-node level and evaluate them on ARM-based multicore clusters. From our results, we evidence aspects of the algorithms that impact the performance and applicability of the different algorithms. Finally, we propose a model that helps us to analyze the scalability of the algorithms.This work has been supported by the Spanish Ministry of Education (PID2019-107255GB-C22) and the Generalitat de Catalunya (2017-SGR-1414).Peer ReviewedPostprint (author's final draft
HBM, present and future of HPC based on FPGAs
In the past decades, advances in the speed of commodity CPUs have far out-paced advances in memory latency. Mainmemory access is therefore increasingly a performance bottleneck for many computer applications, including HPC embedded systems. This translates into unprecedented memory performance requirements in critical systems that the commonly used DRAM memories struggle to provide. High-Bandwidth Memory (HBM) can satisfy these requirements offering high bandwidth, low power and highintegration capacity features. However, it remains unclear whether the predictability and isolation properties of HBM are compatible with the requirements of critical embedded systems. In our research, a deep characterization of the HBM for its use in MEEP applications is performed
- …