Search CORE

219 research outputs found

RePP-C: runtime estimation of performance-power with workload consolidation in CMPs

Author: Martorell Bofill Xavier
Nishtala Rajiv
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Configuration of hardware knobs in multicore environments for meeting performance-power demands constitutes a desirable feature in modern data centers. At the same time, high energy efficiency (performance per watt) requires optimal thread-to-core assignment. In this paper, we present the runtime estimator (RePP-C) for performance-power, characterized by processor frequency states (P-states), a wide range of sleep intervals (Cl-states) and workload consolidation. We also present a schema for frequency and contention-aware thread-to-core assignment (FACTS) which considers various thread demands. The proposed solution (RePP-C) selects a given hardware configuration for each active core to ensure that the performance-power demands are satisfied while using the scheduling schema (FACTS) for mapping threads-to-cores. Our results show that FACTS improves over other state-of-the-art schedulers like Distributed Intensity Online (DIO) and native Linux scheduler by 8.25% and 37.56% in performance, with simultaneous improvement in energy efficiency by 6.2% and 14.17%, respectively. Moreover, we prove the usability of RePP-C by predicting performance and power for 7 different types of workloads and 10 different QoS targets. The results show an average error of 7.55% and 8.96% (with 95% confidence interval) when predicting energy and performance respectively.This work has been partially supported by the European Union FP7 program through the Mont-Blanc-2 project (FP7-ICT-610402), by the Ministerio de Economia y Competitividad under contract Computacion de Altas Prestaciones VII (TIN2015-65316-P), and the Departament d’Innovacio, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programacio i Entorns d’Execucio Paral.lels (2014-SGR-1051).Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

Parallel programming issues and what the compiler can do to help

Author: Martorell Bofill Xavier
Royuela Sara
Publication venue: Barcelona Supercomputing Center
Publication date: 05/05/2015
Field of study

Twenty-first century parallel programming models are becoming real complex due to the diversity of architectures they need to target (Multi- and Many-cores, GPUs, FPGAs, etc.). What if we could use one programming model to rule them all, one programming model to find them, one programming model to bring them all and in the darkness bind them, in the land of MareNostrum where the Applications lie. OmpSs programming model is an attempt to do so, by means of compiler directives. Compilers are essential tools to exploit applications and the architectures the run on. In this sense, compiler analysis and optimization techniques have been widely studied, in order to produce better performing and less consuming codes. In this paper we present two uses of several analyses we have implemented in the Mercurium[3] source-to-source compiler: a) the first use is to help users with correctness hints regarding the usage of the OpenMP and OmpSs tasks; b) the second use is to be able to execute OpenMP in embedded systems, with very little memory, thanks to calculating the Task Dependency Graph of the application at compile time. We also present the next steps of our work: a) extending range analysis for analyzing OpenMP and OmpSs recursive applications, and b) modeling applications using OmpSs and future OpenMP4.1 tasks priorities feature

UPCommons. Portal del coneixement obert de la UPC

Parallel programming issues and what the compiler can do to help

Author: Martorell Bofill Xavier
Royuela Sara
Publication venue: Barcelona Supercomputing Center
Publication date: 05/05/2015
Field of study

REPP-H: runtime estimation of power and performance on heterogeneous data centers

Author: Martorell Bofill Xavier
Mossé Daniel
Nishtala Rajiv
Petrucci Vinicius
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Modern data centers increasingly demand improved performance with minimal power consumption. Managing the power and performance requirements of the applications is challenging because these data centers, incidentally or intentionally, have to deal with server architecture heterogeneity [19], [22]. One critical challenge that data centers have to face is how to manage system power and performance given the different application behavior across multiple different architectures.This work has been supported by the EU FP7 program (Mont-Blanc 2, ICT-610402), by the Ministerio de Economia (CAP-VII, TIN2015-65316-P), and the Generalitat de Catalunya (MPEXPAR, 2014-SGR-1051). The material herein is based in part upon work supported by the US NSF, grant numbers ACI-1535232 and CNS-1305220.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

Dynamic Scheduling of Parallel Applications on Shared-Memory Multiprocessors

Author: Martorell Bofill Xavier
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/1999
Field of study

Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

Runtime estimation of performance–power in CMPs under QoS constraints

Author: Carpenter Paul
Martorell Bofill Xavier
Nishtala Rajiv
Publication venue: Barcelona Supercomputing Center
Publication date: 05/05/2015
Field of study

One of the main challenges in data center systems is operating under certain Quality of Service (QoS) while minimizing power consumption. Increasingly, data centers are exploring and adopting heterogeneous server architectures with different power and performance trade-offs. This not only requires careful understanding of the application behavior across multiple architectures at runtime so as to enable meeting power and performance requirements but also an understanding of individual and aggregated behaviour of application and server level performance and power metrics

UPCommons. Portal del coneixement obert de la UPC

Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs

Author: Alvanos Michail
Amaral José Nelson
Farreras Esclusa Montserrat
Martorell Bofill Xavier
Tiotto Ettore
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Programs written in the Unified Parallel C (UPC) language can access any location of the entire local and remote address space via read/write operations. However, UPC programs that contain fine-grained shared accesses can exhibit performance degradation. One solution is to use the inspector-executor technique to coalesce fine-grained shared accesses to larger remote access operations. A straightforward implementation of the inspector executor transformation results in excessive instrumentation that hinders performance.; This paper addresses this issue and introduces various techniques that aim at reducing the generated instrumentation code: a shared-data localization transformation based on Constant-Stride Linear Memory Descriptors (CSLMADs) [S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, 2003.], the inlining of data locality checks and the usage of an index vector to aggregate the data. Finally, the paper introduces a lightweight loop code motion transformation to privatize shared scalars that were propagated through the loop body.; A performance evaluation, using up to 2048 cores of a POWER 775, explores the impact of each optimization and characterizes the overheads of UPC programs. It also shows that the presented optimizations increase performance of UPC programs up to 1.8 x their UPC hand-optimized counterpart for applications with regular accesses and up to 6.3 x for applications with irregular accesses.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

Energy optimizing methodologies on heterogeneous data centers

Author: Carpenter Paul
Martorell Bofill Xavier
Nishtala Rajiv
Petrucci Vinicius
Publication venue: Barcelona Supercomputing Center
Publication date: 04/05/2017
Field of study

In 2013, U.S. data centers accounted for 2.2% of the country’s total electricity consumption, a figure that is projected to increase rapidly over the next decade. Many important work-loads are interactive, and they demand strict levels of quality-of-service (QoS) to meet user expectations, making it challenging to reduce power consumption due to increasing performance demands

UPCommons. Portal del coneixement obert de la UPC

Analyzing the performance of hierarchical collective algorithms on ARM-based multicore clusters

Author: Gil Marisa
Martorell Bofill Xavier
Utrera Iglesias Gladys Miriam
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

MPI is the de facto communication standard library for parallel applications in distributed memory architectures. Collective operations performance is critical in HPC applications as they can become the bottleneck of their executions. The advent of larger node sizes on multicore clusters has motivated the exploration of hierarchical collective algorithms aware of the process placement in the cluster and the memory hierarchy. This work analyses and compares several hierarchical collective algorithms from the literature that do not form part of the current MPI standard. We implement the algorithms on top of OpenMPI using the shared-memory facility provided by MPI-3 at the intra-node level and evaluate them on ARM-based multicore clusters. From our results, we evidence aspects of the algorithms that impact the performance and applicability of the different algorithms. Finally, we propose a model that helps us to analyze the scalability of the algorithms.This work has been supported by the Spanish Ministry of Education (PID2019-107255GB-C22) and the Generalitat de Catalunya (2017-SGR-1414).Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

HBM, present and future of HPC based on FPGAs

Author: Cervero Teresa
Martorell Bofill Xavier
Perdomo Hourné Elias A.
Publication venue: Barcelona Supercomputing Center
Publication date: 01/05/2022
Field of study

In the past decades, advances in the speed of commodity CPUs have far out-paced advances in memory latency. Mainmemory access is therefore increasingly a performance bottleneck for many computer applications, including HPC embedded systems. This translates into unprecedented memory performance requirements in critical systems that the commonly used DRAM memories struggle to provide. High-Bandwidth Memory (HBM) can satisfy these requirements offering high bandwidth, low power and highintegration capacity features. However, it remains unclear whether the predictability and isolation properties of HBM are compatible with the requirements of critical embedded systems. In our research, a deep characterization of the HBM for its use in MEEP applications is performed

UPCommons. Portal del coneixement obert de la UPC