Search CORE

14 research outputs found

Performance impact of the interconnection network on MareNostrum applications

Author: Labarta Mancho Jesús José
Prat Oriol
Ramírez Bellido Alejandro
Valero Cortés Mateo
Publication venue: 'Indiana University Press (Project Muse)'
Publication date: 01/01/2007
Field of study

Interconnection networks are one of the fundamental components of a supercomputing facility, and one of the most expensive parts. They represent one of the main differences between two supercomputers built from the same processor, and have a significant impact on how the applications should be developed. However, very little is known about how those expensive interconnection networks are used by the real applications running on supercomputing facilities. Furthermore, in the near future, chip multiprocessors offering near supercomputing capabilities, with 64 to 256 processor per chip, will be readily available. Onchip interconnection networks offer the possibility of new designs with lower latencies and much higher bandwidths. In this paper we present an analysis of the impact of the interconnection network for some of the most representative applications running on MareNostrum, at the Barcelona Supercomputing Center. We have collected traces of real runs of the applications, and verified that our performance model (Dimemas) accurately predicts the real machine performance. Then, we present hypothetical situations where we change the network’s latency, bandwidth, number of simultaneous connections, and CPU speed in order to quantify their importance on the final application performance in the context of future on-chip interconnenctions. Our results show that the CPU speed proves more important than the interconnection network, and that among the network’s parameters, interconnection bandwidth is far more important than latency (with a very low impact), or the connectivity (only relevant for low connection bandwidth).Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Generation of simple analytical models for message passing applications

Author: Badia Sala Rosa Maria
Labarta Mancho Jesús José
Rodríguez Herrera Germán
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

We present a methodology which allows to derive accurate and simple models which are able to describe the performance of parallel applications without looking at the source code. A trace is obtained and linear models are derived by fitting the outcome of a set of simulations varying the influential parameters, such as: processor speed, network latency or bandwidth. The simplicity of the linear models allows for natural derivation of interpretations for the corresponding factors of the model, allowing for both prediction accuracy and interpretability to be maintained. We explain how we plan to extend this approach to extrapolate from these models to be apply it to predict for processor counts different to the one of the given traces.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

On An Improved Parallel Construction Of Suffix Arrays For Low Bandwidth Pc-Cluster.

Author: Abdul Rashid Nur'Aini
Abdullah Rosni
Kok Jun Lee
Md. Ali Norhashidah
Publication venue
Publication date: 01/10/2003
Field of study

An algorithm for the parallel construction of suffix arrays generation for any texts with larger alphabet size on distributed memory architecture is presente

Repository@USM

Recommended from our members

Modeling the Office of Science Ten Year Facilities Plan: The PERI Architecture Tiger Team

Author: Alam Sadaf
Bailey David H.
Carrington Laura
Daley Chris
de Supinski Bronis R.
Dubey Anshu
Gamblin Todd
Gunter Dan
Hovland Paul D.
Jagode Heike
Karavanic Karen
Marin Gabriel
Mellor-Crummey John
Moore Shirley
Norris Boyana
Oliker Leonid
Olschanowsky Catherine
Roth Philip C.
Schulz Martin
Shende Sameer
Snavely Allan
Spear Wyatt
Tikir Mustafa
Vetter Jeff
Worley Pat
Wright Nicholas
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 26/06/2009
Field of study

The Performance Engineering Institute (PERI) originally proposed a tiger team activity as a mechanism to target significant effort optimizing key Office of Science applications, a model that was successfully realized with the assistance of two JOULE metric teams. However, the Office of Science requested a new focus beginning in 2008: assistance in forming its ten year facilities plan. To meet this request, PERI formed the Architecture Tiger Team, which is modeling the performance of key science applications on future architectures, with S3D, FLASH and GTC chosen as the first application targets. In this activity, we have measured the performance of these applications on current systems in order to understand their baseline performance and to ensure that our modeling activity focuses on the right versions and inputs of the applications. We have applied a variety of modeling techniques to anticipate the performance of these applications on a range of anticipated systems. While our initial findings predict that Office of Science applications will continue to perform well on future machines from major hardware vendors, we have also encountered several areas in which we must extend our modeling techniques in order to fulfill our mission accurately and completely. In addition, we anticipate that models of a wider range of applications will reveal critical differences between expected future systems, thus providing guidance for future Office of Science procurement decisions, and will enable DOE applications to exploit machines in future facilities fully

UNT Digital Library

Including the workload effect in the parallel program signature

Author: Luque Emilio
Martínez Canillas Javier
Universitat Autònoma de Barcelona. Departament d'Arquitectura de Computadors i Sistemes Operatius
Universitat Autònoma de Barcelona. Escola d'Enginyeria
Publication venue
Publication date: 01/01/2011
Field of study

Performance prediction and application behavior modeling have been the subject of exten- sive research that aim to estimate applications performance with an acceptable precision. A novel approach to predict the performance of parallel applications is based in the con- cept of Parallel Application Signatures that consists in extract an application most relevant parts (phases) and the number of times they repeat (weights). Executing these phases in a target machine and multiplying its exeuction time by its weight an estimation of the application total execution time can be made. One of the problems is that the performance of an application depends on the program workload. Every type of workload affects differently how an application performs in a given system and so affects the signature execution time. Since the workloads used in most scientific parallel applications have dimensions and data ranges well known and the behavior of these applications are mostly deterministic, a model of how the programs workload affect its performance can be obtained. We create a new methodology to model how a program's workload affect the parallel application signature. Using regression analysis we are able to generalize each phase time execution and weight function to predict an application performance in a target system for any type of workload within predefined range. We validate our methodology using a synthetic program, benchmarks applications and well known real scientific applications.La predicción del rendimiento y el modelado del comportamiento de las aplicaciones son tópicos ampliamente estudiados y se cuentan con numerosos trabajos de investigación que pretenden estimar el rendimiento de la aplicaciones con una precisión aceptable. Un nuevo enfoque para predecir el rendimiento de aplicaciones paralelas es el basado en el concepto de las firmas de aplicaciones paralelas que consiste en extraer las partes mas relevantes de una aplicación (fases) y el número de veces que se repiten (pesos). Ejecutando estas fases en una máquina destino y multiplicando su tiempo de ejecución por su peso, se puede obtener una estimación del tiempo total de ejecución de la aplicación. Uno de los problemas es que el rendimiento de una aplicación depende de la carga de trabajo de esta. Cada tipo de carga de trabajo afecta de manera distinta el rendimiento que tiene una aplicación en un sistema determinado y por lo tanto el tiempo de ejecución de la firma. Dado que las cargas de trabajo de la mayoría de las aplicaciones científicas paralelas, tienen dimensiones y rango de datos bien conocidos y que el comportamiento de estas aplicaciones es generalmente determinista, se puede obtener un modelo de cómo la carga de trabajo de un programa afecta su rendimiento. Hemos creado una nueva metodología para modelar cómo la carga de trabajo de un programa afecta a la firma de la aplicación paralela. Usando análisis de regresión, hemos podido generalizar las funciones de tiempo de ejecución y peso para cada fase para predecir el rendimiento de una aplicación en un sistema destino para cualquier tipo de carga de trabajo dentro de un rango predefinido. Hemos validado nuestra metodología utilizando un programa sintético, aplicaciones de benchmarks y aplicaciones reales científicas bien conocidas

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Diposit Digital de Documents de la UAB

PerfBound: Conserving Energy with Bounded Overheads in On/Off-Based HPC Interconnects

Author: Carpenter Paul M.
Saravanan Karthikeyan P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Energy and power are key challenges in high-performance computing. System energy efficiency must be significantly improved, and this requires greater efficiency in all subcomponents. An important target of optimization is the interconnect, since network links are always on, consuming power even during idle periods. A large number of HPC machines have a primary interconnect based on Ethernet (about 40 percent of TOP500 machines), which, since 2010, has included support for saving power via Energy Efficient Ethernet (EEE). Nevertheless, it is unlikely that HPC interconnects would use these energy saving modes unless the performance overhead is known and small. This paper presents PerfBound, a self-contained technique to manage on/off-based networks such as EEE, minimizing interconnect link energy consumption subject to a bound on the performance degradation. PerfBound does not require changes to the applications and it uses only local information already available at switches and NICs without introducing additional communication messages, and is also compatible with multi-hop networks. PerfBound is evaluated using traces from a production supercomputer. For twelve out of fourteen applications, PerfBound has high energy savings, up to 70 percent for only 1 percent performance degradation. This paper also presents DynamicFastwake, which extends PerfBound to exploit multiple low-power states. DynamicFastwake achieves an energy-delay product 10 percent lower than the original PerfBound techniqueThis research was supported by European Union’s 7th Framework Programme [FP7/2007-2013] under the Mont-Blanc-3 (FP7-ICT-671697) and EUROSERVER (FP7-ICT-610456) projects, the Ministry of Economy and Competitiveness of Spain (TIN2012-34557 and TIN2015-65316), Generalitat de Catalunya (FI-AGAUR 2012 FI B 00644, 2014-SGR-1051 and 2014-SGR-1272), the European Union’s Horizon2020 research and innovation programme under the HiPEAC-3 Network of Excellence (ICT-287759), and the Severo Ochoa Program (SEV-2011-00067) of the Spanish Government.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

ACOTES project: Advanced compiler technologies for embedded streaming

Author: Albert Cohen
Alex Ramírez
Andrea Ornstein
Antoniu Pop
Ayal Zaks
Cupertino Miranda
Cédric Bastoul
David Ródenas
Dorit Nuzman
E. Blossom
E.A. Lee
Eduard Ayguadé
Erven Rohou
Harm Munk
Ira Rosen
J. Hoogerbrugge
Konrad Trifunović
Louis-Noël Pouchet
M. Gschwind
M. Wolfe
Marc Duranton
Marco Cornero
Menno Lindwer
Mohammed Fellahi
Paul Carpenter
Philippe Dumont
R. Allen
R.G. Scarborough
Razya Ladelsky
Roger Ferrer
S. Campanoni
Sebastian Pop
Uzi Shvadron
Xavier Martorell
Zbigniew Chamski
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

HAL-CentraleSupelec

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

INRIA a CCSD electronic archive server

HAL-MINES ParisTech

The University of Manchester - Institutional Repository

HAL-Rennes 1

Including the Workload Effect in the Parallel Program Signature

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Validation of Dimemas Communication Model for MPI Collective Operations

Author: Jesús Labarta
Rosa M. Badia
Sergi Girona
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2000
Field of study

Abstract. This paper presents an extension of Dimemas to enable accurate performance prediction of message passing applications with collective communication primitives. The main contribution is a simple model for collective communication operations that can be user-parameterized. The experiments performed with a set of MPI benchmarks demonstrate the utility of the model.

CiteSeerX

Crossref

Performance Projections of HPC Applications on Chip Multiprocessor (CMP) Based Systems

Author: Shawky Sharkawi Sameh Sh
Publication venue
Publication date
Field of study

Performance projections of High Performance Computing (HPC) applications onto various hardware platforms are important for hardware vendors and HPC users. The projections aid hardware vendors in the design of future systems and help HPC users with system procurement and application refinements. In this dissertation, we present an efficient method to project the performance of HPC applications onto Chip Multiprocessor (CMP) based systems using widely available standard benchmark data. The main advantage of this method is the use of published data about the target machine; the target machine need not be available. With the current trend in HPC platforms shifting towards cluster systems with chip multiprocessors (CMPs), efficient and accurate performance projection becomes a challenging task. Typically, CMP-based systems are configured hierarchically, which significantly impacts the performance of HPC applications. The goal of this research is to develop an efficient method to project the performance of HPC applications onto systems that utilize CMPs. To provide for efficiency, our projection methodology is automated (projections are done using a tool) and fast (with small overhead). Our method, called the surrogate-based workload application projection method, utilizes surrogate benchmarks to project an HPC application performance on target systems where computation component of an HPC application is projected separately from the communication component. Our methodology was validated on a variety of systems utilizing different processor and interconnect architectures with high accuracy and efficiency. The average projection error on three target systems was 11.22 percent with standard deviation of 1.18 percent for twelve HPC workloads

Texas A&M Repository