14 research outputs found

    Performance impact of the interconnection network on MareNostrum applications

    Get PDF
    Interconnection networks are one of the fundamental components of a supercomputing facility, and one of the most expensive parts. They represent one of the main differences between two supercomputers built from the same processor, and have a significant impact on how the applications should be developed. However, very little is known about how those expensive interconnection networks are used by the real applications running on supercomputing facilities. Furthermore, in the near future, chip multiprocessors offering near supercomputing capabilities, with 64 to 256 processor per chip, will be readily available. Onchip interconnection networks offer the possibility of new designs with lower latencies and much higher bandwidths. In this paper we present an analysis of the impact of the interconnection network for some of the most representative applications running on MareNostrum, at the Barcelona Supercomputing Center. We have collected traces of real runs of the applications, and verified that our performance model (Dimemas) accurately predicts the real machine performance. Then, we present hypothetical situations where we change the network’s latency, bandwidth, number of simultaneous connections, and CPU speed in order to quantify their importance on the final application performance in the context of future on-chip interconnenctions. Our results show that the CPU speed proves more important than the interconnection network, and that among the network’s parameters, interconnection bandwidth is far more important than latency (with a very low impact), or the connectivity (only relevant for low connection bandwidth).Peer ReviewedPostprint (author's final draft

    Generation of simple analytical models for message passing applications

    Get PDF
    We present a methodology which allows to derive accurate and simple models which are able to describe the performance of parallel applications without looking at the source code. A trace is obtained and linear models are derived by fitting the outcome of a set of simulations varying the influential parameters, such as: processor speed, network latency or bandwidth. The simplicity of the linear models allows for natural derivation of interpretations for the corresponding factors of the model, allowing for both prediction accuracy and interpretability to be maintained. We explain how we plan to extend this approach to extrapolate from these models to be apply it to predict for processor counts different to the one of the given traces.Peer ReviewedPostprint (author's final draft

    On An Improved Parallel Construction Of Suffix Arrays For Low Bandwidth Pc-Cluster.

    Get PDF
    An algorithm for the parallel construction of suffix arrays generation for any texts with larger alphabet size on distributed memory architecture is presente

    Including the workload effect in the parallel program signature

    Get PDF
    Performance prediction and application behavior modeling have been the subject of exten- sive research that aim to estimate applications performance with an acceptable precision. A novel approach to predict the performance of parallel applications is based in the con- cept of Parallel Application Signatures that consists in extract an application most relevant parts (phases) and the number of times they repeat (weights). Executing these phases in a target machine and multiplying its exeuction time by its weight an estimation of the application total execution time can be made. One of the problems is that the performance of an application depends on the program workload. Every type of workload affects differently how an application performs in a given system and so affects the signature execution time. Since the workloads used in most scientific parallel applications have dimensions and data ranges well known and the behavior of these applications are mostly deterministic, a model of how the programs workload affect its performance can be obtained. We create a new methodology to model how a program's workload affect the parallel application signature. Using regression analysis we are able to generalize each phase time execution and weight function to predict an application performance in a target system for any type of workload within predefined range. We validate our methodology using a synthetic program, benchmarks applications and well known real scientific applications.La predicción del rendimiento y el modelado del comportamiento de las aplicaciones son tópicos ampliamente estudiados y se cuentan con numerosos trabajos de investigación que pretenden estimar el rendimiento de la aplicaciones con una precisión aceptable. Un nuevo enfoque para predecir el rendimiento de aplicaciones paralelas es el basado en el concepto de las firmas de aplicaciones paralelas que consiste en extraer las partes mas relevantes de una aplicación (fases) y el número de veces que se repiten (pesos). Ejecutando estas fases en una máquina destino y multiplicando su tiempo de ejecución por su peso, se puede obtener una estimación del tiempo total de ejecución de la aplicación. Uno de los problemas es que el rendimiento de una aplicación depende de la carga de trabajo de esta. Cada tipo de carga de trabajo afecta de manera distinta el rendimiento que tiene una aplicación en un sistema determinado y por lo tanto el tiempo de ejecución de la firma. Dado que las cargas de trabajo de la mayoría de las aplicaciones científicas paralelas, tienen dimensiones y rango de datos bien conocidos y que el comportamiento de estas aplicaciones es generalmente determinista, se puede obtener un modelo de cómo la carga de trabajo de un programa afecta su rendimiento. Hemos creado una nueva metodología para modelar cómo la carga de trabajo de un programa afecta a la firma de la aplicación paralela. Usando análisis de regresión, hemos podido generalizar las funciones de tiempo de ejecución y peso para cada fase para predecir el rendimiento de una aplicación en un sistema destino para cualquier tipo de carga de trabajo dentro de un rango predefinido. Hemos validado nuestra metodología utilizando un programa sintético, aplicaciones de benchmarks y aplicaciones reales científicas bien conocidas

    PerfBound: Conserving Energy with Bounded Overheads in On/Off-Based HPC Interconnects

    Get PDF
    Energy and power are key challenges in high-performance computing. System energy efficiency must be significantly improved, and this requires greater efficiency in all subcomponents. An important target of optimization is the interconnect, since network links are always on, consuming power even during idle periods. A large number of HPC machines have a primary interconnect based on Ethernet (about 40 percent of TOP500 machines), which, since 2010, has included support for saving power via Energy Efficient Ethernet (EEE). Nevertheless, it is unlikely that HPC interconnects would use these energy saving modes unless the performance overhead is known and small. This paper presents PerfBound, a self-contained technique to manage on/off-based networks such as EEE, minimizing interconnect link energy consumption subject to a bound on the performance degradation. PerfBound does not require changes to the applications and it uses only local information already available at switches and NICs without introducing additional communication messages, and is also compatible with multi-hop networks. PerfBound is evaluated using traces from a production supercomputer. For twelve out of fourteen applications, PerfBound has high energy savings, up to 70 percent for only 1 percent performance degradation. This paper also presents DynamicFastwake, which extends PerfBound to exploit multiple low-power states. DynamicFastwake achieves an energy-delay product 10 percent lower than the original PerfBound techniqueThis research was supported by European Union’s 7th Framework Programme [FP7/2007-2013] under the Mont-Blanc-3 (FP7-ICT-671697) and EUROSERVER (FP7-ICT-610456) projects, the Ministry of Economy and Competitiveness of Spain (TIN2012-34557 and TIN2015-65316), Generalitat de Catalunya (FI-AGAUR 2012 FI B 00644, 2014-SGR-1051 and 2014-SGR-1272), the European Union’s Horizon2020 research and innovation programme under the HiPEAC-3 Network of Excellence (ICT-287759), and the Severo Ochoa Program (SEV-2011-00067) of the Spanish Government.Peer ReviewedPostprint (author's final draft

    ACOTES project: Advanced compiler technologies for embedded streaming

    Get PDF
    Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

    Including the Workload Effect in the Parallel Program Signature

    Full text link

    Validation of Dimemas Communication Model for MPI Collective Operations

    No full text
    Abstract. This paper presents an extension of Dimemas to enable accurate performance prediction of message passing applications with collective communication primitives. The main contribution is a simple model for collective communication operations that can be user-parameterized. The experiments performed with a set of MPI benchmarks demonstrate the utility of the model.

    Performance Projections of HPC Applications on Chip Multiprocessor (CMP) Based Systems

    Get PDF
    Performance projections of High Performance Computing (HPC) applications onto various hardware platforms are important for hardware vendors and HPC users. The projections aid hardware vendors in the design of future systems and help HPC users with system procurement and application refinements. In this dissertation, we present an efficient method to project the performance of HPC applications onto Chip Multiprocessor (CMP) based systems using widely available standard benchmark data. The main advantage of this method is the use of published data about the target machine; the target machine need not be available. With the current trend in HPC platforms shifting towards cluster systems with chip multiprocessors (CMPs), efficient and accurate performance projection becomes a challenging task. Typically, CMP-based systems are configured hierarchically, which significantly impacts the performance of HPC applications. The goal of this research is to develop an efficient method to project the performance of HPC applications onto systems that utilize CMPs. To provide for efficiency, our projection methodology is automated (projections are done using a tool) and fast (with small overhead). Our method, called the surrogate-based workload application projection method, utilizes surrogate benchmarks to project an HPC application performance on target systems where computation component of an HPC application is projected separately from the communication component. Our methodology was validated on a variety of systems utilizing different processor and interconnect architectures with high accuracy and efficiency. The average projection error on three target systems was 11.22 percent with standard deviation of 1.18 percent for twelve HPC workloads
    corecore