    Domain knowledge specification for energy tuning

    To overcome the challenges of energy consumption of HPC systems, the European Union Horizon 2020 READEX (Runtime Exploitation of Application Dynamism for Energy-efficient Exascale computing) project uses an online auto-tuning approach to improve energy efficiency of HPC applications. The READEX methodology pre-computes optimal system configurations at design-time, such as the CPU frequency, for instances of program regions and switches at runtime to the configuration given in the tuning model when the region is executed. READEX goes beyond previous approaches by exploiting dynamic changes of a region's characteristics by leveraging region and characteristic specific system configurations. While the tool suite supports an automatic approach, specifying domain knowledge such as the structure and characteristics of the application and application tuning parameters can significantly help to create a more refined tuning model. This paper presents the means available for an application expert to provide domain knowledge and presents tuning results for some benchmarks.Web of Science316art. no. E465

    Development of an oceanographic application in HPC

    High Performance Computing (HPC) is used for running advanced application programs efficiently, reliably, and quickly. In earlier decades, performance analysis of HPC applications was evaluated based on speed, scalability of threads, memory hierarchy. Now, it is essential to consider the energy or the power consumed by the system while executing an application. In fact, the High Power Consumption (HPC) is one of biggest problems for the High Performance Computing (HPC) community and one of the major obstacles for exascale systems design. The new generations of HPC systems intend to achieve exaflop performances and will demand even more energy to processing and cooling. Nowadays, the growth of HPC systems is limited by energy issues Recently, many research centers have focused the attention on doing an automatic tuning of HPC applications which require a wide study of HPC applications in terms of power efficiency. In this context, this paper aims to propose the study of an oceanographic application, named OceanVar, that implements Domain Decomposition based 4D Variational model (DD-4DVar), one of the most commonly used HPC applications, going to evaluate not only the classic aspects of performance but also aspects related to power efficiency in different case of studies. These work were realized at Bsc (Barcelona Supercomputing Center), Spain within the Mont-Blanc project, performing the test first on HCA server with Intel technology and then on a mini-cluster Thunder with ARM technology. In this work of thesis it was initially explained the concept of assimilation date, the context in which it is developed, and a brief description of the mathematical model 4DVAR. After this problem’s close examination, it was performed a porting from Matlab description of the problem of data-assimilation to its sequential version in C language. Secondly, after identifying the most onerous computational kernels in order of time, it has been developed a parallel version of the application with a parallel multiprocessor programming style, using the MPI (Message Passing Interface) protocol. The experiments results, in terms of performance, have shown that, in the case of running on HCA server, an Intel architecture, values of efficiency of the two most onerous functions obtained, growing the number of process, are approximately equal to 80%. In the case of running on ARM architecture, specifically on Thunder mini-cluster, instead, the trend obtained is labeled as "SuperLinear Speedup" and, in our case, it can be explained by a more efficient use of resources (cache memory access) compared with the sequential case. In the second part of this paper was presented an analysis of the some issues of this application that has impact in the energy efficiency. After a brief discussion about the energy consumption characteristics of the Thunder chip in technological landscape, through the use of a power consumption detector, the Yokogawa Power Meter, values of energy consumption of mini-cluster Thunder were evaluated in order to determine an overview on the power-to-solution of this application to use as the basic standard for successive analysis with other parallel styles. Finally, a comprehensive performance evaluation, targeted to estimate the goodness of MPI parallelization, is conducted using a suitable performance tool named Paraver, developed by BSC. Paraver is such a performance analysis and visualisation tool which can be used to analyse MPI, threaded or mixed mode programmes and represents the key to perform a parallel profiling and to optimise the code for High Performance Computing. A set of graphical representation of these statistics make it easy for a developer to identify performance problems. Some of the problems that can be easily identified are load imbalanced decompositions, excessive communication overheads and poor average floating operations per second achieved. Paraver can also report statistics based on hardware counters, which are provided by the underlying hardware. This project aimed to use Paraver configuration files to allow certain metrics to be analysed for this application. To explain in some way the performance trend obtained in the case of analysis on the mini-cluster Thunder, the tracks were extracted from various case of studies and the results achieved is what expected, that is a drastic drop of cache misses by the case ppn (process per node) = 1 to case ppn = 16. This in some way explains a more efficient use of cluster resources with an increase of the number of processes

    PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications

    Energy efficiency is a major concern in modern high-performance computing system design. In the past few years, there has been mounting evidence that power usage limits system scale and computing density, and thus, ultimately system performance. However, despite the impact of power and energy on the computer systems community, few studies provide insight to where and how power is consumed on high-performance systems and applications. In previous work, we designed a framework called PowerPack that was the first tool to isolate the power consumption of devices including disks, memory, NICs, and processors in a high-performance cluster and correlate these measurements to application functions. In this work, we extend our framework to support systems with multicore, multiprocessor-based nodes, and then provide in-depth analyses of the energy consumption of parallel applications on clusters of these systems. These analyses include the impacts of chip multiprocessing on power and energy efficiency, and its interaction with application executions. In addition, we use PowerPack to study the power dynamics and energy efficiencies of dynamic voltage and frequency scaling (DVFS) techniques on clusters. Our experiments reveal conclusively how intelligent DVFS scheduling can enhance system energy efficiency while maintaining performance

    DR.SGX: Hardening SGX Enclaves against Cache Attacks with Data Location Randomization

    Recent research has demonstrated that Intel's SGX is vulnerable to various software-based side-channel attacks. In particular, attacks that monitor CPU caches shared between the victim enclave and untrusted software enable accurate leakage of secret enclave data. Known defenses assume developer assistance, require hardware changes, impose high overhead, or prevent only some of the known attacks. In this paper we propose data location randomization as a novel defensive approach to address the threat of side-channel attacks. Our main goal is to break the link between the cache observations by the privileged adversary and the actual data accesses by the victim. We design and implement a compiler-based tool called DR.SGX that instruments enclave code such that data locations are permuted at the granularity of cache lines. We realize the permutation with the CPU's cryptographic hardware-acceleration units providing secure randomization. To prevent correlation of repeated memory accesses we continuously re-randomize all enclave data during execution. Our solution effectively protects many (but not all) enclaves from cache attacks and provides a complementary enclave hardening technique that is especially useful against unpredictable information leakage

    Uma ferramenta para modelagem e simulação de computação aproximada em hardware

    Orientador: Lucas Francisco WannerDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Pesquisas recentes têm introduzido unidades de hardware que produzem resultados incorretos de maneira determinística ou probabilística para um pequeno conjunto de entradas. Por outro lado, permitem um maior desempenho ou um consumo de energia significativamente menor em comparação com versões precisas das mesmas unidades. Como integrar, validar e avaliar essas alternativas em uma arquitetura ou processador, porém, permanece um desafio. A falta de ferramentas para representar e avaliar hardware aproximado leva desenvolvedores a verificar suas soluções de maneira independente, sem considerar interações com outros componentes, exigindo um grande esforço em modelagem e simulação. Neste trabalho, introduzimos ADeLe, uma linguagem de alto nível para descrever, configurar e integrar unidades de hardware aproximado em um processador. ADeLe reduz o esforço de desenvolvimento de hardware aproximado por modelar aproximações em um alto nível de abstração e injetá-las automaticamente em um modelo de processador para simulação arquitetural. Na ferramenta relacionada a ADeLe, aproximações podem modificar ou substituir completamente o comportamento de instruções de hardware através de políticas definidas pelo usuário. As instruções podem ser modificadas deterministicamente ou probabilisticamente (por exemplo, baseado em tensão de operação e frequência). Para proporcionar um ambiente de teste controlado, as aproximações podem ser ligadas e desligadas a partir do software em execução. O consumo de energia é automaticamente computado com base em modelos customizáveis no sistema. Assim, a ferramenta proporciona um método de verificação genérico e flexível, permitindo uma fácil avaliação da troca entre energia e qualidade de aplicações sujeitadas a hardware aproximado. Demonstramos a ferramenta pela introdução de variadas técnicas de aproximação em um modelo de processador, com o qual aplicações selecionadas foram executadas. Ao modelar módulos de hardware aproximado dedicados, mostramos como ADeLe representa unidades aritméticas aproximadas e unidades funcionais de precisão reduzida executando 4 aplicações de processamento de imagens e 2 de computação de ponto flutuante. Com outro método de aproximação, também mostramos como a ferramenta é utilizada para estudar o impacto de memórias alimentadas por tensão ajustável sobre 9 aplicações. Nossos experimentos demonstram as capacidades da ferramenta e como ela pode ser utilizada para gerar processadores virtuais aproximados e avaliar o equilíbrio entre energia e qualidade para diferentes aplicações com esforço reduzidoAbstract: Recent research has introduced approximate hardware units that produce incorrect outputs deterministically or probabilistically for some small subset of inputs. On the other hand, they allow significantly higher throughput or lower power than their error-free counterparts. The integration, validation, and evaluation of these approximate units in architectures and processors, however, remains challenging. The lack of tools to represent and evaluate approximate hardware leads designers to verify their solutions independently, not considering interactions with other components, demanding high-effort modeling and simulation. In this work, we introduce ADeLe, a high-level language for the description, configuration, and integration of approximate hardware units into processors. ADeLe reduces the design effort for approximate hardware by modeling approximations at a high level of abstraction and automatically injecting them into a processor model for architectural simulation. In the ADeLe framework, approximations may modify or completely replace the functional behavior of instructions according to user-defined policies. Instructions may be approximated deterministically or probabilistically (e.g., based on operating voltage and frequency). To allow for controlled testing, approximations may be enabled and disabled from software. Energy is automatically accounted for based on customizable models that consider the potential power savings of the approximations that are enabled in the system. Thus, the framework provides a generic and flexible verification method, allowing for easy evaluation of the energy-quality trade-off of applications subjected to approximate hardware. We demonstrate the framework by introducing different approximation techniques into a processor model, on top of which we run selected applications. Modeling dedicated hardware modules, we show how ADeLe can represent approximate arithmetic and reduced precision computation units executing 4 image processing and 2 floating point applications. Using a different method of approximation, we also show how the framework is used to study the impact of voltage-overscaled memories over 9 applications. Our experiments show the framework capabilities and how it may be used to generate approximate virtual CPUs and to evaluate energy-quality trade-offs for different applications with reduced effortMestradoCiência da ComputaçãoMestre em Ciência da Computação2017/08015-8  FAPES

    Using Code Perforation to Improve Performance, Reduce Energy Consumption, and Respond to Failures

    Many modern computations (such as video and audio encoders, Monte Carlo simulations, and machine learning algorithms) are designed to trade off accuracy in return for increased performance. To date, such computations typically use ad-hoc, domain-specific techniques developed specifically for the computation at hand. We present a new general technique, code perforation, for automatically augmenting existing computations with the capability of trading off accuracy in return for performance. In contrast to existing approaches, which typically require the manual development of new algorithms, our implemented SpeedPress compiler can automatically apply code perforation to existing computations with no developer intervention whatsoever. The result is a transformed computation that can respond almost immediately to a range of increased performancedemands while keeping any resulting output distortion within acceptable user-defined bounds. We have used SpeedPress to automatically apply code perforation to applications from the PARSEC benchmark suite. The results show that the transformed applications can run as much as two to three times faster than the original applications while distorting the output by less than 10%. Because the transformed applications can operate successfully at many points in the performance/accuracy tradeoff space, they can (dynamically and on demand) navigate the tradeoff space to either maximize performance subject to a given accuracy constraint, or maximize accuracy subject to a given performance constraint. We also demonstrate the SpeedGuard runtime system which uses code perforation to enable applications to automatically adapt to challenging execution environments such as multicore machines that suffer core failures or machines that dynamically adjust the clock speed to reduce power consumption or to protect the machine from overheating

    In Situ Visualization of Performance Data in Parallel CFD Applications

    This thesis summarizes the work of the author on visualization of performance data in parallel Computational Fluid Dynamics (CFD) simulations. Current performance analysis tools are unable to show their data on top of complex simulation geometries (e.g. an aircraft engine). But in CFD simulations, performance is expected to be affected by the computations being carried out, which in turn are tightly related to the underlying computational grid. Therefore it is imperative that performance data is visualized on top of the same computational geometry which they originate from. However, performance tools have no native knowledge of the underlying mesh of the simulation. This scientific gap can be filled by merging the branches of HPC performance analysis and in situ visualization of CFD simulations data, which shall be done by integrating existing, well established state-of-the-art tools from each field. In this threshold, an extension for the open-source performance tool Score-P was designed and developed, which intercepts an arbitrary number of manually selected code regions (mostly functions) and send their respective measurements – amount of executions and cumulative time spent – to the visualization software ParaView – through its in situ library, Catalyst –, as if they were any other flow-related variable. Subsequently the tool was extended with the capacity to also show communication data (messages sent between MPI ranks) on top of the CFD mesh. Testing and evaluation are done with two industry-grade codes: Rolls-Royce’s CFD code, Hydra, and Onera, DLR and Airbus’ CFD code, CODA. On the other hand, it has been also noticed that the current performance tools have limited capacity of displaying their data on top of three-dimensional, framed (i.e. time-stepped) representations of the cluster’s topology. Parallel to that, in order for the approach not to be limited to codes which already have the in situ adapter, it was extended to take the performance data and display it – also in codes without in situ – on a three-dimensional, framed representation of the hardware resources being used by the simulation. Testing is done with the Multi-Grid and Block Tri-diagonal NAS Parallel Benchmarks (NPB), as well as with Hydra and CODA again. The benchmarks are used to explain how the new visualizations work, while real performance analyses are done with the industry-grade CFD codes. The proposed solution is able to provide concrete performance insights, which would not have been reached with the current performance tools and which motivated beneficial changes in the respective source code in real life. Finally, its overhead is discussed and proven to be suitable for usage with CFD codes. The dissertation provides a valuable addition to the state of the art of highly parallel CFD performance analysis and serves as basis for further suggested research directions

    Reliability-Aware Optimization of Approximate Computational Kernels with Rely

    Emerging high-performance architectures are anticipated to contain unreliable components (e.g., ALUs) that offer low power consumption at the expense of soft errors. Some applications (such as multimedia processing, machine learning, and big data analytics) can often naturally tolerate soft errors and can therefore trade accuracy of their results for reduced energy consumption by utilizing these unreliable hardware components. We present and evaluate a technique for reliability-aware optimization of approximate computational kernel implementations. Our technique takes a standard implementation of a computation and automatically replaces some of its arithmetic operations with unreliable versions that consume less power, but may produce incorrect results with some probability. Our technique works with a developer-provided specification of the required reliability of a computation -- the probability that it returns the correct result -- and produces an unreliable implementation that satisfies that specification. We evaluate our approach on five applications from the image processing, numerical analysis, and financial analysis domains and demonstrate how our technique enables automatic exploration of the trade-off between the reliability of a computation and its performance
