89 research outputs found

    Power Bounded Computing on Current & Emerging HPC Systems

    Get PDF
    Power has become a critical constraint for the evolution of large scale High Performance Computing (HPC) systems and commercial data centers. This constraint spans almost every level of computing technologies, from IC chips all the way up to data centers due to physical, technical, and economic reasons. To cope with this reality, it is necessary to understand how available or permissible power impacts the design and performance of emergent computer systems. For this reason, we propose power bounded computing and corresponding technologies to optimize performance on HPC systems with limited power budgets. We have multiple research objectives in this dissertation. They center on the understanding of the interaction between performance, power bounds, and a hierarchical power management strategy. First, we develop heuristics and application aware power allocation methods to improve application performance on a single node. Second, we develop algorithms to coordinate power across nodes and components based on application characteristic and power budget on a cluster. Third, we investigate performance interference induced by hardware and power contentions, and propose a contention aware job scheduling to maximize system throughput under given power budgets for node sharing system. Fourth, we extend to GPU-accelerated systems and workloads and develop an online dynamic performance & power approach to meet both performance requirement and power efficiency. Power bounded computing improves performance scalability and power efficiency and decreases operation costs of HPC systems and data centers. This dissertation opens up several new ways for research in power bounded computing to address the power challenges in HPC systems. The proposed power and resource management techniques provide new directions and guidelines to green exscale computing and other computing systems

    Soft Computing Techiniques for the Protein Folding Problem on High Performance Computing Architectures

    Get PDF
    The protein-folding problem has been extensively studied during the last fifty years. The understanding of the dynamics of global shape of a protein and the influence on its biological function can help us to discover new and more effective drugs to deal with diseases of pharmacological relevance. Different computational approaches have been developed by different researchers in order to foresee the threedimensional arrangement of atoms of proteins from their sequences. However, the computational complexity of this problem makes mandatory the search for new models, novel algorithmic strategies and hardware platforms that provide solutions in a reasonable time frame. We present in this revision work the past and last tendencies regarding protein folding simulations from both perspectives; hardware and software. Of particular interest to us are both the use of inexact solutions to this computationally hard problem as well as which hardware platforms have been used for running this kind of Soft Computing techniques.This work is jointly supported by the FundaciónSéneca (Agencia Regional de Ciencia y Tecnología, Región de Murcia) under grants 15290/PI/2010 and 18946/JLI/13, by the Spanish MEC and European Commission FEDER under grant with reference TEC2012-37945-C02-02 and TIN2012-31345, by the Nils Coordinated Mobility under grant 012-ABEL-CM-2014A, in part financed by the European Regional Development Fund (ERDF). We also thank NVIDIA for hardware donation within UCAM GPU educational and research centers.Ingeniería, Industria y Construcció

    Per-task energy metering and accounting in the multicore era

    Get PDF
    Chip multi-core processors (CMPs) are the preferred processing platform across different domains such as data centers, real-time systems and mobile devices. In all those domains, energy is arguably the most expensive resource in a computing system, in particular, with fastest growth. Therefore, measuring the energy usage draws vast attention. Current studies mostly focus on obtaining finer-granularity energy measurement, such as measuring power in smaller time intervals, distributing energy to hardware components or software components. Such studies focus on scenarios where system energy is measured under the assumption that only one program is running in the system. So far, there is no hardware-level mechanism proposed to distribute the system energy to multiple running programs in a resource sharing multi-core system in an exact way. For the first time, we have formalized the need for per-task energy measurement in multicore by establishing a two-fold concept: Per-Task Energy Metering (PTEM) and Sensible Energy Accounting (SEA). In the scenario where many tasks running in parallel in a multicore system: For each task, the target of PTEM is to provide estimate of the actual energy consumption at runtime based on its resource usage during execution; and SEA aims at providing estimates on the energy it would have consumed when running in isolation with a particular fraction of system's resources. Accurately determining the energy consumed by each task in a system will become of prominent importance in future multi-core based systems as it offers several benefits including (i) Selection of appropriate co-runners, (ii) improved energy-aware task scheduling and (iii) energy-aware billing in data centers. We have shown how these two concepts can be applied to the main components of a computing system: the processor and the memory system. At first, we have applied PTEM to the processor by means of tracking the activities and occupancy of all the resources in a per-task basis. Secondly, we have applied PTEM to the memory system by means of tracking the activities and the state switches of memory banks. Then, we have applied SEA to the processor by predicting the activities and execution time for each task when they run with an fraction of chip resources alone. And last, we apply SEA to the memory system, by means of predicting activities, execution time and the time invoking memory system for each task. As for all these works, by trading-off the hardware cost with the estimation accuracy, we have obtained the implementable and affordable cost mechanisms with high accuracy. We have also shown how these techniques can be applied in different scenarios, such as, to detect significant energy usage variations for any particular task and to develop more energy efficient scheduling policy for the multi-core system. These works in this thesis have been published into IEEE/ACM journals and conferences proceedings that can be found in the publication chapter of this thesis.Los "Chip Multi-core Processors" (CMPs) son la plataforma de procesado preferida en diferentes dominios, tales como los centros de datos, sistemas de tiempo real y dispositivos móviles. En todos estos dominios, la energía puede ser el recurso más caro en el sistema de computación, concretamente, lo rápido que está creciendo. Por lo tanto, como medir el consumo energético está ganando mucha atención. Los estudios actuales se centran mayormente en cómo obtener medidas muy detalladas (finer granularity). Por ejemplo, tomar medidas de potencia en pequeños intervalos de tiempo, usando medidores de energía hardware o software. Estos estudios se centran en escenarios donde el consumo del sistema se mide bajo la suposición de que solo un programa se está ejecutando en el sistema. Aun no hay ninguna propuesta de un mecanismo a nivel de hardware para medir el consumo entre múltiples programas ejecutándose a la vez en un sistema multi-core con recursos compartidos. Por primera vez, hemos formalizado la necesidad de medir el consumo energético por-tarea en un multi-core estableciendo un concepto dual: Per-Taks Energy Metering (PTEM) y Sensible Energy Accounting (SEA). En un escenario donde varias tareas se ejecutan en paralelo en un sistema multi-core, por cada tarea, el objetivo de PTEM es estimar el consumo real energético durante tiempo de ejecución basándose en los recursos usados durante la ejecución, y SEA trata de proveer una estimación del consumo que tendría en solitario con solo una fracción concreta de los recursos del sistema. Determinar el consumo energético con precisión para cada tarea en un sistema tomara gran importancia en el futuro de los sistemas basados en multi-cores, ya que ofrecen varias ventajas tales como: (i) determinar los co-runners apropiados, (ii) mejorar la planificación de tareas teniendo en cuenta su consumo y (iii) facturación de los servicios de los data centers basada en el consumo. Hemos mostrado como estos dos conceptos pueden aplicarse a los principales componentes de un sistema de computación: el procesador y el sistema de memoria. Para empezar, hemos aplicado PTEM al procesador para registrar la actividad y la ocupación de todos los recursos por cada tarea. Luego, hemos aplicado SEA al procesador prediciendo la actividad y tiempo de ejecución por tarea cuando se ejecutan con solo una parte de los recursos del chip. Por último, hemos aplicado SEA al sistema de memoria para predecir la activada, el tiempo ejecución y cuando el sistema de memoria es invocado por cada tarea. Con todo ello, hemos alcanzado un compromiso entre el coste del hardware y la precisión en las estimaciones para obtener mecanismos implementables con un coste aceptable y una alta precisión. Durante nuestros estudios mostramos como esas técnicas pueden ser aplicadas a diferente escenarios, tales como: detectar variaciones significativas en el consumo energético por una tarea en concreto o como desarrollar políticas de planificación energéticamente más eficientes para sistemas multi-core. Los trabajos que hemos publicado durante el desarrollo de esta tesis en los IEEE/ACM journals y en varias conferencias pueden encontrarse en el capítulo de "publicaciones" de este documentoPostprint (published version

    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016)

    Get PDF
    Proceedings of the First PhD Symposium on Sustainable Ultrascale Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.The PhD Symposium was a very good opportunity for the young researchers to share information and knowledge, to present their current research, and to discuss topics with other students in order to look for synergies and common research topics. The idea was very successful and the assessment made by the PhD Student was very good. It also helped to achieve one of the major goals of the NESUS Action: to establish an open European research network targeting sustainable solutions for ultrascale computing aiming at cross fertilization among HPC, large scale distributed systems, and big data management, training, contributing to glue disparate researchers working across different areas and provide a meeting ground for researchers in these separate areas to exchange ideas, to identify synergies, and to pursue common activities in research topics such as sustainable software solutions (applications and system software stack), data management, energy efficiency, and resilience.European Cooperation in Science and Technology. COS

    Hardware-Software Co-Design, Acceleration and Prototyping of Control Algorithms on Reconfigurable Platforms

    Full text link
    Differential equations play a significant role in many disciplines of science and engineering. Solving and implementing Ordinary Differential Equations (ODEs) and partial Differential Equations (PDEs) effectively are very essential as most complex dynamic systems are modeled based on these equations. High Performance Computing (HPC) methodologies are required to compute and implement complex and data intensive applications modeled by differential equations at higher speed. There are, however, some challenges and limitations in implementing dynamic system, modeled by non-linear ordinary differential equations, on digital hardware. Modeling an integrator involves data approximation which results in accuracy error if data values are not considered properly. Accuracy and precision are dependent on the data types defined for each block of a system and subsystems. Also, digital hardware mostly works on fixed point data which leads to some data approximations. Using Field Programmable Gate Array (FPGA), it is possible to solve ordinary differential equations (ODE) at high speed. FPGA also provides scalable, flexible and reconfigurable features. The goal of this thesis is to explore and compare implementation of control algorithms on reconfigurable logic. This thesis focuses on implementing control algorithms modeled by second and fourth order PDEs and ODEs using Xilinx System Generator (XSG) and LabVIEW FPGA module synthesis tools. Xilinx System Generator for DSP allows integration of legacy HDL code, embedded IP cores, MATLAB functions, and hardware components targeted for Xilinx FPGAs to create complete system models that can be simulated and synthesized within the Simulink environment. The National Instruments (NI) LabVIEW FPGA Module extends LabVIEW graphical development to Field-Programmable Gate Arrays (FPGAs) on NI Reconfigurable I/O hardware. This thesis also focuses on efficient implementation and performance comparison of these implementations. Optimization of area, latency and power has also been explored during implementation and comparison results are discussed

    PiCo: A Domain-Specific Language for Data Analytics Pipelines

    Get PDF
    In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models—for which only informal (and often confusing) semantics is generally provided—all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks. From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics. The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level. Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world

    Scalable system software for high performance large-scale applications

    Get PDF
    In the last decades, high-performance large-scale systems have been a fundamental tool for scientific discovery and engineering advances. The sustained growth of supercomputing performance and the concurrent reduction in cost have made this technology available for a large number of scientists and engineers working on many different problems. The design of next-generation supercomputers will include traditional HPC requirements as well as the new requirements to handle data-intensive computations. Data intensive applications will hence play an important role in a variety of fields, and are the current focus of several research trends in HPC. Due to the challenges of scalability and power efficiency, next-generation of supercomputers needs a redesign of the whole software stack. Being at the bottom of the software stack, system software is expected to change drastically to support the upcoming hardware and to meet new application requirements. This PhD thesis addresses the scalability of system software. The thesis start at the Operating System level: first studying general-purpose OS (ex. Linux) and then studying lightweight kernels (ex. CNK). Then, we focus on the runtime system: we implement a runtime system for distributed memory systems that includes many of the system services required by next-generation applications. Finally we focus on hardware features that can be exploited at user-level to improve applications performance, and potentially included into our advanced runtime system. The thesis contributions are the following: Operating System Scalability: We provide an accurate study of the scalability problems of modern Operating Systems for HPC. We design and implement a methodology whereby detailed quantitative information may be obtained for each OS noise event. We validate our approach by comparing it to other well-known standard techniques to analyze OS noise, such FTQ (Fixed Time Quantum). Evaluation of the address translation management for a lightweight kernel: we provide a performance evaluation of different TLB management approaches ¿ dynamic memory mapping, static memory mapping with replaceable TLB entries, and static memory mapping with fixed TLB entries (no TLB misses) on a IBM BlueGene/P system. Runtime System Scalability: We show that a runtime system can efficiently incorporate system services and improve scalability for a specific class of applications. We design and implement a full-featured runtime system and programming model to execute irregular appli- cations on a commodity cluster. The runtime library is called Global Memory and Threading library (GMT) and integrates a locality-aware Partitioned Global Address Space communication model with a fork/join program structure. It supports massive lightweight multi-threading, overlapping of communication and computation and small messages aggregation to tolerate network latencies. We compare GMT to other PGAS models, hand-optimized MPI code and custom architectures (Cray XMT) on a set of large scale irregular applications: breadth first search, random walk and concurrent hash map access. Our runtime system shows performance orders of magnitude higher than other solutions on commodity clusters and competitive with custom architectures. User-level Scalability Exploiting Hardware Features: We show the high complexity of low-level hardware optimizations for single applications, as a motivation to incorporate this logic into an adaptive runtime system. We evaluate the effects of controllable hardware-thread priority mechanism that controls the rate at which each hardware-thread decodes instruction on IBM POWER5 and POWER6 processors. Finally, we show how to effectively exploits cache locality and network-on-chip on the Tilera many-core architecture to improve intra-core scalability

    Model-Based Performance Prediction for Concurrent Software on Multicore Architectures

    Get PDF
    Model-based performance prediction is a well-known concept to ensure the quality of software.Current approaches are based on a single-metric model, which leads to inaccurate predictions for modern architectures. This thesis presents a multi-strategies approach to extend performance prediction models to support multicore architectures.We implemented the strategies into Palladio and significantly increased the performance prediction power
    • …
    corecore