79 research outputs found

    Many-Core Architectures: Hardware-Software Optimization and Modeling Techniques

    Get PDF
    During the last few decades an unprecedented technological growth has been at the center of the embedded systems design paramount, with Moore’s Law being the leading factor of this trend. Today in fact an ever increasing number of cores can be integrated on the same die, marking the transition from state-of-the-art multi-core chips to the new many-core design paradigm. Despite the extraordinarily high computing power, the complexity of many-core chips opens the door to several challenges. As a result of the increased silicon density of modern Systems-on-a-Chip (SoC), the design space exploration needed to find the best design has exploded and hardware designers are in fact facing the problem of a huge design space. Virtual Platforms have always been used to enable hardware-software co-design, but today they are facing with the huge complexity of both hardware and software systems. In this thesis two different research works on Virtual Platforms are presented: the first one is intended for the hardware developer, to easily allow complex cycle accurate simulations of many-core SoCs. The second work exploits the parallel computing power of off-the-shelf General Purpose Graphics Processing Units (GPGPUs), with the goal of an increased simulation speed. The term Virtualization can be used in the context of many-core systems not only to refer to the aforementioned hardware emulation tools (Virtual Platforms), but also for two other main purposes: 1) to help the programmer to achieve the maximum possible performance of an application, by hiding the complexity of the underlying hardware. 2) to efficiently exploit the high parallel hardware of many-core chips in environments with multiple active Virtual Machines. This thesis is focused on virtualization techniques with the goal to mitigate, and overtake when possible, some of the challenges introduced by the many-core design paradigm

    HyperFPGA: SoC-FPGA Cluster Architecture for Supercomputing and Scientific applications

    Get PDF
    Since their inception, supercomputers have addressed problems that far exceed those of a single computing device. Modern supercomputers are made up of tens of thousands of CPUs and GPUs in racks that are interconnected via elaborate and most of the time ad hoc networks. These large facilities provide scientists with unprecedented and ever-growing computing power capable of tackling more complex and larger problems. In recent years, the most powerful supercomputers have already reached megawatt power consumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining this trend. With more pressure on energy efficiency, an alternative to traditional architectures is needed. Reconfigurable hardware, such as FPGAs, has repeatedly been shown to offer substantial advantages over the traditional supercomputing approach with respect to performance and power consumption. In fact, several works that advanced the field of heterogeneous supercomputing using FPGAs are described in this thesis \cite{survey-2002}. Each cluster and its architectural characteristics can be studied from three interconnected domains: network, hardware, and software tools, resulting in intertwined challenges that designers must take into account. The classification and study of the architectures illustrate the trade-offs of the solutions and help identify open problems and research lines, which in turn served as inspiration and background for the HyperFPGA. In this thesis, the HyperFPGA cluster is presented as a way to build scalable SoC-FPGA platforms to explore new architectures for improved performance and energy efficiency in high-performance computing, focusing on flexibility and openness. The HyperFPGA is a modular platform based on a SoM that includes power monitoring tools with high-speed general-purpose interconnects to offer a great level of flexibility and introspection. By exploiting the reconfigurability and programmability offered by the HyperFPGA infrastructure, which combines FPGAs and CPUs, with high-speed general-purpose connectors, novel computing paradigms can be implemented. A custom Linux OS and drivers, along with a custom script for hardware definition, provide a uniform interface from application to platform for a programmable framework that integrates existing tools. The development environment is demonstrated using the N-Queens problem, which is a classic benchmark for evaluating the performance of parallel computing systems. Overall, the results of the HyperFPGA using the N-Queens problem highlight the platform's ability to handle computationally intensive tasks and demonstrate its suitability for its use in supercomputing experiments.Since their inception, supercomputers have addressed problems that far exceed those of a single computing device. Modern supercomputers are made up of tens of thousands of CPUs and GPUs in racks that are interconnected via elaborate and most of the time ad hoc networks. These large facilities provide scientists with unprecedented and ever-growing computing power capable of tackling more complex and larger problems. In recent years, the most powerful supercomputers have already reached megawatt power consumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining this trend. With more pressure on energy efficiency, an alternative to traditional architectures is needed. Reconfigurable hardware, such as FPGAs, has repeatedly been shown to offer substantial advantages over the traditional supercomputing approach with respect to performance and power consumption. In fact, several works that advanced the field of heterogeneous supercomputing using FPGAs are described in this thesis \cite{survey-2002}. Each cluster and its architectural characteristics can be studied from three interconnected domains: network, hardware, and software tools, resulting in intertwined challenges that designers must take into account. The classification and study of the architectures illustrate the trade-offs of the solutions and help identify open problems and research lines, which in turn served as inspiration and background for the HyperFPGA. In this thesis, the HyperFPGA cluster is presented as a way to build scalable SoC-FPGA platforms to explore new architectures for improved performance and energy efficiency in high-performance computing, focusing on flexibility and openness. The HyperFPGA is a modular platform based on a SoM that includes power monitoring tools with high-speed general-purpose interconnects to offer a great level of flexibility and introspection. By exploiting the reconfigurability and programmability offered by the HyperFPGA infrastructure, which combines FPGAs and CPUs, with high-speed general-purpose connectors, novel computing paradigms can be implemented. A custom Linux OS and drivers, along with a custom script for hardware definition, provide a uniform interface from application to platform for a programmable framework that integrates existing tools. The development environment is demonstrated using the N-Queens problem, which is a classic benchmark for evaluating the performance of parallel computing systems. Overall, the results of the HyperFPGA using the N-Queens problem highlight the platform's ability to handle computationally intensive tasks and demonstrate its suitability for its use in supercomputing experiments

    Scalable system software for high performance large-scale applications

    Get PDF
    In the last decades, high-performance large-scale systems have been a fundamental tool for scientific discovery and engineering advances. The sustained growth of supercomputing performance and the concurrent reduction in cost have made this technology available for a large number of scientists and engineers working on many different problems. The design of next-generation supercomputers will include traditional HPC requirements as well as the new requirements to handle data-intensive computations. Data intensive applications will hence play an important role in a variety of fields, and are the current focus of several research trends in HPC. Due to the challenges of scalability and power efficiency, next-generation of supercomputers needs a redesign of the whole software stack. Being at the bottom of the software stack, system software is expected to change drastically to support the upcoming hardware and to meet new application requirements. This PhD thesis addresses the scalability of system software. The thesis start at the Operating System level: first studying general-purpose OS (ex. Linux) and then studying lightweight kernels (ex. CNK). Then, we focus on the runtime system: we implement a runtime system for distributed memory systems that includes many of the system services required by next-generation applications. Finally we focus on hardware features that can be exploited at user-level to improve applications performance, and potentially included into our advanced runtime system. The thesis contributions are the following: Operating System Scalability: We provide an accurate study of the scalability problems of modern Operating Systems for HPC. We design and implement a methodology whereby detailed quantitative information may be obtained for each OS noise event. We validate our approach by comparing it to other well-known standard techniques to analyze OS noise, such FTQ (Fixed Time Quantum). Evaluation of the address translation management for a lightweight kernel: we provide a performance evaluation of different TLB management approaches ¿ dynamic memory mapping, static memory mapping with replaceable TLB entries, and static memory mapping with fixed TLB entries (no TLB misses) on a IBM BlueGene/P system. Runtime System Scalability: We show that a runtime system can efficiently incorporate system services and improve scalability for a specific class of applications. We design and implement a full-featured runtime system and programming model to execute irregular appli- cations on a commodity cluster. The runtime library is called Global Memory and Threading library (GMT) and integrates a locality-aware Partitioned Global Address Space communication model with a fork/join program structure. It supports massive lightweight multi-threading, overlapping of communication and computation and small messages aggregation to tolerate network latencies. We compare GMT to other PGAS models, hand-optimized MPI code and custom architectures (Cray XMT) on a set of large scale irregular applications: breadth first search, random walk and concurrent hash map access. Our runtime system shows performance orders of magnitude higher than other solutions on commodity clusters and competitive with custom architectures. User-level Scalability Exploiting Hardware Features: We show the high complexity of low-level hardware optimizations for single applications, as a motivation to incorporate this logic into an adaptive runtime system. We evaluate the effects of controllable hardware-thread priority mechanism that controls the rate at which each hardware-thread decodes instruction on IBM POWER5 and POWER6 processors. Finally, we show how to effectively exploits cache locality and network-on-chip on the Tilera many-core architecture to improve intra-core scalability

    Studies in Exascale Computer Architecture: Interconnect, Resiliency, and Checkpointing

    Full text link
    Today’s supercomputers are built from the state-of-the-art components to extract as much performance as possible to solve the most computationally intensive problems in the world. Building the next generation of exascale supercomputers, however, would require re-architecting many of these components to extract over 50x more performance than the current fastest supercomputer in the United States. To contribute towards this goal, two aspects of the compute node architecture were examined in this thesis: the on-chip interconnect topology and the memory and storage checkpointing platforms. As a first step, a skeleton exascale system was modeled to meet 1 exaflop of performance along with 100 petabytes of main memory. The model revealed that large kilo-core processors would be necessary to meet the exaflop performance goal; existing topologies, however, would not scale to those levels. To address this new challenge, we investigated and proposed asymmetric high-radix topologies that decoupled local and global communications and used different radix routers for switching network traffic at each level. The proposed topologies scaled more readily to higher numbers of cores with better latency and energy consumption than before. The vast number of components that the model revealed would be needed in these exascale systems cautioned towards better fault tolerance mechanisms. To address this challenge, we showed that local checkpoints within the compute node can be saved to a hybrid DRAM and SSD platform in order to write them faster without wearing out the SSD or consuming a lot of energy. A hybrid checkpointing platform allowed more frequent checkpoints to be made without sacrificing performance. Subsequently, we proposed switching to a DIMM-based SSD in order to perform fine-grained I/O operations that would be integral in interleaving checkpointing and computation while still providing persistence guarantees. Two more techniques that consolidate and overlap checkpointing were designed to better hide the checkpointing latency to the SSD.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/137096/1/sabeyrat_1.pd

    Research And Application Of Parallel Computing Algorithms For Statistical Phylogenetic Inference

    Get PDF
    Estimating the evolutionary history of organisms, phylogenetic inference, is a critical step in many analyses involving biological sequence data such as DNA. The likelihood calculations at the heart of the most effective methods for statistical phylogenetic analyses are extremely computationally intensive, and hence these analyses become a bottleneck in many studies. Recent progress in computer hardware, specifically the increase in pervasiveness of highly parallel, many-core processors has created opportunities for new approaches to computationally intensive methods, such as those in phylogenetic inference. We have developed an open source library, BEAGLE, which uses parallel computing methods to greatly accelerate statistical phylogenetic inference, for both maximum likelihood and Bayesian approaches. BEAGLE defines a uniform application programming interface and includes a collection of efficient implementations that use NVIDIA CUDA, OpenCL, and C++ threading frameworks for evaluating likelihoods under a wide variety of evolutionary models, on GPUs as well as on multi-core CPUs. BEAGLE employs a number of different parallelization techniques for phylogenetic inference, at different granularity levels and for distinct processor architectures. On CUDA and OpenCL devices, the library enables concurrent computation of site likelihoods, data subsets, and independent subtrees. The general design features of the library also provide a model for software development using parallel computing frameworks that is applicable to other domains. BEAGLE has been integrated with some of the leading programs in the field, such as MrBayes and BEAST, and is used in a diverse range of evolutionary studies, including those of disease causing viruses. The library can provide significant performance gains, with the exact increase in performance depending on the specific properties of the data set, evolutionary model, and hardware. In general, nucleotide analyses are accelerated on the order of 10-fold and codon analyses on the order of 100-fold

    Performance and Energy Optimization of the Iterative Solution of Sparse Linear Systems on Multicore Processors

    Get PDF
    En esta tesis doctoral se aborda la solución de sistemas dispersos de ecuaciones lineales utilizando métodos iterativos precondicionados basados en subespacios de Krylov. En concreto, se centra en ILUPACK, una biblioteca que implementa precondicionadores de tipo ILU multinivel para la solución eficiente de sistemas lineales dispersos. El incremento en el número de ecuaciones, y la aparición de nuevas arquitecturas, motiva el desarrollo de una versión paralela de ILUPACK que optimice tanto el tiempo de ejecución como el consumo energético en arquitecturas multinúcleo actuales y en clusters de nodos construidos con esta tecnología. El objetivo principal de la tesis es el diseño, implementación y valuación de resolutores paralelos energéticamente eficientes para sistemas lineales dispersos orientados a procesadores multinúcleo así como aceleradores hardware como el Intel Xeon Phi. Para lograr este objetivo, se aprovecha el paralelismo de tareas mediante OmpSs y MPI, y se desarrolla un entorno automático para detectar ineficiencias energéticas.In this dissertation we target the solution of large sparse systems of linear equations using preconditioned iterative methods based on Krylov subspaces. Specifically, we focus on ILUPACK, a library that offers multi-level ILU preconditioners for the effective solution of sparse linear systems. The increase of the number of equations and the introduction of new HPC architectures motivates us to develop a parallel version of ILUPACK which optimizes both execution time and energy consumption on current multicore architectures and clusters of nodes built from this type of technology. Thus, the main goal of this thesis is the design, implementation and evaluation of parallel and energy-efficient iterative sparse linear system solvers for multicore processors as well as recent manycore accelerators such as the Intel Xeon Phi. To fulfill the general objective, we optimize ILUPACK exploiting task parallelism via OmpSs and MPI, and also develope an automatic framework to detect energy inefficiencies

    Modelli e strumenti di programmazione parallela per piattaforme many-core

    Get PDF
    The negotiation between power consumption, performance, programmability, and portability drives all computing industry designs, in particular the mobile and embedded systems domains. Two design paradigms have proven particularly promising in this context: architectural heterogeneity and many-core processors. Parallel programming models are key to effectively harness the computational power of heterogeneous many-core SoC. This thesis presents a set of techniques and HW/SW extensions that enable performance improvements and that simplify programmability for heterogeneous many-core platforms. The thesis contributions cover vertically the entire software stack for many-core platforms, from hardware abstraction layers running on top of bare-metal, to programming models; from hardware extensions for efficient parallelism support to middleware that enables optimized resource management within many-core platforms. First, we present mechanisms to decrease parallelism overheads on parallel programming runtimes for many-core platforms, targeting fine-grain parallelism. Second, we present programming model support that enables the offload of computational kernels within heterogeneous many-core systems. Third, we present a novel approach to dynamically sharing and managing many-core platforms when multiple applications coded with different programming models execute concurrently. All these contributions were validated using STMicroelectronics STHORM, a real embodiment of a state-of-the-art many-core system. Hardware extensions and architectural explorations were explored using VirtualSoC, a SystemC based cycle-accurate simulator of many-core platforms

    Design of robust scheduling methodologies for high performance computing

    Get PDF
    Scientific applications are often large, complex, computationally-intensive, and irregular. Loops are often an abundant source of parallelism in scientific applications. Due to the ever-increasing computational needs of scientific applications, high performance computing (HPC) systems have become larger and more complex, offering increased parallelism at multiple hardware levels. Load imbalance, caused by irregular computational load per task and unpredictable computing system characteristics (system variability), often degrades the performance of applications. Besides, perturbations, such as reduced computing power, network latency availability, or failures, can severely impact the performance of the applications. System variability and perturbations are only expected to increase in future extreme-scale computing systems. Extrapolating the current failure rate to Exascale would result in a failure every 20 minutes. Such failure rate and perturbations would render the computing systems unusable. This doctoral thesis improves the performance of computationally-intensive scientific applications on HPC systems via robust load balancing. Robust scheduling ensures and maintains improved load balanced execution under unpredictable application and system characteristics. A number of dynamic loop self-scheduling (DLS) techniques have been introduced and successfully used in scientific applications between the 1980s and 2000s. These DLS techniques are not fault-tolerant as they were originally introduced. In this thesis, we identify three major research questions to achieve robust scheduling (1) How to ensure that the DLS techniques employed in scientific applications today adhere to their original design goals and specifications? (2) How to select a DLS technique that will achieve improved performance under perturbations? (3) How to tolerate perturbations during execution and maintain a load balanced execution on HPC systems? To answer the first question, we reproduced the original experiments that introduced the DLS techniques to verify their present implementation. Simulation is used to reproduce experiments on systems from the past. Realistic simulation induces a similar analysis and conclusions to the analysis of the native results. To this end, we devised an approach for bridging the native and simulative executions of parallel applications on HPC systems. This simulation approach is used to reproduce scheduling experiments on past and present systems to verify the implementation of DLS techniques. Given the multiple levels of parallelism offered by the present HPC systems, we analyzed the load imbalance in scientific applications, from computer vision, astrophysics, and mathematical kernels, at both thread and process levels. This analysis revealed a significant interplay between thread level and process level load balancing. We found that dynamic load balancing at the thread level propagates to the process level and vice versa. However, the best application performance is only achieved by two-level dynamic load balancing. Next, we examined the performance of applications under perturbations. We found that the most robust DLS technique does not deliver the best performance under various perturbations. The most efficient DLS technique changes by changing the application, the system, or perturbations during execution. This signifies the algorithm selection problem in the DLS. We leveraged realistic simulations to address the algorithm selection problem of scheduling under perturbations via a simulation assisted approach (SimAS), which answers the second question. SimAS dynamically selects DLS techniques that improve the performance depending on the application, system, and perturbations during the execution. To answer the third question, we introduced a robust dynamic load balancing (rDLB) approach for the robust self-scheduling of scientific applications under failures (question 3). rDLB proactively reschedules already allocated tasks and requires no detection of perturbations. rDLB tolerates up to P −1 processor failures (P is the number of processors allocated to the application) and boosts the flexibility of applications against nonfatal perturbations, such as reduced availability of resources. This thesis is the first to provide insights into the interplay between thread and process level dynamic load balancing in scientific applications. Verified DLS techniques, SimAS, and rDLB are integrated into an MPI-based dynamic load balancing library (DLS4LB), which supports thirteen DLS techniques, for robust dynamic load balancing of scientific applications on HPC systems. Using the methods devised in this thesis, we improved the performance of scientific applications by up to 21% via two-level dynamic load balancing. Under perturbations, we enhanced their performance by a factor of 7 and their flexibility by a factor of 30. This thesis opens up the horizons into understanding the interplay of load balancing between various levels of software parallelism and lays the ground for robust multilevel scheduling for the upcoming Exascale HPC systems and beyond
    • …
    corecore