20 research outputs found

    Unleashing GPU acceleration for symmetric band linear algebra kernels and model reduction

    Get PDF
    Linear algebra operations arise in a myriad of scientific and engineering applications and, therefore, their optimization is targeted by a significant number of high performance computing (HPC) research efforts. In particular, the matrix multiplication and the solution of linear systems are two key problems with efficient implementations (or kernels) for a variety of high per- formance parallel architectures. For these specific prob- lems, leveraging the structure of the associated matrices often leads to remarkable time and memory savings, as is the case, e.g., for symmetric band problems. In this work, we exploit the ample hardware concurrency of many-core graphics processors (GPUs) to accelerate the solution of symmetric positive definite band linear systems, introducing highly tuned versions of the corre- sponding LAPACK routines. The experimental results with the new GPU kernels reveal important reductions of the execution time when compared with tuned imple- mentations of the same operations provided in Intel’s MKL. In addition, we evaluate the performance of the GPU kernels when applied to the solution of model or- der reduction problems and the associated matrix equa- tions.Ernesto Dufrechou and Pablo Ezzatti acknowledge the support from Programa de Desarrollo de las Ciencias Básicas, and Agencia Nacional de Investigación e Innovacioón, Uruguay. Enrique S. Quintana-Ortí was sup- ported by project TIN2011-23283 of the Ministry of Science and Competitiveness (MINECO) and EU FEDER, and project P1-1B2013-20 of the Fundació Caixa Castelló-Bancaixa and UJI

    GPU PERFORMANCE MODELLING AND OPTIMIZATION

    Get PDF
    Ph.DNUS-TU/E JOINT PH.D

    Towards Closing the Programmability-Efficiency Gap using Software-Defined Hardware

    Full text link
    The past decade has seen the breakdown of two important trends in the computing industry: Moore’s law, an observation that the number of transistors in a chip roughly doubles every eighteen months, and Dennard scaling, that enabled the use of these transistors within a constant power budget. This has caused a surge in domain-specific accelerators, i.e. specialized hardware that deliver significantly better energy efficiency than general-purpose processors, such as CPUs. While the performance and efficiency of such accelerators are highly desirable, the fast pace of algorithmic innovation and non-recurring engineering costs have deterred their widespread use, since they are only programmable across a narrow set of applications. This has engendered a programmability-efficiency gap across contemporary platforms. A practical solution that can close this gap is thus lucrative and is likely to engender broad impact in both academic research and the industry. This dissertation proposes such a solution with a reconfigurable Software-Defined Hardware (SDH) system that morphs parts of the hardware on-the-fly to tailor to the requirements of each application phase. This system is designed to deliver near-accelerator-level efficiency across a broad set of applications, while retaining CPU-like programmability. The dissertation first presents a fixed-function solution to accelerate sparse matrix multiplication, which forms the basis of many applications in graph analytics and scientific computing. The solution consists of a tiled hardware architecture, co-designed with the outer product algorithm for Sparse Matrix-Matrix multiplication (SpMM), that uses on-chip memory reconfiguration to accelerate each phase of the algorithm. A proof-of-concept is then presented in the form of a prototyped 40 nm Complimentary Metal-Oxide Semiconductor (CMOS) chip that demonstrates energy efficiency and performance per die area improvements of 12.6x and 17.1x over a high-end CPU, and serves as a stepping stone towards a full SDH system. The next piece of the dissertation enhances the proposed hardware with reconfigurability of the dataflow and resource sharing modes, in order to extend acceleration support to a set of common parallelizable workloads. This reconfigurability lends the system the ability to cater to discrete data access and compute patterns, such as workloads with extensive data sharing and reuse, workloads with limited reuse and streaming access patterns, among others. Moreover, this system incorporates commercial cores and a prototyped software stack for CPU-level programmability. The proposed system is evaluated on a diverse set of compute-bound and memory-bound kernels that compose applications in the domains of graph analytics, machine learning, image and language processing. The evaluation shows average performance and energy-efficiency gains of 5.0x and 18.4x over the CPU. The final part of the dissertation proposes a runtime control framework that uses low-cost monitoring of hardware performance counters to predict the next best configuration and reconfigure the hardware, upon detecting a change in phase or nature of data within the application. In comparison to prior work, this contribution targets multicore CGRAs, uses low-overhead decision tree based predictive models, and incorporates reconfiguration cost-awareness into its policies. Compared to the best-average static (non-reconfiguring) configuration, the dynamically reconfigurable system achieves a 1.6x improvement in performance-per-Watt in the Energy-Efficient mode of operation, or the same performance with 23% lower energy in the Power-Performance mode, for SpMM across a suite of real-world inputs. The proposed reconfiguration mechanism itself outperforms the state-of-the-art approach for dynamic runtime control by up to 2.9x in terms of energy-efficiency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169859/1/subh_1.pd

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters

    Tools and Algorithms for the Construction and Analysis of Systems

    Get PDF
    This open access book constitutes the proceedings of the 28th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2022, which was held during April 2-7, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 46 full papers and 4 short papers presented in this volume were carefully reviewed and selected from 159 submissions. The proceedings also contain 16 tool papers of the affiliated competition SV-Comp and 1 paper consisting of the competition report. TACAS is a forum for researchers, developers, and users interested in rigorously based tools and algorithms for the construction and analysis of systems. The conference aims to bridge the gaps between different communities with this common interest and to support them in their quest to improve the utility, reliability, exibility, and efficiency of tools and algorithms for building computer-controlled systems

    A mixed-signal computer architecture and its application to power system problems

    Get PDF
    Radical changes are taking place in the landscape of modern power systems. This massive shift in the way the system is designed and operated has been termed the advent of the ``smart grid''. One of its implications is a strong market pull for faster power system analysis computing. This work is concerned in particular with transient simulation, which is one of the most demanding power system analyses. This refers to the imitation of the operation of the real-world system over time, for time scales that cover the majority of slow electromechanical transient phenomena. The general mathematical formulation of the simulation problem includes a set of non-linear differential algebraic equations (DAEs). In the algebraic part of this set, heavy linear algebra computations are included, which are related to the admittance matrix of the topology. These computations are a critical factor to the overall performance of a transient simulator. This work proposes the use of analog electronic computing as a means of exceeding the performance barriers of conventional digital computers for the linear algebra operations. Analog computing is integrated in the frame of a power system transient simulator yielding significant computational performance benefits to the latter. Two hybrid, analog and digital computers are presented. The first prototype has been implemented using reconfigurable hardware. In its core, analog computing is used for linear algebra operations, while pipelined digital resources on a field programmable gate array (FPGA) handle all remaining computations. The properties of the analog hardware are thoroughly examined, with special attention to accuracy and timing. The application of the platform to the transient analysis of power system dynamics showed a speedup of two orders of magnitude against conventional software solutions. The second prototype is proposed as a future conceptual architecture that would overcome the limitations of the already implemented hardware, while retaining its virtues. The design space of this future architecture has been thoroughly explored, with the help of a software emulator. For one possible suggested implementation, speedups of four orders of magnitude against software solvers have been observed for the linear algebra operations

    Tools and Algorithms for the Construction and Analysis of Systems

    Get PDF
    This open access book constitutes the proceedings of the 28th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS 2022, which was held during April 2-7, 2022, in Munich, Germany, as part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022. The 46 full papers and 4 short papers presented in this volume were carefully reviewed and selected from 159 submissions. The proceedings also contain 16 tool papers of the affiliated competition SV-Comp and 1 paper consisting of the competition report. TACAS is a forum for researchers, developers, and users interested in rigorously based tools and algorithms for the construction and analysis of systems. The conference aims to bridge the gaps between different communities with this common interest and to support them in their quest to improve the utility, reliability, exibility, and efficiency of tools and algorithms for building computer-controlled systems

    Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II

    No full text
    The High Performance Computing (HPC) community recognizes energy consumption as a major problem. Extensive research is underway to identify means to increase energy efficiency of HPC systems including consideration of alternative building blocks for future systems. This thesis considers one such system, the Texas Instruments Keystone II, a heterogeneous Low-Power System-on-Chip (LPSoC) processor that combines a quad core ARM CPU with an octa-core Digital Signal Processor (DSP). It was first released in 2012. Four issues are considered: i) maximizing the Keystone II ARM CPU performance; ii) implementation and extension of the OpenMP programming model for the Keystone II; iii) simultaneous use of ARM and DSP cores across multiple Keystone SoCs; and iv) an energy model for applications running on LPSoCs like the Keystone II and heterogeneous systems in general. Maximizing the performance of the ARM CPU on the Keystone II system is fundamental to adoption of this system by the HPC community and, of the ARM architecture more broadly. Key to achieving good performance is exploitation of the ARM vector instructions. This thesis presents the first detailed comparison of the use of ARM compiler intrinsic functions with automatic compiler vectorization across four generations of ARM processors. Comparisons are also made with x86 based platforms and the use of equivalent Intel vector instructions. Implementation of the OpenMP programming model on the Keystone II system presents both challenges and opportunities. Challenges in that the OpenMP model was originally developed for a homogeneous programming environment with a common instruction set architecture, and in 2012 work had only just begun to consider how OpenMP might work with accelerators. Opportunities in that shared memory is accessible to all processing elements on the LPSoC, offering performance advantages over what typically exists with attached accelerators. This thesis presents an analysis of a prototype version of OpenMP implemented as a bare-metal runtime on the DSP of a Keystone I system. An implementation for the Keystone II that maps OpenMP 4.0 accelerator directives to OpenCL runtime library operations is presented and evaluated. Exploitation of some of the underlying hardware features of the Keystone II is also discussed. Simultaneous use of the ARM and DSP cores across multiple Keystone II boards is fundamental to the creation of commercially viable HPC offerings based on Keystone technology. The nCore BrownDwarf and HPE Moonshot systems represent two such systems. This thesis presents a proof-of-concept implementation of matrix multiplication (GEMM) for the BrownDwarf system. The BrownDwarf utilizes both Keystone II and Keystone I SoCs through a point-to-point interconnect called Hyperlink. Details of how a novel message passing communication framework across Hyperlink was implemented to support this complex environment are provided. An energy model that can be used to predict energy usage as a function of what fraction of a particular computation is performed on each of the available compute devices offers the opportunity for making runtime decisions on how best to minimize energy usage. This thesis presents a basic energy usage model that considers rates of executions on each device and their active and idle power usages. Using this model, it is shown that only under certain conditions does there exist an energy-optimal work partition that uses multiple compute devices. To validate the model a high resolution energy measurement environment is developed and used to gather energy measurements for a matrix multiplication benchmark running on a variety of systems. Results presented support the model. Drawing on the four issues noted above and other developments that have occurred since the Keystone II system was first announced, the thesis concludes by making comments regarding the future of LPSoCs as building blocks for HPC systems
    corecore