977 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationMemory access irregularities are a major bottleneck for bandwidth limited problems on Graphics Processing Unit (GPU) architectures. GPU memory systems are designed to allow consecutive memory accesses to be coalesced into a single memory access. Noncontiguous accesses within a parallel group of threads working in lock step may cause serialized memory transfers. Irregular algorithms may have data-dependent control flow and memory access, which requires runtime information to be evaluated. Compile time methods for evaluating parallelism, such as static dependence graphs, are not capable of evaluating irregular algorithms. The goals of this dissertation are to study irregularities within the context of unstructured mesh and sparse matrix problems, analyze the impact of vectorization widths on irregularities, and present data-centric methods that improve control flow and memory access irregularity within those contexts. Reordering associative operations has often been exploited for performance gains in parallel algorithms. This dissertation presents a method for associative reordering of stencil computations over unstructured meshes that increases data reuse through caching. This novel parallelization scheme offers considerable speedups over standard methods. Vectorization widths can have significant impact on performance in vectorized computations. Although the hardware vector width is generally fixed, the logical vector width used within a computation can range from one up to the width of the computation. Significant performance differences can occur due to thread scheduling and resource limitations. This dissertation analyzes the impact of vectorization widths on dense numerical computations such as 3D dG postprocessing. It is difficult to efficiently perform dynamic updates on traditional sparse matrix formats. Explicitly controlling memory segmentation allows for in-place dynamic updates in sparse matrices. Dynamically updating the matrix without rebuilding or sorting greatly improves processing time and overall throughput. This dissertation presents a new sparse matrix format, dynamic compressed sparse row (DCSR), which allows for dynamic streaming updates to a sparse matrix. A new method for parallel sparse matrix-matrix multiplication (SpMM) that uses dynamic updates is also presented

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters

    A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

    Get PDF
    Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NTX that can be used to train state-of-the-art deep convolutional neural networks at scale. Our main contributions are: (i) a loose coupling of RISC-V cores and NTX co-processors reducing offloading overhead by 7 x over previously published results; (ii) an optimized IEEE 754 compliant data path for fast high-precision convolutions and gradient propagation; (iii) evaluation of near-memory computing with NTX embedded into residual area on the Logic Base die of a Hybrid Memory Cube; and (iv) a scaling analysis to meshes of HMCs in a data center scenario. We demonstrate a 2.7 x energy efficiency improvement of NTX over contemporary GPUs at 4.4 x less silicon area, and a compute performance of 1.2 Tflop/s for training large state-of-the-art networks with full floating-point precision. At the data center scale, a mesh of NTX achieves above 95 percent parallel and energy efficiency, while providing 2.1 x energy savings or 3.1 x performance improvement over a GPU-based system

    Large-Scale Simulations of Complex Turbulent Flows: Modulation of Turbulent Boundary Layer Separation and Optimization of Discontinuous Galerkin Methods for Next-Generation HPC Platforms

    Full text link
    The separation of spatially evolving turbulent boundary layer flow near regions of adverse pressure gradients has been the subject of numerous studies in the context of flow control. Although many studies have demonstrated the efficacy of passive flow control devices, such as vortex generators (VGs), in reducing the size of the separated region, the interactions between the salient flow structures produced by the VG and those of the separated flow are not fully understood. Here, wall-resolved large-eddy simulation of a model problem of flow over a backward-facing ramp is studied with a submerged, wall-mounted cube being used as a canonical VG. In particular, the turbulent transport that results in the modulation of the separated flow over the ramp is investigated by varying the size, location of the VG, and the spanwise spacing between multiple VGs, which in turn are expected to modify the interactions between the VG-induced flow structures and those of the separated region. The horseshoe vortices produced by the cube entrain the freestream turbulent flow towards the plane of symmetry. These localized regions of high vorticity correspond to turbulent kinetic energy production regions, which effectively transfer energy from the freestream to the near-wall regions. Numerical simulations indicate that: (i) the gradients and the fluctuations, scale with the size of the cube and thus lead to more effective modulation for large cubes, (ii) for a given cube height the different upstream cube positions affect the behavior of the horseshoe vortex---when placed too close to the leading edge, the horseshoe vortex is not sufficiently strong to affect the large-scale structures of the separated region, and when placed too far, the dispersed core of the streamwise vortex is unable to modulate the flow over the ramp, (iii) if the spanwise spacing between neighboring VGs is too small, the counter-rotating vortices are not sufficiently strong to affect the large-scale structures of the separated region, and if the spacing is too large, the flow modulation is similar to that of an isolated VG. Turbulent boundary layer flows are inherently multiscale, and numerical simulations of such systems often require high spatial and temporal resolution to capture the unsteady flow dynamics accurately. While the innovations in computer hardware and distributed computing have enabled advances in the modeling of such large-scale systems, computations of many practical problems of interest are infeasible, even on the largest supercomputers. The need for high accuracy and the evolving heterogeneous architecture of the next-generation high-performance computing centers has impelled interest in the development of high-order methods. While the new class of recovery-assisted discontinuous Galerkin (RADG) methods can provide arbitrary high-orders of accuracy, the large number of degrees of freedom increases costs associated with the arithmetic operations performed and the amount of data transferred on-node. The purpose of the second part of this thesis is to explore optimization strategies to improve the parallel efficiency of RADG. A cache data-tiling strategy is investigated for polynomial orders 1 through 6, which enhances the arithmetic intensity of RADG to make better utilization of on-node floating-point capability. In addition, a power-aware compute framework is suggested by analyzing the power-performance trade-offs when changing from double to single-precision floating-point types---energy savings of 5 W per node are observed---which suggests that a transprecision framework will likely offer better power-performance balance on modern HPC platforms.PHDMechanical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163206/1/suyashtn_1.pd

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters

    Tools for efficient Deep Learning

    Get PDF
    In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces

    Bridging the Scalability Gap by Exploiting Error Tolerance for Emerging Applications

    Full text link
    In recent years, there has been a surge in demand for intelligent applications. These emerging applications are powered by algorithms from domains such as computer vision, image processing, pattern recognition, and machine learning. Across these algorithms, there exist two key computational characteristics. First, the computational demands they place on computing infrastructure is large, with the potential to substantially outstrip existing compute resources. Second, they are necessarily resilient to errors due to their inputs and outputs being inherently noisy and imprecise. Despite the staggering computational requirements and resilience of intelligent applications, current infrastructure uses conventional software and hardware methodologies. These systems needlessly consume resources for every bit of precision and arithmetic. To address this inefficiency and help bridge the performance gap caused by intelligent applications, this dissertation investigates exploiting error tolerance across the hardware-software stack. Specifically, we propose (1) statistical machinery to guarantee that accuracy is not compromised when removing work or precision, (2) a GPU optimization framework for work skipping and bottleneck mitigation, and (3) exploration of unconventional numerical representations to steer future hardware designs.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144025/1/parkerhh_1.pd
    • …
    corecore